Purpose

This vignette exposes the code used to retrieve the vocabulary from the corpus

Functions

token_counts = function(x){
    return (length(strsplit(x, " ")))
}
split = function(x) {
    return (strsplit(x, " "))
}

Getting vocabulary

Reading corpus

print ("reading in documents")
lines = readLines("corpus.dat")

Reading LDA output

print ("load from file")
var_theta = readRDS(file="./topicDocs.Rds")
var_phi = readRDS(file="./topicWords.Rds")

Getting word counts for each document

print("token counts for each document")
var_document_token_counts = lapply(lines, token_counts)

Getting a list of unique words in the corpus (the vocabulary)

print ("getting vocab")
var_vocabulary <- colnames(var_phi)

Saving

print ("saving to file")
vocabfile = file("vocab.txt")
print ("    doclens...")
saveRDS(var_document_token_counts, file="doc_lens.Rds")
print ("    vocab...")
writeLines(var_vocabulary, vocabfile)