Re: LDA input

ivan obeso Mon, 30 Apr 2012 00:27:11 -0700

Yes. You make a sequencial file using, for example, the SequenceFile.Writer
class writing the name of the file as key, and all the content as the
value. You can write as files as you want into the sequence file.

Then, you use this *.seq as a input for DocumentProcessor.tokenizeDocuments
to tokenize this file (you can use here a stemmer). The result of this is a
folder with the files containing the tokens. This folder must be the input
of the DictionaryVectorizer.createTermFrequencyVectors class to create the
TFvectors of the corpus. Finally, this folder is the input of the LDA
algotithm that you can use with the "bin/mahout lda" tool, or calling it
from a java program.

It's not necesary clustering for running the lda algorithm, because it
makes a clustering process itself.

[https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html]

On Sun, Apr 29, 2012 at 1:11 AM, Aneesha <[email protected]> wrote:

> I create sequential file and create vector for k-means. Is it the same
> input we
> need to use for Latent Dirichlet Allocation????
>
>

Re: LDA input

Reply via email to