Re: how to prepare data efficiently for mahout

Gary Snider Sat, 31 Dec 2011 09:24:27 -0800

1) you can look in driver.classes.default.props to find the classes that
map to the 'shortcuts'
seq2sparse = org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
seqdirectory = org.apache.mahout.text.SequenceFilesFromDirectory


2) I'm new to mahout as well and haven't ventured into LDA yet.
But so far this has been true: raw data must be converted into sequence
files for mahout.

Hope at least #1 can help you until someone else can address the LDA and
streaming question.

On Sat, Dec 31, 2011 at 11:36 AM, Allen <[email protected]> wrote:

> Hello there,
>
> I am new to Mahout and trying to get Mahout running on our data
> storage -- Cassandra. After poking around the LDA example on reuters
> data, I have several questions.
>
> 1) Where is the source code for seqdirectory and seq2sparse?
>
> 2) Before the algorithm can run, it looks like the raw text must be
> converted and materialized into a sequece file which represents some
> vectors. Is that true? If so, is there an more efficient way to handle
> the conversion like streaming the data? In my project, all the data is
> in Cassandra. If I need to run some Mahout algorithm, it seems I need
> to get the data out, put them into a temporal directory in HDFS,
> convert them into sequence file and finally turn them into tf-vectors
> format in HDFS. Then I can run the algorithm. 2 temporal data are
> stored in the above procedure which will make the run slow.
>
> Many thanks.
>
> --
> Allen
>

Re: how to prepare data efficiently for mahout

Reply via email to