Re: how to prepare data efficiently for mahout

Sean Owen Sat, 31 Dec 2011 10:50:38 -0800

You might get some mileage out of this article I wrote about using
Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:


http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/

On Sat, Dec 31, 2011 at 10:36 AM, Allen <[email protected]> wrote:

> Hello there,
>
> I am new to Mahout and trying to get Mahout running on our data
> storage -- Cassandra. After poking around the LDA example on reuters
> data, I have several questions.
>
> 1) Where is the source code for seqdirectory and seq2sparse?
>
> 2) Before the algorithm can run, it looks like the raw text must be
> converted and materialized into a sequece file which represents some
> vectors. Is that true? If so, is there an more efficient way to handle
> the conversion like streaming the data? In my project, all the data is
> in Cassandra. If I need to run some Mahout algorithm, it seems I need
> to get the data out, put them into a temporal directory in HDFS,
> convert them into sequence file and finally turn them into tf-vectors
> format in HDFS. Then I can run the algorithm. 2 temporal data are
> stored in the above procedure which will make the run slow.
>
> Many thanks.
>
> --
> Allen
>

Re: how to prepare data efficiently for mahout

Reply via email to