how to prepare data efficiently for mahout

Allen Sat, 31 Dec 2011 09:01:53 -0800

Hello there,

I am new to Mahout and trying to get Mahout running on our data
storage -- Cassandra. After poking around the LDA example on reuters
data, I have several questions.


1) Where is the source code for seqdirectory and seq2sparse?

2) Before the algorithm can run, it looks like the raw text must be
converted and materialized into a sequece file which represents some
vectors. Is that true? If so, is there an more efficient way to handle
the conversion like streaming the data? In my project, all the data is
in Cassandra. If I need to run some Mahout algorithm, it seems I need
to get the data out, put them into a temporal directory in HDFS,
convert them into sequence file and finally turn them into tf-vectors
format in HDFS. Then I can run the algorithm. 2 temporal data are
stored in the above procedure which will make the run slow.

Many thanks.

-- 
Allen

how to prepare data efficiently for mahout

Reply via email to