Thanks for the information. I went through org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles and build-reuters.sh. It looks like its job is to turn <doc_id, content> sequence files to <doc_id, tf_vector> sequence files. I don't understand why it saves some temporal files several times. I think it follows the procedure below. The result after each transformation is saved which I think it is unnecessary.
<doc_id, content> => <doc_id, List<String>> => <word, wordcount> => <word, integer_id> => <doc_id, tf_vector> If the content of the document is small enough, say several MB which is true for most plain text documents, would that be better to put the above procedure in memory? That is read <doc_id, content> from somewhere (Cassandra in my case), then proceed the tf_vector calculation entirely in memory and dump the final result to some place. On Sat, Dec 31, 2011 at 1:50 PM, Sean Owen <[email protected]> wrote: > You might get some mileage out of this article I wrote about using > Cassandra as input for Hadoop/Mahout, though it's not specific to LDA: > > http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/ > > On Sat, Dec 31, 2011 at 10:36 AM, Allen <[email protected]> wrote: > >> Hello there, >> >> I am new to Mahout and trying to get Mahout running on our data >> storage -- Cassandra. After poking around the LDA example on reuters >> data, I have several questions. >> >> 1) Where is the source code for seqdirectory and seq2sparse? >> >> 2) Before the algorithm can run, it looks like the raw text must be >> converted and materialized into a sequece file which represents some >> vectors. Is that true? If so, is there an more efficient way to handle >> the conversion like streaming the data? In my project, all the data is >> in Cassandra. If I need to run some Mahout algorithm, it seems I need >> to get the data out, put them into a temporal directory in HDFS, >> convert them into sequence file and finally turn them into tf-vectors >> format in HDFS. Then I can run the algorithm. 2 temporal data are >> stored in the above procedure which will make the run slow. >> >> Many thanks. >> >> -- >> Allen >> -- Allen
