Hello there, I am new to Mahout and trying to get Mahout running on our data storage -- Cassandra. After poking around the LDA example on reuters data, I have several questions.
1) Where is the source code for seqdirectory and seq2sparse? 2) Before the algorithm can run, it looks like the raw text must be converted and materialized into a sequece file which represents some vectors. Is that true? If so, is there an more efficient way to handle the conversion like streaming the data? In my project, all the data is in Cassandra. If I need to run some Mahout algorithm, it seems I need to get the data out, put them into a temporal directory in HDFS, convert them into sequence file and finally turn them into tf-vectors format in HDFS. Then I can run the algorithm. 2 temporal data are stored in the above procedure which will make the run slow. Many thanks. -- Allen
