You might get some mileage out of this article I wrote about using Cassandra as input for Hadoop/Mahout, though it's not specific to LDA:
http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/ On Sat, Dec 31, 2011 at 10:36 AM, Allen <[email protected]> wrote: > Hello there, > > I am new to Mahout and trying to get Mahout running on our data > storage -- Cassandra. After poking around the LDA example on reuters > data, I have several questions. > > 1) Where is the source code for seqdirectory and seq2sparse? > > 2) Before the algorithm can run, it looks like the raw text must be > converted and materialized into a sequece file which represents some > vectors. Is that true? If so, is there an more efficient way to handle > the conversion like streaming the data? In my project, all the data is > in Cassandra. If I need to run some Mahout algorithm, it seems I need > to get the data out, put them into a temporal directory in HDFS, > convert them into sequence file and finally turn them into tf-vectors > format in HDFS. Then I can run the algorithm. 2 temporal data are > stored in the above procedure which will make the run slow. > > Many thanks. > > -- > Allen >
