1) you can look in driver.classes.default.props to find the classes that map to the 'shortcuts' seq2sparse = org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles seqdirectory = org.apache.mahout.text.SequenceFilesFromDirectory
2) I'm new to mahout as well and haven't ventured into LDA yet. But so far this has been true: raw data must be converted into sequence files for mahout. Hope at least #1 can help you until someone else can address the LDA and streaming question. On Sat, Dec 31, 2011 at 11:36 AM, Allen <[email protected]> wrote: > Hello there, > > I am new to Mahout and trying to get Mahout running on our data > storage -- Cassandra. After poking around the LDA example on reuters > data, I have several questions. > > 1) Where is the source code for seqdirectory and seq2sparse? > > 2) Before the algorithm can run, it looks like the raw text must be > converted and materialized into a sequece file which represents some > vectors. Is that true? If so, is there an more efficient way to handle > the conversion like streaming the data? In my project, all the data is > in Cassandra. If I need to run some Mahout algorithm, it seems I need > to get the data out, put them into a temporal directory in HDFS, > convert them into sequence file and finally turn them into tf-vectors > format in HDFS. Then I can run the algorithm. 2 temporal data are > stored in the above procedure which will make the run slow. > > Many thanks. > > -- > Allen >
