The reuters files are in SGML format (similar to XML) which need to be converted into other formats (SequenceFiles, Vectors etc) which can be consumed by other algorithms in Mahout, most notably the Clustering algos. The ExtractReuters is a helper class which can be used to parse the document ID, and the document content from the reuters data set. and write it to a directory. The usual continuation is to use SequenceFilesFromDirectory (bin/mahout seqdirectory) to convert the textual files containing doc ID and content generated from ExtractReuters to SequenceFiles and then use SparseVectorsFromSequenceFiles (bin/mahout seq2sparse) to convert the sequence files into Mahout Vectors. The resultant vectors are in a form suitable for Clustering.
Now you can probably understand the role of reuters-sgm and reuters-out. The former are reuter documents in SGML format, whereas the latter are the result of the ExtractReuters class which churns out parsed Doc ID, Doc Content text files. On Mon, Jun 27, 2011 at 11:55 AM, wine lover <[email protected]> wrote: > Dear All, > > When studying to use the build-reuters.sh script, I noticed the following > command snippet > > $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters \ > mahout-work/reuters-sgm \ > mahout-work/reuters-out > > Would you like to let me know what do the folders of reuters-sgm and > reuters-out store? What exactly is the functionality of ExtractReuters > > Thanks, >
