The reuters files are in SGML format (similar to XML) which need to be
converted into other formats (SequenceFiles, Vectors etc) which can be
consumed by other algorithms in Mahout, most notably the Clustering algos.
The ExtractReuters is a helper class which can be used to parse the document
ID, and the document content from the reuters data set. and write it to a
directory. The usual continuation is to use SequenceFilesFromDirectory
(bin/mahout seqdirectory) to convert the textual files containing doc ID and
content generated from ExtractReuters to SequenceFiles and then use
SparseVectorsFromSequenceFiles (bin/mahout seq2sparse) to convert the
sequence files into Mahout Vectors. The resultant vectors are in a form
suitable for Clustering.

Now you can probably understand the role of reuters-sgm and reuters-out. The
former are reuter documents in SGML format, whereas the latter are the
result of the ExtractReuters class which churns out parsed Doc ID, Doc
Content text files.

On Mon, Jun 27, 2011 at 11:55 AM, wine lover <[email protected]> wrote:

> Dear All,
>
> When studying to use the build-reuters.sh script, I noticed the following
> command snippet
>
> $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters \
>          mahout-work/reuters-sgm \
>          mahout-work/reuters-out
>
> Would you like to let me know what do the folders of reuters-sgm and
> reuters-out store? What exactly is the functionality of ExtractReuters
>
> Thanks,
>

Reply via email to