Re: Questions on compressed input, custom tokenizers, and feature selection

Suneel Marthi Fri, 15 Nov 2013 17:52:36 -0800

Hi Brian,

1. seqdirectory presently only works with Text files. You would have to create 
your own utility for generating sequence files from gzip.


    It should be easy to create an MR job that reads gzip files and creates 
Sequence files.

2. Custom Tokenizers: 

     Could you provide more specifics here?

    If you are creating a Custom Lucene Tokenizer, then you should be able to 
plug that into the call to seq2sparse (which is subsequent to seqdirectory in 
Mahout's processing pipeline).



       





On Friday, November 15, 2013 7:05 PM, Brian Rogoff <[email protected]> wrote:
 
Hi,
    I'm using Mahout 0.7 with Hadoop 0.20.2-cdh3u2, evaluating it for use
within our company. I have a few questions

    I'd like to use Mahout classification on some data that I have which is
stored as gzipped files. I'd like to create the sequence data directly from
those compressed files. Is there some file filter class I can use which
will enable me to transparently work from the compressed data?

    In case that isn't clear, consider the 20news example in the
mahout-distribution-0.7. If I create a parallel directory to 20news-all
where all of the leaf files are gzipped, say gzipped-news-all, I'd like to
run

./bin/mahout seqdirectory -i ${WORK_DIR}/gzipped-news-all -o
${WORK_DIR}/gzipped-news-seq

perhaps with another argument to indicate that the data input data is
compressed, and have gzipped-news-seq be identical to 20news-seq dir
resulting from running

./bin/mahout seqdirectory -i ${WORK_DIR}/20news-all -o
${WORK_DIR}/20news-seq

    I'd like to see how to substitute custom tokenizers into this flow, if
someone could point me to an example, and I'd also like to know if there
are examples of tweaking the feature selection algorithms.

    Thanks in advance!

-- Brian

Re: Questions on compressed input, custom tokenizers, and feature selection

Reply via email to