Or use seq2encoded, its does randomized hashing instead of tfidf, the
performance as I have seen is identical to the seq2sparse and much lower in
model size (if you give it a lower dimension to project on)

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jan 4, 2013 at 7:20 AM, Dan Filimon <[email protected]>wrote:

> I haven't actually done this myself, but look at
> DatasetSplitter.java's MarkPreferenceMapper.
> That class is responsible for the partitioning and you can probably
> just copy that class and replace the map() so that you look at the
> year from the text somehow.
>
> So, while it's not exactly code-free, it's better than writing a new
> program. :)
>
> On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <[email protected]> wrote:
> > I went through the classify-20newsgroups.sh example and now want to use
> > Naïve Bayes to classify my own text corpus.  Only difference is that I'd
> > prefer to define which documents are in the training set and test set
> > versus using the split command.  My team prefers accuracy comparisons
> > between in-sample years and out-of-sample years as opposed to a random
> > selection across all years.  I don't believe I should run the seq2sparse
> > separately for each set since I'd end with different DFs and,
> > more concerning, different keys assigned to the same n-gram in
> > the dictionary.file-0.
> >
> > Is there an easy way to achieve this with pre-built Mahout functionality?
> >  The only solution that comes to mind is to write a MapReduce program
> that
> > parses through the tfidf-vectors after running seq2sparse and sorts the
> > vectors into the separate training set and test set based on some
> > variable I put in the vector name.
> >
> > Thanks,
> >         Adam
>

Reply via email to