Or use seq2encoded, its does randomized hashing instead of tfidf, the performance as I have seen is identical to the seq2sparse and much lower in model size (if you give it a lower dimension to project on)
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jan 4, 2013 at 7:20 AM, Dan Filimon <[email protected]>wrote: > I haven't actually done this myself, but look at > DatasetSplitter.java's MarkPreferenceMapper. > That class is responsible for the partitioning and you can probably > just copy that class and replace the map() so that you look at the > year from the text somehow. > > So, while it's not exactly code-free, it's better than writing a new > program. :) > > On Fri, Jan 4, 2013 at 2:38 AM, Adam Baron <[email protected]> wrote: > > I went through the classify-20newsgroups.sh example and now want to use > > Naïve Bayes to classify my own text corpus. Only difference is that I'd > > prefer to define which documents are in the training set and test set > > versus using the split command. My team prefers accuracy comparisons > > between in-sample years and out-of-sample years as opposed to a random > > selection across all years. I don't believe I should run the seq2sparse > > separately for each set since I'd end with different DFs and, > > more concerning, different keys assigned to the same n-gram in > > the dictionary.file-0. > > > > Is there an easy way to achieve this with pre-built Mahout functionality? > > The only solution that comes to mind is to write a MapReduce program > that > > parses through the tfidf-vectors after running seq2sparse and sorts the > > vectors into the separate training set and test set based on some > > variable I put in the vector name. > > > > Thanks, > > Adam >
