Thank you Andrew for your inputs. I will try the example in Scala . So this example of 20-newsgroup cannot be used with other data sets to test it once the split is done , is that right ?
Thanks, Alok Tanna On Thu, Jan 14, 2016 at 4:26 PM, Andrew Palumbo <ap....@outlook.com> wrote: > The poor results you are seeing by testing are because you've run > seq2sparse on each set independently. This will create two different > dictionaries, which serve as the vector index for each term in your > vocabulary. You must use the same dictionary that you trained your model > on to vectorize your holdout set. There is an example for doing this in > Scala, using the new Mahout Samsara environment here: > > > http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html > > See the "Define a function to tokenize and vectorize new text using our > current dictionary" section. > > > > ________________________________________ > From: Alok Tanna <tannaa...@gmail.com> > Sent: Thursday, January 14, 2016 2:31 PM > To: user@mahout.apache.org > Subject: Mahout : 20-newsgroups Classification Example : Split command > > Hi , > > This request is in referece to the 20-newsgroups Classification Example on > the below link > https://mahout.apache.org/users/classification/twenty-newsgroups.html > > I am able to run the example and get the results as mentioned in the link, > but when I am trying to do this example without the split command the > results are not same. Also when I try to run the other test data against > the same model results are not accurate. > > Can we have this example run without the split command ? > > Basically I am trying to do this : > > I took both the datasets for training & testing. > > Run below commands on both sets: > 1. seqdirectory > 2. seq2sparse > > Now I have vectors generated for both datasets. > - Run trainnb command using first dataset's vectors output. So instead of > training a model on 80% of the data, I am using the whole dataset. > - Run testnb command using second dataset's vectors output. This is not the > 20% of the data, it's completely new dataset, solely used for testing. > > So instead of using mahout split, we I have specified separate dataset for > testing the model. > > Results for this exercise is totally different then what I get when I am > using split command to split the data . > > > Thanks & Regards, > > Alok R. Tanna > -- Thanks & Regards, Alok R. Tanna