Hello again, Looks like I figured out what the problem was. I was supposed to be using df-count, and not frequency.file-0. df-count does have a key of -1 with a value that looks like the total number of documents.
Thanks again for the responses. On Tue, Jul 29, 2014 at 2:22 PM, Jonathan Cooper-Ellis <[email protected]> wrote: > Hi Suneel, > > Thanks for the response. Yes, I'm trying to determine it from the output > of seq2sparse. Here's the relevant excerpt from my code: > > // Create a vector of wordId=>weight using tfidf. > > Vector vector = new RandomAccessSparseVector(10000); > > TFIDF tfidf = new TFIDF(); > > int documentCount = documentFrequency.get(-1).intValue(); // THIS IS > THROWING NPE > > for (Multiset.Entry<String> entry : words.entrySet()) { > > String word = entry.getElement(); > > int count = entry.getCount(); > > Integer wordId = dictionary.get(word); > > Long freq = documentFrequency.get(wordId); > > double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, > documentCount); > > vector.setQuick(wordId, tfIdfValue); > > } > > > I'm working off this tutorial: > http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ > > > On Tue, Jul 29, 2014 at 1:50 PM, Suneel Marthi <[email protected]> > wrote: > >> Have been silently following this discussion for sometime now. Jonathan if >> I understand u right, u r trying to determine the no. of docs in ur >> corpus. >> Correct? >> >> One of the artifactsfrom seq2sparse should have the doc count, not sure >> which one top of my head and I am not in front of a computer. >> >> The other quick way to determine the no. of docs would be to take the >> tf-idf vectors generated and feed them as input to RowId job. >> The output of RowId job are - matrix and docIndex. >> >> docIndex - mapping of document names to integerIds >> matrix - M * N matrix of M documents and N feature vectors >> >> docIndex should tell u the no. of documents in ur corpus. >> >> This is a quick and dirty way of doing it, I am sure there's a way to >> infer >> that from the o/p of seq2sparse itself (but I am not in from of my >> computer >> now). >> >> >> >> On Tue, Jul 29, 2014 at 10:40 AM, Jonathan Cooper-Ellis <[email protected]> >> wrote: >> >> > Hi Vaibhav, >> > >> > Thanks for the reply. It doesn't look like total count of keys in >> > frequency.file-0 corresponds to the number of documents, because I only >> > used a couple hundred documents to build the model and there are >> thousands >> > of keys in frequency.file-0. Am I misunderstanding something? >> > >> > >> > On Tue, Jul 29, 2014 at 1:15 PM, vaibhav srivastava < >> > [email protected]> >> > wrote: >> > >> > > Hi if I am correct you want to know the number of documents by reading >> > > frequency.file-0; You can use the SequenceFileReader to load the >> > frequency >> > > file and then count the number of keys that will give you the number >> of >> > > documents. >> > > Hope this helps, >> > > Thanks, >> > > vaibhav >> > > >> > > >> > > On Tue, Jul 29, 2014 at 10:32 PM, Jonathan Cooper-Ellis < >> [email protected]> >> > > wrote: >> > > >> > > > Hey guys, >> > > > >> > > > I'm trying to make a Bayesian classifier, but I'm having a hard time >> > > > figuring out how to programatically determine the value of the >> numDocs >> > > > param for calculate method in TFIDF, using the files generated >> building >> > > the >> > > > model on the command line. >> > > > >> > > > I saw some code that did it like this: >> > > > >> > > > int numDocs = documentFrequency.get(-1).intValue(); >> > > > >> > > > Where documentFrequency is a HashMap<Integer,Long> read from >> > > > frequency.file-0, but there's no key -1 in the file so its giving >> me an >> > > NPE >> > > > when I try to pass that to tfidf.calculate. >> > > > >> > > > Anyone know what I'm doing wrong? >> > > > >> > > > >> > > > Best, >> > > > >> > > > jce >> > > > >> > > >> > > >> > > >> > > -- >> > > Thanks and Regards, >> > > Vaibhav Srivastava >> > > Email-id: [email protected] >> > > Mobile no.: 9552543029 >> > > >> > >> > >
