Have been silently following this discussion for sometime now. Jonathan if I understand u right, u r trying to determine the no. of docs in ur corpus. Correct?
One of the artifactsfrom seq2sparse should have the doc count, not sure which one top of my head and I am not in front of a computer. The other quick way to determine the no. of docs would be to take the tf-idf vectors generated and feed them as input to RowId job. The output of RowId job are - matrix and docIndex. docIndex - mapping of document names to integerIds matrix - M * N matrix of M documents and N feature vectors docIndex should tell u the no. of documents in ur corpus. This is a quick and dirty way of doing it, I am sure there's a way to infer that from the o/p of seq2sparse itself (but I am not in from of my computer now). On Tue, Jul 29, 2014 at 10:40 AM, Jonathan Cooper-Ellis <[email protected]> wrote: > Hi Vaibhav, > > Thanks for the reply. It doesn't look like total count of keys in > frequency.file-0 corresponds to the number of documents, because I only > used a couple hundred documents to build the model and there are thousands > of keys in frequency.file-0. Am I misunderstanding something? > > > On Tue, Jul 29, 2014 at 1:15 PM, vaibhav srivastava < > [email protected]> > wrote: > > > Hi if I am correct you want to know the number of documents by reading > > frequency.file-0; You can use the SequenceFileReader to load the > frequency > > file and then count the number of keys that will give you the number of > > documents. > > Hope this helps, > > Thanks, > > vaibhav > > > > > > On Tue, Jul 29, 2014 at 10:32 PM, Jonathan Cooper-Ellis <[email protected]> > > wrote: > > > > > Hey guys, > > > > > > I'm trying to make a Bayesian classifier, but I'm having a hard time > > > figuring out how to programatically determine the value of the numDocs > > > param for calculate method in TFIDF, using the files generated building > > the > > > model on the command line. > > > > > > I saw some code that did it like this: > > > > > > int numDocs = documentFrequency.get(-1).intValue(); > > > > > > Where documentFrequency is a HashMap<Integer,Long> read from > > > frequency.file-0, but there's no key -1 in the file so its giving me an > > NPE > > > when I try to pass that to tfidf.calculate. > > > > > > Anyone know what I'm doing wrong? > > > > > > > > > Best, > > > > > > jce > > > > > > > > > > > -- > > Thanks and Regards, > > Vaibhav Srivastava > > Email-id: [email protected] > > Mobile no.: 9552543029 > > >
