Have been silently following this discussion for sometime now. Jonathan if
I understand u right, u r trying to determine the no. of docs in ur corpus.
Correct?

One of the artifactsfrom seq2sparse should have the doc count, not sure
which one top of my head and I am not in front of a computer.

The other quick way to determine the no. of docs would be to take the
tf-idf vectors generated and feed them as input to RowId job.
The output of RowId job are - matrix and docIndex.

docIndex - mapping of document names to integerIds
matrix - M * N matrix of M documents and N feature vectors

docIndex should tell u the no. of documents in ur corpus.

This is a quick and dirty way of doing it, I am sure there's a way to infer
that from the o/p of seq2sparse itself (but I am not in from of my computer
now).



On Tue, Jul 29, 2014 at 10:40 AM, Jonathan Cooper-Ellis <[email protected]>
wrote:

> Hi Vaibhav,
>
> Thanks for the reply. It doesn't look like total count of keys in
> frequency.file-0 corresponds to the number of documents, because I only
> used a couple hundred documents to build the model and there are thousands
> of keys in frequency.file-0. Am I misunderstanding something?
>
>
> On Tue, Jul 29, 2014 at 1:15 PM, vaibhav srivastava <
> [email protected]>
> wrote:
>
> > Hi if I am correct you want to know the number of documents by reading
> > frequency.file-0; You can use the SequenceFileReader to load the
> frequency
> > file and then count the number of keys that will give you the number of
> > documents.
> > Hope this helps,
> > Thanks,
> > vaibhav
> >
> >
> > On Tue, Jul 29, 2014 at 10:32 PM, Jonathan Cooper-Ellis <[email protected]>
> > wrote:
> >
> > > Hey guys,
> > >
> > > I'm trying to make a Bayesian classifier, but I'm having a hard time
> > > figuring out how to programatically determine the value of the numDocs
> > > param for calculate method in TFIDF, using the files generated building
> > the
> > > model on the command line.
> > >
> > > I saw some code that did it like this:
> > >
> > > int numDocs = documentFrequency.get(-1).intValue();
> > >
> > > Where documentFrequency is a HashMap<Integer,Long> read from
> > > frequency.file-0, but there's no key -1 in the file so its giving me an
> > NPE
> > > when I try to pass that to tfidf.calculate.
> > >
> > > Anyone know what I'm doing wrong?
> > >
> > >
> > > Best,
> > >
> > > jce
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vaibhav Srivastava
> > Email-id: [email protected]
> > Mobile no.: 9552543029
> >
>

Reply via email to