Thanks, Daniel! I hadn't realized that, but it makes perfect sense now. I'll
take a look at the code to account for those cases.

Julian

2011/4/17 Daniel McEnnis <[email protected]>

> Julian,
>
> You're using a dictionary that has only the values seen in the
> training set.  Once you execute with a different document, you may
> have entries that are present in the new set but not in the old.
> Unless you deal with this case specifically, they will generate
> IndexOutOfBounds or NullPointer errors depending on how you implement
> the dictionary.
>
> Daniel
>
> On Sun, Apr 17, 2011 at 3:09 AM, Julian Limon <[email protected]>
> wrote:
> > Hello all,
> >
> > Sorry to bother again, but I've been hitting my head against the wall for
> > the last day and I don't seem to find the answer.
> >
> > I'm trying to create a new tfidf vector (or probably many vectors) out of
> a
> > new directory using something like seq2sparse. However, I want to create
> > these vectors based on the dictionary and idf values of a previously
> > executed directory. Let's say that I created my vectors using the whole
> > corpus and now I want to calculate new tfidf vectors for a few documents
> (or
> > more exactly, a few queries) that share the properties of the previous
> > corpus.
> >
> > I know that seq2sparse stores a dictionary and tf values in temporary
> > folders. My first attempt was to modify DictionaryVectorizer and
> > TFIDFConverter to have them use a dictionary and a df-count from a
> different
> > directory. So far it seems that I had some luck with both, but now I'm
> > getting "index out of bound" exception. My guess is that some other class
> or
> > job determines the size of some array based on the document source.
> >
> > Do you guys have any ideas about what might be wrong? Or even better, do
> you
> > guys know of a better way to generate a vector (i.e., a query vector)
> using
> > previous matrix values (i.e., the index)?
> >
> > Thanks a lot,
> >
> > Julian
> >
> > P.S. The error I'm getting looks like this:
> >
> > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> > WARNING: job_local_0002
> > org.apache.mahout.math.IndexException: Index 517 is outside allowable
> range
> > of [0,0)
> > at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
> > at
> >
> org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
> > at
> >
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
> > at
> >
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
> > monitorAndPrintJob
> > INFO:  map 100% reduce 0%
> >
>

Reply via email to