Thanks, Daniel! I hadn't realized that, but it makes perfect sense now. I'll take a look at the code to account for those cases.
Julian 2011/4/17 Daniel McEnnis <[email protected]> > Julian, > > You're using a dictionary that has only the values seen in the > training set. Once you execute with a different document, you may > have entries that are present in the new set but not in the old. > Unless you deal with this case specifically, they will generate > IndexOutOfBounds or NullPointer errors depending on how you implement > the dictionary. > > Daniel > > On Sun, Apr 17, 2011 at 3:09 AM, Julian Limon <[email protected]> > wrote: > > Hello all, > > > > Sorry to bother again, but I've been hitting my head against the wall for > > the last day and I don't seem to find the answer. > > > > I'm trying to create a new tfidf vector (or probably many vectors) out of > a > > new directory using something like seq2sparse. However, I want to create > > these vectors based on the dictionary and idf values of a previously > > executed directory. Let's say that I created my vectors using the whole > > corpus and now I want to calculate new tfidf vectors for a few documents > (or > > more exactly, a few queries) that share the properties of the previous > > corpus. > > > > I know that seq2sparse stores a dictionary and tf values in temporary > > folders. My first attempt was to modify DictionaryVectorizer and > > TFIDFConverter to have them use a dictionary and a df-count from a > different > > directory. So far it seems that I had some luck with both, but now I'm > > getting "index out of bound" exception. My guess is that some other class > or > > job determines the size of some array based on the document source. > > > > Do you guys have any ideas about what might be wrong? Or even better, do > you > > guys know of a better way to generate a vector (i.e., a query vector) > using > > previous matrix values (i.e., the index)? > > > > Thanks a lot, > > > > Julian > > > > P.S. The error I'm getting looks like this: > > > > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run > > WARNING: job_local_0002 > > org.apache.mahout.math.IndexException: Index 517 is outside allowable > range > > of [0,0) > > at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392) > > at > > > org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69) > > at > > > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95) > > at > > > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > > at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) > > Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient > > monitorAndPrintJob > > INFO: map 100% reduce 0% > > >
