Hello all, Sorry to bother again, but I've been hitting my head against the wall for the last day and I don't seem to find the answer.
I'm trying to create a new tfidf vector (or probably many vectors) out of a new directory using something like seq2sparse. However, I want to create these vectors based on the dictionary and idf values of a previously executed directory. Let's say that I created my vectors using the whole corpus and now I want to calculate new tfidf vectors for a few documents (or more exactly, a few queries) that share the properties of the previous corpus. I know that seq2sparse stores a dictionary and tf values in temporary folders. My first attempt was to modify DictionaryVectorizer and TFIDFConverter to have them use a dictionary and a df-count from a different directory. So far it seems that I had some luck with both, but now I'm getting "index out of bound" exception. My guess is that some other class or job determines the size of some array based on the document source. Do you guys have any ideas about what might be wrong? Or even better, do you guys know of a better way to generate a vector (i.e., a query vector) using previous matrix values (i.e., the index)? Thanks a lot, Julian P.S. The error I'm getting looks like this: Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local_0002 org.apache.mahout.math.IndexException: Index 517 is outside allowable range of [0,0) at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392) at org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: map 100% reduce 0%
