Create vector using existing dictionary and IDF values

Julian Limon Sun, 17 Apr 2011 00:09:37 -0700

Hello all,

Sorry to bother again, but I've been hitting my head against the wall for
the last day and I don't seem to find the answer.


I'm trying to create a new tfidf vector (or probably many vectors) out of a
new directory using something like seq2sparse. However, I want to create
these vectors based on the dictionary and idf values of a previously
executed directory. Let's say that I created my vectors using the whole
corpus and now I want to calculate new tfidf vectors for a few documents (or
more exactly, a few queries) that share the properties of the previous
corpus.

I know that seq2sparse stores a dictionary and tf values in temporary
folders. My first attempt was to modify DictionaryVectorizer and
TFIDFConverter to have them use a dictionary and a df-count from a different
directory. So far it seems that I had some luck with both, but now I'm
getting "index out of bound" exception. My guess is that some other class or
job determines the size of some array based on the document source.

Do you guys have any ideas about what might be wrong? Or even better, do you
guys know of a better way to generate a vector (i.e., a query vector) using
previous matrix values (i.e., the index)?

Thanks a lot,

Julian

P.S. The error I'm getting looks like this:

Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0002
org.apache.mahout.math.IndexException: Index 517 is outside allowable range
of [0,0)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
at
org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
monitorAndPrintJob
INFO:  map 100% reduce 0%

Create vector using existing dictionary and IDF values

Reply via email to