RowSimilarity incorrectly setting 'size'

Anna Lahoud Fri, 07 Sep 2012 15:49:56 -0700

I am running a RowSimilarityJob with a large dataset. When I call
Vector.size() on the resulting vector, it always returns Integer.MAX_VALUE.
At first I thought maybe I really did end up with a cardinality that
outsized the int. Upon further checking, I found that the rowid vector
cardinality was correct. It is only the vectors after the RowId job that
have an invalid size.


I did some looking into the job's temp directory (which in my Mahout V0.6
still exists after the job). Both the cooccurrence and the weight outputs
are also set to size=Integer.MAX_VALUE.

In searching for the problem, I found that the VectorNormMapper, which is
the first of three job that run, the vector is created with the following
line:

RandomAccessSparseVector partialColumnVector = new
RandomAccessSparseVector(Integer.MAX_VALUE);

which sets the size for the vector to the Integer.MAX_VALUE. I believe that
is then carried through to the remaining vectors throughout the jobs.

I don't know if this is a known bug or not.

Thanks,

Anna

RowSimilarity incorrectly setting 'size'

Reply via email to