I am running a RowSimilarityJob with a large dataset. When I call Vector.size() on the resulting vector, it always returns Integer.MAX_VALUE. At first I thought maybe I really did end up with a cardinality that outsized the int. Upon further checking, I found that the rowid vector cardinality was correct. It is only the vectors after the RowId job that have an invalid size.
I did some looking into the job's temp directory (which in my Mahout V0.6 still exists after the job). Both the cooccurrence and the weight outputs are also set to size=Integer.MAX_VALUE. In searching for the problem, I found that the VectorNormMapper, which is the first of three job that run, the vector is created with the following line: RandomAccessSparseVector partialColumnVector = new RandomAccessSparseVector(Integer.MAX_VALUE); which sets the size for the vector to the Integer.MAX_VALUE. I believe that is then carried through to the remaining vectors throughout the jobs. I don't know if this is a known bug or not. Thanks, Anna
