Very generally spoken, RowSimilarityJob starts with a matrix A',
transposes it back to A and computes A'A (with some slight modifications
that allow the embedding of similarity measures).

The way this multiplication is done is very similar to Jake's "outer
column" trick aka the column picture of matrix multiplication.

The crucial thing to look at are extremely long rows of A which
correspond to the power users in recommendation lingua. Of course the
same problems arise in other domains such as document similarities where
terms with a high document frequency would slown down the processing
time and techniques such throwing the 1% of terms with the highest df
are applied.

--sebastian

On 18.10.2011 10:24, Dan Brickley wrote:
> 2011/10/18 Sebastian Schelter <[email protected]>:
>> Hi Ramon,
>>
>> my first suggestion would be to use Mahout 0.6 as significant
>> improvements have been made to RowSimilarityJob and the 0.5 version has
>> known bugs.
>>
>> The runtime of RowSimilarityJob is not only determined by the size of
>> the input but also by the distribution of the interactions among the
>> users.
> 
> As an aside, I've notice this 'users' terminology lurking in the
> background of RowSimilarityJob (eg. in JIRA discussion).
> 
> My use of it last week seemed perfectly reasonable; but rows were
> books (or bibliographic records), with feature columns from library
> topic codes. Does the 'user' terminology suggest it's really focussed
> on recommendations?
> 
> I'm used to seeing this in the Taste part of Mahout, where sometimes
> it's suggested we can re-use recommender pieces by eg. thinking more
> broadly and 'recommending topics to books' or vice versa. This makes
> sense but introduces an extra layer of conceptual confusion. Is there
> any important sense in which rows (or columns?) in RowSimilarityJob
> ought to be thought of as users? Or the values/weights as preferences?
> 
> cheers,
> 
> Dan

Reply via email to