This may not quite be true because the RSJ is able to take some liberties.

The origin of these is that A'A can be viewed as a self join.  Thus as rows of 
A are read, the cooccurrences can be emitted as they are read.

For B'A, we have to somehow get corresponding rows of A and B at the same time 
in the same place.  If both matrices are stored in sparse row-major form, then 
a map-side merge join would work at the cost of some locality.  You can recover 
that locality in special cases by a few tricks.  For instance, you might 
actually store A and B as adjoined rows.  That means that fetching a row of A 
inherently also gives a row of B.  Not sure how this could come about.  

A second way to get the locality is to use a system like MapR (conflict of 
interest alert, vendor specific feature alert, yada yada).  In such a system, 
you can force files to be co-resident.  In MapR, this is done by setting chunk 
size to zero and storing A and B in the same volume.  This makes that volume 
only be stored in a single container which forces all of the files in that 
volume to have exactly the same replication pattern.  It also makes that volume 
not scale as well.  When this is feasible, it can result in a massive speed 
improvement.  I know of one site that does this and reportedly achieves 10-20x 
speed up because of the decrease in non-local reads.

A third option is to use a reduce side join.  This would be necessary if A and 
B were ever not stored with rows in sequential order and were also not randomly 
accessible.  I would avoid this option if possible.

On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote:

> I don't think you need to run RowSimilarityJob on B'A, I think you would
> need an equivalent of RowSimilarityJob to compute B'A. I guess you could
> extends the MatrixMultiplicationJob to use the similarity measures from
> RowSimilarityJob instead of standard dot products.

Reply via email to