This may not quite be true because the RSJ is able to take some liberties. The origin of these is that A'A can be viewed as a self join. Thus as rows of A are read, the cooccurrences can be emitted as they are read.
For B'A, we have to somehow get corresponding rows of A and B at the same time in the same place. If both matrices are stored in sparse row-major form, then a map-side merge join would work at the cost of some locality. You can recover that locality in special cases by a few tricks. For instance, you might actually store A and B as adjoined rows. That means that fetching a row of A inherently also gives a row of B. Not sure how this could come about. A second way to get the locality is to use a system like MapR (conflict of interest alert, vendor specific feature alert, yada yada). In such a system, you can force files to be co-resident. In MapR, this is done by setting chunk size to zero and storing A and B in the same volume. This makes that volume only be stored in a single container which forces all of the files in that volume to have exactly the same replication pattern. It also makes that volume not scale as well. When this is feasible, it can result in a massive speed improvement. I know of one site that does this and reportedly achieves 10-20x speed up because of the decrease in non-local reads. A third option is to use a reduce side join. This would be necessary if A and B were ever not stored with rows in sequential order and were also not randomly accessible. I would avoid this option if possible. On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote: > I don't think you need to run RowSimilarityJob on B'A, I think you would > need an equivalent of RowSimilarityJob to compute B'A. I guess you could > extends the MatrixMultiplicationJob to use the similarity measures from > RowSimilarityJob instead of standard dot products.
