Completely concur with that. MatrixMultiplicationJob is already using a
mapside merge-join AFAIK.


On 05.04.2013 15:04, Ted Dunning wrote:
> This may not quite be true because the RSJ is able to take some liberties.
> 
> The origin of these is that A'A can be viewed as a self join.  Thus as rows 
> of A are read, the cooccurrences can be emitted as they are read.
> 
> For B'A, we have to somehow get corresponding rows of A and B at the same 
> time in the same place.  If both matrices are stored in sparse row-major 
> form, then a map-side merge join would work at the cost of some locality.  
> You can recover that locality in special cases by a few tricks.  For 
> instance, you might actually store A and B as adjoined rows.  That means that 
> fetching a row of A inherently also gives a row of B.  Not sure how this 
> could come about.  
> 
> A second way to get the locality is to use a system like MapR (conflict of 
> interest alert, vendor specific feature alert, yada yada).  In such a system, 
> you can force files to be co-resident.  In MapR, this is done by setting 
> chunk size to zero and storing A and B in the same volume.  This makes that 
> volume only be stored in a single container which forces all of the files in 
> that volume to have exactly the same replication pattern.  It also makes that 
> volume not scale as well.  When this is feasible, it can result in a massive 
> speed improvement.  I know of one site that does this and reportedly achieves 
> 10-20x speed up because of the decrease in non-local reads.
> 
> A third option is to use a reduce side join.  This would be necessary if A 
> and B were ever not stored with rows in sequential order and were also not 
> randomly accessible.  I would avoid this option if possible.
> 
> On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote:
> 
>> I don't think you need to run RowSimilarityJob on B'A, I think you would
>> need an equivalent of RowSimilarityJob to compute B'A. I guess you could
>> extends the MatrixMultiplicationJob to use the similarity measures from
>> RowSimilarityJob instead of standard dot products.
> 
> 

Reply via email to