I guess I don't understand this issue.

In my case both the item ids and user ids of the separate DistributedRow Matrix 
will match and I know the size for the entire space from a previous step where 
I create id maps. I suppose you are saying the the m/r code would be super 
simple if a row of B' and a  column of A could be processed together, which I 
understand as an optimal implementation.

So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem to 
work. You loose the ability to substutute different RowSimilarityJob measures. 
I assume this creates something like the co-occurrence similairty measure. But 
oh, well. Maybe I'll look at that later.

I also see why you say the two matrices A and B don't have to have the same 
size since [B'A]H_v = [B'A]A' so the dimensions will work out as long as the 
users dimension is the same throughout.


On Apr 6, 2013, at 7:46 AM, Sebastian Schelter <[email protected]> wrote:

Completely concur with that. MatrixMultiplicationJob is already using a
mapside merge-join AFAIK.


On 05.04.2013 15:04, Ted Dunning wrote:
> This may not quite be true because the RSJ is able to take some liberties.
> 
> The origin of these is that A'A can be viewed as a self join.  Thus as rows 
> of A are read, the cooccurrences can be emitted as they are read.
> 
> For B'A, we have to somehow get corresponding rows of A and B at the same 
> time in the same place.  If both matrices are stored in sparse row-major 
> form, then a map-side merge join would work at the cost of some locality.  
> You can recover that locality in special cases by a few tricks.  For 
> instance, you might actually store A and B as adjoined rows.  That means that 
> fetching a row of A inherently also gives a row of B.  Not sure how this 
> could come about.  
> 
> A second way to get the locality is to use a system like MapR (conflict of 
> interest alert, vendor specific feature alert, yada yada).  In such a system, 
> you can force files to be co-resident.  In MapR, this is done by setting 
> chunk size to zero and storing A and B in the same volume.  This makes that 
> volume only be stored in a single container which forces all of the files in 
> that volume to have exactly the same replication pattern.  It also makes that 
> volume not scale as well.  When this is feasible, it can result in a massive 
> speed improvement.  I know of one site that does this and reportedly achieves 
> 10-20x speed up because of the decrease in non-local reads.
> 
> A third option is to use a reduce side join.  This would be necessary if A 
> and B were ever not stored with rows in sequential order and were also not 
> randomly accessible.  I would avoid this option if possible.
> 
> On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote:
> 
>> I don't think you need to run RowSimilarityJob on B'A, I think you would
>> need an equivalent of RowSimilarityJob to compute B'A. I guess you could
>> extends the MatrixMultiplicationJob to use the similarity measures from
>> RowSimilarityJob instead of standard dot products.
> 
> 


Reply via email to