I guess I don't understand this issue. In my case both the item ids and user ids of the separate DistributedRow Matrix will match and I know the size for the entire space from a previous step where I create id maps. I suppose you are saying the the m/r code would be super simple if a row of B' and a column of A could be processed together, which I understand as an optimal implementation.
So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem to work. You loose the ability to substutute different RowSimilarityJob measures. I assume this creates something like the co-occurrence similairty measure. But oh, well. Maybe I'll look at that later. I also see why you say the two matrices A and B don't have to have the same size since [B'A]H_v = [B'A]A' so the dimensions will work out as long as the users dimension is the same throughout. On Apr 6, 2013, at 7:46 AM, Sebastian Schelter <[email protected]> wrote: Completely concur with that. MatrixMultiplicationJob is already using a mapside merge-join AFAIK. On 05.04.2013 15:04, Ted Dunning wrote: > This may not quite be true because the RSJ is able to take some liberties. > > The origin of these is that A'A can be viewed as a self join. Thus as rows > of A are read, the cooccurrences can be emitted as they are read. > > For B'A, we have to somehow get corresponding rows of A and B at the same > time in the same place. If both matrices are stored in sparse row-major > form, then a map-side merge join would work at the cost of some locality. > You can recover that locality in special cases by a few tricks. For > instance, you might actually store A and B as adjoined rows. That means that > fetching a row of A inherently also gives a row of B. Not sure how this > could come about. > > A second way to get the locality is to use a system like MapR (conflict of > interest alert, vendor specific feature alert, yada yada). In such a system, > you can force files to be co-resident. In MapR, this is done by setting > chunk size to zero and storing A and B in the same volume. This makes that > volume only be stored in a single container which forces all of the files in > that volume to have exactly the same replication pattern. It also makes that > volume not scale as well. When this is feasible, it can result in a massive > speed improvement. I know of one site that does this and reportedly achieves > 10-20x speed up because of the decrease in non-local reads. > > A third option is to use a reduce side join. This would be necessary if A > and B were ever not stored with rows in sequential order and were also not > randomly accessible. I would avoid this option if possible. > > On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote: > >> I don't think you need to run RowSimilarityJob on B'A, I think you would >> need an equivalent of RowSimilarityJob to compute B'A. I guess you could >> extends the MatrixMultiplicationJob to use the similarity measures from >> RowSimilarityJob instead of standard dot products. > >
