I don't know if it's explicitly documented. It's just the jobs you see in RowSimilarityJob though. Crudely: phase 1 computes some statistics per row (vector, item) and transposes. Phase 2 does the similarity computation. Phase 3 puts the results together.
At a high level it's not different than computing these values without Hadoop. Of course, the parallel implementation on Hadoop is very different in its details. The result is in theory the same, but probably differs slightly due to bits of logic in the Hadoop job that would prune small or insignificant data. Does that start to answer? @bejoy, this is not what is describe in MiA Chapter 6. This is RowSimilarityJob, which isn't described directly in the book. On Mon, Nov 14, 2011 at 6:24 PM, Chris Schilling <[email protected]> wrote: > Hi All, > > I was just curious if the job flow for the distributed similarity calculation > is documented anywhere. What is the difference between calculating a > similarity sequentially versus using distributed matrix operations on Hadoop. > I am just looking for a high level description of how to get from the > User-Item matrix to a Item Item similarity score in map-reduce. > > Thanks! > Chris > >
