I don't know if it's explicitly documented. It's just the jobs you see
in RowSimilarityJob though. Crudely: phase 1 computes some statistics
per row (vector, item) and transposes. Phase 2 does the similarity
computation. Phase 3 puts the results together.

At a high level it's not different than computing these values without
Hadoop. Of course, the parallel implementation on Hadoop is very
different in its details. The result is in theory the same, but
probably differs slightly due to bits of logic in the Hadoop job that
would prune small or insignificant data.

Does that start to answer?

@bejoy, this is not what is describe in MiA Chapter 6. This is
RowSimilarityJob, which isn't described directly in the book.

On Mon, Nov 14, 2011 at 6:24 PM, Chris Schilling
<[email protected]> wrote:
> Hi All,
>
> I was just curious if the job flow for the distributed similarity calculation 
> is documented anywhere.  What is the difference between calculating a 
> similarity sequentially versus using distributed matrix operations on Hadoop. 
>  I am just looking for a high level description of how to get from the 
> User-Item matrix to a Item Item similarity score in map-reduce.
>
> Thanks!
> Chris
>
>

Reply via email to