On Jul 1, 2010, at 11:02am, Jake Mannix wrote:

What were Jimmy's improvements, Ken? Improvements to the algorithm, or the
implementation, or what?

Improvements in implementation.

My take-away from the talk was:

1. Do combining of inbound link "juice" inside of mapper

This requires keeping a memory-resident map of target pages -> accumulated juice, which means you have to worry about memory management.

But the win is that you can emit only one (or at least many fewer) page/juice pairs from the mapper.

Arun felt that with proper tuning of the shuffle phase, primarily around memory allocation, you could reduce spills to disk and thus get essentially the same level of performance, without having to worry about memory management. As Ted noted, Jimmy disagreed.

2. Order inputs to mapper by page URL

This improves the performance of change #1, because most links are inter-domain, so ordering by URL means you get many more similar targets for a given number of source page URLs.

3. Don't pass the link graph structure through the mapreduce pipeline

Normally algorithms will emit special key/values from the mapper that specify the link structure (not juice), so that the link graph can be reconstructed at the end of the pipe.

His "Schimmy" algorithm skips that, by maintaining the static link graph in a partitioned set of sorted sequence files.

In the reduce phase, incoming keys (URLs) are partitioned and sorted similarly, so that the reducer can do a sequential merge of incoming key/values and the link graph data, to emit the (same) link graph but with updated page rank scores.

All in all, seems like he wound up with a 3x speed win (31% of original time), IIRC.

-- Ken


On Thu, Jul 1, 2010 at 4:24 PM, Ken Krugler <[email protected] >wrote:


On Jul 1, 2010, at 8:16am, Andrzej Bialecki wrote:

On 2010-06-30 21:11, Grant Ingersoll wrote:


On Jun 27, 2010, at 12:10 PM, Manish Katyal wrote:

Is there an implementation of the page-rank algorithm in Mahout?


No, there isn't. However, do you mean to implement one specifically for
link analysis or a general purpose one?


There is one in Nutch, but it's tied to the Nutch API.


It's likely we'll be contributing one to Mahout - either based on Jimmy Lin's enhancements as described during Hadoop Summit on Tuesday, or we might try the "do it all with SVD" approach as previously proposed by Ted, and
mentioned by Jake.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to