Re: page rank algorithm?

Ken Krugler Thu, 01 Jul 2010 11:45:29 -0700


On Jul 1, 2010, at 11:02am, Jake Mannix wrote:

What were Jimmy's improvements, Ken? Improvements to the algorithm,or the
implementation, or what?


Improvements in implementation.

My take-away from the talk was:

1. Do combining of inbound link "juice" inside of mapper

This requires keeping a memory-resident map of target pages ->accumulated juice, which means you have to worry about memorymanagement.

But the win is that you can emit only one (or at least many fewer)page/juice pairs from the mapper.

Arun felt that with proper tuning of the shuffle phase, primarilyaround memory allocation, you could reduce spills to disk and thus getessentially the same level of performance, without having to worryabout memory management. As Ted noted, Jimmy disagreed.


2. Order inputs to mapper by page URL

This improves the performance of change #1, because most links areinter-domain, so ordering by URL means you get many more similartargets for a given number of source page URLs.


3. Don't pass the link graph structure through the mapreduce pipeline

Normally algorithms will emit special key/values from the mapper thatspecify the link structure (not juice), so that the link graph can bereconstructed at the end of the pipe.

His "Schimmy" algorithm skips that, by maintaining the static linkgraph in a partitioned set of sorted sequence files.

In the reduce phase, incoming keys (URLs) are partitioned and sortedsimilarly, so that the reducer can do a sequential merge of incomingkey/values and the link graph data, to emit the (same) link graph butwith updated page rank scores.

All in all, seems like he wound up with a 3x speed win (31% oforiginal time), IIRC.


-- Ken

On Thu, Jul 1, 2010 at 4:24 PM, Ken Krugler <[email protected]>wrote:
On Jul 1, 2010, at 8:16am, Andrzej Bialecki wrote:

On 2010-06-30 21:11, Grant Ingersoll wrote:
On Jun 27, 2010, at 12:10 PM, Manish Katyal wrote:

Is there an implementation of the page-rank algorithm in Mahout?
No, there isn't. However, do you mean to implement onespecifically for
link analysis or a general purpose one?
There is one in Nutch, but it's tied to the Nutch API.
It's likely we'll be contributing one to Mahout - either based onJimmyLin's enhancements as described during Hadoop Summit on Tuesday, orwe mighttry the "do it all with SVD" approach as previously proposed byTed, and
mentioned by Jake.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: page rank algorithm?

Reply via email to