On Jul 1, 2010, at 11:02am, Jake Mannix wrote:
What were Jimmy's improvements, Ken? Improvements to the algorithm,
or the
implementation, or what?
Improvements in implementation.
My take-away from the talk was:
1. Do combining of inbound link "juice" inside of mapper
This requires keeping a memory-resident map of target pages ->
accumulated juice, which means you have to worry about memory
management.
But the win is that you can emit only one (or at least many fewer)
page/juice pairs from the mapper.
Arun felt that with proper tuning of the shuffle phase, primarily
around memory allocation, you could reduce spills to disk and thus get
essentially the same level of performance, without having to worry
about memory management. As Ted noted, Jimmy disagreed.
2. Order inputs to mapper by page URL
This improves the performance of change #1, because most links are
inter-domain, so ordering by URL means you get many more similar
targets for a given number of source page URLs.
3. Don't pass the link graph structure through the mapreduce pipeline
Normally algorithms will emit special key/values from the mapper that
specify the link structure (not juice), so that the link graph can be
reconstructed at the end of the pipe.
His "Schimmy" algorithm skips that, by maintaining the static link
graph in a partitioned set of sorted sequence files.
In the reduce phase, incoming keys (URLs) are partitioned and sorted
similarly, so that the reducer can do a sequential merge of incoming
key/values and the link graph data, to emit the (same) link graph but
with updated page rank scores.
All in all, seems like he wound up with a 3x speed win (31% of
original time), IIRC.
-- Ken
On Thu, Jul 1, 2010 at 4:24 PM, Ken Krugler <[email protected]
>wrote:
On Jul 1, 2010, at 8:16am, Andrzej Bialecki wrote:
On 2010-06-30 21:11, Grant Ingersoll wrote:
On Jun 27, 2010, at 12:10 PM, Manish Katyal wrote:
Is there an implementation of the page-rank algorithm in Mahout?
No, there isn't. However, do you mean to implement one
specifically for
link analysis or a general purpose one?
There is one in Nutch, but it's tied to the Nutch API.
It's likely we'll be contributing one to Mahout - either based on
Jimmy
Lin's enhancements as described during Hadoop Summit on Tuesday, or
we might
try the "do it all with SVD" approach as previously proposed by
Ted, and
mentioned by Jake.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g