Re: WebGraph Outlinks.reduce OOM

Markus Jelsma Mon, 16 Apr 2012 12:29:01 -0700

Will provide a patch tomorrow.
https://issues.apache.org/jira/browse/NUTCH-1335

On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma<[email protected]> wrote:

It seems a single URL has about half a million outlinks connected to
it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000 times! I
can remedy this state by storing the links in a Set to enforce
uniqueness and let the job finish.

However, i am curious as to how this is possible in the first place
and what we can do about this. This is obviously not caused by a
regular crawl as it already deduplicates outlinks during parsing and

this much outlinks would mean the page was over 15MB large whichwould

exceed our limits.

Anyone here to offer a possible explanation on how an OutlinkDB can
make a mess out of itself? Should we enforce uniqueness in the mean
time?

On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
<[email protected]> wrote:

Hi,

Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit thenumberof outlinks per record in the parser to the default of 100. So iwouldnot expect the List and the both Sets in the reducer to use thatmuch.
Also, URL's longer than about 400 characters are discarded anyway.

Any thoughts to share?

Thanks,
Markus


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: WebGraph Outlinks.reduce OOM

Reply via email to