Re: WebGraph Outlinks.reduce OOM

Markus Jelsma Mon, 16 Apr 2012 11:19:04 -0700

It seems a single URL has about half a million outlinks connected to itin the OutlinkDB! A pattern of 50 URL's repeats a 100.000 times! I canremedy this state by storing the links in a Set to enforce uniquenessand let the job finish.

However, i am curious as to how this is possible in the first place andwhat we can do about this. This is obviously not caused by a regularcrawl as it already deduplicates outlinks during parsing and this muchoutlinks would mean the page was over 15MB large which would exceed ourlimits.

Anyone here to offer a possible explanation on how an OutlinkDB canmake a mess out of itself? Should we enforce uniqueness in the meantime?

On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma<[email protected]> wrote:

Hi,

Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the number

of outlinks per record in the parser to the default of 100. So iwouldnot expect the List and the both Sets in the reducer to use thatmuch.

Also, URL's longer than about 400 characters are discarded anyway.

Any thoughts to share?

Thanks,
Markus

Re: WebGraph Outlinks.reduce OOM

Reply via email to