It seems a single URL has about half a million outlinks connected to it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000 times! I can remedy this state by storing the links in a Set to enforce uniqueness and let the job finish.

However, i am curious as to how this is possible in the first place and what we can do about this. This is obviously not caused by a regular crawl as it already deduplicates outlinks during parsing and this much outlinks would mean the page was over 15MB large which would exceed our limits.

Anyone here to offer a possible explanation on how an OutlinkDB can make a mess out of itself? Should we enforce uniqueness in the mean time?

On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma <[email protected]> wrote:
Hi,

Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the number
of outlinks per record in the parser to the default of 100. So i would not expect the List and the both Sets in the reducer to use that much.
Also, URL's longer than about 400 characters are discarded anyway.

Any thoughts to share?

Thanks,
Markus

Reply via email to