Will provide a patch tomorrow.
https://issues.apache.org/jira/browse/NUTCH-1335
On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma
<[email protected]> wrote:
It seems a single URL has about half a million outlinks connected to
it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000 times! I
can remedy this state by storing the links in a Set to enforce
uniqueness and let the job finish.
However, i am curious as to how this is possible in the first place
and what we can do about this. This is obviously not caused by a
regular crawl as it already deduplicates outlinks during parsing and
this much outlinks would mean the page was over 15MB large which
would
exceed our limits.
Anyone here to offer a possible explanation on how an OutlinkDB can
make a mess out of itself? Should we enforce uniqueness in the mean
time?
On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma
<[email protected]> wrote:
Hi,
Recently a reducer got killed because of this. Increasing heap did
work but the next job some days later also failed. I looked at the
code and i cannot seem to find why it would take more than 400MB of
RAM to process outlinks of a single record. We do limit outlinks so
the HashSets pages and domains are used. But we also limit the
number
of outlinks per record in the parser to the default of 100. So i
would
not expect the List and the both Sets in the reducer to use that
much.
Also, URL's longer than about 400 characters are discarded anyway.
Any thoughts to share?
Thanks,
Markus
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350