Hi,
Recently a reducer got killed because of this. Increasing heap did work
but the next job some days later also failed. I looked at the code and i
cannot seem to find why it would take more than 400MB of RAM to process
outlinks of a single record. We do limit outlinks so the HashSets pages
and domains are used. But we also limit the number of outlinks per
record in the parser to the default of 100. So i would not expect the
List and the both Sets in the reducer to use that much. Also, URL's
longer than about 400 characters are discarded anyway.
Any thoughts to share?
Thanks,
Markus