I got to getting around the problem a while back ... Just wanted to update
the forum with my work-around, in case anyone else is looking for a
solution.

The Apparently memory was the root of the issue. I don't know the internals
of parse yet. I have not looked at the code, but it seems to me that the
parser tries to span threads proportional to the number of documents in the
parser's queue. Again, I m not sure if I am 100% correct or not. I am just
guessing this based on the error message in hadoop.log. 

The way I got it to work was to split the segments into smaller segments and
fetch and parse each smaller segment one by one. 

I split the segments using the  option. 
(source: http://wiki.apache.org/nutch/bin/nutch_mergesegs)

I would love to take a closer look at the parser soon and come back with a
better answer. But for now, this works and gets the job done.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696p4123739.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to