I got to getting around the problem a while back ... Just wanted to update the forum with my work-around, in case anyone else is looking for a solution.
The Apparently memory was the root of the issue. I don't know the internals of parse yet. I have not looked at the code, but it seems to me that the parser tries to span threads proportional to the number of documents in the parser's queue. Again, I m not sure if I am 100% correct or not. I am just guessing this based on the error message in hadoop.log. The way I got it to work was to split the segments into smaller segments and fetch and parse each smaller segment one by one. I split the segments using the option. (source: http://wiki.apache.org/nutch/bin/nutch_mergesegs) I would love to take a closer look at the parser soon and come back with a better answer. But for now, this works and gets the job done. -- View this message in context: http://lucene.472066.n3.nabble.com/IOException-while-parsing-tp4123696p4123739.html Sent from the Nutch - User mailing list archive at Nabble.com.

