If I take certain crawl_generate input, and run fetch on it, this always happens on the same url. The url is a 300mb txt. But other big size files run fine, without this problem. Also, if I open the file myself with notepad++, its not corrupted, it opens fine.
I manged to debug it on the cluster and get to the specific point where the exception in thrown. It seems that the stream suddenly ends, and I cant understand why. There is no disk space problem, and the datanodes seem healthy. Ferdy Galema wrote > > Hi, > > This most certainly has something to do with data corruption. When all > your > mappers succeed and the error is in reducer code (looks like your case), > it > may be mapreduce intermediate output that cannot be read succesfully. Is > there enough free space on your configured mapred.local.dir devices? Are > they healthy? Perhaps try to run some sanity check jobs. (Something like > Hadoop teragen and terasort). > > Ferdy. > > On Sun, Jul 22, 2012 at 3:56 PM, nutch.buddy@ < > nutch.buddy@> wrote: > >> I'm still stuch with this problem. >> >> I now know that the 'NegativeArraySizeException " is not related. >> >> during fetch procesc, at 100% map, 98% reduce >> I get the following exceptions: >> java.lang.RuntimeException: problem advancing post rec#9618 >> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1182) >> at >> >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:246) >> at >> >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:242) >> at >> >> org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40) >> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) >> at org.apache.hadoop.mapred.child$4.run(Child.java:270) >> ... >> ... >> Caused by: java.io.EOFException >> at java.io.DataInputStream.readByte(DataInputStream.java:250) >> at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) >> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) >> at org.apache.hadoop.io.Text.readFields(Text.java:282) >> at >> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73) >> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44) >> at >> org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:1225) >> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1180) >> >> >> I use a 8 node cluster with cdh4. >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-problem-advancing-port-rec-during-fetching-tp3994633p3996554.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > -- View this message in context: http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-problem-advancing-port-rec-during-fetching-tp3994633p3996706.html Sent from the Nutch - User mailing list archive at Nabble.com.

