Hi, Have you tried the mapreduce mailing list? This really looks like a Hadoop specific error. (Note that crawl_generate really is really just a sequence file.) What about using a different Hadoop version? It might be that CDH4 is just incompatible with the recent versions of Nutch. I know that CDH3 works. CDH4 is a major upgrade with lots of changes.
Because you can reproduce the error, you could also try to execute the job with a debug session attached to the reducer. Include Hadoop sources in your debugger so you are able to actually see what is happening. There should be plenty resources explaining how to debug a mapreduce task. My guess is that the crawl_generate is created with a corrupted entry. A single url from the crawldb can cause this. Ferdy. On Tue, Jul 31, 2012 at 10:53 AM, [email protected] < [email protected]> wrote: > I have new information. > It seems that in Task$ValueIterator.java, in the method readNextKey(), > there's a call to keyIn.reset(...) > well there it does > count = start + length, > where 'start' got the value of nextKeyBytes.getPosition() and 'length' get > the value of nextKeyBytes.getLength(); > > the sum is beyond integer limit, and thus count is turned to a negative > number which after that caues and EOFException to be thrown. > > Any inputs from anyone regarding this new info? > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-problem-advancing-port-rec-during-fetching-tp3994633p3998304.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

