Hi

I am in the process of upgrading from Nutch 2.2.1 to Nutch 2.3-SNAPSHOT:

I have upgraded HBase from 0.90.4 to 0.94.13 and can scan all of the
pre-existing tables through HBase shell. If I inject new URL's into a new
crawl table, everything works fine. However, when running a job, e.g.
FetcherJob against the tables that pre-exist, I encounter the following
Exception coming from GoraRecordReader- this is preventing FetcherMapper
from running :

java.io.EOFException

        at
org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)

        at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)

        at
org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)

        at
org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:376)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:156)

        at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)

        at
org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:145)

        at
org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:114)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:713)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:679)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:644)

        at
org.apache.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:625)

        at
org.apache.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:48)

        at
org.apache.gora.hbase.query.HBaseScannerResult.nextInner(HBaseScannerResult.java:54)

        at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)

        at
org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:119)

        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)

        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)

        at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Like I said, working against a new table is fine- its only against the
existing data (crawlId's). There seems to be something that either Avro
doesn't like about the data- HBase seems to be fine as I can scan tables
and read data directly.

Any ideas?


Az

Reply via email to