Hello,

In the past I had an awesome experience with nutch. About 8 years ago I ran
a process where I checked out each process in our SVN repo, ran
doxogen/javadoc on them. Then unleashed nutch on them and setup a
searchable front end.

I am doing a video coarse '10 Hadoop able problems' and I want to pay some
tribute to nutch by including a section on it

Nutch 2.3.1, I have to say, I do not even understand it as a release.

First, I attempted to use the hbase gora integration. I have a pretty
recent hbase. I notice nutch 2.3.1 has gora 6.1 as ad dep so I checked that
out. Gora has an assembly target so I ran that. It really did not seem to
make an assembly jar, so I spent about a half hour dragging hbase jars to
deal with class not found errors. Finally I got one of those darn errors
like: NoSuchMethod SetColumnFamily(string, string) which meant this was a
dead effort because now I would also have to go install an hbase to match
gora and I felt that was a big time suck...

So onto the Cassandra integration.... Same process ...finding hector,
Cassandra-all thrift because the assembly does not really assemble them.

Run a crawl, fail @ 33% problem with super columns....WTF
supercolumnns....?? Anyway so I go in Jira apparently this thing does not
never worked...and there is maybe a new one using CQL but that is not in
gora 0.6.1...

So I figure let me use the FILE support. Like that is the bare
minimum....It has to work right?


/Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
inject cdir -crawlId e
InjectorJob: starting at 2017-06-03 15:17:23
InjectorJob: Injecting urlDir: cdir
InjectorJob: Using class org.apache.gora.avro.store.AvroStore as the Gora
storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 2
Injector: finished at 2017-06-03 15:17:25, elapsed: 00:00:01
Sat Jun 3 15:17:25 EDT 2017 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId e -batchId 1496517445-7390
GeneratorJob: starting at 2017-06-03 15:17:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2017-06-03 15:17:28, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1496517445-7390 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now


2017-06-03 15:17:24,847 ERROR store.AvroStore -

java.nio.channels.ClosedChannelException
        at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.checkClosed(ChecksumFileSystem.java:417)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at
org.codehaus.jackson.impl.Utf8Generator._flushBuffer(Utf8Generator.java:1754)
        at
org.codehaus.jackson.impl.Utf8Generator.flush(Utf8Generator.java:1088)
        at org.apache.avro.io.JsonEncoder.flush(JsonEncoder.java:73)
        at org.apache.gora.avro.store.AvroStore.close(AvroStore.java:119)
        at
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56)
        at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


I just do not get the entire 2.3.1 release. It is very frustrating. The
webui's tend to fire blank pages with no stack traces. Its unclear why
backends that do not work are even documented. How can even the file/avro
support not even work? Am I on crazy pills?

Thanks,
Edward

Reply via email to