Don't know about the pills, but pills should be good quality.If you just want 
to crawl about, i would, most certainly after reading this, recommend 1.x. Less 
moving parts, less fuzz. 

Markus 
 
-----Original message-----
> From:Edward Capriolo <[email protected]>
> Sent: Saturday 3rd June 2017 21:27
> To: [email protected]
> Subject: What up with 2.3.1 ?
> 
> Hello,
> 
> In the past I had an awesome experience with nutch. About 8 years ago I ran
> a process where I checked out each process in our SVN repo, ran
> doxogen/javadoc on them. Then unleashed nutch on them and setup a
> searchable front end.
> 
> I am doing a video coarse '10 Hadoop able problems' and I want to pay some
> tribute to nutch by including a section on it
> 
> Nutch 2.3.1, I have to say, I do not even understand it as a release.
> 
> First, I attempted to use the hbase gora integration. I have a pretty
> recent hbase. I notice nutch 2.3.1 has gora 6.1 as ad dep so I checked that
> out. Gora has an assembly target so I ran that. It really did not seem to
> make an assembly jar, so I spent about a half hour dragging hbase jars to
> deal with class not found errors. Finally I got one of those darn errors
> like: NoSuchMethod SetColumnFamily(string, string) which meant this was a
> dead effort because now I would also have to go install an hbase to match
> gora and I felt that was a big time suck...
> 
> So onto the Cassandra integration.... Same process ...finding hector,
> Cassandra-all thrift because the assembly does not really assemble them.
> 
> Run a crawl, fail @ 33% problem with super columns....WTF
> supercolumnns....?? Anyway so I go in Jira apparently this thing does not
> never worked...and there is maybe a new one using CQL but that is not in
> gora 0.6.1...
> 
> So I figure let me use the FILE support. Like that is the bare
> minimum....It has to work right?
> 
> 
> /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
> inject cdir -crawlId e
> InjectorJob: starting at 2017-06-03 15:17:23
> InjectorJob: Injecting urlDir: cdir
> InjectorJob: Using class org.apache.gora.avro.store.AvroStore as the Gora
> storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 2
> Injector: finished at 2017-06-03 15:17:25, elapsed: 00:00:01
> Sat Jun 3 15:17:25 EDT 2017 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
> generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
> -crawlId e -batchId 1496517445-7390
> GeneratorJob: starting at 2017-06-03 15:17:26
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 50000
> GeneratorJob: finished at 2017-06-03 15:17:28, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1496517445-7390 containing 0 URLs
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> 
> 2017-06-03 15:17:24,847 ERROR store.AvroStore -
> 
> java.nio.channels.ClosedChannelException
>         at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.checkClosed(ChecksumFileSystem.java:417)
>         at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
>         at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>         at
> org.codehaus.jackson.impl.Utf8Generator._flushBuffer(Utf8Generator.java:1754)
>         at
> org.codehaus.jackson.impl.Utf8Generator.flush(Utf8Generator.java:1088)
>         at org.apache.avro.io.JsonEncoder.flush(JsonEncoder.java:73)
>         at org.apache.gora.avro.store.AvroStore.close(AvroStore.java:119)
>         at
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56)
>         at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 
> 
> I just do not get the entire 2.3.1 release. It is very frustrating. The
> webui's tend to fire blank pages with no stack traces. Its unclear why
> backends that do not work are even documented. How can even the file/avro
> support not even work? Am I on crazy pills?
> 
> Thanks,
> Edward
> 

Reply via email to