Don't know about the pills, but pills should be good quality.If you just want to crawl about, i would, most certainly after reading this, recommend 1.x. Less moving parts, less fuzz.
Markus -----Original message----- > From:Edward Capriolo <[email protected]> > Sent: Saturday 3rd June 2017 21:27 > To: [email protected] > Subject: What up with 2.3.1 ? > > Hello, > > In the past I had an awesome experience with nutch. About 8 years ago I ran > a process where I checked out each process in our SVN repo, ran > doxogen/javadoc on them. Then unleashed nutch on them and setup a > searchable front end. > > I am doing a video coarse '10 Hadoop able problems' and I want to pay some > tribute to nutch by including a section on it > > Nutch 2.3.1, I have to say, I do not even understand it as a release. > > First, I attempted to use the hbase gora integration. I have a pretty > recent hbase. I notice nutch 2.3.1 has gora 6.1 as ad dep so I checked that > out. Gora has an assembly target so I ran that. It really did not seem to > make an assembly jar, so I spent about a half hour dragging hbase jars to > deal with class not found errors. Finally I got one of those darn errors > like: NoSuchMethod SetColumnFamily(string, string) which meant this was a > dead effort because now I would also have to go install an hbase to match > gora and I felt that was a big time suck... > > So onto the Cassandra integration.... Same process ...finding hector, > Cassandra-all thrift because the assembly does not really assemble them. > > Run a crawl, fail @ 33% problem with super columns....WTF > supercolumnns....?? Anyway so I go in Jira apparently this thing does not > never worked...and there is maybe a new one using CQL but that is not in > gora 0.6.1... > > So I figure let me use the FILE support. Like that is the bare > minimum....It has to work right? > > > /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch > inject cdir -crawlId e > InjectorJob: starting at 2017-06-03 15:17:23 > InjectorJob: Injecting urlDir: cdir > InjectorJob: Using class org.apache.gora.avro.store.AvroStore as the Gora > storage class. > InjectorJob: total number of urls rejected by filters: 0 > InjectorJob: total number of urls injected after normalization and > filtering: 2 > Injector: finished at 2017-06-03 15:17:25, elapsed: 00:00:01 > Sat Jun 3 15:17:25 EDT 2017 : Iteration 1 of 2 > Generating batchId > Generating a new fetchlist > /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch > generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 > -crawlId e -batchId 1496517445-7390 > GeneratorJob: starting at 2017-06-03 15:17:26 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: false > GeneratorJob: normalizing: false > GeneratorJob: topN: 50000 > GeneratorJob: finished at 2017-06-03 15:17:28, time elapsed: 00:00:02 > GeneratorJob: generated batch id: 1496517445-7390 containing 0 URLs > Generate returned 1 (no new segments created) > Escaping loop: no more URLs to fetch now > > > 2017-06-03 15:17:24,847 ERROR store.AvroStore - > > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.checkClosed(ChecksumFileSystem.java:417) > at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.codehaus.jackson.impl.Utf8Generator._flushBuffer(Utf8Generator.java:1754) > at > org.codehaus.jackson.impl.Utf8Generator.flush(Utf8Generator.java:1088) > at org.apache.avro.io.JsonEncoder.flush(JsonEncoder.java:73) > at org.apache.gora.avro.store.AvroStore.close(AvroStore.java:119) > at > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > I just do not get the entire 2.3.1 release. It is very frustrating. The > webui's tend to fire blank pages with no stack traces. Its unclear why > backends that do not work are even documented. How can even the file/avro > support not even work? Am I on crazy pills? > > Thanks, > Edward >

