RE: What up with 2.3.1 ?
Forwarding with correct thread name. -- Forwarded message -- From: lewis john mcgibbneyDate: Mon, Jun 5, 2017 at 2:50 PM Subject: Re: user Digest 3 Jun 2017 19:27:20 - Issue 2758 To: "user@nutch.apache.org" Hi Ed, Disappointing to hear that this really got under your skin... never nice to hear that frustration becomes the outcome rather than successfully running the software. I've provided comments below On Sat, Jun 3, 2017 at 12:27 PM, wrote: > > From: Edward Capriolo > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Sat, 3 Jun 2017 15:27:06 -0400 > Subject: What up with 2.3.1 ? > > Nutch 2.3.1, I have to say, I do not even understand it as a release. > This could be understood... as a previous (historical) user of the Nutch 1.X series... you seem to have prior expectations which are/were based on a simplified technology stack. Nutch 2.X is aimed at using a different stack and focuses on use of more modern storage solutions as you've found out. It has never really been touted as the go-to Nutch branch... you will notice that Nutch 1.X is the mainstream (master) branch. You'll also see, that over a number of years, the message has been consistent... Nutch 1.X is the go-to software both for users of source and release artifacts. > > First, I attempted to ... If you want to use Nutch 2.3.1 with HBase, you should use the backend datastore support which ships with the release announcement. That is as follows Apache Avro 1.7.6 Apache Hadoop 1.2.1 and 2.5.2 Apache HBase 0.98.8-hadoop2 (although also tested with 1.X) Apache Cassandra 2.0.2 Apache Solr 4.10.3 MongoDB 2.6.X Apache Accumlo 1.5.1 Apache Spark 1.4.1 I've tried my best, alongside several others over at the Gora community, to ensure all of these datastores are documented over at http://gora.apache.org/current/index.html#gora-modules. It should be noted that since then, Gora master branch contains datastore version upgrades for nearly every datastore. > > > > I just do not get the entire 2.3.1 release. It is very frustrating. Yes, as I said this is disappointing to see that you struggled so much with this. I've tried to make best efforts to ensure our Nutch2 tutorial is up-to-date https://wiki.apache.org/nutch/Nutch2Tutorial > The > webui's tend to fire blank pages with no stack traces. Please feel free to log issues... if it is broken then we can try to fix it. Without some Jira issue or debug information then we don't know it is broken. > Its unclear why > backends that do not work are even documented. HBase is most widely used, followed by MongoDB... on the other end of the spectrum, Cassandra is least used and broken. It has not been maintained for quite some time... and yes this is reflected by use of Super Columns. We are currently re-writing the backend as part of a GSoC project. > How can even the file/avro > support not even work? > Please log your issue(s) in Jira and I can try to reproduce it using 2.x branch. I do not use this backend now when I have deployed 2.X. I was not aware that it was broken. Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
RE: What up with 2.3.1 ?
Don't know about the pills, but pills should be good quality.If you just want to crawl about, i would, most certainly after reading this, recommend 1.x. Less moving parts, less fuzz. Markus -Original message- > From:Edward Capriolo> Sent: Saturday 3rd June 2017 21:27 > To: user@nutch.apache.org > Subject: What up with 2.3.1 ? > > Hello, > > In the past I had an awesome experience with nutch. About 8 years ago I ran > a process where I checked out each process in our SVN repo, ran > doxogen/javadoc on them. Then unleashed nutch on them and setup a > searchable front end. > > I am doing a video coarse '10 Hadoop able problems' and I want to pay some > tribute to nutch by including a section on it > > Nutch 2.3.1, I have to say, I do not even understand it as a release. > > First, I attempted to use the hbase gora integration. I have a pretty > recent hbase. I notice nutch 2.3.1 has gora 6.1 as ad dep so I checked that > out. Gora has an assembly target so I ran that. It really did not seem to > make an assembly jar, so I spent about a half hour dragging hbase jars to > deal with class not found errors. Finally I got one of those darn errors > like: NoSuchMethod SetColumnFamily(string, string) which meant this was a > dead effort because now I would also have to go install an hbase to match > gora and I felt that was a big time suck... > > So onto the Cassandra integration Same process ...finding hector, > Cassandra-all thrift because the assembly does not really assemble them. > > Run a crawl, fail @ 33% problem with super columnsWTF > supercolumnns?? Anyway so I go in Jira apparently this thing does not > never worked...and there is maybe a new one using CQL but that is not in > gora 0.6.1... > > So I figure let me use the FILE support. Like that is the bare > minimumIt has to work right? > > > /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch > inject cdir -crawlId e > InjectorJob: starting at 2017-06-03 15:17:23 > InjectorJob: Injecting urlDir: cdir > InjectorJob: Using class org.apache.gora.avro.store.AvroStore as the Gora > storage class. > InjectorJob: total number of urls rejected by filters: 0 > InjectorJob: total number of urls injected after normalization and > filtering: 2 > Injector: finished at 2017-06-03 15:17:25, elapsed: 00:00:01 > Sat Jun 3 15:17:25 EDT 2017 : Iteration 1 of 2 > Generating batchId > Generating a new fetchlist > /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch > generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -topN 5 -noNorm -noFilter -adddays 0 > -crawlId e -batchId 1496517445-7390 > GeneratorJob: starting at 2017-06-03 15:17:26 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: false > GeneratorJob: normalizing: false > GeneratorJob: topN: 5 > GeneratorJob: finished at 2017-06-03 15:17:28, time elapsed: 00:00:02 > GeneratorJob: generated batch id: 1496517445-7390 containing 0 URLs > Generate returned 1 (no new segments created) > Escaping loop: no more URLs to fetch now > > > 2017-06-03 15:17:24,847 ERROR store.AvroStore - > > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.checkClosed(ChecksumFileSystem.java:417) > at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.codehaus.jackson.impl.Utf8Generator._flushBuffer(Utf8Generator.java:1754) > at > org.codehaus.jackson.impl.Utf8Generator.flush(Utf8Generator.java:1088) > at org.apache.avro.io.JsonEncoder.flush(JsonEncoder.java:73) > at org.apache.gora.avro.store.AvroStore.close(AvroStore.java:119) > at > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > I just do not get the entire 2.3.1