RE: What up with 2.3.1 ?

2017-06-05 Thread lewis john mcgibbney
Forwarding with correct thread name.

-- Forwarded message --
From: lewis john mcgibbney 
Date: Mon, Jun 5, 2017 at 2:50 PM
Subject: Re: user Digest 3 Jun 2017 19:27:20 - Issue 2758
To: "user@nutch.apache.org" 


Hi Ed,
Disappointing to hear that this really got under your skin... never nice to
hear that frustration becomes the outcome rather than successfully running
the software. I've provided comments below

On Sat, Jun 3, 2017 at 12:27 PM,  wrote:

>
> From: Edward Capriolo 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Sat, 3 Jun 2017 15:27:06 -0400
> Subject: What up with 2.3.1 ?
>
> Nutch 2.3.1, I have to say, I do not even understand it as a release.
>

This could be understood... as a previous (historical) user of the Nutch
1.X series... you seem to have prior expectations which are/were based on a
simplified technology stack. Nutch 2.X is aimed at using a different stack
and focuses on use of more modern storage solutions as you've found out. It
has never really been touted as the go-to Nutch branch... you will notice
that Nutch 1.X is the mainstream (master) branch. You'll also see, that
over a number of years, the message has been consistent... Nutch 1.X is the
go-to software both for users of source and release artifacts.


>
> First, I attempted to ...



If you want to use Nutch 2.3.1 with HBase, you should use the backend
datastore support which ships with the release announcement. That is as
follows

Apache Avro 1.7.6
Apache Hadoop 1.2.1 and 2.5.2
Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
Apache Cassandra 2.0.2
Apache Solr 4.10.3
MongoDB 2.6.X
Apache Accumlo 1.5.1
Apache Spark 1.4.1

I've tried my best, alongside several others over at the Gora community, to
ensure all of these datastores are documented over at
http://gora.apache.org/current/index.html#gora-modules.
It should be noted that since then, Gora master branch contains datastore
version upgrades for nearly every datastore.


>
>
>
> I just do not get the entire 2.3.1 release. It is very frustrating.


Yes, as I said this is disappointing to see that you struggled so much with
this. I've tried to make best efforts to ensure our Nutch2 tutorial is
up-to-date
https://wiki.apache.org/nutch/Nutch2Tutorial


> The
> webui's tend to fire blank pages with no stack traces.


Please feel free to log issues... if it is broken then we can try to fix
it. Without some Jira issue or debug information then we don't know it is
broken.


> Its unclear why
> backends that do not work are even documented.


HBase is most widely used, followed by MongoDB... on the other end of the
spectrum, Cassandra is least used and broken. It has not been maintained
for quite some time... and yes this is reflected by use of Super Columns.
We are currently re-writing the backend as part of a GSoC project.


> How can even the file/avro
> support not even work?
>

Please log your issue(s) in Jira and I can try to reproduce it using 2.x
branch. I do not use this backend now when I have deployed 2.X. I was not
aware that it was broken.
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


RE: What up with 2.3.1 ?

2017-06-03 Thread Markus Jelsma
Don't know about the pills, but pills should be good quality.If you just want 
to crawl about, i would, most certainly after reading this, recommend 1.x. Less 
moving parts, less fuzz. 

Markus 
 
-Original message-
> From:Edward Capriolo 
> Sent: Saturday 3rd June 2017 21:27
> To: user@nutch.apache.org
> Subject: What up with 2.3.1 ?
> 
> Hello,
> 
> In the past I had an awesome experience with nutch. About 8 years ago I ran
> a process where I checked out each process in our SVN repo, ran
> doxogen/javadoc on them. Then unleashed nutch on them and setup a
> searchable front end.
> 
> I am doing a video coarse '10 Hadoop able problems' and I want to pay some
> tribute to nutch by including a section on it
> 
> Nutch 2.3.1, I have to say, I do not even understand it as a release.
> 
> First, I attempted to use the hbase gora integration. I have a pretty
> recent hbase. I notice nutch 2.3.1 has gora 6.1 as ad dep so I checked that
> out. Gora has an assembly target so I ran that. It really did not seem to
> make an assembly jar, so I spent about a half hour dragging hbase jars to
> deal with class not found errors. Finally I got one of those darn errors
> like: NoSuchMethod SetColumnFamily(string, string) which meant this was a
> dead effort because now I would also have to go install an hbase to match
> gora and I felt that was a big time suck...
> 
> So onto the Cassandra integration Same process ...finding hector,
> Cassandra-all thrift because the assembly does not really assemble them.
> 
> Run a crawl, fail @ 33% problem with super columnsWTF
> supercolumnns?? Anyway so I go in Jira apparently this thing does not
> never worked...and there is maybe a new one using CQL but that is not in
> gora 0.6.1...
> 
> So I figure let me use the FILE support. Like that is the bare
> minimumIt has to work right?
> 
> 
> /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
> inject cdir -crawlId e
> InjectorJob: starting at 2017-06-03 15:17:23
> InjectorJob: Injecting urlDir: cdir
> InjectorJob: Using class org.apache.gora.avro.store.AvroStore as the Gora
> storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and
> filtering: 2
> Injector: finished at 2017-06-03 15:17:25, elapsed: 00:00:01
> Sat Jun 3 15:17:25 EDT 2017 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> /Users/ecapriolo/Downloads/apache-nutch-2.3.1/dist/apache-nutch-2.3.1-bin/bin/nutch
> generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -topN 5 -noNorm -noFilter -adddays 0
> -crawlId e -batchId 1496517445-7390
> GeneratorJob: starting at 2017-06-03 15:17:26
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: false
> GeneratorJob: normalizing: false
> GeneratorJob: topN: 5
> GeneratorJob: finished at 2017-06-03 15:17:28, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1496517445-7390 containing 0 URLs
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> 
> 2017-06-03 15:17:24,847 ERROR store.AvroStore -
> 
> java.nio.channels.ClosedChannelException
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.checkClosed(ChecksumFileSystem.java:417)
> at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at
> org.codehaus.jackson.impl.Utf8Generator._flushBuffer(Utf8Generator.java:1754)
> at
> org.codehaus.jackson.impl.Utf8Generator.flush(Utf8Generator.java:1088)
> at org.apache.avro.io.JsonEncoder.flush(JsonEncoder.java:73)
> at org.apache.gora.avro.store.AvroStore.close(AvroStore.java:119)
> at
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:56)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> 
> I just do not get the entire 2.3.1