Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Drulea, Sherban Mon, 28 Sep 2015 18:39:45 -0700

Hi Lewis,

I made progress. I downloaded and installed the release candidate from
https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1


I ran the “crawl" executable with a Mongo backend.

My gora.properties:
-------------------------------------------------------------------
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers
―――――――――――――――――――――――――――――――――


My nutch-site.xml:
-------------------------------------------------------------------

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
        <property>
                <name>http.agent.name</name>
                <value>nutch Mongo Solr Crawler</value>
        </property>

        <property>
                <name>storage.data.store.class</name>
                <value>org.apache.gora.mongodb.store.MongoStore</value>
                <description>Default class for storing data</description>
        </property>

        <property>
                <name>plugin.includes</name>
                
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basi
c|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|url
filter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basi
c|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-rege
x|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
        </property>
  

  
</configuration>
-------------------------------------------------------------------


I run with this command:
./bin/crawl urls method_centers http://localhost:8983/solr 2


Nutch successfully injects into the Mongo backend but fails on the SOLR
indexing. Here’s the execution trace where nutch errors out on SOLR
indexing task …

FetcherJob: finished at 2015-09-28 18:27:57, time elapsed: 00:00:12
Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443490061-8003 -crawlId method_centers
ParserJob: starting at 2015-09-28 18:27:58
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1443490061-8003
ParserJob: success
ParserJob: finished at 2015-09-28 18:28:00, time elapsed: 00:00:02
CrawlDB update for method_centers
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443490061-8003 -crawlId method_centers
DbUpdaterJob: starting at 2015-09-28 18:28:01
DbUpdaterJob: batchId: 1443490061-8003
DbUpdaterJob: finished at 2015-09-28 18:28:03, time elapsed: 00:00:02
Indexing method_centers on SOLR index -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


SolrIndexerJob: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Expected content type application/octet-stream but got
text/html;charset=ISO-8859-1. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr /><i><small>Powered by
Jetty://</small></i><br/>

<br/>              
<br/>
</body>
</html>

at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.jav
a:455)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.jav
a:197)
        at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
UpdateRequest.java:117)
        at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
        at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
        at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.ja
va:146)
        at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:124)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:186)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
  /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
Failed with exit value 255.


I verified my SOLR is up and running. The SOLR web gui says solr-spec
5.1.0. Do I have to configure SOLR for nutch indexing? If so, are there
instructions to configure SOLR for nutch?


Unrelated question…
How does nutch crawl every link in pages in the seeds.txt file? Is there a
difference between a URL directory entry vs specific page URL?
For example, let’s say http://foo.com/index.html contains 100 links. Will
nutch crawl these 2 seed.txt entries the same way(i.e. crawl each 100
links)?
http://foo.com/index.html
http://foo.com 


Thanks again for your help. I’ll give +1 vote for 2.3.1 candidate once
SOLR indexing works ;).

Cheers,
Sherban



On 9/28/15, 11:55 AM, "Drulea, Sherban" <sdru...@rand.org> wrote:

>Hi Lewis,
>
>Thanks for your reply. You¹re right, there¹s no homebrew recipe for Nutch.
>I use the official nutch 2.3 OS X release download from the Apache
>website. I run nutch from /runtime/local/bin. The homebrew packages are
>other dependent software (mongo, cassandra, hbase,e tc.)
>
>All the problems I described are with the nutch 2.3 download, not homebrew
>packages.
>
>Where do I download nutch 2.3.1? Should I just pull the latest from
>http://svn.apache.org/viewvc/nutch/trunk/ ?
>
>Cheers,
>Sherban
>
>
>
>On 9/27/15, 9:57 AM, "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com>
>wrote:
>
>>Hi Drulea,
>>
>>On Sun, Sep 27, 2015 at 7:36 AM, <user-digest-h...@nutch.apache.org>
>>wrote:
>>
>>>
>>> I¹m using nutch 2.3 on OS X 10.9.5 with homebrew.
>>>
>>
>>
>>From the start I would like to point you at the current release candidate
>>for Nutch 2.3.1. The VOTE is currently open and the release candidate is
>>being tested by the community. There are a number of bugs fixed down in
>>Gora (particularly within the gora-mongodb module) which Nutch 2.3.1 will
>>benefit from.
>>It can be obtained from here
>>http://www.mail-archive.com/dev%40nutch.apache.org/msg19271.html
>>
>>Another thing here is that, AFAIK we are not publishing Homebrew recipes!
>>Wherever you got your recipe from I can guarantee you that it is not an
>>official Nutch one! I do however see two
>>
>>lmcgibbn@LMC-032857 /usr/local(joshua) $ brew search nutch
>>No formula found for "nutch".
>>==> Searching pull requests...
>>Closed pull requests:
>>Added formula for Apache Nutch (
>>https://github.com/Homebrew/homebrew/pull/26587)
>>Added Apache Nutch 2.2.1
>>(https://github.com/Homebrew/homebrew/pull/22004)
>>
>>None of these are from the release managers at Nutch... maybe this is
>>something we should look in to.
>>
>>
>>>
>>> I¹ve been unable to use the crawl command with MySQL, Mongo, or
>>>Cassandra.
>>> The inject step fails in each configuration with the following arcane
>>> errors:
>>>
>>> 1.) MySQL (after downgrading to gora-cpre 0.2.1 in ivy.xml as per
>>>comments)
>>>
>>
>>
>>MySQL backend for Gora is broken by now. Things have changed and moved on
>>with the SQL module being left in the dust. Avro has also moved on
>>significantly and we now utilize a MUCH never version of Avro so your
>>NoSuchMethodError below us entirely understandable.
>>
>>
>>>       InjectorJob: Injecting urlDir: urls
>>>
>>
>>[...snip]
>>
>>
>>
>>>
>>>
>>> 2.) Mongo with default 0.5 gora
>>>
>>> InjectorJob: Injecting urlDir: urls
>>>
>>> InjectorJob: org.apache.gora.util.GoraException:
>>> java.lang.NullPointerException
>>>
>>>
>>>
>>[...snip]
>>
>>This is gone in the Nutch 2.3.1 release candidate.
>>
>>
>>> 3.) Mongo(upgrading to gora 0.6.1 to resolve previous issue above)
>>>
>>> InjectorJob: Injecting urlDir: urls
>>>
>>> InjectorJob: java.lang.UnsupportedOperationException: Not implemented
>>>by
>>> the DistributedFileSystem FileSystem implementation
>>>
>>>
>>>
>>[...snip]
>>
>>Can you please try with the 2.3.1 release candidate and provide the same
>>feedback?
>>
>>
>>> 4.) Cassandra using default gora 0.5
>>>
>>> InjectorJob: Injecting urlDir: urls
>>>
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> org.apache.avro.Schema.access$1400()Ljava/lang/ThreadLocal;
>>>
>>>
>>>
>>[...snip]
>>
>>I've never seen this before. On another note, Renato and me are currently
>>overhauling the gora-cassandra driver from Hector --> Datastax Java
>>Driver.
>>Work is ongoing here
>>https://github.com/renato2099/gora/tree/gora-datastax-cassandra
>>
>>
>>> Does the ³crawl" script inject task work with any backend storage
>>>reliably
>>> on OS X?
>>>
>>
>>Well we can better answer that question if and when you and more people
>>try
>>our the 2.3.1 release candidate.
>>
>>
>>
>>>
>>> Which backend is the most reliable to use with nutch 2.3?
>>>
>>
>>HBase 0.94.14
>>
>>
>>>
>>> It¹s frustrating that 3 common (and supposedly supported) backends
>>>don¹t
>>> work with nutch due to arcane errors.
>>>
>>>
>>I agree. But lets not throw the baby out with the bath water here. Hows
>>about you try out the above and respond and we can take it from there?
>>Would be great to have more developers submitting patches for 2.X branch.
>>If you are keen then it would be great to have you on board.
>>Thanks
>>Lewis
>


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Reply via email to