Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Drulea, Sherban Mon, 28 Sep 2015 20:54:29 -0700

Hi Lewis,

I made further progress. Following the instructions on
https://wiki.apache.org/nutch/NutchTutorial, I tried to copy the nutch
schema.xml to SOLR.


However, the nutch tutorial is out of date for SOLR 5.1.0. It references
different directory structures. Furthermore, the 2.3.1 SOLR schema.xml
doesn’t appear to work.

I did the following:

1.) Created a “nutch” folder called in
/usr/local/Cellar/solr/5.1.0/server/solr
2.) Created a “conf” folder in the “nutch” folder.
3.) Copied 
/usr/local/Cellar/solr/5.1.0/server/solr/configsets/basic_configs/conf/solr
config.xml to 
/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/solrconfig.xml
4.) Copied ~/svn/release-2.3.1/runtime/local/conf/schema.xml to
/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/schema.xml.
5.) I went into the SOLR admin UI and added a new core called “nutch”
6.) I get the following error (screenshot attached):
Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by:
enablePositionIncrements is not a valid option as of Lucene 5.0


7.) I deleted all enabledPositionIncrements=“true” in
/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/schema.xml
8.) I tried creating the nutch core again (same step as #6).
9.) Now I get this error (screenshot attached):
Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by:
copyField source :'rawcontent' is not a glob and doesn't match any
explicit field or dynamicField.


The schema.xml in the 2.3.1 seems incompatible with SOLR 5.1.0. Can
someone please update a working schema.xml and document how to upload it
to SOLR 5.1.0?

Cheers,
Sherban

-- 
Sherban Drulea, RAND Corporation
Senior Research Software Engineer, Information Services
m5129   x7384   [email protected]
―






On 9/28/15, 6:38 PM, "Drulea, Sherban" <[email protected]> wrote:

>Hi Lewis,
>
>I made progress. I downloaded and installed the release candidate from
>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>
>I ran the “crawl" executable with a Mongo backend.
>
>My gora.properties:
>-------------------------------------------------------------------
>gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>gora.mongodb.override_hadoop_configuration=false
>gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>gora.mongodb.servers=localhost:27017
>gora.mongodb.db=method_centers
>―――――――――――――――――――――――――――――――――
>
>
>My nutch-site.xml:
>-------------------------------------------------------------------
>
><?xml version="1.0"?>
><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
><!-- Put site-specific property overrides in this file. -->
><configuration>
>       <property>
>               <name>http.agent.name</name>
>               <value>nutch Mongo Solr Crawler</value>
>       </property>
>
>       <property>
>               <name>storage.data.store.class</name>
>               <value>org.apache.gora.mongodb.store.MongoStore</value>
>               <description>Default class for storing data</description>
>       </property>
>
>        <property>
>               <name>plugin.includes</name>
>               
> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(bas
>i
>c|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|ur
>l
>filter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-bas
>i
>c|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-reg
>e
>x|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
>       </property>
>  
>
>  
></configuration>
>-------------------------------------------------------------------
>
>
>I run with this command:
>./bin/crawl urls method_centers http://localhost:8983/solr 2
>
>
>Nutch successfully injects into the Mongo backend but fails on the SOLR
>indexing. Here’s the execution trace where nutch errors out on SOLR
>indexing task …
>
>FetcherJob: finished at 2015-09-28 18:27:57, time elapsed: 00:00:12
>Parsing : 
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443490061-8003 -crawlId method_centers
>ParserJob: starting at 2015-09-28 18:27:58
>ParserJob: resuming:   false
>ParserJob: forced reparse:     false
>ParserJob: batchId:    1443490061-8003
>ParserJob: success
>ParserJob: finished at 2015-09-28 18:28:00, time elapsed: 00:00:02
>CrawlDB update for method_centers
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443490061-8003 -crawlId method_centers
>DbUpdaterJob: starting at 2015-09-28 18:28:01
>DbUpdaterJob: batchId: 1443490061-8003
>DbUpdaterJob: finished at 2015-09-28 18:28:03, time elapsed: 00:00:02
>Indexing method_centers on SOLR index -> http://localhost:8983/solr
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>IndexingJob: starting
>Active IndexWriters :
>SOLRIndexWriter
>       solr.server.url : URL of the SOLR instance (mandatory)
>       solr.commit.size : buffer size when sending to SOLR (default 1000)
>       solr.mapping.file : name of the mapping file for fields (default
>solrindex-mapping.xml)
>       solr.auth : use authentication (default false)
>       solr.auth.username : username for authentication
>       solr.auth.password : password for authentication
>
>
>SolrIndexerJob: 
>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>Expected content type application/octet-stream but got
>text/html;charset=ISO-8859-1. <html>
><head>
><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
><title>Error 404 Not Found</title>
></head>
><body><h2>HTTP ERROR 404</h2>
><p>Problem accessing /solr/update. Reason:
><pre>    Not Found</pre></p><hr /><i><small>Powered by
>Jetty://</small></i><br/>
>
><br/>             
><br/>
></body>
></html>
>
>at 
>org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.ja
>v
>a:455)
>       at 
>org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.ja
>v
>a:197)
>       at 
>org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstrac
>t
>UpdateRequest.java:117)
>       at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
>       at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
>       at 
>org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.j
>a
>va:146)
>       at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:124)
>       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:186)
>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>
>Error running:
>  /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>Failed with exit value 255.
>
>
>I verified my SOLR is up and running. The SOLR web gui says solr-spec
>5.1.0. Do I have to configure SOLR for nutch indexing? If so, are there
>instructions to configure SOLR for nutch?
>
>
>Unrelated question…
>How does nutch crawl every link in pages in the seeds.txt file? Is there a
>difference between a URL directory entry vs specific page URL?
>For example, let’s say http://foo.com/index.html contains 100 links. Will
>nutch crawl these 2 seed.txt entries the same way(i.e. crawl each 100
>links)?
>http://foo.com/index.html
>http://foo.com 
>
>
>Thanks again for your help. I’ll give +1 vote for 2.3.1 candidate once
>SOLR indexing works ;).
>
>Cheers,
>Sherban
>
>
>
>On 9/28/15, 11:55 AM, "Drulea, Sherban" <[email protected]> wrote:
>
>>Hi Lewis,
>>
>>Thanks for your reply. You¹re right, there¹s no homebrew recipe for
>>Nutch.
>>I use the official nutch 2.3 OS X release download from the Apache
>>website. I run nutch from /runtime/local/bin. The homebrew packages are
>>other dependent software (mongo, cassandra, hbase,e tc.)
>>
>>All the problems I described are with the nutch 2.3 download, not
>>homebrew
>>packages.
>>
>>Where do I download nutch 2.3.1? Should I just pull the latest from
>>http://svn.apache.org/viewvc/nutch/trunk/ ?
>>
>>Cheers,
>>Sherban
>>
>>
>>
>>On 9/27/15, 9:57 AM, "Lewis John Mcgibbney" <[email protected]>
>>wrote:
>>
>>>Hi Drulea,
>>>
>>>On Sun, Sep 27, 2015 at 7:36 AM, <[email protected]>
>>>wrote:
>>>
>>>>
>>>> I¹m using nutch 2.3 on OS X 10.9.5 with homebrew.
>>>>
>>>
>>>
>>>From the start I would like to point you at the current release
>>>candidate
>>>for Nutch 2.3.1. The VOTE is currently open and the release candidate is
>>>being tested by the community. There are a number of bugs fixed down in
>>>Gora (particularly within the gora-mongodb module) which Nutch 2.3.1
>>>will
>>>benefit from.
>>>It can be obtained from here
>>>http://www.mail-archive.com/dev%40nutch.apache.org/msg19271.html
>>>
>>>Another thing here is that, AFAIK we are not publishing Homebrew
>>>recipes!
>>>Wherever you got your recipe from I can guarantee you that it is not an
>>>official Nutch one! I do however see two
>>>
>>>lmcgibbn@LMC-032857 /usr/local(joshua) $ brew search nutch
>>>No formula found for "nutch".
>>>==> Searching pull requests...
>>>Closed pull requests:
>>>Added formula for Apache Nutch (
>>>https://github.com/Homebrew/homebrew/pull/26587)
>>>Added Apache Nutch 2.2.1
>>>(https://github.com/Homebrew/homebrew/pull/22004)
>>>
>>>None of these are from the release managers at Nutch... maybe this is
>>>something we should look in to.
>>>
>>>
>>>>
>>>> I¹ve been unable to use the crawl command with MySQL, Mongo, or
>>>>Cassandra.
>>>> The inject step fails in each configuration with the following arcane
>>>> errors:
>>>>
>>>> 1.) MySQL (after downgrading to gora-cpre 0.2.1 in ivy.xml as per
>>>>comments)
>>>>
>>>
>>>
>>>MySQL backend for Gora is broken by now. Things have changed and moved
>>>on
>>>with the SQL module being left in the dust. Avro has also moved on
>>>significantly and we now utilize a MUCH never version of Avro so your
>>>NoSuchMethodError below us entirely understandable.
>>>
>>>
>>>>       InjectorJob: Injecting urlDir: urls
>>>>
>>>
>>>[...snip]
>>>
>>>
>>>
>>>>
>>>>
>>>> 2.) Mongo with default 0.5 gora
>>>>
>>>> InjectorJob: Injecting urlDir: urls
>>>>
>>>> InjectorJob: org.apache.gora.util.GoraException:
>>>> java.lang.NullPointerException
>>>>
>>>>
>>>>
>>>[...snip]
>>>
>>>This is gone in the Nutch 2.3.1 release candidate.
>>>
>>>
>>>> 3.) Mongo(upgrading to gora 0.6.1 to resolve previous issue above)
>>>>
>>>> InjectorJob: Injecting urlDir: urls
>>>>
>>>> InjectorJob: java.lang.UnsupportedOperationException: Not implemented
>>>>by
>>>> the DistributedFileSystem FileSystem implementation
>>>>
>>>>
>>>>
>>>[...snip]
>>>
>>>Can you please try with the 2.3.1 release candidate and provide the same
>>>feedback?
>>>
>>>
>>>> 4.) Cassandra using default gora 0.5
>>>>
>>>> InjectorJob: Injecting urlDir: urls
>>>>
>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>> org.apache.avro.Schema.access$1400()Ljava/lang/ThreadLocal;
>>>>
>>>>
>>>>
>>>[...snip]
>>>
>>>I've never seen this before. On another note, Renato and me are
>>>currently
>>>overhauling the gora-cassandra driver from Hector --> Datastax Java
>>>Driver.
>>>Work is ongoing here
>>>https://github.com/renato2099/gora/tree/gora-datastax-cassandra
>>>
>>>
>>>> Does the ³crawl" script inject task work with any backend storage
>>>>reliably
>>>> on OS X?
>>>>
>>>
>>>Well we can better answer that question if and when you and more people
>>>try
>>>our the 2.3.1 release candidate.
>>>
>>>
>>>
>>>>
>>>> Which backend is the most reliable to use with nutch 2.3?
>>>>
>>>
>>>HBase 0.94.14
>>>
>>>
>>>>
>>>> It¹s frustrating that 3 common (and supposedly supported) backends
>>>>don¹t
>>>> work with nutch due to arcane errors.
>>>>
>>>>
>>>I agree. But lets not throw the baby out with the bath water here. Hows
>>>about you try out the above and respond and we can take it from there?
>>>Would be great to have more developers submitting patches for 2.X
>>>branch.
>>>If you are keen then it would be great to have you on board.
>>>Thanks
>>>Lewis
>>
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Reply via email to