Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Drulea, Sherban Tue, 29 Sep 2015 13:09:05 -0700

I tried with SOLR 4.9.1.

I copied /release-2.3.1/runtime/local/conf/schema.xml to
solr-4.9.1/example/solr/collection1/conf/schema.xml


Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
http://localhost:8983/solr 2

Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
-crawlId method_centers
InjectorJob: starting at 2015-09-29 12:54:46
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
Gora storage class.
InjectorJob: total number of urls rejected by filters: 1
InjectorJob: total number of urls injected after normalization and
filtering: 5
Injector: finished at 2015-09-29 12:54:49, elapsed: 00:00:02
Tue Sep 29 12:54:49 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId method_centers -batchId 1443556489-5775
GeneratorJob: starting at 2015-09-29 12:54:50
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-09-29 12:54:52, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443556490-521927141 containing 5 URLs
Fetching : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443556489-5775 -crawlId method_centers -threads 50
FetcherJob: starting at 2015-09-29 12:54:53
FetcherJob: batchId: 1443556489-5775
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443567293080
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-09-29 12:55:05, time elapsed: 00:00:12
Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443556489-5775 -crawlId method_centers
ParserJob: starting at 2015-09-29 12:55:06
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1443556489-5775
ParserJob: success
ParserJob: finished at 2015-09-29 12:55:08, time elapsed: 00:00:02
CrawlDB update for method_centers
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443556489-5775 -crawlId method_centers
DbUpdaterJob: starting at 2015-09-29 12:55:09
DbUpdaterJob: batchId: 1443556489-5775
DbUpdaterJob: finished at 2015-09-29 12:55:11, time elapsed: 00:00:02
Indexing method_centers on SOLR index -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


IndexingJob: done.
SOLR dedup -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://localhost:8983/solr
Tue Sep 29 12:55:17 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId method_centers -batchId 1443556517-13841
GeneratorJob: starting at 2015-09-29 12:55:18
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-09-29 12:55:20, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?


The index job worked but there’s no data in SOLR. Is there a known good
version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
instructions still valid?



On 9/28/15, 8:53 PM, "Drulea, Sherban" <[email protected]> wrote:

>Hi Lewis,
>
>I made further progress. Following the instructions on
>https://wiki.apache.org/nutch/NutchTutorial, I tried to copy the nutch
>schema.xml to SOLR.
>
>However, the nutch tutorial is out of date for SOLR 5.1.0. It references
>different directory structures. Furthermore, the 2.3.1 SOLR schema.xml
>doesn’t appear to work.
>
>I did the following:
>
>1.) Created a “nutch” folder called in
>/usr/local/Cellar/solr/5.1.0/server/solr
>2.) Created a “conf” folder in the “nutch” folder.
>3.) Copied 
>/usr/local/Cellar/solr/5.1.0/server/solr/configsets/basic_configs/conf/sol
>r
>config.xml to 
>/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/solrconfig.xml
>4.) Copied ~/svn/release-2.3.1/runtime/local/conf/schema.xml to
>/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/schema.xml.
>5.) I went into the SOLR admin UI and added a new core called “nutch”
>6.) I get the following error (screenshot attached):
>Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by:
>enablePositionIncrements is not a valid option as of Lucene 5.0
>
>
>7.) I deleted all enabledPositionIncrements=“true” in
>/usr/local/Cellar/solr/5.1.0/server/solr/nutch/conf/schema.xml
>8.) I tried creating the nutch core again (same step as #6).
>9.) Now I get this error (screenshot attached):
>Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by:
>copyField source :'rawcontent' is not a glob and doesn't match any
>explicit field or dynamicField.
>
>
>The schema.xml in the 2.3.1 seems incompatible with SOLR 5.1.0. Can
>someone please update a working schema.xml and document how to upload it
>to SOLR 5.1.0?
>
>Cheers,
>Sherban
>
>-- 
>Sherban Drulea, RAND Corporation
>Senior Research Software Engineer, Information Services
>m5129   x7384   [email protected]
>―
>
>
>
>
>
>
>On 9/28/15, 6:38 PM, "Drulea, Sherban" <[email protected]> wrote:
>
>>Hi Lewis,
>>
>>I made progress. I downloaded and installed the release candidate from
>>https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>
>>I ran the “crawl" executable with a Mongo backend.
>>
>>My gora.properties:
>>-------------------------------------------------------------------
>>gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>>gora.mongodb.override_hadoop_configuration=false
>>gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>>gora.mongodb.servers=localhost:27017
>>gora.mongodb.db=method_centers
>>―――――――――――――――――――――――――――――――――
>>
>>
>>My nutch-site.xml:
>>-------------------------------------------------------------------
>>
>><?xml version="1.0"?>
>><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>><!-- Put site-specific property overrides in this file. -->
>><configuration>
>>      <property>
>>              <name>http.agent.name</name>
>>              <value>nutch Mongo Solr Crawler</value>
>>      </property>
>>
>>      <property>
>>              <name>storage.data.store.class</name>
>>              <value>org.apache.gora.mongodb.store.MongoStore</value>
>>              <description>Default class for storing data</description>
>>      </property>
>>
>>        <property>
>>              <name>plugin.includes</name>
>>              
>> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(ba
>>s
>>i
>>c|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|u
>>r
>>l
>>filter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-ba
>>s
>>i
>>c|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-re
>>g
>>e
>>x|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
>>      </property>
>>  
>>
>>  
>></configuration>
>>-------------------------------------------------------------------
>>
>>
>>I run with this command:
>>./bin/crawl urls method_centers http://localhost:8983/solr 2
>>
>>
>>Nutch successfully injects into the Mongo backend but fails on the SOLR
>>indexing. Here’s the execution trace where nutch errors out on SOLR
>>indexing task …
>>
>>FetcherJob: finished at 2015-09-28 18:27:57, time elapsed: 00:00:12
>>Parsing : 
>>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>mapred.reduce.tasks.speculative.execution=false -D
>>mapred.map.tasks.speculative.execution=false -D
>>mapred.compress.map.output=true -D
>>mapred.skip.attempts.to.start.skipping=2 -D
>>mapred.skip.map.max.skip.records=1 1443490061-8003 -crawlId
>>method_centers
>>ParserJob: starting at 2015-09-28 18:27:58
>>ParserJob: resuming:  false
>>ParserJob: forced reparse:    false
>>ParserJob: batchId:   1443490061-8003
>>ParserJob: success
>>ParserJob: finished at 2015-09-28 18:28:00, time elapsed: 00:00:02
>>CrawlDB update for method_centers
>>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>mapred.reduce.tasks.speculative.execution=false -D
>>mapred.map.tasks.speculative.execution=false -D
>>mapred.compress.map.output=true 1443490061-8003 -crawlId method_centers
>>DbUpdaterJob: starting at 2015-09-28 18:28:01
>>DbUpdaterJob: batchId: 1443490061-8003
>>DbUpdaterJob: finished at 2015-09-28 18:28:03, time elapsed: 00:00:02
>>Indexing method_centers on SOLR index -> http://localhost:8983/solr
>>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>mapred.reduce.tasks.speculative.execution=false -D
>>mapred.map.tasks.speculative.execution=false -D
>>mapred.compress.map.output=true -D
>>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>>IndexingJob: starting
>>Active IndexWriters :
>>SOLRIndexWriter
>>      solr.server.url : URL of the SOLR instance (mandatory)
>>      solr.commit.size : buffer size when sending to SOLR (default 1000)
>>      solr.mapping.file : name of the mapping file for fields (default
>>solrindex-mapping.xml)
>>      solr.auth : use authentication (default false)
>>      solr.auth.username : username for authentication
>>      solr.auth.password : password for authentication
>>
>>
>>SolrIndexerJob: 
>>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>Expected content type application/octet-stream but got
>>text/html;charset=ISO-8859-1. <html>
>><head>
>><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>><title>Error 404 Not Found</title>
>></head>
>><body><h2>HTTP ERROR 404</h2>
>><p>Problem accessing /solr/update. Reason:
>><pre>    Not Found</pre></p><hr /><i><small>Powered by
>>Jetty://</small></i><br/>
>>
>><br/>            
>><br/>
>></body>
>></html>
>>
>>at 
>>org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.j
>>a
>>v
>>a:455)
>>      at 
>>org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.j
>>a
>>v
>>a:197)
>>      at 
>>org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstra
>>c
>>t
>>UpdateRequest.java:117)
>>      at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
>>      at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
>>      at 
>>org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.
>>j
>>a
>>va:146)
>>      at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:124)
>>      at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:186)
>>      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>      at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>>
>>Error running:
>>  /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>>mapred.reduce.tasks.speculative.execution=false -D
>>mapred.map.tasks.speculative.execution=false -D
>>mapred.compress.map.output=true -D
>>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>>Failed with exit value 255.
>>
>>
>>I verified my SOLR is up and running. The SOLR web gui says solr-spec
>>5.1.0. Do I have to configure SOLR for nutch indexing? If so, are there
>>instructions to configure SOLR for nutch?
>>
>>
>>Unrelated question…
>>How does nutch crawl every link in pages in the seeds.txt file? Is there
>>a
>>difference between a URL directory entry vs specific page URL?
>>For example, let’s say http://foo.com/index.html contains 100 links. Will
>>nutch crawl these 2 seed.txt entries the same way(i.e. crawl each 100
>>links)?
>>http://foo.com/index.html
>>http://foo.com 
>>
>>
>>Thanks again for your help. I’ll give +1 vote for 2.3.1 candidate once
>>SOLR indexing works ;).
>>
>>Cheers,
>>Sherban
>>
>>
>>
>>On 9/28/15, 11:55 AM, "Drulea, Sherban" <[email protected]> wrote:
>>
>>>Hi Lewis,
>>>
>>>Thanks for your reply. You¹re right, there¹s no homebrew recipe for
>>>Nutch.
>>>I use the official nutch 2.3 OS X release download from the Apache
>>>website. I run nutch from /runtime/local/bin. The homebrew packages are
>>>other dependent software (mongo, cassandra, hbase,e tc.)
>>>
>>>All the problems I described are with the nutch 2.3 download, not
>>>homebrew
>>>packages.
>>>
>>>Where do I download nutch 2.3.1? Should I just pull the latest from
>>>http://svn.apache.org/viewvc/nutch/trunk/ ?
>>>
>>>Cheers,
>>>Sherban
>>>
>>>
>>>
>>>On 9/27/15, 9:57 AM, "Lewis John Mcgibbney" <[email protected]>
>>>wrote:
>>>
>>>>Hi Drulea,
>>>>
>>>>On Sun, Sep 27, 2015 at 7:36 AM, <[email protected]>
>>>>wrote:
>>>>
>>>>>
>>>>> I¹m using nutch 2.3 on OS X 10.9.5 with homebrew.
>>>>>
>>>>
>>>>
>>>>From the start I would like to point you at the current release
>>>>candidate
>>>>for Nutch 2.3.1. The VOTE is currently open and the release candidate
>>>>is
>>>>being tested by the community. There are a number of bugs fixed down in
>>>>Gora (particularly within the gora-mongodb module) which Nutch 2.3.1
>>>>will
>>>>benefit from.
>>>>It can be obtained from here
>>>>http://www.mail-archive.com/dev%40nutch.apache.org/msg19271.html
>>>>
>>>>Another thing here is that, AFAIK we are not publishing Homebrew
>>>>recipes!
>>>>Wherever you got your recipe from I can guarantee you that it is not an
>>>>official Nutch one! I do however see two
>>>>
>>>>lmcgibbn@LMC-032857 /usr/local(joshua) $ brew search nutch
>>>>No formula found for "nutch".
>>>>==> Searching pull requests...
>>>>Closed pull requests:
>>>>Added formula for Apache Nutch (
>>>>https://github.com/Homebrew/homebrew/pull/26587)
>>>>Added Apache Nutch 2.2.1
>>>>(https://github.com/Homebrew/homebrew/pull/22004)
>>>>
>>>>None of these are from the release managers at Nutch... maybe this is
>>>>something we should look in to.
>>>>
>>>>
>>>>>
>>>>> I¹ve been unable to use the crawl command with MySQL, Mongo, or
>>>>>Cassandra.
>>>>> The inject step fails in each configuration with the following arcane
>>>>> errors:
>>>>>
>>>>> 1.) MySQL (after downgrading to gora-cpre 0.2.1 in ivy.xml as per
>>>>>comments)
>>>>>
>>>>
>>>>
>>>>MySQL backend for Gora is broken by now. Things have changed and moved
>>>>on
>>>>with the SQL module being left in the dust. Avro has also moved on
>>>>significantly and we now utilize a MUCH never version of Avro so your
>>>>NoSuchMethodError below us entirely understandable.
>>>>
>>>>
>>>>>       InjectorJob: Injecting urlDir: urls
>>>>>
>>>>
>>>>[...snip]
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> 2.) Mongo with default 0.5 gora
>>>>>
>>>>> InjectorJob: Injecting urlDir: urls
>>>>>
>>>>> InjectorJob: org.apache.gora.util.GoraException:
>>>>> java.lang.NullPointerException
>>>>>
>>>>>
>>>>>
>>>>[...snip]
>>>>
>>>>This is gone in the Nutch 2.3.1 release candidate.
>>>>
>>>>
>>>>> 3.) Mongo(upgrading to gora 0.6.1 to resolve previous issue above)
>>>>>
>>>>> InjectorJob: Injecting urlDir: urls
>>>>>
>>>>> InjectorJob: java.lang.UnsupportedOperationException: Not implemented
>>>>>by
>>>>> the DistributedFileSystem FileSystem implementation
>>>>>
>>>>>
>>>>>
>>>>[...snip]
>>>>
>>>>Can you please try with the 2.3.1 release candidate and provide the
>>>>same
>>>>feedback?
>>>>
>>>>
>>>>> 4.) Cassandra using default gora 0.5
>>>>>
>>>>> InjectorJob: Injecting urlDir: urls
>>>>>
>>>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>>>> org.apache.avro.Schema.access$1400()Ljava/lang/ThreadLocal;
>>>>>
>>>>>
>>>>>
>>>>[...snip]
>>>>
>>>>I've never seen this before. On another note, Renato and me are
>>>>currently
>>>>overhauling the gora-cassandra driver from Hector --> Datastax Java
>>>>Driver.
>>>>Work is ongoing here
>>>>https://github.com/renato2099/gora/tree/gora-datastax-cassandra
>>>>
>>>>
>>>>> Does the ³crawl" script inject task work with any backend storage
>>>>>reliably
>>>>> on OS X?
>>>>>
>>>>
>>>>Well we can better answer that question if and when you and more people
>>>>try
>>>>our the 2.3.1 release candidate.
>>>>
>>>>
>>>>
>>>>>
>>>>> Which backend is the most reliable to use with nutch 2.3?
>>>>>
>>>>
>>>>HBase 0.94.14
>>>>
>>>>
>>>>>
>>>>> It¹s frustrating that 3 common (and supposedly supported) backends
>>>>>don¹t
>>>>> work with nutch due to arcane errors.
>>>>>
>>>>>
>>>>I agree. But lets not throw the baby out with the bath water here. Hows
>>>>about you try out the above and respond and we can take it from there?
>>>>Would be great to have more developers submitting patches for 2.X
>>>>branch.
>>>>If you are keen then it would be great to have you on board.
>>>>Thanks
>>>>Lewis
>>>
>>
>>
>>_________________________________________________________________________
>>_
>>
>>This email message is for the sole use of the intended recipient(s) and
>>may contain confidential information. Any unauthorized review, use,
>>disclosure or distribution is prohibited. If you are not the intended
>>recipient, please contact the sender by reply email and destroy all
>>copies
>>of the original message.
>

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Reply via email to