Re: Nutch-1.3 + Solr 3.3.0 = fail

Markus Jelsma Mon, 08 Aug 2011 17:05:22 -0700

3.3 will work perfectly as there are no changes the the javabin format. 
However, one should update the schema version to reflect recent changes in 
branch 3.4-dev. It's likely this branch version is released earlier than Nutch 
1.4 that should be compatible with the most recent stable Solr release.


> Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
> however I did not update my blog yet with Solr 3.3. ;-)
> 
> have fun!
> 
> On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
> 
> <[email protected]>wrote:
> > On 8/2/2011 11:21 PM, Way Cool wrote:
> >> Try changing uniqueKey from id to url as below under in schema.xml and
> >> restart Solr:
> >> <uniqueKey>url</uniqueKey>
> >> 
> >> If that still did not work, that means you are having an empty url. We
> >> can fix that.
> >> 
> >> 
> >> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
> >> uscourts.gov <[email protected]>
> >> 
> >>> wrote:
> >>> Friends,
> >>> 
> >>> I am having the worst time getting nutch and solr to play together
> >>> nicely.
> >>> 
> >>> I downloaded and installed the current binaries for both nutch and
> >>> solr.
> >>> 
> >>>  I
> >>> 
> >>> edited the nutch-site.xml file to include:
> >>> 
> >>> <property>
> >>> <name>http.agent.name</name>
> >>> <value>Solr/Nutch Search</value>
> >>> </property>
> >>> <property>
> >>> <name>plugin.includes</name>
> >>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
> >>> index-basic|query-(basic|****stemmer|site|url)|summary-****
> >>> basic|scoring-opic|
> >>> urlnormalizer-(pass|regex|****basic)</value>
> >>> </property>
> >>> <property>
> >>> <name>http.content.limit</****name>
> >>> <value>65536</value>
> >>> </property>
> >>> <property>
> >>> <name>searcher.dir</name>
> >>> <value>/opt/SolrSearch</value>
> >>> </property>
> >>> 
> >>> 
> >>> I installed them and tested them according to each of their respective
> >>> tutorials; in other words I believe each is working, separately.  I
> >>> crawled
> >>> a url and the 'readdb -stats' report shows that I have successfully
> >>> collected some links.  Most of the links are to '.pdf' files.
> >>> 
> >>> I followed the instructions to link nutch and solr; e.g. copy the nutch
> >>> schema to become the solr schema.
> >>> 
> >>> When I run the bin/nutch solrindex ... command I get the following
> >>> error:
> >>> 
> >>> java.io.IOException: Job failed!
> >>> 
> >>> When I look in the log/hadoop.log file I see:
> >>> 
> >>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
> >>> dest: content
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
> >>> dest: site
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
> >>> dest:
> >>> title
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
> >>> dest: host
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
> >>> dest: segment
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
> >>> dest:
> >>> boost
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
> >>> dest:
> >>> digest
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
> >>> dest:
> >>> tstamp
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>> dest: id
> >>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url
> >>> dest: url
> >>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
> >>> org.apache.solr.common.****SolrException: Document [null] missing
> >>> required
> >>> field: id
> >>> 
> >>> Document [null] missing required field: id
> >>> 
> >>> request:
> >>> http://localhost:8983/solr/****update?wt=javabin&version=2<http://loca
> >>> lhost:8983/solr/**update?wt=javabin&version=2>
> >>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://loc
> >>> alhost:8983/solr/update?wt=javabin&version=2>
> >>> 
> >>>        at
> >>>        org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
> >>> 
> >>> request(CommonsHttpSolrServer.****java:435)
> >>> 
> >>>        at
> >>>        org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
> >>> 
> >>> request(CommonsHttpSolrServer.****java:244)
> >>> 
> >>>        at org.apache.solr.client.solrj.****request.**
> >>> 
> >>> AbstractUpdateRequest.**
> >>> process(AbstractUpdateRequest.****java:105)
> >>> 
> >>>        at org.apache.solr.client.solrj.****SolrServer.add(SolrServer.**
> >>> 
> >>> java:49)
> >>> 
> >>>        at
> >>>        org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter.
> >>> 
> >>> ****
> >>> java:82)
> >>> 
> >>>        at org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
> >>> 
> >>> IndexerOutputFormat.java:48)
> >>> 
> >>>        at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
> >>> 
> >>> ReduceTask.java:474)
> >>> 
> >>>        at org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
> >>> 
> >>> java:411)
> >>> 
> >>>        at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
> >>> 
> >>> LocalJobRunner.java:216)
> >>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException:
> >>> Job failed!
> >>> 
> >>> The same error appears in the solr log.
> >>> 
> >>> I have tried the 'sync solrj libraries' fix; that is, I copied
> >>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
> >>> effect.  Since I am running binaries, I, of course, did not run ant
> >>> job.
> >>> 
> >>>  Is
> >>> 
> >>> that the magic?
> >>> 
> >>> Any suggestions?
> >>> 
> >>>  Update from the trenches ....
> > 
> > I followed Way Cool's suggestion (now called  Dr. Cool since he has been
> > so helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.
> > 
> > I am off using this pair until a get a breather and then try Nutch 1.3
> > and Solr 3.3 again, this time with Dr. Cool's latest suggestion/
> > 
> > Thanks to all.  /jb

Re: Nutch-1.3 + Solr 3.3.0 = fail

Reply via email to