Re: Nutch-1.3 + Solr 3.3.0 = fail

Way Cool Mon, 08 Aug 2011 16:41:10 -0700

Glad it worked for you on Solr 3.2. I did try Nutch 1.3 and Solr 3.3,
however I did not update my blog yet with Solr 3.3. ;-)


have fun!

On Mon, Aug 8, 2011 at 1:57 PM, John R. Brinkema
<[email protected]>wrote:

> On 8/2/2011 11:21 PM, Way Cool wrote:
>
>> Try changing uniqueKey from id to url as below under in schema.xml and
>> restart Solr:
>> <uniqueKey>url</uniqueKey>
>>
>> If that still did not work, that means you are having an empty url. We can
>> fix that.
>>
>>
>> On Mon, Aug 1, 2011 at 12:45 PM, John R. Brinkema<brinkema@teo.**
>> uscourts.gov <[email protected]>
>>
>>> wrote:
>>> Friends,
>>>
>>> I am having the worst time getting nutch and solr to play together
>>> nicely.
>>>
>>> I downloaded and installed the current binaries for both nutch and solr.
>>>  I
>>> edited the nutch-site.xml file to include:
>>>
>>> <property>
>>> <name>http.agent.name</name>
>>> <value>Solr/Nutch Search</value>
>>> </property>
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|****urlfilter-regex|parse-(text|****html|tika)|
>>> index-basic|query-(basic|****stemmer|site|url)|summary-****
>>> basic|scoring-opic|
>>> urlnormalizer-(pass|regex|****basic)</value>
>>> </property>
>>> <property>
>>> <name>http.content.limit</****name>
>>> <value>65536</value>
>>> </property>
>>> <property>
>>> <name>searcher.dir</name>
>>> <value>/opt/SolrSearch</value>
>>> </property>
>>>
>>>
>>> I installed them and tested them according to each of their respective
>>> tutorials; in other words I believe each is working, separately.  I
>>> crawled
>>> a url and the 'readdb -stats' report shows that I have successfully
>>> collected some links.  Most of the links are to '.pdf' files.
>>>
>>> I followed the instructions to link nutch and solr; e.g. copy the nutch
>>> schema to become the solr schema.
>>>
>>> When I run the bin/nutch solrindex ... command I get the following error:
>>>
>>> java.io.IOException: Job failed!
>>>
>>> When I look in the log/hadoop.log file I see:
>>>
>>> 2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
>>> dest: content
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site dest:
>>> site
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
>>> dest:
>>> title
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host dest:
>>> host
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
>>> dest: segment
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
>>> dest:
>>> boost
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
>>> dest:
>>> digest
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
>>> dest:
>>> tstamp
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
>>> id
>>> 2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest:
>>> url
>>> 2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
>>> org.apache.solr.common.****SolrException: Document [null] missing
>>> required
>>> field: id
>>>
>>> Document [null] missing required field: id
>>>
>>> request: 
>>> http://localhost:8983/solr/****update?wt=javabin&version=2<http://localhost:8983/solr/**update?wt=javabin&version=2>
>>> <ht**tp://localhost:8983/solr/**update?wt=javabin&version=2<http://localhost:8983/solr/update?wt=javabin&version=2>
>>> >
>>>
>>>        at org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>> request(CommonsHttpSolrServer.****java:435)
>>>        at org.apache.solr.client.solrj.****impl.CommonsHttpSolrServer.**
>>> request(CommonsHttpSolrServer.****java:244)
>>>        at org.apache.solr.client.solrj.****request.**
>>> AbstractUpdateRequest.**
>>> process(AbstractUpdateRequest.****java:105)
>>>        at org.apache.solr.client.solrj.****SolrServer.add(SolrServer.**
>>> java:49)
>>>        at org.apache.nutch.indexer.solr.****SolrWriter.close(SolrWriter.
>>> ****
>>> java:82)
>>>        at org.apache.nutch.indexer.****IndexerOutputFormat$1.close(**
>>> IndexerOutputFormat.java:48)
>>>        at org.apache.hadoop.mapred.****ReduceTask.runOldReducer(**
>>> ReduceTask.java:474)
>>>        at org.apache.hadoop.mapred.****ReduceTask.run(ReduceTask.****
>>> java:411)
>>>        at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>> LocalJobRunner.java:216)
>>> 2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: Job
>>> failed!
>>>
>>> The same error appears in the solr log.
>>>
>>> I have tried the 'sync solrj libraries' fix; that is, I copied
>>> apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no
>>> effect.  Since I am running binaries, I, of course, did not run ant job.
>>>  Is
>>> that the magic?
>>>
>>> Any suggestions?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  Update from the trenches ....
>
> I followed Way Cool's suggestion (now called  Dr. Cool since he has been so
> helpful) of using Nutch 1.3 and Solr 3.2 ... which worked just fine.
>
> I am off using this pair until a get a breather and then try Nutch 1.3 and
> Solr 3.3 again, this time with Dr. Cool's latest suggestion/
>
> Thanks to all.  /jb
>
>

Re: Nutch-1.3 + Solr 3.3.0 = fail

Reply via email to