Re: Nutch 1.9 integration with Solr 5.0.0

Jeff Cocking Tue, 07 Apr 2015 09:46:58 -0700

I use the following for all my indexing work.

Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2



On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <[email protected]>
wrote:

> Yes it is working correctly from the browser. I can also manually add the
> documents from web browser.But not through nutch.
> I am not able to figure out where the problem is.
>
> Is there any manual way of adding crawldb and linkdb to the nutch besides
> that command?
>
>
> On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
>
>> There can be numerous reasons....Hosts.conf, firewall, etc.  These are all
>> unique to your system.
>>
>> Have you viewed the solr admin panel via a browser?  This is a critical
>> step in the installation.  This validates SOLR can accept HTTP commands.
>>
>> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <[email protected]>
>> wrote:
>>
>>  I created a new core named *foo*. Than I copied the *schema.xml* from
>>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>>> https://wiki.apache.org/nutch/NutchTutorial*.
>>> I changed the url to*http://localhost:8983/solr/#/foo*
>>> so new command is
>>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
>>>   But now I am getting error
>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>> by this URL*
>>> *
>>> *
>>> Is some other change is also required in URL to support POST requests?
>>>
>>> Full log
>>>
>>> 2015-04-07 20:10:56,068 INFO  indexer.IndexingJob - Indexer: starting at
>>> 2015-04-07 20:10:56
>>> 2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: deleting
>>> gone
>>> documents: false
>>> 2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: URL
>>> filtering: true
>>> 2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: URL
>>> normalizing: true
>>> 2015-04-07 20:10:56,727 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>> 2015-04-07 20:10:56,727 INFO  indexer.IndexingJob - Active IndexWriters :
>>> SOLRIndexWriter
>>> solr.server.url : URL of the SOLR instance (mandatory)
>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>> solr.mapping.file : name of the mapping file for fields (default
>>> solrindex-mapping.xml)
>>> solr.auth : use authentication (default false)
>>> solr.auth.username : use authentication (default false)
>>> solr.auth : username for authentication
>>> solr.auth.password : password for authentication
>>>
>>>
>>> 2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> crawldb: crawl/crawldb
>>> 2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> linkdb: crawl/linkdb
>>> 2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
>>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>>> 2015-04-07 20:10:57,205 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>> applicable
>>> 2015-04-07 20:10:58,020 INFO  anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2015-04-07 20:10:58,134 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:00,114 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,205 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,344 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,577 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,788 INFO  regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,921 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>> 2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: content
>>> dest: content
>>> 2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: title
>>> dest:
>>> title
>>> 2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: host dest:
>>> host
>>> 2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: segment
>>> dest: segment
>>> 2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: boost
>>> dest:
>>> boost
>>> 2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: digest
>>> dest: digest
>>> 2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: tstamp
>>> dest: tstamp
>>> 2015-04-07 20:11:02,266 INFO  solr.SolrIndexWriter - Indexing 250
>>> documents
>>> 2015-04-07 20:11:02,267 INFO  solr.SolrIndexWriter - Deleting 0 documents
>>> 2015-04-07 20:11:02,512 INFO  solr.SolrIndexWriter - Indexing 250
>>> documents
>>> *2015-04-07 20:11:02,576 WARN  mapred.LocalJobRunner -
>>> job_local1831338118_0001*
>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>> by this URL*
>>> *
>>> *
>>> *HTTP method POST is not supported by this URL*
>>>
>>> request: http://localhost:8983/solr/
>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:430)
>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:244)
>>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>> process(AbstractUpdateRequest.java:105)
>>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>> SolrIndexWriter.java:135)
>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:50)
>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:41)
>>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>> ReduceTask.java:458)
>>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:323)
>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:53)
>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>> ReduceTask.java:522)
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>> LocalJobRunner.java:398)
>>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>>> java.io.IOException: Job failed!
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>
>>>
>>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>>
>>>  The command you are using is not pointing to the specific solr index you
>>>> created.  The http://localhost:8983/solr needs to be changed to the URL
>>>> for
>>>> the core created.  It should look like
>>>> http://localhost:8983/solr/#/new_core_name.
>>>>
>>>>
>>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <[email protected]>
>>>> wrote:
>>>>
>>>>   I followed instructions as given on your blog and created a new core
>>>> for
>>>>
>>>>> nutch data and copied schema.xml of nutch into it.
>>>>> Then I run the following command in nutch working directory
>>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>>
>>>>> But then also the same error is coming as like previous runs.
>>>>>
>>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   Solr5 is multicore by default. You have not finished the install by
>>>>>
>>>>>> setting up solr5's core.  I would suggest you look at the link I sent
>>>>>> to
>>>>>> finish up your setup.
>>>>>>
>>>>>> After you finish your install your solr URL will be
>>>>>> http://localhost:8983/solr/#/core_name.
>>>>>>
>>>>>> Jeff Cocking
>>>>>>
>>>>>> I apologize for my brevity.
>>>>>> This was sent from my mobile device while I should be focusing on
>>>>>> something else.....
>>>>>> Like a meeting, driving, family, etc.
>>>>>>
>>>>>>   On Apr 6, 2015, at 11:16 PM, Anchit Jain <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  I have already installed Solr.I want to integrate it with nutch.
>>>>>>> Whenever I try to issue this command to nutch
>>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>
>>>>>>>  -linkdb
>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>>
>>>>>>> I always get a error
>>>>>>>
>>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here is the complete hadoop log for the process.I have underlined the
>>>>>>>
>>>>>>>  error
>>>>>>
>>>>>>  part in it.
>>>>>>>
>>>>>>> 2015-04-07 09:38:06,613 INFO  indexer.IndexingJob - Indexer: starting
>>>>>>>
>>>>>>>  at
>>>>>> 2015-04-07 09:38:06
>>>>>>
>>>>>>> 2015-04-07 09:38:06,684 INFO  indexer.IndexingJob - Indexer: deleting
>>>>>>>
>>>>>>>  gone
>>>>>>
>>>>>>  documents: false
>>>>>>> 2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL
>>>>>>>
>>>>>>>  filtering:
>>>>>>
>>>>>>  true
>>>>>>> 2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL
>>>>>>> normalizing: true
>>>>>>> 2015-04-07 09:38:06,893 INFO  indexer.IndexWriters - Adding
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>> 2015-04-07 09:38:06,893 INFO  indexer.IndexingJob - Active
>>>>>>>
>>>>>>>  IndexWriters :
>>>>>> SOLRIndexWriter
>>>>>>
>>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>>> solrindex-mapping.xml)
>>>>>>> solr.auth : use authentication (default false)
>>>>>>> solr.auth.username : use authentication (default false)
>>>>>>> solr.auth : username for authentication
>>>>>>> solr.auth.password : password for authentication
>>>>>>>
>>>>>>>
>>>>>>> 2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -
>>>>>>>
>>>>>>>  IndexerMapReduce:
>>>>>>
>>>>>>  crawldb: crawl/crawldb
>>>>>>> 2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -
>>>>>>>
>>>>>>>  IndexerMapReduce:
>>>>>>
>>>>>>  linkdb: crawl/linkdb
>>>>>>> 2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -
>>>>>>>
>>>>>>>  IndexerMapReduces:
>>>>>>
>>>>>>  adding segment: crawl/segments/20150406231502
>>>>>>> 2015-04-07 09:38:07,036 WARN  util.NativeCodeLoader - Unable to load
>>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>>>
>>>>>>>  where
>>>>>>
>>>>>>  applicable
>>>>>>> 2015-04-07 09:38:07,540 INFO  anchor.AnchorIndexingFilter - Anchor
>>>>>>> deduplication is: off
>>>>>>> 2015-04-07 09:38:07,565 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:09,552 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,642 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,734 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,895 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:11,088 INFO  regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>>  rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:11,219 INFO  indexer.IndexWriters - Adding
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source:
>>>>>>> content
>>>>>>> dest: content
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: title
>>>>>>>
>>>>>>>  dest:
>>>>>>
>>>>>>  title
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: host
>>>>>>>
>>>>>>>  dest:
>>>>>> host
>>>>>>
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source:
>>>>>>> segment
>>>>>>> dest: segment
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: boost
>>>>>>>
>>>>>>>  dest:
>>>>>>
>>>>>>  boost
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: digest
>>>>>>>
>>>>>>>  dest:
>>>>>>
>>>>>>  digest
>>>>>>> 2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: tstamp
>>>>>>>
>>>>>>>  dest:
>>>>>>
>>>>>>  tstamp
>>>>>>> 2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Indexing 250
>>>>>>>
>>>>>>>  documents
>>>>>>
>>>>>>  2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Deleting 0
>>>>>>>
>>>>>>>  documents
>>>>>> 2015-04-07 09:38:11,644 INFO  solr.SolrIndexWriter - Indexing 250
>>>>>> documents
>>>>>>
>>>>>>  *2015-04-07 09:38:11,699 WARN  mapred.LocalJobRunner -
>>>>>>> job_local1245074757_0001*
>>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>>
>>>>>>> *Not Found*
>>>>>>>
>>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>>> * at
>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>
>>>>>>>  request(CommonsHttpSolrServer.java:430)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>
>>>>>>>  request(CommonsHttpSolrServer.java:244)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>>
>>>>>>>  process(AbstractUpdateRequest.java:105)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>>
>>>>>>>  SolrIndexWriter.java:135)*
>>>>>>
>>>>>>  * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>>> IndexWriters.java:88)*
>>>>>>> * at
>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>
>>>>>>>  IndexerOutputFormat.java:50)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>
>>>>>>>  IndexerOutputFormat.java:41)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>>
>>>>>>>  ReduceTask.java:458)*
>>>>>>
>>>>>>  * at
>>>>>>>
>>>>>>>  org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>>>> * at
>>>>>>
>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>
>>>>>>>  IndexerMapReduce.java:323)*
>>>>>>
>>>>>>  * at
>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>
>>>>>>>  IndexerMapReduce.java:53)*
>>>>>>
>>>>>>  * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>>
>>>>>>>  ReduceTask.java:522)*
>>>>>>
>>>>>>  * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>>>>> * at
>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>>
>>>>>>>  LocalJobRunner.java:398)*
>>>>>>
>>>>>>  *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>>> java.io.IOException: Job failed!*
>>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>> java:114)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
>>>>>>> java:186)*
>>>>>>>
>>>>>>>  On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <[email protected]>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>> With Solr5.0.0 you can skip that step.  Solr will auto create your
>>>>>>> schema
>>>>>>> document based on the data being provided.
>>>>>>>
>>>>>>>> One of the new features with Solr5 is the install/service feature. I
>>>>>>>>
>>>>>>>>  did a
>>>>>>> quick write up on how to install Solr5 on Centos.  Might be something
>>>>>>>
>>>>>>>> useful there for you.
>>>>>>>>
>>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>>
>>>>>>>> jeff
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
>>>>>>>> [email protected]
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>>>>
>>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>>> in  solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>>
>>>>>>>>>
>

Re: Nutch 1.9 integration with Solr 5.0.0

Reply via email to