I use the following for all my indexing work. Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds> Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <[email protected]> wrote: > Yes it is working correctly from the browser. I can also manually add the > documents from web browser.But not through nutch. > I am not able to figure out where the problem is. > > Is there any manual way of adding crawldb and linkdb to the nutch besides > that command? > > > On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote: > >> There can be numerous reasons....Hosts.conf, firewall, etc. These are all >> unique to your system. >> >> Have you viewed the solr admin panel via a browser? This is a critical >> step in the installation. This validates SOLR can accept HTTP commands. >> >> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <[email protected]> >> wrote: >> >> I created a new core named *foo*. Than I copied the *schema.xml* from >>> *nutch* into *var/solr/data/foo/conf* with changes as described in* >>> https://wiki.apache.org/nutch/NutchTutorial*. >>> I changed the url to*http://localhost:8983/solr/#/foo* >>> so new command is >>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/ >>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*" >>> But now I am getting error >>> *org.apache.solr.common.SolrException: HTTP method POST is not supported >>> by this URL* >>> * >>> * >>> Is some other change is also required in URL to support POST requests? >>> >>> Full log >>> >>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at >>> 2015-04-07 20:10:56 >>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting >>> gone >>> documents: false >>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL >>> filtering: true >>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL >>> normalizing: true >>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters : >>> SOLRIndexWriter >>> solr.server.url : URL of the SOLR instance (mandatory) >>> solr.commit.size : buffer size when sending to SOLR (default 1000) >>> solr.mapping.file : name of the mapping file for fields (default >>> solrindex-mapping.xml) >>> solr.auth : use authentication (default false) >>> solr.auth.username : use authentication (default false) >>> solr.auth : username for authentication >>> solr.auth.password : password for authentication >>> >>> >>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce: >>> crawldb: crawl/crawldb >>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce: >>> linkdb: crawl/linkdb >>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - >>> IndexerMapReduces: adding segment: crawl/segments/20150406231502 >>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >>> where >>> applicable >>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor >>> deduplication is: off >>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'indexer', using default >>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content >>> dest: content >>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title >>> dest: >>> title >>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host dest: >>> host >>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment >>> dest: segment >>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost >>> dest: >>> boost >>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest >>> dest: digest >>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp >>> dest: tstamp >>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250 >>> documents >>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents >>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250 >>> documents >>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner - >>> job_local1831338118_0001* >>> *org.apache.solr.common.SolrException: HTTP method POST is not supported >>> by this URL* >>> * >>> * >>> *HTTP method POST is not supported by this URL* >>> >>> request: http://localhost:8983/solr/ >>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer. >>> request(CommonsHttpSolrServer.java:430) >>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer. >>> request(CommonsHttpSolrServer.java:244) >>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest. >>> process(AbstractUpdateRequest.java:105) >>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write( >>> SolrIndexWriter.java:135) >>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88) >>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write( >>> IndexerOutputFormat.java:50) >>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write( >>> IndexerOutputFormat.java:41) >>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write( >>> ReduceTask.java:458) >>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500) >>> at org.apache.nutch.indexer.IndexerMapReduce.reduce( >>> IndexerMapReduce.java:323) >>> at org.apache.nutch.indexer.IndexerMapReduce.reduce( >>> IndexerMapReduce.java:53) >>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer( >>> ReduceTask.java:522) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) >>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run( >>> LocalJobRunner.java:398) >>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer: >>> java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) >>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) >>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) >>> >>> >>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote: >>> >>> The command you are using is not pointing to the specific solr index you >>>> created. The http://localhost:8983/solr needs to be changed to the URL >>>> for >>>> the core created. It should look like >>>> http://localhost:8983/solr/#/new_core_name. >>>> >>>> >>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <[email protected]> >>>> wrote: >>>> >>>> I followed instructions as given on your blog and created a new core >>>> for >>>> >>>>> nutch data and copied schema.xml of nutch into it. >>>>> Then I run the following command in nutch working directory >>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb >>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize >>>>> >>>>> But then also the same error is coming as like previous runs. >>>>> >>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <[email protected]> >>>>> wrote: >>>>> >>>>> Solr5 is multicore by default. You have not finished the install by >>>>> >>>>>> setting up solr5's core. I would suggest you look at the link I sent >>>>>> to >>>>>> finish up your setup. >>>>>> >>>>>> After you finish your install your solr URL will be >>>>>> http://localhost:8983/solr/#/core_name. >>>>>> >>>>>> Jeff Cocking >>>>>> >>>>>> I apologize for my brevity. >>>>>> This was sent from my mobile device while I should be focusing on >>>>>> something else..... >>>>>> Like a meeting, driving, family, etc. >>>>>> >>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <[email protected]> >>>>>> wrote: >>>>>> >>>>>> I have already installed Solr.I want to integrate it with nutch. >>>>>>> Whenever I try to issue this command to nutch >>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ >>>>>>> >>>>>>> -linkdb >>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize" >>>>>> >>>>>>> I always get a error >>>>>>> >>>>>>> Indexer: java.io.IOException: Job failed! >>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) >>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) >>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) >>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Here is the complete hadoop log for the process.I have underlined the >>>>>>> >>>>>>> error >>>>>> >>>>>> part in it. >>>>>>> >>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting >>>>>>> >>>>>>> at >>>>>> 2015-04-07 09:38:06 >>>>>> >>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting >>>>>>> >>>>>>> gone >>>>>> >>>>>> documents: false >>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL >>>>>>> >>>>>>> filtering: >>>>>> >>>>>> true >>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL >>>>>>> normalizing: true >>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding >>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active >>>>>>> >>>>>>> IndexWriters : >>>>>> SOLRIndexWriter >>>>>> >>>>>>> solr.server.url : URL of the SOLR instance (mandatory) >>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000) >>>>>>> solr.mapping.file : name of the mapping file for fields (default >>>>>>> solrindex-mapping.xml) >>>>>>> solr.auth : use authentication (default false) >>>>>>> solr.auth.username : use authentication (default false) >>>>>>> solr.auth : username for authentication >>>>>>> solr.auth.password : password for authentication >>>>>>> >>>>>>> >>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - >>>>>>> >>>>>>> IndexerMapReduce: >>>>>> >>>>>> crawldb: crawl/crawldb >>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - >>>>>>> >>>>>>> IndexerMapReduce: >>>>>> >>>>>> linkdb: crawl/linkdb >>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - >>>>>>> >>>>>>> IndexerMapReduces: >>>>>> >>>>>> adding segment: crawl/segments/20150406231502 >>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load >>>>>>> native-hadoop library for your platform... using builtin-java classes >>>>>>> >>>>>>> where >>>>>> >>>>>> applicable >>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor >>>>>>> deduplication is: off >>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find >>>>>>> >>>>>>> rules >>>>>> for scope 'indexer', using default >>>>>> >>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding >>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: >>>>>>> content >>>>>>> dest: content >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title >>>>>>> >>>>>>> dest: >>>>>> >>>>>> title >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host >>>>>>> >>>>>>> dest: >>>>>> host >>>>>> >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: >>>>>>> segment >>>>>>> dest: segment >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost >>>>>>> >>>>>>> dest: >>>>>> >>>>>> boost >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest >>>>>>> >>>>>>> dest: >>>>>> >>>>>> digest >>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp >>>>>>> >>>>>>> dest: >>>>>> >>>>>> tstamp >>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250 >>>>>>> >>>>>>> documents >>>>>> >>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0 >>>>>>> >>>>>>> documents >>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250 >>>>>> documents >>>>>> >>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner - >>>>>>> job_local1245074757_0001* >>>>>>> *org.apache.solr.common.SolrException: Not Found* >>>>>>> >>>>>>> *Not Found* >>>>>>> >>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2 >>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>* >>>>>>> * at >>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer. >>>>>>> >>>>>>> request(CommonsHttpSolrServer.java:430)* >>>>>> >>>>>> * at >>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer. >>>>>>> >>>>>>> request(CommonsHttpSolrServer.java:244)* >>>>>> >>>>>> * at >>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest. >>>>>>> >>>>>>> process(AbstractUpdateRequest.java:105)* >>>>>> >>>>>> * at >>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write( >>>>>>> >>>>>>> SolrIndexWriter.java:135)* >>>>>> >>>>>> * at org.apache.nutch.indexer.IndexWriters.write( >>>>>>> IndexWriters.java:88)* >>>>>>> * at >>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write( >>>>>>> >>>>>>> IndexerOutputFormat.java:50)* >>>>>> >>>>>> * at >>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write( >>>>>>> >>>>>>> IndexerOutputFormat.java:41)* >>>>>> >>>>>> * at >>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write( >>>>>>> >>>>>>> ReduceTask.java:458)* >>>>>> >>>>>> * at >>>>>>> >>>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)* >>>>>> * at >>>>>> >>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce( >>>>>>> >>>>>>> IndexerMapReduce.java:323)* >>>>>> >>>>>> * at >>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce( >>>>>>> >>>>>>> IndexerMapReduce.java:53)* >>>>>> >>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer( >>>>>>> >>>>>>> ReduceTask.java:522)* >>>>>> >>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)* >>>>>>> * at >>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run( >>>>>>> >>>>>>> LocalJobRunner.java:398)* >>>>>> >>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer: >>>>>>> java.io.IOException: Job failed!* >>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)* >>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob. >>>>>>> java:114)* >>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)* >>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* >>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob. >>>>>>> java:186)* >>>>>>> >>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <[email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your >>>>>>> schema >>>>>>> document based on the data being provided. >>>>>>> >>>>>>>> One of the new features with Solr5 is the install/service feature. I >>>>>>>> >>>>>>>> did a >>>>>>> quick write up on how to install Solr5 on Centos. Might be something >>>>>>> >>>>>>>> useful there for you. >>>>>>>> >>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/ >>>>>>>> >>>>>>>> jeff >>>>>>>> >>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain < >>>>>>>> [email protected] >>>>>>>> wrote: >>>>>>>> >>>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in >>>>>>>> >>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory >>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/ >>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*? >>>>>>>>> Also there is no *start.jar* present in example directory. >>>>>>>>> >>>>>>>>> >

