Re: Nutch 1.9 integration with Solr 5.0.0

Anchit Jain Tue, 07 Apr 2015 07:55:14 -0700

I created a new core named *foo*. Than I copied the *schema.xml* from*nutch* into *var/solr/data/foo/conf* with changes as describedin*https://wiki.apache.org/nutch/NutchTutorial*.

I changed the url to*http://localhost:8983/solr/#/foo*
so new command is

"*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/-linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"

 But now I am getting error

*org.apache.solr.common.SolrException: HTTP method POST is not supportedby this URL*

*
*
Is some other change is also required in URL to support POST requests?


Full log

2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at2015-04-07 20:10:562015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deletinggone documents: false2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URLfiltering: true2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URLnormalizing: true2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Addingorg.apache.nutch.indexwriter.solr.SolrIndexWriter

2015-04-07 20:10:56,727 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (defaultsolrindex-mapping.xml)

solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -IndexerMapReduce: crawldb: crawl/crawldb2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -IndexerMapReduce: linkdb: crawl/linkdb2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -IndexerMapReduces: adding segment: crawl/segments/201504062315022015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to loadnative-hadoop library for your platform... using builtin-java classeswhere applicable2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchordeduplication is: off2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't findrules for scope 'indexer', using default2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Addingorg.apache.nutch.indexwriter.solr.SolrIndexWriter2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: contentdest: content2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: titledest: title2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: hostdest: host2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segmentdest: segment2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boostdest: boost2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digestdest: digest2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstampdest: tstamp

2015-04-07 20:11:02,266 INFO  solr.SolrIndexWriter - Indexing 250 documents
2015-04-07 20:11:02,267 INFO  solr.SolrIndexWriter - Deleting 0 documents
2015-04-07 20:11:02,512 INFO  solr.SolrIndexWriter - Indexing 250 documents

*2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -job_local1831338118_0001**org.apache.solr.common.SolrException: HTTP method POST is not supportedby this URL*

*
*
*HTTP method POST is not supported by this URL*

request: http://localhost:8983/solr/

atorg.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)atorg.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)atorg.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)atorg.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)

at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)

atorg.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)atorg.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)atorg.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:458)

at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)

atorg.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:323)atorg.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)

at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:

The command you are using is not pointing to the specific solr index you
created.  The http://localhost:8983/solr needs to be changed to the URL for
the core created.  It should look like
http://localhost:8983/solr/#/new_core_name.


On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <[email protected]>
wrote:

I followed instructions as given on your blog and created a new core for
nutch data and copied schema.xml of nutch into it.
Then I run the following command in nutch working directory
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize

But then also the same error is coming as like previous runs.

On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <[email protected]> wrote:

Solr5 is multicore by default. You have not finished the install by
setting up solr5's core.  I would suggest you look at the link I sent to
finish up your setup.

After you finish your install your solr URL will be
http://localhost:8983/solr/#/core_name.

Jeff Cocking

I apologize for my brevity.
This was sent from my mobile device while I should be focusing on
something else.....
Like a meeting, driving, family, etc.

On Apr 6, 2015, at 11:16 PM, Anchit Jain <[email protected]>

wrote:

I have already installed Solr.I want to integrate it with nutch.
Whenever I try to issue this command to nutch
""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/

-linkdb

crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"

I always get a error

Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)



Here is the complete hadoop log for the process.I have underlined the

error

part in it.

2015-04-07 09:38:06,613 INFO  indexer.IndexingJob - Indexer: starting

at

2015-04-07 09:38:06
2015-04-07 09:38:06,684 INFO  indexer.IndexingJob - Indexer: deleting

gone

documents: false
2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL

filtering:

true
2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL
normalizing: true
2015-04-07 09:38:06,893 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:06,893 INFO  indexer.IndexingJob - Active

IndexWriters :

SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

IndexerMapReduce:

crawldb: crawl/crawldb
2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

IndexerMapReduce:

linkdb: crawl/linkdb
2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

IndexerMapReduces:

adding segment: crawl/segments/20150406231502
2015-04-07 09:38:07,036 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes

where

applicable
2015-04-07 09:38:07,540 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-04-07 09:38:07,565 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:09,552 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:10,642 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:10,734 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:10,895 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:11,088 INFO  regex.RegexURLNormalizer - can't find

rules

for scope 'indexer', using default
2015-04-07 09:38:11,219 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: content
dest: content
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: title

dest:

title
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: host

dest:

host
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: segment
dest: segment
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: boost

dest:

boost
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: digest

dest:

digest
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: tstamp

dest:

tstamp
2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Indexing 250

documents

2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Deleting 0

documents

2015-04-07 09:38:11,644 INFO  solr.SolrIndexWriter - Indexing 250

documents

*2015-04-07 09:38:11,699 WARN  mapred.LocalJobRunner -
job_local1245074757_0001*
*org.apache.solr.common.SolrException: Not Found*

*Not Found*

*request: http://localhost:8983/solr/update?wt=javabin&version=2
<http://localhost:8983/solr/update?wt=javabin&version=2>*
* at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.

request(CommonsHttpSolrServer.java:430)*

* at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.

request(CommonsHttpSolrServer.java:244)*

* at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.

process(AbstractUpdateRequest.java:105)*

* at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(

SolrIndexWriter.java:135)*

* at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
* at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(

IndexerOutputFormat.java:50)*

* at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(

IndexerOutputFormat.java:41)*

* at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(

ReduceTask.java:458)*

* at

org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*

* at
org.apache.nutch.indexer.IndexerMapReduce.reduce(

IndexerMapReduce.java:323)*

* at
org.apache.nutch.indexer.IndexerMapReduce.reduce(

IndexerMapReduce.java:53)*

* at org.apache.hadoop.mapred.ReduceTask.runOldReducer(

ReduceTask.java:522)*

* at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
* at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(

LocalJobRunner.java:398)*

*2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*

On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <[email protected]>

wrote:

With Solr5.0.0 you can skip that step.  Solr will auto create your

schema

document based on the data being provided.

One of the new features with Solr5 is the install/service feature. I

did a

quick write up on how to install Solr5 on Centos.  Might be something
useful there for you.

http://www.cocking.com/apache-solr-5-0-install-on-centos-7/

jeff

On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <[email protected]
wrote:

I want to index nutch results using *Solr 5.0* but as mentioned in
https://wiki.apache.org/nutch/NutchTutorial there is no directory
${APACHE_SOLR_HOME}/example/solr/collection1/conf/
in  solr 5.0 . So where I have to copy *schema.xml*?
Also there is no *start.jar* present in example directory.

Re: Nutch 1.9 integration with Solr 5.0.0

Reply via email to