Re: Nutch 1.9 integration with Solr 5.0.0

Anchit Jain Tue, 07 Apr 2015 10:01:05 -0700

Same error :-( .

So no workaround for the error?


On Tuesday 07 April 2015 10:06 PM, Jeff Cocking wrote:

I use the following for all my indexing work.

Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2


On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <[email protected]>
wrote:

Yes it is working correctly from the browser. I can also manually add the
documents from web browser.But not through nutch.
I am not able to figure out where the problem is.

Is there any manual way of adding crawldb and linkdb to the nutch besides
that command?


On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:

There can be numerous reasons....Hosts.conf, firewall, etc.  These are all
unique to your system.

Have you viewed the solr admin panel via a browser?  This is a critical
step in the installation.  This validates SOLR can accept HTTP commands.

On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <[email protected]>
wrote:

  I created a new core named *foo*. Than I copied the *schema.xml* from

*nutch* into *var/solr/data/foo/conf* with changes as described in*
https://wiki.apache.org/nutch/NutchTutorial*.
I changed the url to*http://localhost:8983/solr/#/foo*
so new command is
"*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
   But now I am getting error
*org.apache.solr.common.SolrException: HTTP method POST is not supported
by this URL*
*
*
Is some other change is also required in URL to support POST requests?

Full log

2015-04-07 20:10:56,068 INFO  indexer.IndexingJob - Indexer: starting at
2015-04-07 20:10:56
2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: deleting
gone
documents: false
2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: URL
filtering: true
2015-04-07 20:10:56,178 INFO  indexer.IndexingJob - Indexer: URL
normalizing: true
2015-04-07 20:10:56,727 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 20:10:56,727 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
IndexerMapReduce:
crawldb: crawl/crawldb
2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
IndexerMapReduce:
linkdb: crawl/linkdb
2015-04-07 20:10:56,772 INFO  indexer.IndexerMapReduce -
IndexerMapReduces: adding segment: crawl/segments/20150406231502
2015-04-07 20:10:57,205 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where
applicable
2015-04-07 20:10:58,020 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-04-07 20:10:58,134 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:00,114 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:01,205 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:01,344 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:01,577 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:01,788 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 20:11:01,921 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: content
dest: content
2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: title
dest:
title
2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: host dest:
host
2015-04-07 20:11:01,986 INFO  solr.SolrMappingReader - source: segment
dest: segment
2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: boost
dest:
boost
2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: digest
dest: digest
2015-04-07 20:11:01,987 INFO  solr.SolrMappingReader - source: tstamp
dest: tstamp
2015-04-07 20:11:02,266 INFO  solr.SolrIndexWriter - Indexing 250
documents
2015-04-07 20:11:02,267 INFO  solr.SolrIndexWriter - Deleting 0 documents
2015-04-07 20:11:02,512 INFO  solr.SolrIndexWriter - Indexing 250
documents
*2015-04-07 20:11:02,576 WARN  mapred.LocalJobRunner -
job_local1831338118_0001*
*org.apache.solr.common.SolrException: HTTP method POST is not supported
by this URL*
*
*
*HTTP method POST is not supported by this URL*

request: http://localhost:8983/solr/
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
SolrIndexWriter.java:135)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(
IndexerMapReduce.java:323)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(
IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
ReduceTask.java:522)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)
2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:

  The command you are using is not pointing to the specific solr index you

created.  The http://localhost:8983/solr needs to be changed to the URL
for
the core created.  It should look like
http://localhost:8983/solr/#/new_core_name.


On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <[email protected]>
wrote:

   I followed instructions as given on your blog and created a new core
for

nutch data and copied schema.xml of nutch into it.
Then I run the following command in nutch working directory
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize

But then also the same error is coming as like previous runs.

On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <[email protected]>
wrote:

   Solr5 is multicore by default. You have not finished the install by

setting up solr5's core.  I would suggest you look at the link I sent
to
finish up your setup.

After you finish your install your solr URL will be
http://localhost:8983/solr/#/core_name.

Jeff Cocking

I apologize for my brevity.
This was sent from my mobile device while I should be focusing on
something else.....
Like a meeting, driving, family, etc.

   On Apr 6, 2015, at 11:16 PM, Anchit Jain <[email protected]>
wrote:

  I have already installed Solr.I want to integrate it with nutch.

Whenever I try to issue this command to nutch
""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/

  -linkdb

crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"

I always get a error

Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)



Here is the complete hadoop log for the process.I have underlined the

  error

  part in it.

2015-04-07 09:38:06,613 INFO  indexer.IndexingJob - Indexer: starting

  at

2015-04-07 09:38:06

2015-04-07 09:38:06,684 INFO  indexer.IndexingJob - Indexer: deleting

  gone

  documents: false

2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL

  filtering:

  true

2015-04-07 09:38:06,685 INFO  indexer.IndexingJob - Indexer: URL
normalizing: true
2015-04-07 09:38:06,893 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:06,893 INFO  indexer.IndexingJob - Active

  IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

  IndexerMapReduce:

  crawldb: crawl/crawldb

2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

  IndexerMapReduce:

  linkdb: crawl/linkdb

2015-04-07 09:38:06,898 INFO  indexer.IndexerMapReduce -

  IndexerMapReduces:

  adding segment: crawl/segments/20150406231502

2015-04-07 09:38:07,036 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes

  where

  applicable

2015-04-07 09:38:07,540 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-04-07 09:38:07,565 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:09,552 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:10,642 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:10,734 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:10,895 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:11,088 INFO  regex.RegexURLNormalizer - can't find

  rules

for scope 'indexer', using default

2015-04-07 09:38:11,219 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source:
content
dest: content
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: title

  dest:

  title

2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: host

  dest:

host

2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source:
segment
dest: segment
2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: boost

  dest:

  boost

2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: digest

  dest:

  digest

2015-04-07 09:38:11,237 INFO  solr.SolrMappingReader - source: tstamp

  dest:

  tstamp

2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Indexing 250

  documents

  2015-04-07 09:38:11,526 INFO  solr.SolrIndexWriter - Deleting 0

  documents

2015-04-07 09:38:11,644 INFO  solr.SolrIndexWriter - Indexing 250
documents

  *2015-04-07 09:38:11,699 WARN  mapred.LocalJobRunner -

job_local1245074757_0001*
*org.apache.solr.common.SolrException: Not Found*

*Not Found*

*request: http://localhost:8983/solr/update?wt=javabin&version=2
<http://localhost:8983/solr/update?wt=javabin&version=2>*
* at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.

  request(CommonsHttpSolrServer.java:430)*

  * at

org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.

  request(CommonsHttpSolrServer.java:244)*

  * at

org.apache.solr.client.solrj.request.AbstractUpdateRequest.

  process(AbstractUpdateRequest.java:105)*

  * at

org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(

  SolrIndexWriter.java:135)*

  * at org.apache.nutch.indexer.IndexWriters.write(

IndexWriters.java:88)*
* at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(

  IndexerOutputFormat.java:50)*

  * at

org.apache.nutch.indexer.IndexerOutputFormat$1.write(

  IndexerOutputFormat.java:41)*

  * at

org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(

  ReduceTask.java:458)*

  * at

  org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*

* at

org.apache.nutch.indexer.IndexerMapReduce.reduce(

  IndexerMapReduce.java:323)*

  * at

org.apache.nutch.indexer.IndexerMapReduce.reduce(

  IndexerMapReduce.java:53)*

  * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(

  ReduceTask.java:522)*

  * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*

* at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(

  LocalJobRunner.java:398)*

  *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:

java.io.IOException: Job failed!*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
java:186)*

  On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <[email protected]>

  wrote:

With Solr5.0.0 you can skip that step.  Solr will auto create your
schema
document based on the data being provided.

One of the new features with Solr5 is the install/service feature. I

  did a

quick write up on how to install Solr5 on Centos.  Might be something

useful there for you.

http://www.cocking.com/apache-solr-5-0-install-on-centos-7/

jeff

On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
[email protected]
wrote:

   I want to index nutch results using *Solr 5.0* but as mentioned in

https://wiki.apache.org/nutch/NutchTutorial there is no directory
${APACHE_SOLR_HOME}/example/solr/collection1/conf/
in  solr 5.0 . So where I have to copy *schema.xml*?
Also there is no *start.jar* present in example directory.

Re: Nutch 1.9 integration with Solr 5.0.0

Reply via email to