Apache nutch 1.9 error - Input path does not exist

gsamsa Wed, 24 Sep 2014 07:15:30 -0700

Hello guys,

I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an ubuntu
virtual machine in virtualbox.


*Description of error*


I start a crawl like that:

*./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*

However, I get the following error(that is my log from
`nutch/logs/hadoop.logs`):

  

    /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
2014-09-24 14:39:46
        2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
-solr/crawldb
        2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
urls
        2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
        2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
library not loaded
        2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
        2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
        2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
number of urls rejected by filters: 0
        2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
number of urls after normalization: 2
        2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
injected urls into crawl db.
        2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
false
        2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
false
        2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
merged: 2
        2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
urls injected: 0
        2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished at
2014-09-24 14:39:52, elapsed: 00:00:05
        2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
at 2014-09-24 14:39:55
        2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
filtering: false
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
normalizing: true
        2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
50000
        2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
        2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
        2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
        2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
        2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
        2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
        2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
        2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
        2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
Partitioning selected urls for politeness.
        2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
-solr/segments/20140924143959
        2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
        2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
at 2014-09-24 14:40:01, elapsed: 00:00:05
        2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting at
2014-09-24 14:40:03
        2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
-solr/segments
        2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
set for : 1411573203467
        2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
        2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
PriviledgedActionException as:testUser
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
        2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
                at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
                at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
                at
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
                at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
                at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
                at
org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
                at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
                at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:415)
                at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
                at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
                at 
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
                at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
                at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
                at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/

I basically have configured my solr like in the tutorial on  apache wiki
<http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch> 
:

/    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
    
    cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
    vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
 
    Copy exactly in 351 line: <field name="_version_" type="long"
indexed="true" stored="true"/> 
/   
This is what I get when I start solr:

<http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg> 

*What I tried:*


According to this  thread
<http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html>
  
the issue should be fixed by deleting all segments files in
*-solr/segments*, however, that does not resolve the issue.

Any recommendations where this error can come from and what I can do to fix
it?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Apache nutch 1.9 error - Input path does not exist

Reply via email to