Re: Apache nutch 1.9 error - Input path does not exist

Jonathan Cooper-Ellis Wed, 24 Sep 2014 07:35:16 -0700

Hello,

It looks like you're confusing the usage of bin/crawl with the old
bin/nutch crawl command. You want to start the crawl like this:


bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds>

So, the script thinks "-solr" is your crawl directory (which does not
exist):

2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/testUser/Desktop/nutch-solr-example/apache-
nutch-1.9/-solr/segments/crawl_generate

Hope that helps!

-jce



On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <[email protected]> wrote:

> Hello guys,
>
> I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> ubuntu
> virtual machine in virtualbox.
>
> *Description of error*
>
>
> I start a crawl like that:
>
> *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
>
> However, I get the following error(that is my log from
> `nutch/logs/hadoop.logs`):
>
>
>
>     /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting at
> 2014-09-24 14:39:46
>         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: crawlDb:
> -solr/crawldb
>         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
> urls
>         2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
>         2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
> library not loaded
>         2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
>         2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
>         2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
> number of urls rejected by filters: 0
>         2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
> number of urls after normalization: 2
>         2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
> injected urls into crawl db.
>         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: overwrite:
> false
>         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
> false
>         2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
> merged: 2
>         2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total new
> urls injected: 0
>         2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector: finished
> at
> 2014-09-24 14:39:52, elapsed: 00:00:05
>         2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator: starting
> at 2014-09-24 14:39:55
>         2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator:
> Selecting
> best-scoring urls due for fetch.
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> filtering: false
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> normalizing: true
>         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
> 50000
>         2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
>         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
>         2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
>         2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
>         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
>         2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'generate_host_count', using default
>         2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
> Partitioning selected urls for politeness.
>         2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator: segment:
> -solr/segments/20140924143959
>         2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
>         2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator: finished
> at 2014-09-24 14:40:01, elapsed: 00:00:05
>         2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher: starting
> at
> 2014-09-24 14:40:03
>         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher: segment:
> -solr/segments
>         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher Timelimit
> set for : 1411573203467
>         2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to
> load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>         2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
> PriviledgedActionException as:testUser
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
>
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>         2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>                 at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
>                 at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>                 at
> org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
>                 at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>                 at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>                 at
> org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>                 at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>                 at
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>                 at java.security.AccessController.doPrivileged(Native
> Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>                 at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>                 at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>                 at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>                 at
> org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
>                 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
>                 at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>                 at
> org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/
>
> I basically have configured my solr like in the tutorial on  apache wiki
> <http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>
> :
>
> /    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
>
>     cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
> ${APACHE_SOLR_HOME}/example/solr/conf/
>     vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
>
>     Copy exactly in 351 line: <field name="_version_" type="long"
> indexed="true" stored="true"/>
> /
> This is what I get when I start solr:
>
> <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg>
>
> *What I tried:*
>
>
> According to this  thread
> <
> http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html
> >
> the issue should be fixed by deleting all segments files in
> *-solr/segments*, however, that does not resolve the issue.
>
> Any recommendations where this error can come from and what I can do to fix
> it?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Apache nutch 1.9 error - Input path does not exist

Reply via email to