Hello, It looks like you're confusing the usage of bin/crawl with the old bin/nutch crawl command. You want to start the crawl like this:
bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds> So, the script thinks "-solr" is your crawl directory (which does not exist): 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/testUser/Desktop/nutch-solr-example/apache- nutch-1.9/-solr/segments/crawl_generate Hope that helps! -jce On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <[email protected]> wrote: > Hello guys, > > I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an > ubuntu > virtual machine in virtualbox. > > *Description of error* > > > I start a crawl like that: > > *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1* > > However, I get the following error(that is my log from > `nutch/logs/hadoop.logs`): > > > > / 2014-09-24 14:39:46,252 INFO crawl.Injector - Injector: starting at > 2014-09-24 14:39:46 > 2014-09-24 14:39:46,259 INFO crawl.Injector - Injector: crawlDb: > -solr/crawldb > 2014-09-24 14:39:46,259 INFO crawl.Injector - Injector: urlDir: > urls > 2014-09-24 14:39:46,260 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2014-09-24 14:39:47,263 WARN util.NativeCodeLoader - Unable to > load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-09-24 14:39:47,375 WARN snappy.LoadSnappy - Snappy native > library not loaded > 2014-09-24 14:39:49,076 INFO regex.RegexURLNormalizer - can't find > rules for scope 'inject', using default > 2014-09-24 14:39:49,132 INFO regex.RegexURLNormalizer - can't find > rules for scope 'inject', using default > 2014-09-24 14:39:50,001 INFO crawl.Injector - Injector: Total > number of urls rejected by filters: 0 > 2014-09-24 14:39:50,002 INFO crawl.Injector - Injector: Total > number of urls after normalization: 2 > 2014-09-24 14:39:50,003 INFO crawl.Injector - Injector: Merging > injected urls into crawl db. > 2014-09-24 14:39:51,046 INFO crawl.Injector - Injector: overwrite: > false > 2014-09-24 14:39:51,046 INFO crawl.Injector - Injector: update: > false > 2014-09-24 14:39:52,116 INFO crawl.Injector - Injector: URLs > merged: 2 > 2014-09-24 14:39:52,136 INFO crawl.Injector - Injector: Total new > urls injected: 0 > 2014-09-24 14:39:52,139 INFO crawl.Injector - Injector: finished > at > 2014-09-24 14:39:52, elapsed: 00:00:05 > 2014-09-24 14:39:55,557 WARN util.NativeCodeLoader - Unable to > load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-09-24 14:39:55,571 INFO crawl.Generator - Generator: starting > at 2014-09-24 14:39:55 > 2014-09-24 14:39:55,574 INFO crawl.Generator - Generator: > Selecting > best-scoring urls due for fetch. > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator: > filtering: false > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator: > normalizing: true > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator: topN: > 50000 > 2014-09-24 14:39:58,013 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2014-09-24 14:39:58,014 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2014-09-24 14:39:58,014 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2014-09-24 14:39:58,044 INFO regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default > 2014-09-24 14:39:58,291 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2014-09-24 14:39:58,292 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2014-09-24 14:39:58,292 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2014-09-24 14:39:58,370 INFO regex.RegexURLNormalizer - can't find > rules for scope 'generate_host_count', using default > 2014-09-24 14:39:58,782 INFO crawl.Generator - Generator: > Partitioning selected urls for politeness. > 2014-09-24 14:39:59,785 INFO crawl.Generator - Generator: segment: > -solr/segments/20140924143959 > 2014-09-24 14:40:00,313 INFO regex.RegexURLNormalizer - can't find > rules for scope 'partition', using default > 2014-09-24 14:40:01,032 INFO crawl.Generator - Generator: finished > at 2014-09-24 14:40:01, elapsed: 00:00:05 > 2014-09-24 14:40:03,462 INFO fetcher.Fetcher - Fetcher: starting > at > 2014-09-24 14:40:03 > 2014-09-24 14:40:03,467 INFO fetcher.Fetcher - Fetcher: segment: > -solr/segments > 2014-09-24 14:40:03,467 INFO fetcher.Fetcher - Fetcher Timelimit > set for : 1411573203467 > 2014-09-24 14:40:04,207 WARN util.NativeCodeLoader - Unable to > load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-09-24 14:40:04,301 ERROR security.UserGroupInformation - > PriviledgedActionException as:testUser > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > > file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate > 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > > file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > at > org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > at > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > at > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > at > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) > at > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > at java.security.AccessController.doPrivileged(Native > Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > at > org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468) > at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/ > > I basically have configured my solr like in the tutorial on apache wiki > <http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch> > : > > / mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml > ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org > > cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml > ${APACHE_SOLR_HOME}/example/solr/conf/ > vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml > > Copy exactly in 351 line: <field name="_version_" type="long" > indexed="true" stored="true"/> > / > This is what I get when I start solr: > > <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg> > > *What I tried:* > > > According to this thread > < > http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html > > > the issue should be fixed by deleting all segments files in > *-solr/segments*, however, that does not resolve the issue. > > Any recommendations where this error can come from and what I can do to fix > it? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

