Hi, you're right. This will be fixed in Nutch 1.11.
Thanks, Sebastian On 11/09/2015 10:07 PM, Frumpus wrote: > Ok, it seems as though I have run into a version of this problem: > > > [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA > > | | > | | | | | | > | [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRAIf the linkdb is > missing the indexer fails with 2015-06-17 12:52:10,621 ERROR > ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: .../linkdb/current | > | | > | View on issues.apache.org | Preview by Yahoo | > | | > | | > > > Which is a result of the crawl script not being aware that I set > ignore_external_links to true in my nutch-site.xml file. > I am trying to crawl several sites and was hoping to keep my life simple by > ignoring external links and leaving regex-urlfilter.txt along (just using +.) > Now it looks like I'll have to change that back to false and mess with regex > filters for all of my urls. Hopefully I can get a 1.11 soon? It looks like > this is fixed there. > > Subject: nutch 1.10 crawl fails at indexing with Input path does not exist > .../linkdb/current > > I am running nutch 1.10 on Ubuntu 14.04 with Solr 5.3.1 > I have set up a fairly simple instance with 1 seed url and it crawls fine, > but when it attemps to index, it crashes with the following: > Indexer: starting at 2015-11-09 14:00:17Indexer: deleting gone documents: > falseIndexer: URL filtering: falseIndexer: URL normalizing: falseActive > IndexWriters :SOLRIndexWriter solr.server.url : URL of the SOLR > instance (mandatory) solr.commit.size : buffer size when sending to > SOLR (default 1000) solr.mapping.file : name of the mapping file for > fields (default solrindex-mapping.xml) solr.auth : use authentication > (default false) solr.auth.username : username for authentication > solr.auth.password : password for authentication > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:415) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn f ormation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187) > Error running: /opt/apache-nutch-1.10/bin/nutch index > -Dsolr.server.url=http://localhost:8983/solr/testcrawl testcrawl//crawldb > -linkdb testcrawl//linkdb testcrawl//segments/20151109135956Failed with exit > value 255. > > > The hadoop.log file has a little more detail that suggests a possible > permissions problem, but running the crawl as root (using sudo) it seems like > that should not be an issue. > > > 2015-11-09 14:00:18,556 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: testcrawl/crawldb2015-11-09 14:00:18,556 INFO > indexer.IndexerMapReduce - IndexerMapReduce: linkdb: > testcrawl/linkdb2015-11-09 14:00:18,556 INFO indexer.IndexerMapReduce - > IndexerMapReduces: adding segment: > testcrawl/segments/201511091359562015-11-09 14:00:19,059 WARN > util.NativeCodeLoader - Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable2015-11-09 > 14:00:19,287 ERROR security.UserGroupInformation - PriviledgedActionException > as:root cause:org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current2015-11-09 > 14:00:19,297 ERROR indexer.IndexingJob - Indexer: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/apache-nutch-1.10/testcrawl/linkdb/current at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) > at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobCli e nt.java:1353) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187) > I'm still learning here and could really use some guidance on how to > troubleshoot this. > > >

