Ok, it seems as though I have run into a version of this problem:
[NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA
| |
| | | | | |
| [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRAIf the linkdb is
missing the indexer fails with 2015-06-17 12:52:10,621 ERROR
...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: .../linkdb/current |
| |
| View on issues.apache.org | Preview by Yahoo |
| |
| |
Which is a result of the crawl script not being aware that I set
ignore_external_links to true in my nutch-site.xml file.
I am trying to crawl several sites and was hoping to keep my life simple by
ignoring external links and leaving regex-urlfilter.txt along (just using +.)
Now it looks like I'll have to change that back to false and mess with regex
filters for all of my urls. Hopefully I can get a 1.11 soon? It looks like this
is fixed there.
Subject: nutch 1.10 crawl fails at indexing with Input path does not exist
.../linkdb/current
I am running nutch 1.10 on Ubuntu 14.04 with Solr 5.3.1
I have set up a fairly simple instance with 1 seed url and it crawls fine, but
when it attemps to index, it crashes with the following:
Indexer: starting at 2015-11-09 14:00:17Indexer: deleting gone documents:
falseIndexer: URL filtering: falseIndexer: URL normalizing: falseActive
IndexWriters :SOLRIndexWriter solr.server.url : URL of the SOLR instance
(mandatory) solr.commit.size : buffer size when sending to SOLR (default
1000) solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml) solr.auth : use authentication (default false)
solr.auth.username : username for authentication solr.auth.password
: password for authentication
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113) at
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
Error running: /opt/apache-nutch-1.10/bin/nutch index
-Dsolr.server.url=http://localhost:8983/solr/testcrawl testcrawl//crawldb
-linkdb testcrawl//linkdb testcrawl//segments/20151109135956Failed with exit
value 255.
The hadoop.log file has a little more detail that suggests a possible
permissions problem, but running the crawl as root (using sudo) it seems like
that should not be an issue.
2015-11-09 14:00:18,556 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: testcrawl/crawldb2015-11-09 14:00:18,556 INFO
indexer.IndexerMapReduce - IndexerMapReduce: linkdb: testcrawl/linkdb2015-11-09
14:00:18,556 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding
segment: testcrawl/segments/201511091359562015-11-09 14:00:19,059 WARN
util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable2015-11-09 14:00:19,287
ERROR security.UserGroupInformation - PriviledgedActionException as:root
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current2015-11-09
14:00:19,297 ERROR indexer.IndexingJob - Indexer:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/opt/apache-nutch-1.10/testcrawl/linkdb/current at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113) at
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
I'm still learning here and could really use some guidance on how to
troubleshoot this.