current

Frumpus Mon, 09 Nov 2015 13:12:09 -0800

Ok, it seems as though I have run into a version of this problem:


[NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA

|   |
|   |   |   |   |   |
| [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRAIf the linkdb is 
missing the indexer fails with 2015-06-17 12:52:10,621 ERROR 
...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: .../linkdb/current  |
|  |
| View on issues.apache.org | Preview by Yahoo |
|  |
|   |


Which is a result of the crawl script not being aware that I set 
ignore_external_links to true in my nutch-site.xml file. 
I am trying to crawl several sites and was hoping to keep my life simple by 
ignoring external links and leaving regex-urlfilter.txt along (just using +.)
Now it looks like I'll have to change that back to false and mess with regex 
filters for all of my urls. Hopefully I can get a 1.11 soon? It looks like this 
is fixed there. 
     
 Subject: nutch 1.10 crawl fails at indexing with Input path does not exist 
.../linkdb/current
   
I am running nutch 1.10 on Ubuntu 14.04 with Solr 5.3.1
I have set up a fairly simple instance with 1 seed url and it crawls fine, but 
when it attemps to index, it crashes with the following:
Indexer: starting at 2015-11-09 14:00:17Indexer: deleting gone documents: 
falseIndexer: URL filtering: falseIndexer: URL normalizing: falseActive 
IndexWriters :SOLRIndexWriter        solr.server.url : URL of the SOLR instance 
(mandatory)        solr.commit.size : buffer size when sending to SOLR (default 
1000)        solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)        solr.auth : use authentication (default false)    
    solr.auth.username : username for authentication        solr.auth.password 
: password for authentication

Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)   
     at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)    
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)   
     at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)     
   at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)        
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)        at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)        at 
java.security.AccessController.doPrivileged(Native Method)        at 
javax.security.auth.Subject.doAs(Subject.java:415)        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)        
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)        at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)        at 
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)        at 
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)        at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)        at 
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
Error running:  /opt/apache-nutch-1.10/bin/nutch index 
-Dsolr.server.url=http://localhost:8983/solr/testcrawl testcrawl//crawldb 
-linkdb testcrawl//linkdb testcrawl//segments/20151109135956Failed with exit 
value 255.


The hadoop.log file has a little more detail that suggests a possible 
permissions problem, but running the crawl as root (using sudo) it seems like 
that should not be an issue.


2015-11-09 14:00:18,556 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
crawldb: testcrawl/crawldb2015-11-09 14:00:18,556 INFO  
indexer.IndexerMapReduce - IndexerMapReduce: linkdb: testcrawl/linkdb2015-11-09 
14:00:18,556 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding 
segment: testcrawl/segments/201511091359562015-11-09 14:00:19,059 WARN  
util.NativeCodeLoader - Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable2015-11-09 14:00:19,287 
ERROR security.UserGroupInformation - PriviledgedActionException as:root 
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current2015-11-09 
14:00:19,297 ERROR indexer.IndexingJob - Indexer: 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/opt/apache-nutch-1.10/testcrawl/linkdb/current        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)   
     at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)    
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)   
     at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)     
   at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)        
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)        at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)        at 
java.security.AccessController.doPrivileged(Native Method)        at 
javax.security.auth.Subject.doAs(Subject.java:415)        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)        
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)        at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)        at 
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)        at 
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)        at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)        at 
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
I'm still learning here and could really use some guidance on how to 
troubleshoot this.

Re: nutch 1.10 crawl fails at indexing with Input path does not exist .../linkdb/current

Reply via email to