current

Sebastian Nagel Mon, 09 Nov 2015 13:58:27 -0800

Hi,

you're right. This will be fixed in Nutch 1.11.


Thanks,
Sebastian

On 11/09/2015 10:07 PM, Frumpus wrote:
> Ok, it seems as though I have run into a version of this problem:
> 
> 
> [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRA
> 
> |   |
> |   |   |   |   |   |
> | [NUTCH-2041] indexer fails if linkdb is missing - ASF JIRAIf the linkdb is 
> missing the indexer fails with 2015-06-17 12:52:10,621 ERROR 
> ...cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: .../linkdb/current  |
> |  |
> | View on issues.apache.org | Preview by Yahoo |
> |  |
> |   |
> 
> 
> Which is a result of the crawl script not being aware that I set 
> ignore_external_links to true in my nutch-site.xml file. 
> I am trying to crawl several sites and was hoping to keep my life simple by 
> ignoring external links and leaving regex-urlfilter.txt along (just using +.)
> Now it looks like I'll have to change that back to false and mess with regex 
> filters for all of my urls. Hopefully I can get a 1.11 soon? It looks like 
> this is fixed there. 
>      
>  Subject: nutch 1.10 crawl fails at indexing with Input path does not exist 
> .../linkdb/current
>    
> I am running nutch 1.10 on Ubuntu 14.04 with Solr 5.3.1
> I have set up a fairly simple instance with 1 seed url and it crawls fine, 
> but when it attemps to index, it crashes with the following:
> Indexer: starting at 2015-11-09 14:00:17Indexer: deleting gone documents: 
> falseIndexer: URL filtering: falseIndexer: URL normalizing: falseActive 
> IndexWriters :SOLRIndexWriter        solr.server.url : URL of the SOLR 
> instance (mandatory)        solr.commit.size : buffer size when sending to 
> SOLR (default 1000)        solr.mapping.file : name of the mapping file for 
> fields (default solrindex-mapping.xml)        solr.auth : use authentication 
> (default false)        solr.auth.username : username for authentication       
>  solr.auth.password : password for authentication
> 
> Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
> exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current        at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) 
>        at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>         at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)  
>       at 
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)        
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)        
> at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)        
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)        at 
> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)        at 
> java.security.AccessController.doPrivileged(Native Method)        at 
> javax.security.auth.Subject.doAs(Subject.java:415)        at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn
 f
ormation.java:1190)        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)        
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)        at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)        at 
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)        at 
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)        at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)        at 
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
> Error running:  /opt/apache-nutch-1.10/bin/nutch index 
> -Dsolr.server.url=http://localhost:8983/solr/testcrawl testcrawl//crawldb 
> -linkdb testcrawl//linkdb testcrawl//segments/20151109135956Failed with exit 
> value 255.
> 
> 
> The hadoop.log file has a little more detail that suggests a possible 
> permissions problem, but running the crawl as root (using sudo) it seems like 
> that should not be an issue.
> 
> 
> 2015-11-09 14:00:18,556 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> crawldb: testcrawl/crawldb2015-11-09 14:00:18,556 INFO  
> indexer.IndexerMapReduce - IndexerMapReduce: linkdb: 
> testcrawl/linkdb2015-11-09 14:00:18,556 INFO  indexer.IndexerMapReduce - 
> IndexerMapReduces: adding segment: 
> testcrawl/segments/201511091359562015-11-09 14:00:19,059 WARN  
> util.NativeCodeLoader - Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable2015-11-09 
> 14:00:19,287 ERROR security.UserGroupInformation - PriviledgedActionException 
> as:root cause:org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: file:/opt/apache-nutch-1.10/testcrawl/linkdb/current2015-11-09 
> 14:00:19,297 ERROR indexer.IndexingJob - Indexer: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/opt/apache-nutch-1.10/testcrawl/linkdb/current        at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) 
>        at
  
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)    
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)   
     at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)     
   at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)        
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)        at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)        at 
java.security.AccessController.doPrivileged(Native Method)        at 
javax.security.auth.Subject.doAs(Subject.java:415)        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)        
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)        at 
org.apache.hadoop.mapred.JobClient.runJob(JobCli
 e
nt.java:1353)        at 
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)        at 
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)        at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)        at 
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
> I'm still learning here and could really use some guidance on how to 
> troubleshoot this. 
> 
>   
>

Re: nutch 1.10 crawl fails at indexing with Input path does not exist .../linkdb/current

Reply via email to