Actually I'm not shure if I look at the right log lines. Please
explain in more detail for what exactly I should look for. Anyway I
found the following line just before the error:

Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
failed(2,0): Can't retrieve Tika parser for mime-type text/javascript

But I can see that parsing erros like this already appeared earlier
during the crawl.



2011/7/12 Markus Jelsma <[email protected]>:
> Were there errors during parsing of that last segment?
>
>> I'm starting with nutch and I ran a simple job as described in the
>> nutch tutorial. After a while I get the following error:
>>
>>
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
>> LinkDb: starting at 2011-07-12 12:32:03
>> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
>> LinkDb: adding segment:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
>> Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d
>> ata Input path does not exist:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da
>> ta Input path does not exist:
>> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da
>> ta at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
>> 90) at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>> putFormat.java:44) at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
>> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
>

Reply via email to