> Actually I'm not shure if I look at the right log lines. Please
> explain in more detail for what exactly I should look for. Anyway I
> found the following line just before the error:
> 
> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript

These are not bad.

> 
> But I can see that parsing erros like this already appeared earlier
> during the crawl.

Well, you either didn't parse that segment or the parse hasn't been correctly 
written to disk or you're linkdb job is running out of space in your /tmp dir, 
which is common. You'll usually see a DiskChecker exception somewhere.

> 
> 2011/7/12 Markus Jelsma <[email protected]>:
> > Were there errors during parsing of that last segment?
> > 
> >> I'm starting with nutch and I ran a simple job as described in the
> >> nutch tutorial. After a while I get the following error:
> >> 
> >> 
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> >> LinkDb: starting at 2011-07-12 12:32:03
> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> >> LinkDb: URL normalize: true
> >> LinkDb: URL filter: true
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> >> exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse
> >> _d ata Input path does not exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse
> >> _da ta Input path does not exist:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse
> >> _da ta at
> >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java
> >> :1 90) at
> >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile
> >> In putFormat.java:44) at
> >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:
> >> 20 1) at
> >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >> at
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> >>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)

Reply via email to