> Actually I'm not shure if I look at the right log lines. Please > explain in more detail for what exactly I should look for. Anyway I > found the following line just before the error: > > Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js: > failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
These are not bad. > > But I can see that parsing erros like this already appeared earlier > during the crawl. Well, you either didn't parse that segment or the parse hasn't been correctly written to disk or you're linkdb job is running out of space in your /tmp dir, which is common. You'll usually see a DiskChecker exception somewhere. > > 2011/7/12 Markus Jelsma <[email protected]>: > > Were there errors during parsing of that last segment? > > > >> I'm starting with nutch and I ran a simple job as described in the > >> nutch tutorial. After a while I get the following error: > >> > >> > >> CrawlDb update: URL filtering: true > >> CrawlDb update: Merging segment data into db. > >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03 > >> LinkDb: starting at 2011-07-12 12:32:03 > >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb > >> LinkDb: URL normalize: true > >> LinkDb: URL filter: true > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732 > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256 > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856 > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908 > >> LinkDb: adding segment: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051 > >> Exception in thread "main" > >> org.apache.hadoop.mapred.InvalidInputException: Input path does not > >> exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse > >> _d ata Input path does not exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse > >> _da ta Input path does not exist: > >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse > >> _da ta at > >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java > >> :1 90) at > >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFile > >> In putFormat.java:44) at > >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java: > >> 20 1) at > >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > >> at > >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at > >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:142) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)

