> Actually I'm not shure if I look at the right log lines. Please
> explain in more detail for what exactly I should look for. Anyway I
> found the following line just before the error:
>
> Error parsing: http://eu.apachecon.com/js/jquery.akslideshow.js:
> failed(2,0): Can't retrieve Tika parser for mime-type text/javascript
>
> But I can see that parsing erros like this already appeared earlier
> during the crawl.
>

This simply means that the javascript parser is not enabled in your conf
(which is the default behaviour) and as a consequence the default parser
(Tika) was used to try and parse it but has no resources for doing so.

Note : we should probably add .js to the default url filters. The javascript
parser has been deactivated by default because it generates atrocious URLs
so we might as well prevent such URLs form being fetched in the first place.

Anyway this is not the source of the problem. You seem to have unparsed
segments among the ones specified. Could be that you interrupted a previous
crawl or got a problem with it and did not delete these segments or the
whole crawl directory. Removing the segments and calling the last couple of
steps manually should do the trick.



>
>
>
> 2011/7/12 Markus Jelsma <[email protected]>:
> > Were there errors during parsing of that last segment?
> >
> >> I'm starting with nutch and I ran a simple job as described in the
> >> nutch tutorial. After a while I get the following error:
> >>
> >>
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2011-07-12 12:32:03, elapsed: 00:00:03
> >> LinkDb: starting at 2011-07-12 12:32:03
> >> LinkDb: linkdb: /Users/toom/Downloads/nutch-1.3/sites/linkdb
> >> LinkDb: URL normalize: true
> >> LinkDb: URL filter: true
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122856
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712122908
> >> LinkDb: adding segment:
> >> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712123051
> >> Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> >> exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238/parse_d
> >> ata Input path does not exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712113732/parse_da
> >> ta Input path does not exist:
> >>
> file:/Users/toom/Downloads/nutch-1.3/sites/segments/20110712114256/parse_da
> >> ta at
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> >> 90) at
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> >> putFormat.java:44) at
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> >> 1) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> >> at
> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> >> org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> >>       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> >>       at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> >>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to