Hi All, I am trying to parse already crawled segments using the method -- ParseSegment.parse(seg);
seg is the Path to the existing segment. This internally fires a new job and the error thrown is -- Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:80) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) What I am trying to do here is parse the already fetched data to test my HTML Parse Filter. Looks like the above method of ParseSegment gets called in the normal workflow of crawl, fetch, parse ... What I have done is modified the org.apache.nutch.crawl.Crawl.run() to call only ParseSegment and commented the injector, generator and fetcher parts. I am calling ParseSegment.parse(segment) in the run() method. I am passing the segment name in the command line. Should I be calling some other method to test my HTML parser filter plugin without crawling again? Any pointers should be helpful. Thanks, Ashish

