All,

I have been crawling the web now for a few days without any issues.
All of the sudden today I came across this error.

Exception in thread "main" java.io.IOException: Segment already parsed!
at 
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220)

I found on the GOOG that I should be deleting my segment
sub-directories manually for each run, but I have not had to do this
previously. Here is the command I am running.

bin/crawl urls/seeds.txt crawl http://localhost:8983/solr 1

What could I have changed to cause this Segment already parsed error
to appear. I can't seem to get rid of it!

Thanks,
Adam

Reply via email to