Hi Adam,

this should not happen: only the segment generated and fetched
right now is parsed. What about the property:

<property>
  <name>fetcher.parse</name>
  <value>false</value>
  <description>If true, fetcher will parse content. Default is false, which 
means
  that a separate parsing step is required after fetching is 
finished.</description>
</property>

Is it still the default (=false)?

In any case, more information is needed:
- Nutch version
- more logs, esp. command-line messages
  from bin/crawl ("Operating on segment : xxx", etc.)

Thanks,
Sebastian

On 07/21/2014 10:21 PM, Adam Estrada wrote:
> All,
> 
> I have been crawling the web now for a few days without any issues.
> All of the sudden today I came across this error.
> 
> Exception in thread "main" java.io.IOException: Segment already parsed!
> at 
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220)
> 
> I found on the GOOG that I should be deleting my segment
> sub-directories manually for each run, but I have not had to do this
> previously. Here is the command I am running.
> 
> bin/crawl urls/seeds.txt crawl http://localhost:8983/solr 1
> 
> What could I have changed to cause this Segment already parsed error
> to appear. I can't seem to get rid of it!
> 
> Thanks,
> Adam
> 

Reply via email to