This also answers your other question about memory exceptions while fetching : if you are parsing at the same time then you'll need more memory.
On 22 July 2014 14:40, Adam Estrada <[email protected]> wrote: > Sebastian, > > Thanks so much for the quick response. You were right. I read > somewhere that changing that property to true would help to speed up a > crawl. That was wrong...I changed it back and everything went back to > normal. > > <property> > <name>fetcher.parse</name> > <value>false</value> > <description>If true, fetcher will parse content. Default is false, > which means > that a separate parsing step is required after fetching is > finished.</description> > </property> > > Maybe you could shed some light on why this property exists so that > other folks reading this thread can benefit? > > Thanks again! > Adam > > On Mon, Jul 21, 2014 at 4:21 PM, Adam Estrada <[email protected]> > wrote: > > All, > > > > I have been crawling the web now for a few days without any issues. > > All of the sudden today I came across this error. > > > > Exception in thread "main" java.io.IOException: Segment already parsed! > > at > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213) > > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220) > > > > I found on the GOOG that I should be deleting my segment > > sub-directories manually for each run, but I have not had to do this > > previously. Here is the command I am running. > > > > bin/crawl urls/seeds.txt crawl http://localhost:8983/solr 1 > > > > What could I have changed to cause this Segment already parsed error > > to appear. I can't seem to get rid of it! > > > > Thanks, > > Adam > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

