This also answers your other question about memory exceptions while
fetching : if you are parsing at the same time then you'll need more memory.




On 22 July 2014 14:40, Adam Estrada <[email protected]> wrote:

> Sebastian,
>
> Thanks so much for the quick response. You were right. I read
> somewhere that changing that property to true would help to speed up a
> crawl. That was wrong...I changed it back and everything went back to
> normal.
>
> <property>
>   <name>fetcher.parse</name>
>   <value>false</value>
>   <description>If true, fetcher will parse content. Default is false,
> which means
>   that a separate parsing step is required after fetching is
> finished.</description>
> </property>
>
> Maybe you could shed some light on why this property exists so that
> other folks reading this thread can benefit?
>
> Thanks again!
> Adam
>
> On Mon, Jul 21, 2014 at 4:21 PM, Adam Estrada <[email protected]>
> wrote:
> > All,
> >
> > I have been crawling the web now for a few days without any issues.
> > All of the sudden today I came across this error.
> >
> > Exception in thread "main" java.io.IOException: Segment already parsed!
> > at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:975)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> > at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:247)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:220)
> >
> > I found on the GOOG that I should be deleting my segment
> > sub-directories manually for each run, but I have not had to do this
> > previously. Here is the command I am running.
> >
> > bin/crawl urls/seeds.txt crawl http://localhost:8983/solr 1
> >
> > What could I have changed to cause this Segment already parsed error
> > to appear. I can't seem to get rid of it!
> >
> > Thanks,
> > Adam
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to