What method in crawl.java would trigger the invocation of plugins? 

Sent from my iPhone. Please ignore the typos.

On Nov 3, 2011, at 5:30 AM, Markus Jelsma <[email protected]> wrote:

> remove *parse* in the segment and you're good to go.
> 
> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>> Hi All,
>> 
>> I am trying to parse already crawled segments using the method --
>> ParseSegment.parse(seg);
>> 
>> 
>> seg is the Path to the existing segment.
>> This internally fires a new job and the error thrown is --
>> 
>> Exception in thread "main" java.io.IOException: Segment already parsed!
>> at
>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>> t.java:80) at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>> 
>> What I am trying to do here is parse the already fetched data to test my
>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>> in the normal workflow of crawl, fetch, parse ...
>> 
>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>> call only ParseSegment and commented the injector, generator and fetcher
>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>> passing the segment name in the command line.
>> 
>> Should I be calling some other method to test my HTML parser filter plugin
>> without crawling again?
>> 
>> Any pointers should be helpful.
>> 
>> Thanks,
>> Ashish
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Reply via email to