Hi,

Outlinks are added to the ParseData object before being passed to a 
HTMLParseFilter. In a HTMLParseFilter plugin you can obtain the Outlinks and 
remove those you don't want.

   Outlinks[] outlinks = 
parseResult.get(content.getUrl()).getData().getOutlinks();

Use the setOutlinks() method to write your processed list to the ParseData.

Cheers,
 
 
-----Original message-----
> From:刘?? <[email protected]>
> Sent: Fri 03-Aug-2012 15:45
> To: [email protected]
> Subject: Can I only add url in a specified div to the fetch list with nutch?
> 
> Such as the title, I want crawl a page with many urls, but only the ones in
> a specified div are meaningful to me. So I want to write a plugin to filter
> it, but I don't know which extension point should I choose.
> 
> The htmlparser filter can get the html content, but seems like process
> after the "add to fetch list" operation. And the urlfilter can control the
> fetch list, but I cant get the html content in it.
> 
> Look forward to any helpful replies, thx.
> 

Reply via email to