Creating plugin by extending HtmlParserFilter Can be the other option...
Since it is a plugin interface, so no modification in nutch build...
regards
Sourabh

On Mon, Jan 17, 2011 at 8:59 PM, Nobin Mathew <[email protected]>wrote:

> On Sat, Jan 15, 2011 at 3:49 AM, Žygimantas Medelis
> <[email protected]> wrote:
> > Hi,
> >
> > URLFilters allow to filter links based on content of the URL. Is it
> possible
> > to extend filters so as to filter links based on their anchor text?
> > URLFilter takes only url as its parameter.
> >
> > One way to do this is to modify parse-html plugin. There out-links are
> > collected and Outlink class provides getAnchor method. Then
> > those out-links which do not have required anchor text are not included
> when
> > ParseData is created thus preventing Nutch from crawling them.
>
> what about ParseOuputFormat.java write() function, where you will get
> the outlink and anchor text.
> You can create some thing like URLFilter which will also take anchor
> text as input.
> I don't how to get the sentence which is having the anchor text.
>
> >
> > Yet this does not seem like a good solution, parser plugins should not do
> > URL filtering. Is there a better way? What about extending this even
> further
> > and creating a filter based on the whole sentence an anchor is located
> in?
> >
> > regards
> > zm
> >
>

Reply via email to