Creating plugin by extending HtmlParserFilter Can be the other option... Since it is a plugin interface, so no modification in nutch build... regards Sourabh
On Mon, Jan 17, 2011 at 8:59 PM, Nobin Mathew <[email protected]>wrote: > On Sat, Jan 15, 2011 at 3:49 AM, Žygimantas Medelis > <[email protected]> wrote: > > Hi, > > > > URLFilters allow to filter links based on content of the URL. Is it > possible > > to extend filters so as to filter links based on their anchor text? > > URLFilter takes only url as its parameter. > > > > One way to do this is to modify parse-html plugin. There out-links are > > collected and Outlink class provides getAnchor method. Then > > those out-links which do not have required anchor text are not included > when > > ParseData is created thus preventing Nutch from crawling them. > > what about ParseOuputFormat.java write() function, where you will get > the outlink and anchor text. > You can create some thing like URLFilter which will also take anchor > text as input. > I don't how to get the sentence which is having the anchor text. > > > > > Yet this does not seem like a good solution, parser plugins should not do > > URL filtering. Is there a better way? What about extending this even > further > > and creating a filter based on the whole sentence an anchor is located > in? > > > > regards > > zm > > >

