On Sat, Jan 15, 2011 at 3:49 AM, Žygimantas Medelis <[email protected]> wrote: > Hi, > > URLFilters allow to filter links based on content of the URL. Is it possible > to extend filters so as to filter links based on their anchor text? > URLFilter takes only url as its parameter. > > One way to do this is to modify parse-html plugin. There out-links are > collected and Outlink class provides getAnchor method. Then > those out-links which do not have required anchor text are not included when > ParseData is created thus preventing Nutch from crawling them.
what about ParseOuputFormat.java write() function, where you will get the outlink and anchor text. You can create some thing like URLFilter which will also take anchor text as input. I don't how to get the sentence which is having the anchor text. > > Yet this does not seem like a good solution, parser plugins should not do > URL filtering. Is there a better way? What about extending this even further > and creating a filter based on the whole sentence an anchor is located in? > > regards > zm >

