Hi, URLFilters allow to filter links based on content of the URL. Is it possible to extend filters so as to filter links based on their anchor text? URLFilter takes only url as its parameter.
One way to do this is to modify parse-html plugin. There out-links are collected and Outlink class provides getAnchor method. Then those out-links which do not have required anchor text are not included when ParseData is created thus preventing Nutch from crawling them. Yet this does not seem like a good solution, parser plugins should not do URL filtering. Is there a better way? What about extending this even further and creating a filter based on the whole sentence an anchor is located in? regards zm

