They are called ParseFilters in 2.x : http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html as they are not limited to processing HTML documents since Tika generates SAX events for other mimetypes
J. On 12 June 2013 13:37, Tony Mullins <[email protected]> wrote: > Hi , > > If I go to http://wiki.apache.org/nutch/AboutPlugins ,here it shows me > HTMLParseFilter is extension point for adding custom metadata to HTML and > its 'Filter' method's signature is 'public ParseResult filter(Content > content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment > doc)' but its in api 1.4 doc. > > I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in v2.2 > api doc > http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html. > > So please tell me which class to use in v2.2 api for adding my custom rule > to extract some data from HTML page (is it ParseFilter ?) and add it to > HMTL metadata so later then I could add it to my Solr using indexfilter > plugin. > > > Thanks, > Tony. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

