Hi Martin, I am struggling to understand how the DocumentFragment (populated either by private methods parseTagSoup or parseNeko depending on your config in nutch-site.xml) is null! What you don't mention is some problem you are having? I can't DEBUG the code tonight but I am interested to see what is up here. Lewis
On Thursday, May 23, 2013, Martin Aesch <[email protected]> wrote: > Dear nutchers, > > I extended the ParseFilter extension point > > public Parse filter(String url, WebPage page, Parse parse, > HTMLMetaTags metaTags, DocumentFragment doc) { > > From what I understand, plugin parse-html should populate the > DocumentFragment doc. > > Unfortunately, doc is always null. I tried this with my own plugin, as > well as with the nutch-shipped plugin microformats-reltag, which extends > the same point. > > Both plugins are working, and they are called. I attached my debugger, > and both for my own plugin as well as for the reltag-plugin, doc is > always null. > > I checked parse-plugins.xml, yes, parse-html is called and my mime-types > are those which call parse-html > (extension-id="org.apache.nutch.parse.html.HtmlParser"). > > What am I missing? > > Thanks, > Martin > > -- *Lewis*

