nutch custom parser plugin

Cam Bazz Fri, 15 Jul 2011 16:37:29 -0700

Hello,

I am experimenting with the plugin system, and turning on and off
plugins to see its effects.


I understand index-basic plugin takes parsed data from parse, and
creates the necessary fields, and this is where we can add a field to
document, that will show up in solr. (the new field, of course has to
be previously defined in the schema.xml) - how ever from index-basic
plugin, i can not reach the raw content. I have access to the parse
object, which has the parsed version created by HtmlParser

Looking at the HtmlParser.java I have found:

    ParseData parseData = new ParseData(status, title, outlinks,
content.getMetadata(), metadata);
    ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
parseData));

    // run filters on parse
    ParseResult filteredParse = this.htmlParseFilters.filter(content,
parseResult, metaTags, root);


and I believe this is where ParseResult is created, and returned.

All I want to do is get access to the raw data tru content, run my own
parsing scheme, and create another field named contentfocus and put my
stuff in this field. But I believe am in the wrong point.

Which way I can get access to nutch document and content data?  Or if
this is not possible, how do I create another parseresult to be added
as contentfocus, next to normal content?

Best Regards,
C.B.

nutch custom parser plugin

Reply via email to