Hello,
In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:
public ParseImpl(String text, String features, ParseData data) {
this(new ParseText(text), new ParseText(features), data, true);
}
public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
this.text = text;
this.data = data;
this.features = features;
this.isCanonical = isCanonical;
}
public String getFeatures() {
return this.features.getText();
}
and although I create the parseImpl like
ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));
in the HtmlParser.java
I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:
SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!
I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.
Best Regards,
C.B.