Hello,
Because the ParseImpl implements the interface of Writable and it will
be serialized and deserialized when transferring among namenode and
datanodes in hadoop. So, if you add a property in any class implements
"Writable", you should add the read and write code for the new property
in read and write functions of ParseImpl class, which tells nutch how to
do when serializing and deserializing ParseImpl class.
P.S. For the "features" is a string, so you could put it into
ParseData.parseData (it's Map structure), without any changes in base
classes of nutch.
Regards,
Joey
On 07/16/2011 08:21 AM, Cam Bazz wrote:
Hello,
In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:
public ParseImpl(String text, String features, ParseData data) {
this(new ParseText(text), new ParseText(features), data, true);
}
public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
this.text = text;
this.data = data;
this.features = features;
this.isCanonical = isCanonical;
}
public String getFeatures() {
return this.features.getText();
}
and although I create the parseImpl like
ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));
in the HtmlParser.java
I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:
SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!
I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.
Best Regards,
C.B.