Hello,

Because the ParseImpl implements the interface of Writable and it will be serialized and deserialized when transferring among namenode and datanodes in hadoop. So, if you add a property in any class implements "Writable", you should add the read and write code for the new property in read and write functions of ParseImpl class, which tells nutch how to do when serializing and deserializing ParseImpl class.

P.S. For the "features" is a string, so you could put it into ParseData.parseData (it's Map structure), without any changes in base classes of nutch.

Regards,
Joey


On 07/16/2011 08:21 AM, Cam Bazz wrote:
Hello,

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

   public ParseImpl(String text, String features, ParseData data) {
     this(new ParseText(text), new ParseText(features), data, true);
   }

   public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
     this.text = text;
     this.data = data;
     this.features = features;
     this.isCanonical = isCanonical;
   }

   public String getFeatures() {
         return this.features.getText();
   }


and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the HtmlParser.java

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!


I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,
C.B.

Reply via email to