Re: modifying parse implementation

Joye Fri, 15 Jul 2011 17:44:22 -0700

Hello,

Because the ParseImpl implements the interface of Writable and it willbe serialized and deserialized when transferring among namenode anddatanodes in hadoop. So, if you add a property in any class implements"Writable", you should add the read and write code for the new propertyin read and write functions of ParseImpl class, which tells nutch how todo when serializing and deserializing ParseImpl class.

P.S. For the "features" is a string, so you could put it intoParseData.parseData (it's Map structure), without any changes in baseclasses of nutch.


Regards,
Joey


On 07/16/2011 08:21 AM, Cam Bazz wrote:

Hello,

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

   public ParseImpl(String text, String features, ParseData data) {
     this(new ParseText(text), new ParseText(features), data, true);
   }

   public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
     this.text = text;
     this.data = data;
     this.features = features;
     this.isCanonical = isCanonical;
   }

   public String getFeatures() {
         return this.features.getText();
   }


and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the HtmlParser.java

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!


I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,
C.B.

Re: modifying parse implementation

Reply via email to