Re: modifying parse implementation

Joye Sat, 16 Jul 2011 19:52:07 -0700

Hello,

You could put the features into ParseData by calling


/parseData.getParseMeta().set("features", valueOfFeatures);

/When you wanna use it, call parseData.getParseMeta().get("features") toget it out/, /the same as the use of Java Map.


No need call the setter method. :-)/

/Regards,
Joey/
/

On 07/17/2011 04:55 AM, Cam Bazz wrote:

Hello,

I did not understand ParseData.parseData -

In ParseData there are getContentMeta and getParseMeta

There is also a getMeta(String string) - it appears that there is no
setter for this.

There is also setParseMeta, but it appears content meta is not settable.

Best Regards,
C.B.




On Sat, Jul 16, 2011 at 3:43 AM, Joye<[email protected]>  wrote:

Hello,

Because the ParseImpl implements the interface of Writable and it will be
serialized and deserialized when transferring among namenode and datanodes
in hadoop. So, if you add a property in any class implements "Writable", you
should add the read and write code for the new property in read and write
functions of ParseImpl class, which tells nutch how to do when serializing
and deserializing ParseImpl class.

P.S. For the "features" is a string, so you could put it into
ParseData.parseData (it's Map structure), without any changes in base
classes of nutch.

Regards,
Joey


On 07/16/2011 08:21 AM, Cam Bazz wrote:

Hello,

In my quest to create a custom parser, I have modified parseimpl to
hold another ParseText called features, such as:

   public ParseImpl(String text, String features, ParseData data) {
     this(new ParseText(text), new ParseText(features), data, true);
   }

   public ParseImpl(ParseText text, ParseText features, ParseData data,
boolean isCanonical) {
     this.text = text;
     this.data = data;
     this.features = features;
     this.isCanonical = isCanonical;
   }

   public String getFeatures() {
         return this.features.getText();
   }


and although I create the parseImpl like

ParseResult parseResult =
ParseResult.createParseResult(content.getUrl(), new ParseImpl(text,
features, parseData));

in the HtmlParser.java

I get an error when indexing if I do parse.getFeatures() -
parse.getText() will return the correct text, but if I call
parse.getFeatures() in index-basic plugin I get:

SolrIndexer: starting at 2011-07-16 03:06:54
java.io.IOException: Job failed!


I am getting a much better understanding of how nutch works. I dont
think my approach of butchering HtmlParser and ParseImpl is the best,
and I am sure all these can be put inside a another plugin.

Best Regards,
C.B.

Re: modifying parse implementation

Reply via email to