Forget this.
I am tripping and the low counters were directly in relation to NUTCH-1591
Sorry
Lewis


On Wed, Jun 19, 2013 at 5:04 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> We define the structure of ParseStatus [0] in our WebPage JSON schema [1].
> All good so far.
> What is not good (or not clear to me at least), is how we currently use
> methods within this class to define Hadoop counters for the parsing tasks.
> I parse large amounts of URLs, but the counters on one of my jobs only
> indicates counters and their values as
>
> failed 11
> success 498
> notparsed 252
> I now digress slightly for some more technical stuff/observations. These
> are merely observations of me stepping through the Nutch code in an attempt
> to find out why the numbers are so (embarrassingly/surprisingly) low.
>
> I began at where we actually initiate the counter. This can of course be
> located at line #134 of ParserJob [2], where we do
>
> 133 if (pstatus != null) {  134 context.getCounter("ParserStatus",  135 
> ParseStatusCodes.majorCodes[pstatus.getMajorCode()]).increment(1);
>  136 }
> So I then wondered when the ParseStatus.setMajorCode(int value) is
> actually called to assign one of "failed", "success" or "notparsed"
> respectively.
> It turns out that .setMajorCode(int value) is called in now fewer than two
> places; line #217 of HtmlParser [3]
>
> 216 ParseStatus status = new ParseStatus();  217 
> status.setMajorCode(ParseStatusCodes.SUCCESS);
>  218 if (metaTags.getRefresh()) {
> and numerous lines within ParseStatusUtils [4].
>
> It therefore seems that there is clear inconsistency in our implementation
> of assigning ParseStatusCodes to ParseStatus'. My hope is that this is why
> the counters are all messed up.
>
> My suggestion, I believe that implementations should follow that as
> defined in HtmlParser, where we access the ParseStatus bean directly. We
> could pass this stuff through ParseStatusUtils, but for me this is
> unnecessary and just adding more confusion.
>
> I know this is a long post, and I apologize for that, but I would be
> really please if others were able to comment.
> I can then work towards a patch for this... if one is required.
>
> Thanks
>
> [0]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java?view=markup
> [1]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/gora/webpage.avsc?view=markup
> [2]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?view=markup
> [3]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup
> [4]
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusUtils.java?view=markup
> [5]
>
> --
> *Lewis*
>



-- 
*Lewis*

Reply via email to