Forget this. I am tripping and the low counters were directly in relation to NUTCH-1591 Sorry Lewis
On Wed, Jun 19, 2013 at 5:04 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > We define the structure of ParseStatus [0] in our WebPage JSON schema [1]. > All good so far. > What is not good (or not clear to me at least), is how we currently use > methods within this class to define Hadoop counters for the parsing tasks. > I parse large amounts of URLs, but the counters on one of my jobs only > indicates counters and their values as > > failed 11 > success 498 > notparsed 252 > I now digress slightly for some more technical stuff/observations. These > are merely observations of me stepping through the Nutch code in an attempt > to find out why the numbers are so (embarrassingly/surprisingly) low. > > I began at where we actually initiate the counter. This can of course be > located at line #134 of ParserJob [2], where we do > > 133 if (pstatus != null) { 134 context.getCounter("ParserStatus", 135 > ParseStatusCodes.majorCodes[pstatus.getMajorCode()]).increment(1); > 136 } > So I then wondered when the ParseStatus.setMajorCode(int value) is > actually called to assign one of "failed", "success" or "notparsed" > respectively. > It turns out that .setMajorCode(int value) is called in now fewer than two > places; line #217 of HtmlParser [3] > > 216 ParseStatus status = new ParseStatus(); 217 > status.setMajorCode(ParseStatusCodes.SUCCESS); > 218 if (metaTags.getRefresh()) { > and numerous lines within ParseStatusUtils [4]. > > It therefore seems that there is clear inconsistency in our implementation > of assigning ParseStatusCodes to ParseStatus'. My hope is that this is why > the counters are all messed up. > > My suggestion, I believe that implementations should follow that as > defined in HtmlParser, where we access the ParseStatus bean directly. We > could pass this stuff through ParseStatusUtils, but for me this is > unnecessary and just adding more confusion. > > I know this is a long post, and I apologize for that, but I would be > really please if others were able to comment. > I can then work towards a patch for this... if one is required. > > Thanks > > [0] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java?view=markup > [1] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/gora/webpage.avsc?view=markup > [2] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?view=markup > [3] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup > [4] > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusUtils.java?view=markup > [5] > > -- > *Lewis* > -- *Lewis*

