Hi,
We define the structure of ParseStatus [0] in our WebPage JSON schema [1].
All good so far.
What is not good (or not clear to me at least), is how we currently use
methods within this class to define Hadoop counters for the parsing tasks.
I parse large amounts of URLs, but the counters on one of my jobs only
indicates counters and their values as

failed 11
success 498
notparsed 252
I now digress slightly for some more technical stuff/observations. These
are merely observations of me stepping through the Nutch code in an attempt
to find out why the numbers are so (embarrassingly/surprisingly) low.

I began at where we actually initiate the counter. This can of course be
located at line #134 of ParserJob [2], where we do

133 if (pstatus != null) {  134 context.getCounter("ParserStatus",
135 ParseStatusCodes.majorCodes[pstatus.getMajorCode()]).increment(1);
 136 }
So I then wondered when the ParseStatus.setMajorCode(int value) is actually
called to assign one of "failed", "success" or "notparsed" respectively.
It turns out that .setMajorCode(int value) is called in now fewer than two
places; line #217 of HtmlParser [3]

216 ParseStatus status = new ParseStatus();  217
status.setMajorCode(ParseStatusCodes.SUCCESS);
 218 if (metaTags.getRefresh()) {
and numerous lines within ParseStatusUtils [4].

It therefore seems that there is clear inconsistency in our implementation
of assigning ParseStatusCodes to ParseStatus'. My hope is that this is why
the counters are all messed up.

My suggestion, I believe that implementations should follow that as defined
in HtmlParser, where we access the ParseStatus bean directly. We could pass
this stuff through ParseStatusUtils, but for me this is unnecessary and
just adding more confusion.

I know this is a long post, and I apologize for that, but I would be really
please if others were able to comment.
I can then work towards a patch for this... if one is required.

Thanks

[0]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/storage/ParseStatus.java?view=markup
[1]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/gora/webpage.avsc?view=markup
[2]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?view=markup
[3]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup
[4]
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseStatusUtils.java?view=markup
[5]

-- 
*Lewis*

Reply via email to