On Mon, Feb 4, 2013 at 7:18 PM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi Kiran, > > You are using 2.x still? > > Yes, I am using 2.x version of Nutch. HttpBase [0] suggests that upon receipt of a 404 response code the > ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears > to be 14! [1]. > What are you expecting to happen here? > > Yes, the ProtocolStatus is changed to NOTFOUND but i am talking about fetch status which is still 1 (db_unfetched status) rather than assigning it 3 (db_gone status). We can see in this log file ( https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000) that Urls with protocolStatus NOTFOUND have a fetch status of 1 (db_unfetched). Shouldn't they be changed from status 1 to status 3 ? The second column in the log file is fetchStatus and third column is protocolStatus Due to this reason when i do (readdb -stats) there is inconsistency. I am not sure if its a problem only for me or anyone else. I have did the crawl from scratch 3-4 times. > > > PS : I have made patch which dumps only particular fields through command > > line (Example: ./bin/nutch readdb -dump table_fields -fields > > "status,protocolStatus"). baseUrl is dumped by default along with other > > fields requested. I can upload if anyone is interested. > > Please file an issue and attach your patch. Any potential addition to > the codebase is welcomed., > Sure. Will do! > > [0] > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java > [1] > http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java > > -- > Lewis > -- Kiran Chitturi

