On Mon, Feb 4, 2013 at 7:18 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Kiran,
>
> You are using 2.x still?
>
> Yes, I am using 2.x version of Nutch.

 HttpBase [0] suggests that upon receipt of a 404 response code the

> ProtocolStatus is marked to ProtocolStatusCodes.NOTFOUND which appears
> to be 14! [1].
> What are you expecting to happen here?
>
> Yes, the ProtocolStatus is changed to NOTFOUND but  i am talking about
fetch status which is still 1 (db_unfetched status) rather than assigning
it 3 (db_gone status).

We can see in this log file (
https://raw.github.com/salvager/NutchDev/master/runtime/local/table_fields/part-r-00000)
that Urls with protocolStatus NOTFOUND have a fetch status of 1
(db_unfetched). Shouldn't they be changed from status 1 to status 3 ? The
second column in the log file is fetchStatus and third column is
protocolStatus

Due to this reason when i do (readdb -stats) there is inconsistency.

I am not sure if its a problem only for me or anyone else. I have did the
crawl from scratch 3-4 times.

>
> > PS : I have made patch which dumps only particular fields through command
> > line (Example: ./bin/nutch readdb -dump table_fields -fields
> > "status,protocolStatus"). baseUrl is dumped by default along with other
> > fields requested. I can upload if anyone is interested.
>
> Please file an issue and attach your patch. Any potential addition to
> the codebase is welcomed.,
>
Sure. Will do!

>



> [0]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
> [1]
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/protocol/ProtocolStatusCodes.java
>
> --
> Lewis
>



-- 
Kiran Chitturi

Reply via email to