Re: double slash in path normalized away by Nutch 1.7

Steve Newcomb Thu, 31 Oct 2013 06:07:04 -0700

This was a problem of ignorance on my part.  The problem was in
regex-normalize.xml.  Evidently the default version of
regex-normalize.xml does these weird things, so it's easy to fix.


If there is a useful lesson here, it's that the contents of
regex-normalize.xml are exemplary, not more.

On 10/28/2013 11:06 AM, Steve Newcomb wrote:
> Correction: Where I said "path portion" in my note, I should have said
> "query portion".
> 
> On 10/28/2013 10:54 AM, Steve Newcomb wrote:
>> I think maybe Nutch is not working correctly with respect to URLs whose
>> path portions contain double slashes.  I'm using Nutch 1.7 (with the
>> protocol-httpclient plugin) to validate a carefully-maintained list of
>> URLs, so I'm paying unusually close attention, I guess, to what's
>> happening to every one of them.
>>
>> In Firefox, the following URL works:
>>
>> https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0
>>
>> Note the double slash after "https%3A" in the path portion of the URL.
>>
>> After using Nutch to check this URL along with many others, the segment
>> dump does not report this URL.  Instead, it reports another URL -- one
>> in which the double slash in the path portion of the URL has been
>> changed to a single slash.
>>
>> The altered URL reported in the Nutch dump is evidently incorrect.  When
>> I try the Nutch-reported URL in Firefox, I see that the server at
>> www.pay.gov can't resolve it successfully.
>>
>> The dump record for the altered URL reveals "robots denied", which is
>> useful information for me, and it may be *correct* information, too: the
>> URL is a form for users to fill out.  (I do not know what would happen
>> if robots were allowed by the server.  I suspect Nutch would report that
>> the resource does not exist, which would be incorrect for the URL I used
>> as a seed, and correct for the URL that Nutch reported.)
>>
>> But how can I find this information in the segment dump, since the
>> information appears under a *different* URL than the one I was
>> attempting to validate?  My current workaround is to normalize the path
>> portion of the URL I'm looking for in the same apparently-incorrect
>> fashion as Nutch does.  Not pretty.
>>
>>
>> Steve Newcomb
>>

Re: double slash in path normalized away by Nutch 1.7

Reply via email to