This was a problem of ignorance on my part. The problem was in regex-normalize.xml. Evidently the default version of regex-normalize.xml does these weird things, so it's easy to fix.
If there is a useful lesson here, it's that the contents of regex-normalize.xml are exemplary, not more. On 10/28/2013 11:06 AM, Steve Newcomb wrote: > Correction: Where I said "path portion" in my note, I should have said > "query portion". > > On 10/28/2013 10:54 AM, Steve Newcomb wrote: >> I think maybe Nutch is not working correctly with respect to URLs whose >> path portions contain double slashes. I'm using Nutch 1.7 (with the >> protocol-httpclient plugin) to validate a carefully-maintained list of >> URLs, so I'm paying unusually close attention, I guess, to what's >> happening to every one of them. >> >> In Firefox, the following URL works: >> >> https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0 >> >> Note the double slash after "https%3A" in the path portion of the URL. >> >> After using Nutch to check this URL along with many others, the segment >> dump does not report this URL. Instead, it reports another URL -- one >> in which the double slash in the path portion of the URL has been >> changed to a single slash. >> >> The altered URL reported in the Nutch dump is evidently incorrect. When >> I try the Nutch-reported URL in Firefox, I see that the server at >> www.pay.gov can't resolve it successfully. >> >> The dump record for the altered URL reveals "robots denied", which is >> useful information for me, and it may be *correct* information, too: the >> URL is a form for users to fill out. (I do not know what would happen >> if robots were allowed by the server. I suspect Nutch would report that >> the resource does not exist, which would be incorrect for the URL I used >> as a seed, and correct for the URL that Nutch reported.) >> >> But how can I find this information in the segment dump, since the >> information appears under a *different* URL than the one I was >> attempting to validate? My current workaround is to normalize the path >> portion of the URL I'm looking for in the same apparently-incorrect >> fashion as Nutch does. Not pretty. >> >> >> Steve Newcomb >>

