Hi, That double slash is normalized by your regex-normalizer. Check the configuration file and remove the normalization rule or change so it does does not normalize if it comes right after http or https.
Cheers, Markus -----Original message----- > From:Steve Newcomb <[email protected]> > Sent: Monday 28th October 2013 15:10 > To: [email protected] > Subject: double slash in path normalized away by Nutch 1.7 > > I think maybe Nutch is not working correctly with respect to URLs whose > path portions contain double slashes. I'm using Nutch 1.7 (with the > protocol-httpclient plugin) to validate a carefully-maintained list of > URLs, so I'm paying unusually close attention, I guess, to what's > happening to every one of them. > > In Firefox, the following URL works: > > https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0 > > Note the double slash after "https%3A" in the path portion of the URL. > > After using Nutch to check this URL along with many others, the segment > dump does not report this URL. Instead, it reports another URL -- one > in which the double slash in the path portion of the URL has been > changed to a single slash. > > The altered URL reported in the Nutch dump is evidently incorrect. When > I try the Nutch-reported URL in Firefox, I see that the server at > www.pay.gov can't resolve it successfully. > > The dump record for the altered URL reveals "robots denied", which is > useful information for me, and it may be *correct* information, too: the > URL is a form for users to fill out. (I do not know what would happen > if robots were allowed by the server. I suspect Nutch would report that > the resource does not exist, which would be incorrect for the URL I used > as a seed, and correct for the URL that Nutch reported.) > > But how can I find this information in the segment dump, since the > information appears under a *different* URL than the one I was > attempting to validate? My current workaround is to normalize the path > portion of the URL I'm looking for in the same apparently-incorrect > fashion as Nutch does. Not pretty. > > > Steve Newcomb > >

