RE: URL Normalization and the # sign

Iain Lopata Thu, 27 Mar 2014 05:39:15 -0700

Further analysis suggests that the Injector invokes the normalizer chain and
that there is no command line option to prevent this.


The Basic normalize drops the portion of the URL after the # when it
executes url.getFile() [Line 100 in BasicURLNormalizer.java version 1.6]
which does not return the fragment portion of the URL.

Does this seem correct?

Is there any other way of encoding the seed URL to prevent the fragment from
being dropped?  

Is it possible to omit the basic normalizer from the chain and implement the
same rules in the regex normalize?

-----Original Message-----
From: Iain Lopata [mailto:[email protected]] 
Sent: Wednesday, March 26, 2014 9:19 PM
To: [email protected]
Subject: URL Normalization and the # sign

I need to inject a number of seed URLs that require a # sign in the
parameters.

 

Both the basic-normalizer and the regex-normalizer (with a default
configuration) would seem to remove the portion of the URL after the #

 

I can modify the regex normalizer configuration file to eliminate this
behavior, but I cannot find a way to do this for the basic normalizer
without modifying the source.

 

I have tried encoding the # sign as %23 but this does not work with the site
I am crawling - it appears to need the #.

 

Any other suggestions?  If I don't invoke the basic normalize does anyone
have the additional entries for the regex normalizer config file that would
replace its functionality?

 

Thanks

RE: URL Normalization and the # sign

Reply via email to