Further analysis suggests that the Injector invokes the normalizer chain and that there is no command line option to prevent this.
The Basic normalize drops the portion of the URL after the # when it executes url.getFile() [Line 100 in BasicURLNormalizer.java version 1.6] which does not return the fragment portion of the URL. Does this seem correct? Is there any other way of encoding the seed URL to prevent the fragment from being dropped? Is it possible to omit the basic normalizer from the chain and implement the same rules in the regex normalize? -----Original Message----- From: Iain Lopata [mailto:[email protected]] Sent: Wednesday, March 26, 2014 9:19 PM To: [email protected] Subject: URL Normalization and the # sign I need to inject a number of seed URLs that require a # sign in the parameters. Both the basic-normalizer and the regex-normalizer (with a default configuration) would seem to remove the portion of the URL after the # I can modify the regex normalizer configuration file to eliminate this behavior, but I cannot find a way to do this for the basic normalizer without modifying the source. I have tried encoding the # sign as %23 but this does not work with the site I am crawling - it appears to need the #. Any other suggestions? If I don't invoke the basic normalize does anyone have the additional entries for the regex normalizer config file that would replace its functionality? Thanks

