I need to inject a number of seed URLs that require a # sign in the
parameters.

 

Both the basic-normalizer and the regex-normalizer (with a default
configuration) would seem to remove the portion of the URL after the #

 

I can modify the regex normalizer configuration file to eliminate this
behavior, but I cannot find a way to do this for the basic normalizer
without modifying the source.

 

I have tried encoding the # sign as %23 but this does not work with the site
I am crawling - it appears to need the #.

 

Any other suggestions?  If I don't invoke the basic normalize does anyone
have the additional entries for the regex normalizer config file that would
replace its functionality?

 

Thanks

Reply via email to