Moyer, Brett <bmo...@tiaa.org> wrote: > What is the best practice on URL case?
I work with web archiving and URL-normalisation is quite a tricky thing. The software we use is https://github.com/ukwa/webarchive-discovery and in there a lot of energy has been spend on the subject. Long story short, we index 2 forms: The unmodified raw one and a heavily normalised one. Question: Is https://www.example.com/FOO/ the same as http://example.com/foo ? Technically it is not as * There might be different content served for different protocols (highly unlikely) * www might mean something (unlikely) * FOO might be another resource than foo (unlikely) * The trailing slash might be significant (seen on some Apache proxy-setups) There are other rules, such as trying to remove session-ids, everything after # and so on. None of the individual steps results in many false positives in themselves, but they do add up. For most practical purposes (URL-lookup & grouping, following links between archived pages, resolving embedded resources from pages) we use the heavily normalised URL. - Toke Eskildsen