Moyer, Brett <bmo...@tiaa.org> wrote:
> What is the best practice on URL case?

I work with web archiving and URL-normalisation is quite a tricky thing. The 
software we use is https://github.com/ukwa/webarchive-discovery and in there a 
lot of energy has been spend on the subject. Long story short, we index 2 
forms: The unmodified raw one and a heavily normalised one.

Question: Is
  https://www.example.com/FOO/
the same as
  http://example.com/foo
?

Technically it is not as
* There might be different content served for different protocols (highly 
unlikely)
* www might mean something (unlikely)
* FOO might be another resource than foo (unlikely)
* The trailing slash might be significant (seen on some Apache proxy-setups)

There are other rules, such as trying to remove session-ids, everything after # 
and so on. None of the individual steps results in many false positives in 
themselves, but they do add up.

For most practical purposes (URL-lookup & grouping, following links between 
archived pages, resolving embedded resources from pages) we use the heavily 
normalised URL.

- Toke Eskildsen

Reply via email to