Re: URL Case Sensitive/Insensitive
Lowercasing might work, it might not. Hostnames originally were case-insensitive, but that might have changed with I18N hostnames. Paths are interpreted by the web server. On Windows, paths are case-insensitive. On Unix, they are case-sensitive. Web servers might be configured to use case-insensitive paths. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 11, 2018, at 10:33 AM, Moyer, Brett wrote: > > https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund > https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund > > Is there any issue if we just lowercase all URLs? I can't think of an issue > that would be caused, but that's why I'm asking the Guru's! > > Brett Moyer > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, December 11, 2018 12:41 PM > To: solr-user > Subject: Re: URL Case Sensitive/Insensitive > > What do you mean by "url case"? No, I'm not being snarky. > > The value returned in a doc is very different than the value searched. > The stored data is the original input without going through any > filters. > > If you mean the value _returned_ by Solr from a stored field, then the > case is exactly whatever was input originally. To get it a consistent > case, I'd change it on the client side before sending to Solr, or > use, say, a ScriptUpdateProcessor to change it on the way in to Solr. > > If you're talking about _searching_ the URL, you need to put the > appropriate filters in your analysis chain. Most distributions have a > "lowercase" type that is a keywordtokenizer and lowercasefilter That > still treats the searchable text as a single token, so for instance > you wouldn't be able to search for url:com with pre-and-post wildcards > which is not a good pattern. If you want to search sub-parts of a url, > you'll use one of the text-based types to break it up into tokens. > Even in this case, though, the returned data is still the original > case since it's the stored data that's returned. > > Best, > Erick > On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett wrote: >> >> Hello, I'm new to Solr been using it for a few months. A recent question >> came up from our business partners about URL casing. Previously their URLs >> were upper case, they made a change and now all lower. Both pages/URLs are >> still accessible so there are duplicates in Solr. They are requesting all >> URLs be evaluated as lowercase. What is the best practice on URL case? Is >> there a negative to making all lowercase? I know I can drop the index and >> re-crawl to fix it, but long term how should URL case be treated? Thanks! >> >> Brett Moyer >> >> * >> This e-mail may contain confidential or privileged information. >> If you are not the intended recipient, please notify the sender immediately >> and then delete it. >> >> TIAA >> * > * > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > *
RE: URL Case Sensitive/Insensitive
https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund Is there any issue if we just lowercase all URLs? I can't think of an issue that would be caused, but that's why I'm asking the Guru's! Brett Moyer -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, December 11, 2018 12:41 PM To: solr-user Subject: Re: URL Case Sensitive/Insensitive What do you mean by "url case"? No, I'm not being snarky. The value returned in a doc is very different than the value searched. The stored data is the original input without going through any filters. If you mean the value _returned_ by Solr from a stored field, then the case is exactly whatever was input originally. To get it a consistent case, I'd change it on the client side before sending to Solr, or use, say, a ScriptUpdateProcessor to change it on the way in to Solr. If you're talking about _searching_ the URL, you need to put the appropriate filters in your analysis chain. Most distributions have a "lowercase" type that is a keywordtokenizer and lowercasefilter That still treats the searchable text as a single token, so for instance you wouldn't be able to search for url:com with pre-and-post wildcards which is not a good pattern. If you want to search sub-parts of a url, you'll use one of the text-based types to break it up into tokens. Even in this case, though, the returned data is still the original case since it's the stored data that's returned. Best, Erick On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett wrote: > > Hello, I'm new to Solr been using it for a few months. A recent question came > up from our business partners about URL casing. Previously their URLs were > upper case, they made a change and now all lower. Both pages/URLs are still > accessible so there are duplicates in Solr. They are requesting all URLs be > evaluated as lowercase. What is the best practice on URL case? Is there a > negative to making all lowercase? I know I can drop the index and re-crawl to > fix it, but long term how should URL case be treated? Thanks! > > Brett Moyer > > * > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > * * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
Re: URL Case Sensitive/Insensitive
Moyer, Brett wrote: > What is the best practice on URL case? I work with web archiving and URL-normalisation is quite a tricky thing. The software we use is https://github.com/ukwa/webarchive-discovery and in there a lot of energy has been spend on the subject. Long story short, we index 2 forms: The unmodified raw one and a heavily normalised one. Question: Is https://www.example.com/FOO/ the same as http://example.com/foo ? Technically it is not as * There might be different content served for different protocols (highly unlikely) * www might mean something (unlikely) * FOO might be another resource than foo (unlikely) * The trailing slash might be significant (seen on some Apache proxy-setups) There are other rules, such as trying to remove session-ids, everything after # and so on. None of the individual steps results in many false positives in themselves, but they do add up. For most practical purposes (URL-lookup & grouping, following links between archived pages, resolving embedded resources from pages) we use the heavily normalised URL. - Toke Eskildsen
Re: URL Case Sensitive/Insensitive
What do you mean by "url case"? No, I'm not being snarky. The value returned in a doc is very different than the value searched. The stored data is the original input without going through any filters. If you mean the value _returned_ by Solr from a stored field, then the case is exactly whatever was input originally. To get it a consistent case, I'd change it on the client side before sending to Solr, or use, say, a ScriptUpdateProcessor to change it on the way in to Solr. If you're talking about _searching_ the URL, you need to put the appropriate filters in your analysis chain. Most distributions have a "lowercase" type that is a keywordtokenizer and lowercasefilter That still treats the searchable text as a single token, so for instance you wouldn't be able to search for url:com with pre-and-post wildcards which is not a good pattern. If you want to search sub-parts of a url, you'll use one of the text-based types to break it up into tokens. Even in this case, though, the returned data is still the original case since it's the stored data that's returned. Best, Erick On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett wrote: > > Hello, I'm new to Solr been using it for a few months. A recent question came > up from our business partners about URL casing. Previously their URLs were > upper case, they made a change and now all lower. Both pages/URLs are still > accessible so there are duplicates in Solr. They are requesting all URLs be > evaluated as lowercase. What is the best practice on URL case? Is there a > negative to making all lowercase? I know I can drop the index and re-crawl to > fix it, but long term how should URL case be treated? Thanks! > > Brett Moyer > > * > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > *