Re: URL Case Sensitive/Insensitive

2018-12-11 Thread Walter Underwood
Lowercasing might work, it might not.

Hostnames originally were case-insensitive, but that might have changed with 
I18N hostnames.

Paths are interpreted by the web server. On Windows, paths are 
case-insensitive. On Unix, they are case-sensitive. Web servers might be 
configured to use case-insensitive paths.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 11, 2018, at 10:33 AM, Moyer, Brett  wrote:
> 
> https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund
> https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund
> 
> Is there any issue if we just lowercase all URLs? I can't think of an issue 
> that would be caused, but that's why I'm asking the Guru's!
> 
> Brett Moyer
>
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Tuesday, December 11, 2018 12:41 PM
> To: solr-user
> Subject: Re: URL Case Sensitive/Insensitive
> 
> What do you mean by "url case"? No, I'm not being snarky.
> 
> The value returned in a doc is very different than the value searched.
> The stored data is the original input without going through any
> filters.
> 
> If you mean the value _returned_ by Solr from a stored field, then the
> case is exactly whatever was input originally. To get it a consistent
> case, I'd change it on the client side before sending  to Solr, or
> use, say, a  ScriptUpdateProcessor to change it on the way in to Solr.
> 
> If you're talking about _searching_ the URL, you need to put the
> appropriate filters in your analysis chain. Most distributions have a
> "lowercase" type that is a keywordtokenizer and lowercasefilter That
> still treats the searchable text as a single token, so for instance
> you wouldn't be able to search for url:com with pre-and-post wildcards
> which is not a good pattern. If you want to search sub-parts of a url,
> you'll use one of the text-based types to break it up into tokens.
> Even in this case, though, the returned data is still the original
> case since it's the stored data that's returned.
> 
> Best,
> Erick
> On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett  wrote:
>> 
>> Hello, I'm new to Solr been using it for a few months. A recent question 
>> came up from our business partners about URL casing. Previously their URLs 
>> were upper case, they made a change and now all lower. Both pages/URLs are 
>> still accessible so there are duplicates in Solr. They are requesting all 
>> URLs be evaluated as lowercase. What is the best practice on URL case? Is 
>> there a negative to making all lowercase? I know I can drop the index and 
>> re-crawl to fix it, but long term how should URL case be treated? Thanks!
>> 
>> Brett Moyer
>> 
>> *
>> This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender immediately 
>> and then delete it.
>> 
>> TIAA
>> *
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> *



RE: URL Case Sensitive/Insensitive

2018-12-11 Thread Moyer, Brett
https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund
https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund

Is there any issue if we just lowercase all URLs? I can't think of an issue 
that would be caused, but that's why I'm asking the Guru's!

Brett Moyer
   

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, December 11, 2018 12:41 PM
To: solr-user
Subject: Re: URL Case Sensitive/Insensitive

What do you mean by "url case"? No, I'm not being snarky.

The value returned in a doc is very different than the value searched.
The stored data is the original input without going through any
filters.

If you mean the value _returned_ by Solr from a stored field, then the
case is exactly whatever was input originally. To get it a consistent
case, I'd change it on the client side before sending  to Solr, or
use, say, a  ScriptUpdateProcessor to change it on the way in to Solr.

If you're talking about _searching_ the URL, you need to put the
appropriate filters in your analysis chain. Most distributions have a
"lowercase" type that is a keywordtokenizer and lowercasefilter That
still treats the searchable text as a single token, so for instance
you wouldn't be able to search for url:com with pre-and-post wildcards
which is not a good pattern. If you want to search sub-parts of a url,
you'll use one of the text-based types to break it up into tokens.
Even in this case, though, the returned data is still the original
case since it's the stored data that's returned.

Best,
Erick
On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett  wrote:
>
> Hello, I'm new to Solr been using it for a few months. A recent question came 
> up from our business partners about URL casing. Previously their URLs were 
> upper case, they made a change and now all lower. Both pages/URLs are still 
> accessible so there are duplicates in Solr. They are requesting all URLs be 
> evaluated as lowercase. What is the best practice on URL case? Is there a 
> negative to making all lowercase? I know I can drop the index and re-crawl to 
> fix it, but long term how should URL case be treated? Thanks!
>
> Brett Moyer
>
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA
> *
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Re: URL Case Sensitive/Insensitive

2018-12-11 Thread Toke Eskildsen
Moyer, Brett  wrote:
> What is the best practice on URL case?

I work with web archiving and URL-normalisation is quite a tricky thing. The 
software we use is https://github.com/ukwa/webarchive-discovery and in there a 
lot of energy has been spend on the subject. Long story short, we index 2 
forms: The unmodified raw one and a heavily normalised one.

Question: Is
  https://www.example.com/FOO/
the same as
  http://example.com/foo
?

Technically it is not as
* There might be different content served for different protocols (highly 
unlikely)
* www might mean something (unlikely)
* FOO might be another resource than foo (unlikely)
* The trailing slash might be significant (seen on some Apache proxy-setups)

There are other rules, such as trying to remove session-ids, everything after # 
and so on. None of the individual steps results in many false positives in 
themselves, but they do add up.

For most practical purposes (URL-lookup & grouping, following links between 
archived pages, resolving embedded resources from pages) we use the heavily 
normalised URL.

- Toke Eskildsen


Re: URL Case Sensitive/Insensitive

2018-12-11 Thread Erick Erickson
What do you mean by "url case"? No, I'm not being snarky.

The value returned in a doc is very different than the value searched.
The stored data is the original input without going through any
filters.

If you mean the value _returned_ by Solr from a stored field, then the
case is exactly whatever was input originally. To get it a consistent
case, I'd change it on the client side before sending  to Solr, or
use, say, a  ScriptUpdateProcessor to change it on the way in to Solr.

If you're talking about _searching_ the URL, you need to put the
appropriate filters in your analysis chain. Most distributions have a
"lowercase" type that is a keywordtokenizer and lowercasefilter That
still treats the searchable text as a single token, so for instance
you wouldn't be able to search for url:com with pre-and-post wildcards
which is not a good pattern. If you want to search sub-parts of a url,
you'll use one of the text-based types to break it up into tokens.
Even in this case, though, the returned data is still the original
case since it's the stored data that's returned.

Best,
Erick
On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett  wrote:
>
> Hello, I'm new to Solr been using it for a few months. A recent question came 
> up from our business partners about URL casing. Previously their URLs were 
> upper case, they made a change and now all lower. Both pages/URLs are still 
> accessible so there are duplicates in Solr. They are requesting all URLs be 
> evaluated as lowercase. What is the best practice on URL case? Is there a 
> negative to making all lowercase? I know I can drop the index and re-crawl to 
> fix it, but long term how should URL case be treated? Thanks!
>
> Brett Moyer
>
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA
> *