subject:"Case\-insensitive string equality"

Re: Case-insensitive string equality

2017-09-05 Thread Steve D'Aprano

On Wed, 6 Sep 2017 12:27 am, Grant Edwards wrote:

> On 2017-09-03, Gregory Ewing  wrote:
>> Stefan Ram wrote:
>>>   But of
>>>   course, actually the rules of orthography require "Maße" or
>>>   "Masse" and do not allow "MASSE" or "MASZE", just as in
>>>   English, "English" has to be written "English" and not
>>>   "english" or "ENGLISH".
>>
>> While "english" is wrong in English, there's no rule
>> against using  "ENGLISH" as an all-caps version.
> 
> Perhaps there's no "rule" in your book of rules, but it's almost
> universally considered bad style and you will lose points with your
> teacher, editor, etc.

And yet editors frequently use ALL CAPS for book titles and sometimes even
chapter headings, as well as the author's name. I have a shelf full of books by
STEPHEN KING behind me.

Likewise for movie titles on DVDs, album titles on CDs, address labels (I'm
looking at a letter addressed to MR S DAPRANO right now), labels on food and
medication.

The late and much-lamented English humourist Sir Terry Pratchett wrote almost
forty books including the character of Death, who SPEAKS IN CAPITALS. (And his
superior, Azrael, does the same in letters almost the size of the entire page.
Fortunately he says only a single word in the entire series.)

Many government forms and databases require all caps, presumably because it is
easier for handwriting recognition, or maybe its just a left over from the days
of type writers.

> On the inter-tubes generally indicates you're shouting, or just a
> kook.  I guess if either of those is true, then it's good style.

It is true that in general we don't write ordinary prose in all caps, there are
plenty of non-kook uses for it. But speaking of capitals on the Internet:

 HI EVERYBODY!!
 try pressing the the Caps Lock key
 O THANKS!!! ITS SO MUCH EASIER TO WRITE NOW!!!
 fuck me

http://bash.org/?835030

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-05 Thread Chris Angelico

On Wed, Sep 6, 2017 at 12:27 AM, Grant Edwards
 wrote:
> On 2017-09-03, Gregory Ewing  wrote:
>> Stefan Ram wrote:
>>>   But of
>>>   course, actually the rules of orthography require "Maße" or
>>>   "Masse" and do not allow "MASSE" or "MASZE", just as in
>>>   English, "English" has to be written "English" and not
>>>   "english" or "ENGLISH".
>>
>> While "english" is wrong in English, there's no rule
>> against using  "ENGLISH" as an all-caps version.
>
> Perhaps there's no "rule" in your book of rules, but it's almost
> universally considered bad style and you will lose points with your
> teacher, editor, etc.
>
> On the inter-tubes generally indicates you're shouting, or just a
> kook.  I guess if either of those is true, then it's good style.

ENGLISH, Doc!

*Doc Brown proceeds to explain and/or demonstrate*

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-05 Thread Grant Edwards

On 2017-09-03, Gregory Ewing  wrote:
> Stefan Ram wrote:
>>   But of
>>   course, actually the rules of orthography require "Maße" or
>>   "Masse" and do not allow "MASSE" or "MASZE", just as in
>>   English, "English" has to be written "English" and not
>>   "english" or "ENGLISH".
>
> While "english" is wrong in English, there's no rule
> against using  "ENGLISH" as an all-caps version.

Perhaps there's no "rule" in your book of rules, but it's almost
universally considered bad style and you will lose points with your
teacher, editor, etc.

On the inter-tubes generally indicates you're shouting, or just a
kook.  I guess if either of those is true, then it's good style.

-- 
Grant Edwards   grant.b.edwardsYow!
  at   
BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-
  gmail.com

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-05 Thread Chris Angelico

On Tue, Sep 5, 2017 at 6:05 PM, Stefan Behnel  wrote:
> Steve D'Aprano schrieb am 02.09.2017 um 02:31:
>> - the German eszett, ß, which has two official[1] uppercase forms: 'SS'
>> and an uppercase eszett
>
> I wonder if there is an equivalent to Godwin's Law with respect to
> character case related discussions and the German ß.

Given that it's such a useful test case, I think it's inevitable (the
first part of Godwin's Law), but not a conversation killer (the second
part, and not (AFAIK) part of the original statement). Either that, or
the Turkish Iı/İi.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-05 Thread Stefan Behnel

Steve D'Aprano schrieb am 02.09.2017 um 02:31:
> - the German eszett, ß, which has two official[1] uppercase forms: 'SS'
> and an uppercase eszett

I wonder if there is an equivalent to Godwin's Law with respect to
character case related discussions and the German ß.

Stefan

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-04 Thread Rick Johnson

Steven D'Aprano wrote:
[...]
> (1) Add a new string method, which performs a case-
> insensitive equality test. Here is a potential
> implementation, written in pure Python:
> 
> def equal(self, other):
> if self is other:
> return True
> if not isinstance(other, str):
> raise TypeError
> if len(self) != len(other):
> return False
> casefold = str.casefold
> for a, b in zip(self, other):
> if casefold(a) != casefold(b):
> return False
> return True
> 
> Alternatively: how about a === triple-equals operator to do
> the same thing?

A good idea. But wouldn't that specific usage be
inconsistent (even backwards) with the semantics of "===" as
defined in most languages that use "==="?

For me -- and this comment will be going beyond the scope of
strings, and possibly, beyond the scope of this thread -- i
feel that python is missing a pair of equality testing
devices (sugared or not; but preferably sugared), that
define a universal means by which all types can be tested
with either "superficial equality" (aka: ==) or "deep
equality" (aka: ===).

However, such a design (whist quite intuitive) would break
equality testing as it exists today in Python. For instance,
it would mean that:

(1) Superficial Equality

>>> "abc" == "abc"
True
>>> "abc" == "ABC" 
True

(2) Deep Equality

 >>> "abc" === "abc"
 True
 >>> "abc" === "ABC"
 False

And i don't think even GvR's time machine will be much
help here. :-(

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-04 Thread Tim Chase

On 2017-09-02 12:21, Steve D'Aprano wrote:
> On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote:
> > I'd want to have an optional parameter to take locale into
> > consideration.  E.g.  
> 
> Does regular case-sensitive equality take the locale into
> consideration?

No.  Python says that .casefold()

https://docs.python.org/3/library/stdtypes.html#str.casefold

implements the Unicode case-folding specification

ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt

which calls out the additional processing for Turkic languages:

# T: special case for uppercase I and dotted uppercase I
#- For non-Turkic languages, this mapping is normally not used.
#- For Turkic languages (tr, az), this mapping can be used
#  instead of the normal mapping for these characters.
#  Note that the Turkic mappings do not maintain canonical
#  equivalence without additional processing.
#  See the discussions of case mapping in the Unicode Standard
#  for more information.

So it looks like what Python lacks is that "additional processing",
after which .casefold() should solve the problems.

According to my reading, if locale doesn't play part in the equation

   s1.casefold() == s2.casefold()

should suffice.  Any case-insensitive code using .upper() or .lower()
instead of .casefold() becomes a code-smell.

> If regular case-sensitive string comparisons don't support the
> locale, why should case-insensitive comparisons be required to?

Adding new code to Python that just does what is already available is
indeed bad justification. But adding *new* functionality that handles
the locale-aware-case-insensitive-comparison could be justified.

> As far as I'm concerned, the only "must have" is that ASCII letters
> do the right thing. Everything beyond that is a "quality of
> implementation" issue.

But for this use-case, we already have .casefold() which does the job
and even extends beyond plain 7-bit ASCII to most of the typical
i18n/Unicode use-cases.

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Capital ß [was Re: Case-insensitive string equality]

2017-09-04 Thread MRAB


On 2017-09-04 03:28, Steve D'Aprano wrote:

On Sat, 2 Sep 2017 01:48 pm, Stefan Ram wrote:


Steve D'Aprano  writes:

[1] I believe that the German government has now officially recognised the
uppercase form of ß.


  [skip to the last paragraph for some "ß" content,
  unless you want to read details about German spelling rules.]

  The German language is as free as the English one. It does
  not come from a government.


Nevertheless, even in English there are de facto rules about what you can and
cannot use as text for official purposes. In most countries, you cannot change
your name to an unpronounceable "love symbol" as the late Artist Formally Known
As Prince did. You can't fill in your tax using Alienese

http://futurama.wikia.com/wiki/Alienese

or even Vietnamese, Greek or Arabic.

In Australia, the Victorian state government Department of Births Deaths and
Marriages doesn't even accept such unexceptional and minor variants as Zöe for
Zoe. Of course you are free to call yourself Zöe when you sign your emails, but
your birth certificate, passport and drivers licence will show it as Zoe.


Of course they reject "Zöe": the correct spelling is "Zoë". :-)

[snip]
--
https://mail.python.org/mailman/listinfo/python-list

Capital ß [was Re: Case-insensitive string equality]

2017-09-03 Thread Steve D'Aprano

On Sat, 2 Sep 2017 01:48 pm, Stefan Ram wrote:

> Steve D'Aprano  writes:
>>[1] I believe that the German government has now officially recognised the
>>uppercase form of ß.
> 
>   [skip to the last paragraph for some "ß" content,
>   unless you want to read details about German spelling rules.]
> 
>   The German language is as free as the English one. It does
>   not come from a government.

Nevertheless, even in English there are de facto rules about what you can and
cannot use as text for official purposes. In most countries, you cannot change
your name to an unpronounceable "love symbol" as the late Artist Formally Known
As Prince did. You can't fill in your tax using Alienese

http://futurama.wikia.com/wiki/Alienese

or even Vietnamese, Greek or Arabic.

In Australia, the Victorian state government Department of Births Deaths and
Marriages doesn't even accept such unexceptional and minor variants as Zöe for
Zoe. Of course you are free to call yourself Zöe when you sign your emails, but
your birth certificate, passport and drivers licence will show it as Zoe.

>   The 16 states (Bundesländer), agreed to a common institution
>   ("Der Rat für deutsche Rechtschreibung") to write down the
>   rules for their /schools/. The federal government is not
>   involved. Most publishing houses volunteered to follow those
>   school rules. Outside of schools or binding contracts,
>   everyone is free to write as he likes.

I'm not suggesting that the Spelling Police will come arrest me in either
Germany or the UK/Australia if I were to write my name Ƽτευεη. Switzerland on
the other hand ... *wink*

But there are very strong conventions about what is acceptable, and often there
are actual laws in place that limit what letters are used in official
documentation and records, what is taught in schools, etc. 

The 1996 spelling reforms, and their legal status, are described here:

https://en.wikipedia.org/wiki/German_orthography_reform_of_1996

and Der Rat für deutsche Rechtschreibung:

https://en.wikipedia.org/wiki/Council_for_German_Orthography

(Sorry for not linking to the German versions as well.)

>   The "ß" sometimes has been uppercased to "SS" and sometimes
>   to "SZ". 

Historically, some German publishers used a distinct uppercase ß, while others
used a ligature of SZ, explicitly stating that this was an interim measure
until they decided on a good looking uppercase ß.

More about capital ß:

https://typography.guru/journal/germanys-new-character/

https://medium.com/@typefacts/the-german-capital-letter-eszett-e0936c1388f8

https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Steve D'Aprano

On Mon, 4 Sep 2017 09:10 am, Gregory Ewing wrote:

> Stefan Ram wrote:
>>   But of
>>   course, actually the rules of orthography require "Maße" or
>>   "Masse" and do not allow "MASSE" or "MASZE", just as in
>>   English, "English" has to be written "English" and not
>>   "english" or "ENGLISH".
> 
> While "english" is wrong in English, there's no rule
> against using  "ENGLISH" as an all-caps version.

It's not always wrong. If you're e.e. cummings, then you're allowed to avoid
capitals. (If you're anyone else, you're either a derivative hack, or lazy.)
And if you are referring to the spin applied to billiard balls, it is
acceptable to write it as english. 

> Are you saying that all-caps text is not allowed in
> German? If so, that's very different from English.

Germans use ALLCAPS for headlines, book titles, emphasis etc just as English
speakers do. For example: http://www.spiegel.de/politik/index.html

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Gregory Ewing


Stefan Ram wrote:

  But of
  course, actually the rules of orthography require "Maße" or
  "Masse" and do not allow "MASSE" or "MASZE", just as in
  English, "English" has to be written "English" and not
  "english" or "ENGLISH".


While "english" is wrong in English, there's no rule
against using  "ENGLISH" as an all-caps version.

Are you saying that all-caps text is not allowed in
German? If so, that's very different from English.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Pavol Lisy

On 9/3/17, Steve D'Aprano  wrote:
> On Sun, 3 Sep 2017 05:17 pm, Stephan Houben wrote:
>
>> Generally speaking, the more you learn about case normalization,
>> the more attractive case sensitivity looks
>
> Just because something is hard doesn't mean its not worth doing.
>
> And just because you can't please all the people all the time doesn't mean
> its
> not worthwhile.

I was thinking about compare where false positivity is acceptable (and
well defined property).

For example if somebody has case sensitive FS and wants to publish
files and avoid name collision on any case insensitive FS then compare
with false positive equals could be useful.

Then backward compatibility problem could be (theoretically)
simplified to enhancing equivalence classes in future.

I mean something like ->

equal = lambda a, b: any(f(a) == f(b) for f in C)# where C is
enhanceble list of compare equals functions

Could you think that such equivalence relation could solve problems
which you describe in first mail in this thread?

And if trying to "solve" unicode problem why not? ->

a ≐ b
a ⋵ L
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Steve D'Aprano

On Sun, 3 Sep 2017 05:17 pm, Stephan Houben wrote:

> Generally speaking, the more you learn about case normalization,
> the more attractive case sensitivity looks

Just because something is hard doesn't mean its not worth doing.

And just because you can't please all the people all the time doesn't mean its
not worthwhile.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Chris Angelico

On Sun, Sep 3, 2017 at 5:17 PM, Stephan Houben
 wrote:
> Generally speaking, the more you learn about case normalization,
> the more attractive case sensitivity looks ;-)

Absolutely agreed. My general recommendation is to have two vastly
different concepts: "equality matching" and "searching". Equality is
case sensitive and strict; NFC/NFD normalization is about all you can
do. Searching, on the other hand, can be case insensitive, do
NFKC/NFKD normalization, and can even (in many contexts) strip off
diacritical marks altogether, allowing people to search for "resume"
and find "résumé", or to search for "muller" and find "Müller". There
can be a whole swathe of other transformations done in search
normalization too (collapse whitespace, fold punctuation into a few
types, etc), though of course you need to be aware of context. But IMO
it's far safer to NOT define things in terms of "equality".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-03 Thread Stephan Houben

Op 2017-09-02, Pavol Lisy schreef :
> But problem is that if somebody like to have stable API it has to be
> changed to "do what the Unicode consortium said (at X.Y. )" :/

It is even more exciting. Presumably a reason to have case-insentivity
is to be compatible with existing popular case-insentive systems.

So here is, for your entertainment, how some of these systems work.

* Windows NTFS case-insensitive file system

  A NTFS file system contains a hidden table $UpCase which maps
  characters to their upcase variant. Note:

  1. This maps characters in the BMP *only*, so NTFS treats
 characters outside the BMP as case-insensitive.
  2. Every character can in turn only be mapped into a single
 BMP character, so ß -> SS is not possible.
  3. The table is in practice dependent on the language of the
 Windows system which created it (so a Turkish NTFS partition
 would contain i -> İ), but in theory can contain any allowed
 mapping: I can create an NTFS filesystem which maps a -> b.
  4. Older Windows versions generated tables which
 were broken for certain languages (NT 3.51/Georgian). 
 You may still have some NTFS partition with such a table lying
 around.

* macOS case-insensitive file system

  1. HFS+ is based on Unicode 3.2; this is fixed in stone because of
 backward compatibility.
  2. APFS is based on Unicode 9.0 and does normalization as well

Generally speaking, the more you learn about case normalization,
the more attractive case sensitivity looks ;-)
Also slim hope to have a single caseInsensitiveCompare function
which "Does The Right Thing"™.

Stephan
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-02 Thread Pavol Lisy

On 9/2/17 at 4:21, Steve D'Aprano  wrote:
> If regular case-sensitive string comparisons don't support the locale, why
> should case-insensitive comparisons be required to?

I think that Chris answered very good before:

On 9/2/17 at 2:53 AM, Chris Angelico  wrote:
> On Sat, Sep 2, 2017 at 10:31 AM, Steve D'Aprano
> But code is often *wrong* due to backward compatibility concerns. Then you 
> have to
> decide whether, for a brand new API, it's better to "do the same as
> the regex module" or to "do what the Unicode consortium says".

But problem is that if somebody like to have stable API it has to be
changed to "do what the Unicode consortium said (at X.Y. )" :/

Maybe it is simpler to write intelligent linter to catch wrong comparisions?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Chris Angelico

On Sat, Sep 2, 2017 at 12:21 PM, Steve D'Aprano
 wrote:
> On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote:
>
>> On 2017-08-31 07:10, Steven D'Aprano wrote:
>>> So I'd like to propose some additions to 3.7 or 3.8.
>>
>> Adding my "yes, a case-insensitive equality-check would be useful"
>> with the following concerns:
>>
>> I'd want to have an optional parameter to take locale into
>> consideration.  E.g.
>
> Does regular case-sensitive equality take the locale into consideration?
>
> How do I convince Python to return true for these?
>
> 'i'.upper() == 'İ'
> 'I'.lower() == 'ı'
>
>
> I'm 99% sure that these are rhetorical questions where the answers are
> obviously:
>
> - No it doesn't.
> - And you can't.
>
> If regular case-sensitive string comparisons don't support the locale, why
> should case-insensitive comparisons be required to?
>
> We should not confuse "nice to have" for "must have". As far as I'm concerned,
> the only "must have" is that ASCII letters do the right thing. Everything
> beyond that is a "quality of implementation" issue.

The only "must have" is that the locale-independent conversions do the
right thing. We already have str.casefold() that correctly handles
>99% of situations; the easiest way to craft something like this is to
define it in terms of that.

> In any case, thanks to everyone for their feedback. Clearly there not enough
> support for this for me to even bother taking it to Python-Ideas.

Agreed; if this were important enough for someone to want to run the
gauntlet of -ideas and -dev, I'd predict that it would be one of those
VERY hotly debated topics.

ChrisA

On Sat, Sep 2, 2017 at 12:21 PM, Steve D'Aprano
 wrote:
> On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote:
>
>> On 2017-08-31 07:10, Steven D'Aprano wrote:
>>> So I'd like to propose some additions to 3.7 or 3.8.
>>
>> Adding my "yes, a case-insensitive equality-check would be useful"
>> with the following concerns:
>>
>> I'd want to have an optional parameter to take locale into
>> consideration.  E.g.
>
> Does regular case-sensitive equality take the locale into consideration?
>
> How do I convince Python to return true for these?
>
> 'i'.upper() == 'İ'
> 'I'.lower() == 'ı'
>
>
> I'm 99% sure that these are rhetorical questions where the answers are
> obviously:
>
> - No it doesn't.
> - And you can't.
>
> If regular case-sensitive string comparisons don't support the locale, why
> should case-insensitive comparisons be required to?
>
> We should not confuse "nice to have" for "must have". As far as I'm concerned,
> the only "must have" is that ASCII letters do the right thing. Everything
> beyond that is a "quality of implementation" issue.
>
>
> In any case, thanks to everyone for their feedback. Clearly there not enough
> support for this for me to even bother taking it to Python-Ideas.
>
>
>
>
> --
> Steve
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.
>
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Steve D'Aprano

On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote:

> On 2017-08-31 07:10, Steven D'Aprano wrote:
>> So I'd like to propose some additions to 3.7 or 3.8.
> 
> Adding my "yes, a case-insensitive equality-check would be useful"
> with the following concerns:
> 
> I'd want to have an optional parameter to take locale into
> consideration.  E.g.

Does regular case-sensitive equality take the locale into consideration?

How do I convince Python to return true for these?

'i'.upper() == 'İ'
'I'.lower() == 'ı'

I'm 99% sure that these are rhetorical questions where the answers are
obviously:

- No it doesn't.
- And you can't.

If regular case-sensitive string comparisons don't support the locale, why
should case-insensitive comparisons be required to?

We should not confuse "nice to have" for "must have". As far as I'm concerned,
the only "must have" is that ASCII letters do the right thing. Everything
beyond that is a "quality of implementation" issue.

In any case, thanks to everyone for their feedback. Clearly there not enough
support for this for me to even bother taking it to Python-Ideas.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Chris Angelico

On Sat, Sep 2, 2017 at 10:31 AM, Steve D'Aprano
 wrote:
> On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote:
>
>> Aside from lower(), which returns the string unchanged, the case
>> conversion rules say that this contains two letters.
>
> Do you have a reference to that?
>
> I mean, where in the Unicode case conversion rules is that stated? You cannot
> take the behaviour of Python as necessarily correct here -- it may be that the
> behaviour of Python is erroneous.

Yep! It's all in here.

ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

> For what its worth, even under Unicode's own rules, there are always going to 
> be
> odd corner cases that surprise people. The most obvious cases are:
>
> You can't keep everybody happy. Doesn't mean we can't meet 99% of the 
> usescases.
>
> After all, what do you think the regex case insensitive matching does?

Honestly, I don't know what it does without checking. But code is
often *wrong* due to backward compatibility concerns. Then you have to
decide whether, for a brand new API, it's better to "do the same as
the regex module" or to "do what the Unicode consortium says".

As it turns out, the Python 're' module doesn't match the letters
against the ligature:

>>> re.search("F", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("f", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("I", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("i", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE)
>>> re.search("S", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE)
>>> re.search("s", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE)
>>>

I would consider that code to be incorrect.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Steve D'Aprano

On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote:

> Aside from lower(), which returns the string unchanged, the case
> conversion rules say that this contains two letters.

Do you have a reference to that?

I mean, where in the Unicode case conversion rules is that stated? You cannot
take the behaviour of Python as necessarily correct here -- it may be that the
behaviour of Python is erroneous.

For what its worth, even under Unicode's own rules, there are always going to be
odd corner cases that surprise people. The most obvious cases are:

- dotted and dottless i

- the German eszett, ß, which has two official[1] uppercase forms: 'SS' and an
uppercase eszett

- long s, ſ, which may or may not be treated as distinct from s

- likewise for ligatures -- is æ a ligature, or is it Old English ash?

You can't keep everybody happy. Doesn't mean we can't meet 99% of the usescases.

After all, what do you think the regex case insensitive matching does?

[1] I believe that the German government has now officially recognised the
uppercase form of ß.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Chris Angelico

On Sat, Sep 2, 2017 at 10:09 AM, Steve D'Aprano
 wrote:
> The question wasn't what "\N{LATIN SMALL LIGATURE FI}".upper() would find,
> but "\N{LATIN SMALL LIGATURE FI}".
>
> Nor did they ask about
>
> "\N{LATIN SMALL LIGATURE FI}".replace("\N{LATIN SMALL LIGATURE
> FI}", "Surprise!")
>
>> So what's the definition of "case insensitive find"? The most simple
>> and obvious form is:
>>
>> def case_insensitive_find(self, other):
>> return self.casefold().find(other.casefold())
>
> That's not the definition, that's an implementation.

It's a reference implementation that defines certain semantics.
Obviously you can *actually* implement it using some kind of C-level
loop or something, as long as you can define the semantics somehow.

So what IS your definition of "case insensitive find"? Do you
case-fold both strings or just one? What do you do about
length-changing case folding?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Steve D'Aprano

On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote:

> On Fri, Sep 1, 2017 at 11:22 PM, Steve D'Aprano
>  wrote:
>> On Fri, 1 Sep 2017 09:53 am, MRAB wrote:
>>
>>> What would you expect the result would be for:
>>>
>>>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F")
>>>
>>>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I)
>>
>> That's easy.
>>
>> -1 in both cases, since neither "F" nor "I" is found in either string. We can
>> prove this by manually checking:
>>
>> py> for c in "\N{LATIN SMALL LIGATURE FI}":
>> ... print(c, 'F' in c, 'f' in c)
>> ... print(c, 'I' in c, 'i' in c)
>> ...
>> ﬁ False False
>> ﬁ False False
>>
>>
>> If you want some other result, then you're not talking about case
>> sensitivity.
> 
 "\N{LATIN SMALL LIGATURE FI}".upper()
> 'FI'


The question wasn't what "\N{LATIN SMALL LIGATURE FI}".upper() would find,
but "\N{LATIN SMALL LIGATURE FI}".

Nor did they ask about

"\N{LATIN SMALL LIGATURE FI}".replace("\N{LATIN SMALL LIGATURE
FI}", "Surprise!")



> So what's the definition of "case insensitive find"? The most simple
> and obvious form is:
> 
> def case_insensitive_find(self, other):
> return self.casefold().find(other.casefold())

That's not the definition, that's an implementation.


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Chris Angelico

On Fri, Sep 1, 2017 at 11:22 PM, Steve D'Aprano
 wrote:
> On Fri, 1 Sep 2017 09:53 am, MRAB wrote:
>
>> What would you expect the result would be for:
>>
>>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F")
>>
>>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I)
>
> That's easy.
>
> -1 in both cases, since neither "F" nor "I" is found in either string. We can
> prove this by manually checking:
>
> py> for c in "\N{LATIN SMALL LIGATURE FI}":
> ... print(c, 'F' in c, 'f' in c)
> ... print(c, 'I' in c, 'i' in c)
> ...
> ﬁ False False
> ﬁ False False
>
>
> If you want some other result, then you're not talking about case sensitivity.

>>> "\N{LATIN SMALL LIGATURE FI}".upper()
'FI'
>>> "\N{LATIN SMALL LIGATURE FI}".lower()
'ﬁ'
>>> "\N{LATIN SMALL LIGATURE FI}".casefold()
'fi'

Aside from lower(), which returns the string unchanged, the case
conversion rules say that this contains two letters. So "F" exists in
the uppercased version of the string, and "f" exists in the casefolded
version.

So what's the definition of "case insensitive find"? The most simple
and obvious form is:

def case_insensitive_find(self, other):
return self.casefold().find(other.casefold())

which would clearly return 0 and 1 for the two original searches.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Steve D'Aprano

On Thu, 31 Aug 2017 08:15 pm, Rhodri James wrote:

> I'd quibble about the name and the implementation (length is not
> preserved under casefolding), 

Yes, I'd forgotten about that.

> but I'd go for this.  The number of times 
> I've written something like this in different languages...


[...]
> The only way I can think of to get much traction with this is to have a
> separate case-insensitive string class.  That feels a bit heavyweight,
> though.

You might be right about that.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-09-01 Thread Steve D'Aprano

On Fri, 1 Sep 2017 09:53 am, MRAB wrote:

> What would you expect the result would be for:
> 
>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F")
> 
>  "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I)

That's easy. 

-1 in both cases, since neither "F" nor "I" is found in either string. We can
prove this by manually checking:

py> for c in "\N{LATIN SMALL LIGATURE FI}":
... print(c, 'F' in c, 'f' in c)
... print(c, 'I' in c, 'i' in c)
...
ﬁ False False
ﬁ False False

If you want some other result, then you're not talking about case sensitivity.

If anyone wants to propose "normalisation-insensitive matching", I'll ask you to
please start your own thread rather than derailing this one with an unrelated,
and much more difficult, problem.

The proposal here is *case insensitive* matching, not Unicode normalisation. If
you want to decompose the strings, you know how to:

py> import unicodedata
py> unicodedata.normalize('NFKD', "\N{LATIN SMALL LIGATURE FI}")
'fi'

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Pete Forman

Steven D'Aprano  writes:

> Three times in the last week the devs where I work accidentally
> introduced bugs into our code because of a mistake with case-insensitive
> string comparisons. They managed to demonstrate three different failures:
>
> # 1
> a = something().upper()  # normalise string
> ... much later on
> if a == b.lower(): ...
>
>
> # 2
> a = something().upper()
> ... much later on
> if a == 'maildir': ...
>
>
> # 3
> a = something()  # unnormalised
> assert 'foo' in a
> ... much later on
> pos = a.find('FOO')
>
>
>
> Not every two line function needs to be in the standard library, but I've
> come to the conclusion that case-insensitive testing and searches should
> be. I've made these mistakes myself at times, as I'm sure most people
> have, and I'm tired of writing my own case-insensitive function over and
> over again.
>
>
> So I'd like to propose some additions to 3.7 or 3.8. If the feedback here
> is positive, I'll take it to Python-Ideas for the negative feedback :-)
>
>
> (1) Add a new string method, which performs a case-insensitive equality
> test. Here is a potential implementation, written in pure Python:
>
>
> def equal(self, other):
> if self is other:
> return True
> if not isinstance(other, str):
> raise TypeError
> if len(self) != len(other):
> return False
> casefold = str.casefold
> for a, b in zip(self, other):
> if casefold(a) != casefold(b):
> return False
> return True
>
> Alternatively: how about a === triple-equals operator to do the same
> thing?
>
>
>
> (2) Add keyword-only arguments to str.find and str.index:
>
> casefold=False
>
> which does nothing if false (the default), and switches to a case-
> insensitive search if true.
>
>
>
>
> Alternatives:
>
> (i) Do nothing. The status quo wins a stalemate.
>
> (ii) Instead of str.find or index, use a regular expression.
>
> This is less discoverable (you need to know regular expressions) and
> harder to get right than to just call a string method. Also, I expect
> that invoking the re engine just for case insensitivity will be a lot
> more expensive than a simple search need be.
>
> (iii) Not every two line function needs to be in the standard library.
> Just add this to the top of every module:
>
> def equal(s, t):
> return s.casefold() == t.casefold()
>
>
> That's the status quo wins again. It's an annoyance. A small
> annoyance, but multiplied by the sheer number of times it happens, it
> becomes a large annoyance. I believe the annoyance factor of
> case-insensitive comparisons outweighs the "two line function"
> objection.
>
> And the two-line "equal" function doesn't solve the problem for find
> and index, or for sets dicts, list.index and the `in` operator either.
>
>
> Unsolved problems:
>
> This proposal doesn't help with sets and dicts, list.index and the `in`
> operator either.
>
>
>
> Thoughts?

This seems to me to be rather similar to sort() and sorted(). How about
giving equals() an optional parameter key, and perhaps the older cmp?
Using casefold or upper or lower would satisfy many use cases but also
allow Unicode or more locale specific normalization to be applied.

The shortcircuiting in a character based comparison holds little appeal
for me. I generally find that a string is a more useful concept than a
collection of characters.

+1 for using an affix in the name to represent a normalized version of
the input.

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Tim Chase

On 2017-09-01 00:53, MRAB wrote:
> What would you expect the result would be for:
> 
>>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F")

0

>>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I)

0.5

>>> "\N{LATIN SMALL LIGATURE FFI}".case_insensitive_find("I)

0.6

;-)

(mostly joking, but those are good additional tests to consider)

-tkc





-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread MRAB


On 2017-08-31 16:29, Tim Chase wrote:

On 2017-08-31 07:10, Steven D'Aprano wrote:

So I'd like to propose some additions to 3.7 or 3.8.


Adding my "yes, a case-insensitive equality-check would be useful"
with the following concerns:

I'd want to have an optional parameter to take locale into
consideration.  E.g.

   "i".case_insensitive_equals("I") # depends on Locale
   "i".case_insensitive_equals("I", Locale("TR")) == False
   "i".case_insensitive_equals("I", Locale("US")) == True

and other oddities like

   "ß".case_insensitive_equals("SS") == True

(though casefold() takes care of that later one).  Then you get
things like

   "III".case_insensitive_equals("\N{ROMAN NUMERAL THREE}")
   "iii".case_insensitive_equals("\N{ROMAN NUMERAL THREE}")
   "FI".case_insensitive_equals("\N{LATIN SMALL LIGATURE FI}")

where the decomposition might need to be considered.  There are just
a lot of odd edge-cases to consider when discussing fuzzy equality.


(1) Add a new string method,


This is my preferred avenue.


Alternatively: how about a === triple-equals operator to do the
same thing?


No.  A strong -1 for new operators.  This peeves me in other
languages (looking at you, PHP & JavaScript)


(2) Add keyword-only arguments to str.find and str.index:

casefold=False

which does nothing if false (the default), and switches to a
case- insensitive search if true.


I'm okay with some means of conveying the insensitivity to
str.find/str.index but have no interest in list.find/list.index
growing similar functionality.  I'm meh on the "casefold=False"
syntax, especially in light of my hope it would take a locale for the
comparisons.


[snip]
What would you expect the result would be for:

"\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F")

"\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I)
--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Tim Chase

On 2017-08-31 07:10, Steven D'Aprano wrote:
> So I'd like to propose some additions to 3.7 or 3.8.

Adding my "yes, a case-insensitive equality-check would be useful"
with the following concerns:

I'd want to have an optional parameter to take locale into
consideration.  E.g.

  "i".case_insensitive_equals("I") # depends on Locale
  "i".case_insensitive_equals("I", Locale("TR")) == False
  "i".case_insensitive_equals("I", Locale("US")) == True

and other oddities like

  "ß".case_insensitive_equals("SS") == True

(though casefold() takes care of that later one).  Then you get
things like

  "III".case_insensitive_equals("\N{ROMAN NUMERAL THREE}")
  "iii".case_insensitive_equals("\N{ROMAN NUMERAL THREE}")
  "FI".case_insensitive_equals("\N{LATIN SMALL LIGATURE FI}")

where the decomposition might need to be considered.  There are just
a lot of odd edge-cases to consider when discussing fuzzy equality.

> (1) Add a new string method,

This is my preferred avenue.

> Alternatively: how about a === triple-equals operator to do the
> same thing?

No.  A strong -1 for new operators.  This peeves me in other
languages (looking at you, PHP & JavaScript)

> (2) Add keyword-only arguments to str.find and str.index:
> 
> casefold=False
> 
> which does nothing if false (the default), and switches to a
> case- insensitive search if true.

I'm okay with some means of conveying the insensitivity to
str.find/str.index but have no interest in list.find/list.index
growing similar functionality.  I'm meh on the "casefold=False"
syntax, especially in light of my hope it would take a locale for the
comparisons.

> Unsolved problems:
> 
> This proposal doesn't help with sets and dicts, list.index and the
> `in` operator either.

I'd be less concerned about these.  If you plan to index a set/dict
by the key, normalize it before you put it in.  Or perhaps create a
CaseInsensitiveDict/CaseInsensitiveSet class.  For lists and 'in'
operator usage, it's not too hard to make up a helper function based
on the newly-grown method:

  def case_insensitive_in(itr, target, locale=None):
return any(
  target.case_insensitive_equals(x, locale)
  for x in itr
  )

  def case_insensitive_index(itr, target, locale=None):
for i, x in enumerate(itr):
  if target.case_insensitive_equals(x, locale):
return i
raise ValueError("Could not find %s" % target)

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Pavol Lisy

On 8/31/17, Steve D'Aprano  wrote:

>> Additionally: a proper "case insensitive comparison" should almost
>> certainly start with a Unicode normalization. But should it be NFC/NFD
>> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
>> application.
>
> Normalisation is orthogonal to comparisons and searches. Python doesn't
> automatically normalise strings, as people have pointed out a bazillion
> times
> in the past, and it happily compares
>
> 'ö' LATIN SMALL LETTER O WITH DIAERESIS
>
> 'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS
>
>
> as unequal. I don't propose to change that just so that we can get 'a'
> equals 'A' :-)

Locale-dependent Case Mappings. The principal example of a case
mapping that depends
on the locale is Turkish, where U+0131 “ı” latin small letter dotless i maps to
U+0049 “I” latin capital letter i and U+0069 “i” latin small letter i maps to
U+0130 “İ” latin capital letter i with dot above. (source:
http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf)

So 'SIKISIN'.casefold() could be dangerous ->
https://translate.google.com/#tr/en/sikisin%0As%C4%B1k%C4%B1s%C4%B1n
(although I am not sure if this story is true ->
https://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch
)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread John Gordon

In  Serhiy Storchaka 
 writes:

> > But when there is a common source of mistakes, we can help prevent
> > that mistake.

> How can you do this? I know only one way -- teaching and practicing.

Modify the environment so that the mistake simply can't happen (or at
least happens much less frequently.)

-- 
John Gordon   A is for Amy, who fell down the stairs
gor...@panix.com  B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Tim Chase

On 2017-08-31 18:17, Peter Otten wrote:
> A quick and dirty fix would be a naming convention:
> 
> upcase_a = something().upper()

I tend to use a "_u" suffix as my convention:

  something_u = something.upper()

which keeps the semantics of the original variable-name while hinting
at the normalization.

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Peter Otten

Steven D'Aprano wrote:

> Three times in the last week the devs where I work accidentally
> introduced bugs into our code because of a mistake with case-insensitive
> string comparisons. They managed to demonstrate three different failures:
> 
> # 1
> a = something().upper()  # normalise string
> ... much later on
> if a == b.lower(): ...

A quick and dirty fix would be a naming convention:

upcase_a = something().upper()

> ... much later on
> if upcase_a == b.lower(): ...

Wait, what...


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Tim Chase

On 2017-08-31 23:30, Chris Angelico wrote:
> The method you proposed seems a little odd - it steps through the
> strings character by character and casefolds them separately. How is
> it superior to the two-line function? And it still doesn't solve any
> of your other cases.

It also breaks when casefold() returns multiple characters:

>>> s1 = 'ss'
>>> s2 = 'SS'
>>> s3 = 'ß'
>>> equal(s1,s2) # using Steve's equal() function
True
>>> equal(s1,s3)
False
>>> equal(s2,s3)
False
>>> s1.casefold() == s2.casefold()
True
>>> s1.casefold() == s3.casefold()
True
>>> s2.casefold() == s3.casefold()
True


-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Rhodri James


On 31/08/17 15:03, Chris Angelico wrote:

On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram  wrote:

Chris Angelico  writes:

On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano
 wrote:

On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote:

31.08.17 10:10, Steven D'Aprano ???:

def equal(s, t):
  return s.casefold() == t.casefold()

The method you proposed seems a little odd - it steps through the
strings character by character and casefolds them separately. How is
it superior to the two-line function?


   When the strings are long, casefolding both strings
   just to be able to tell that the first character of
   the left string is »;« while the first character of
   the right string is »'« and so the result is »False«
   might be slower than necessary.
[chomp]
   However, premature optimization is the root of all evil!


Fair enough.

However, I'm more concerned about the possibility of a semantic
difference between the two. Is it at all possible for the case folding
of an entire string to differ from the concatenation of the case
foldings of its individual characters?

Additionally: a proper "case insensitive comparison" should almost
certainly start with a Unicode normalization. But should it be NFC/NFD
or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
application.


There's also the example in the documentation of str.casefold to 
consider.  We would rather like str.equal("ß", "ss") to be true.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Serhiy Storchaka


31.08.17 17:38, Steve D'Aprano пише:

On Thu, 31 Aug 2017 11:45 pm, Serhiy Storchaka wrote:


It is not clear what is your problem exactly.


That is fair. This is why I am discussing it here first, before taking it to
Python-Ideas. At the moment my ideas on the matter are still half-formed.


What are you discussing? Without knowing what problem you are solving 
and what solution your are proposed it is hard to discuss it.



The easy one-line function
solves the problem of testing case-insensitive string equality.


True. Except that when a problem is as common as case-insensitive comparisons,
there should be a standard solution, instead of having to re-invent the wheel
over and over again. Even when the wheel is only two or three lines.


This *is* a standard solution. Don't invent the wheel, just use it properly.


This is why we have dict.clear, for example, instead of:

 Just add this function to the top of every module and script

 def clear(d):
 for key in list(d.keys()): del d[key]


No, there are other reasons for adding the clear() method in dict. 
Performance and atomicity (and the latter is more important).



We say, *not every* two line function needs to be a builtin, rather than **no**
two line function.


But there should be good reasons for this.


If you asked a solution that magically prevent people
from making simple programming mistakes, there is no such solution.


Very true. But when there is a common source of mistakes, we can help prevent
that mistake.


How can you do this? I know only one way -- teaching and practicing.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Steve D'Aprano

On Thu, 31 Aug 2017 11:45 pm, Serhiy Storchaka wrote:

> It is not clear what is your problem exactly. 

That is fair. This is why I am discussing it here first, before taking it to
Python-Ideas. At the moment my ideas on the matter are still half-formed.

> The easy one-line function 
> solves the problem of testing case-insensitive string equality. 

True. Except that when a problem is as common as case-insensitive comparisons,
there should be a standard solution, instead of having to re-invent the wheel
over and over again. Even when the wheel is only two or three lines.

This is why we have dict.clear, for example, instead of:

Just add this function to the top of every module and script

def clear(d):
for key in list(d.keys()): del d[key]

We say, *not every* two line function needs to be a builtin, rather than **no**
two line function.

> Regular 
> expressions solve the problem of case-insensitive searching a position
> of a substring.

And now you have two problems... *wink*

> If you asked a solution that magically prevent people 
> from making simple programming mistakes, there is no such solution.

Very true. But when there is a common source of mistakes, we can help prevent
that mistake.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Chris Angelico

On Fri, Sep 1, 2017 at 12:27 AM, Steve D'Aprano
 wrote:
>> Additionally: a proper "case insensitive comparison" should almost
>> certainly start with a Unicode normalization. But should it be NFC/NFD
>> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
>> application.
>
> Normalisation is orthogonal to comparisons and searches. Python doesn't
> automatically normalise strings, as people have pointed out a bazillion times
> in the past, and it happily compares
>
> 'ö' LATIN SMALL LETTER O WITH DIAERESIS
>
> 'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS
>
>
> as unequal. I don't propose to change that just so that we can get 'a'
> equals 'A' :-)

You may not, but others will. Which is just one of the reasons that
"case insensitive comparison" is not as simple as it initially seems,
and thus (IMO) is best NOT baked into the language.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Steve D'Aprano

On Fri, 1 Sep 2017 12:03 am, Chris Angelico wrote:

> On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram  wrote:
>> Chris Angelico  writes:

>>>The method you proposed seems a little odd - it steps through the
>>>strings character by character and casefolds them separately. How is
>>>it superior to the two-line function?
>>
>>   When the strings are long, casefolding both strings
>>   just to be able to tell that the first character of
>>   the left string is »;« while the first character of
>>   the right string is »'« and so the result is »False«
>>   might be slower than necessary.

Thanks Stefan, that was my reasoning.

Also, if the implementation was in C, doing the comparison character by
character is unlikely to be slower than doing the comparison all at once, since
the "all at once" comparison actually is character by character under the hood.

>> [chomp]
>>   However, premature optimization is the root of all evil!
> 
> Fair enough.

Its not premature optimization to plan ahead for obvious scenarios.

Sometimes we may want to compare two large strings. Calling casefold on them
temporarily uses doubles the memory taken by the two strings, which can be
significant. Assuming the method were written in C, you would be very unlikely
to build up two large temporary case-folded arrays before doing the comparison.

If I were subclassing str in pure Python, I wouldn't bother. The tradeoffs are
different.

> However, I'm more concerned about the possibility of a semantic
> difference between the two. Is it at all possible for the case folding
> of an entire string to differ from the concatenation of the case
> foldings of its individual characters?

I don't believe so, but I welcome correction.

> Additionally: a proper "case insensitive comparison" should almost
> certainly start with a Unicode normalization. But should it be NFC/NFD
> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
> application.

Normalisation is orthogonal to comparisons and searches. Python doesn't
automatically normalise strings, as people have pointed out a bazillion times
in the past, and it happily compares 

'ö' LATIN SMALL LETTER O WITH DIAERESIS

'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS

as unequal. I don't propose to change that just so that we can get 'a'
equals 'A' :-)

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Chris Angelico

On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram  wrote:
> Chris Angelico  writes:
>>On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano
>> wrote:
>>> On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote:
 31.08.17 10:10, Steven D'Aprano ???:
> def equal(s, t):
>  return s.casefold() == t.casefold()
>>The method you proposed seems a little odd - it steps through the
>>strings character by character and casefolds them separately. How is
>>it superior to the two-line function?
>
>   When the strings are long, casefolding both strings
>   just to be able to tell that the first character of
>   the left string is »;« while the first character of
>   the right string is »'« and so the result is »False«
>   might be slower than necessary.
> [chomp]
>   However, premature optimization is the root of all evil!

Fair enough.

However, I'm more concerned about the possibility of a semantic
difference between the two. Is it at all possible for the case folding
of an entire string to differ from the concatenation of the case
foldings of its individual characters?

Additionally: a proper "case insensitive comparison" should almost
certainly start with a Unicode normalization. But should it be NFC/NFD
or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the
application.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Serhiy Storchaka


31.08.17 15:49, Steve D'Aprano пише:

On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote:

31.08.17 10:10, Steven D'Aprano пише:

(iii) Not every two line function needs to be in the standard library.
Just add this to the top of every module:

def equal(s, t):
  return s.casefold() == t.casefold()


This is my answer.


Unsolved problems:

This proposal doesn't help with sets and dicts, list.index and the `in`
operator either.


This is the end of the discussion.


Your answer of an equal() function doesn't help with sets and dicts either.


See rejected issue18986 [1], PEP 455 [2] and corresponding mailing lists 
discussions. The conclusion was that the need of such collections is low 
and the problems that they are purposed to solve can be solved with 
normal dict (as they are solved now).



So I guess we're stuck with no good standard answer:

- the easy two-line function doesn't even come close to solving the problem
   of case-insensitive string operations;


It is not clear what is your problem exactly. The easy one-line function 
solves the problem of testing case-insensitive string equality. Regular 
expressions solve the problem of case-insensitive searching a position 
of a substring. If you asked a solution that magically prevent people 
from making simple programming mistakes, there is no such solution.


[1] https://bugs.python.org/issue18986
[2] https://www.python.org/dev/peps/pep-0455/

--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Chris Angelico

On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano
 wrote:
> On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote:
>
>> 31.08.17 10:10, Steven D'Aprano пише:
>>> (iii) Not every two line function needs to be in the standard library.
>>> Just add this to the top of every module:
>>>
>>> def equal(s, t):
>>>  return s.casefold() == t.casefold()
>>
>> This is my answer.
>>
>>> Unsolved problems:
>>>
>>> This proposal doesn't help with sets and dicts, list.index and the `in`
>>> operator either.
>>
>> This is the end of the discussion.
>
> Your answer of an equal() function doesn't help with sets and dicts either.
>
> So I guess we're stuck with no good standard answer:
>
> - the easy two-line function doesn't even come close to solving the problem
>   of case-insensitive string operations;
>
> - but we can't have case-insensitive string operations because too many
>   people say "just use this two-line function".

The method you proposed seems a little odd - it steps through the
strings character by character and casefolds them separately. How is
it superior to the two-line function? And it still doesn't solve any
of your other cases.

To make dicts/sets case insensitive, you really need something that
people have periodically asked for: a dict with a key function. It
would retain the actual key used, but for comparison purposes, would
use the modified version. Then you could use str.casefold as your key
function, and voila, you have a case insensitive dict.

I don't know if that would solve your other problems though.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Steve D'Aprano

On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote:

> 31.08.17 10:10, Steven D'Aprano пише:
>> (iii) Not every two line function needs to be in the standard library.
>> Just add this to the top of every module:
>> 
>> def equal(s, t):
>>  return s.casefold() == t.casefold()
> 
> This is my answer.
> 
>> Unsolved problems:
>> 
>> This proposal doesn't help with sets and dicts, list.index and the `in`
>> operator either.
> 
> This is the end of the discussion.

Your answer of an equal() function doesn't help with sets and dicts either.

So I guess we're stuck with no good standard answer:

- the easy two-line function doesn't even come close to solving the problem
  of case-insensitive string operations;

- but we can't have case-insensitive string operations because too many 
  people say "just use this two-line function".




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Rhodri James


On 31/08/17 08:10, Steven D'Aprano wrote:

So I'd like to propose some additions to 3.7 or 3.8. If the feedback here
is positive, I'll take it to Python-Ideas for the negative feedback :-)


(1) Add a new string method, which performs a case-insensitive equality
test. Here is a potential implementation, written in pure Python:


def equal(self, other):
 if self is other:
 return True
 if not isinstance(other, str):
 raise TypeError
 if len(self) != len(other):
 return False
 casefold = str.casefold
 for a, b in zip(self, other):
 if casefold(a) != casefold(b):
 return False
 return True


I'd quibble about the name and the implementation (length is not 
preserved under casefolding), but I'd go for this.  The number of times 
I've written something like this in different languages...



Alternatively: how about a === triple-equals operator to do the same
thing?


Much less keen on new syntax, especially when other languages use it for 
other purposes.



(2) Add keyword-only arguments to str.find and str.index:

 casefold=False

 which does nothing if false (the default), and switches to a case-
 insensitive search if true.


There's an implementation argument to be had about whether separate 
casefolded methods would be better, but yes.



Unsolved problems:

This proposal doesn't help with sets and dicts, list.index and the `in`
operator either.


The only way I can think of to get much traction with this is to have a 
separate case-insensitive string class.  That feels a bit heavyweight, 
though.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Serhiy Storchaka


31.08.17 10:10, Steven D'Aprano пише:

(iii) Not every two line function needs to be in the standard library.
Just add this to the top of every module:

def equal(s, t):
 return s.casefold() == t.casefold()


This is my answer.


Unsolved problems:

This proposal doesn't help with sets and dicts, list.index and the `in`
operator either.


This is the end of the discussion.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Case-insensitive string equality

2017-08-31 Thread Antoon Pardon

IMO this should be solved by a company used library and I would
go in the direction of a Normalized_String class.

This has the advantages 
(1) that the company can choose whatever normalization suits them,
not all cases are suited by comparing case insentitively,
(2) individual devs in the company don't have to write there own.
(3) and Normalized_Strings can be keys in directories and members
in a set.

Op 31-08-17 om 09:10 schreef Steven D'Aprano:
> Three times in the last week the devs where I work accidentally 
> introduced bugs into our code because of a mistake with case-insensitive 
> string comparisons. They managed to demonstrate three different failures:
>
> # 1
> a = something().upper()  # normalise string
> ... much later on
> if a == b.lower(): ...
>
>
> # 2
> a = something().upper()
> ... much later on
> if a == 'maildir': ...
>
>
> # 3
> a = something()  # unnormalised
> assert 'foo' in a
> ... much later on
> pos = a.find('FOO')
>
>
>
> Not every two line function needs to be in the standard library, but I've 
> come to the conclusion that case-insensitive testing and searches should 
> be. I've made these mistakes myself at times, as I'm sure most people 
> have, and I'm tired of writing my own case-insensitive function over and 
> over again.
>
>
> So I'd like to propose some additions to 3.7 or 3.8. If the feedback here 
> is positive, I'll take it to Python-Ideas for the negative feedback :-)
>
>
> (1) Add a new string method, which performs a case-insensitive equality 
> test. Here is a potential implementation, written in pure Python:
>
>
> def equal(self, other):
> if self is other:
> return True
> if not isinstance(other, str):
> raise TypeError
> if len(self) != len(other):
> return False
> casefold = str.casefold
> for a, b in zip(self, other):
> if casefold(a) != casefold(b):
> return False
> return True
>
> Alternatively: how about a === triple-equals operator to do the same 
> thing?
>
>
>
> (2) Add keyword-only arguments to str.find and str.index:
>
> casefold=False
>
> which does nothing if false (the default), and switches to a case-
> insensitive search if true.
>
>
>
>
> Alternatives:
>
> (i) Do nothing. The status quo wins a stalemate.
>
> (ii) Instead of str.find or index, use a regular expression.
>
> This is less discoverable (you need to know regular expressions) and 
> harder to get right than to just call a string method. Also, I expect 
> that invoking the re engine just for case insensitivity will be a lot 
> more expensive than a simple search need be.
>
> (iii) Not every two line function needs to be in the standard library. 
> Just add this to the top of every module:
>
> def equal(s, t):
> return s.casefold() == t.casefold()
>
>
> That's the status quo wins again. It's an annoyance. A small annoyance, 
> but multiplied by the sheer number of times it happens, it becomes a 
> large annoyance. I believe the annoyance factor of case-insensitive 
> comparisons outweighs the "two line function" objection.
>
> And the two-line "equal" function doesn't solve the problem for find and 
> index, or for sets dicts, list.index and the `in` operator either.
>
>
> Unsolved problems:
>
> This proposal doesn't help with sets and dicts, list.index and the `in` 
> operator either.
>
>
>
> Thoughts?
>
>
>

-- 
https://mail.python.org/mailman/listinfo/python-list

Case-insensitive string equality

2017-08-31 Thread Steven D'Aprano

Three times in the last week the devs where I work accidentally 
introduced bugs into our code because of a mistake with case-insensitive 
string comparisons. They managed to demonstrate three different failures:

# 1
a = something().upper()  # normalise string
... much later on
if a == b.lower(): ...


# 2
a = something().upper()
... much later on
if a == 'maildir': ...


# 3
a = something()  # unnormalised
assert 'foo' in a
... much later on
pos = a.find('FOO')



Not every two line function needs to be in the standard library, but I've 
come to the conclusion that case-insensitive testing and searches should 
be. I've made these mistakes myself at times, as I'm sure most people 
have, and I'm tired of writing my own case-insensitive function over and 
over again.


So I'd like to propose some additions to 3.7 or 3.8. If the feedback here 
is positive, I'll take it to Python-Ideas for the negative feedback :-)


(1) Add a new string method, which performs a case-insensitive equality 
test. Here is a potential implementation, written in pure Python:


def equal(self, other):
if self is other:
return True
if not isinstance(other, str):
raise TypeError
if len(self) != len(other):
return False
casefold = str.casefold
for a, b in zip(self, other):
if casefold(a) != casefold(b):
return False
return True

Alternatively: how about a === triple-equals operator to do the same 
thing?



(2) Add keyword-only arguments to str.find and str.index:

casefold=False

which does nothing if false (the default), and switches to a case-
insensitive search if true.




Alternatives:

(i) Do nothing. The status quo wins a stalemate.

(ii) Instead of str.find or index, use a regular expression.

This is less discoverable (you need to know regular expressions) and 
harder to get right than to just call a string method. Also, I expect 
that invoking the re engine just for case insensitivity will be a lot 
more expensive than a simple search need be.

(iii) Not every two line function needs to be in the standard library. 
Just add this to the top of every module:

def equal(s, t):
return s.casefold() == t.casefold()


That's the status quo wins again. It's an annoyance. A small annoyance, 
but multiplied by the sheer number of times it happens, it becomes a 
large annoyance. I believe the annoyance factor of case-insensitive 
comparisons outweighs the "two line function" objection.

And the two-line "equal" function doesn't solve the problem for find and 
index, or for sets dicts, list.index and the `in` operator either.


Unsolved problems:

This proposal doesn't help with sets and dicts, list.index and the `in` 
operator either.



Thoughts?



-- 
Steven D'Aprano
“You are deluded if you think software engineers who can't write 
operating systems or applications without security holes, can write 
virtualization layers without security holes.” —Theo de Raadt
-- 
https://mail.python.org/mailman/listinfo/python-list

47 matches

Mail list logo