Re: Case-insensitive string equality
On Wed, 6 Sep 2017 12:27 am, Grant Edwards wrote: > On 2017-09-03, Gregory Ewing wrote: >> Stefan Ram wrote: >>> But of >>> course, actually the rules of orthography require "Maße" or >>> "Masse" and do not allow "MASSE" or "MASZE", just as in >>> English, "English" has to be written "English" and not >>> "english" or "ENGLISH". >> >> While "english" is wrong in English, there's no rule >> against using "ENGLISH" as an all-caps version. > > Perhaps there's no "rule" in your book of rules, but it's almost > universally considered bad style and you will lose points with your > teacher, editor, etc. And yet editors frequently use ALL CAPS for book titles and sometimes even chapter headings, as well as the author's name. I have a shelf full of books by STEPHEN KING behind me. Likewise for movie titles on DVDs, album titles on CDs, address labels (I'm looking at a letter addressed to MR S DAPRANO right now), labels on food and medication. The late and much-lamented English humourist Sir Terry Pratchett wrote almost forty books including the character of Death, who SPEAKS IN CAPITALS. (And his superior, Azrael, does the same in letters almost the size of the entire page. Fortunately he says only a single word in the entire series.) Many government forms and databases require all caps, presumably because it is easier for handwriting recognition, or maybe its just a left over from the days of type writers. > On the inter-tubes generally indicates you're shouting, or just a > kook. I guess if either of those is true, then it's good style. It is true that in general we don't write ordinary prose in all caps, there are plenty of non-kook uses for it. But speaking of capitals on the Internet: HI EVERYBODY!! try pressing the the Caps Lock key O THANKS!!! ITS SO MUCH EASIER TO WRITE NOW!!! fuck me http://bash.org/?835030 -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Wed, Sep 6, 2017 at 12:27 AM, Grant Edwards wrote: > On 2017-09-03, Gregory Ewing wrote: >> Stefan Ram wrote: >>> But of >>> course, actually the rules of orthography require "Maße" or >>> "Masse" and do not allow "MASSE" or "MASZE", just as in >>> English, "English" has to be written "English" and not >>> "english" or "ENGLISH". >> >> While "english" is wrong in English, there's no rule >> against using "ENGLISH" as an all-caps version. > > Perhaps there's no "rule" in your book of rules, but it's almost > universally considered bad style and you will lose points with your > teacher, editor, etc. > > On the inter-tubes generally indicates you're shouting, or just a > kook. I guess if either of those is true, then it's good style. ENGLISH, Doc! *Doc Brown proceeds to explain and/or demonstrate* ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-09-03, Gregory Ewing wrote: > Stefan Ram wrote: >> But of >> course, actually the rules of orthography require "Maße" or >> "Masse" and do not allow "MASSE" or "MASZE", just as in >> English, "English" has to be written "English" and not >> "english" or "ENGLISH". > > While "english" is wrong in English, there's no rule > against using "ENGLISH" as an all-caps version. Perhaps there's no "rule" in your book of rules, but it's almost universally considered bad style and you will lose points with your teacher, editor, etc. On the inter-tubes generally indicates you're shouting, or just a kook. I guess if either of those is true, then it's good style. -- Grant Edwards grant.b.edwardsYow! at BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI-BI- gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Tue, Sep 5, 2017 at 6:05 PM, Stefan Behnel wrote: > Steve D'Aprano schrieb am 02.09.2017 um 02:31: >> - the German eszett, ß, which has two official[1] uppercase forms: 'SS' >> and an uppercase eszett > > I wonder if there is an equivalent to Godwin's Law with respect to > character case related discussions and the German ß. Given that it's such a useful test case, I think it's inevitable (the first part of Godwin's Law), but not a conversation killer (the second part, and not (AFAIK) part of the original statement). Either that, or the Turkish Iı/İi. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Steve D'Aprano schrieb am 02.09.2017 um 02:31: > - the German eszett, ß, which has two official[1] uppercase forms: 'SS' > and an uppercase eszett I wonder if there is an equivalent to Godwin's Law with respect to character case related discussions and the German ß. Stefan -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Steven D'Aprano wrote: [...] > (1) Add a new string method, which performs a case- > insensitive equality test. Here is a potential > implementation, written in pure Python: > > def equal(self, other): > if self is other: > return True > if not isinstance(other, str): > raise TypeError > if len(self) != len(other): > return False > casefold = str.casefold > for a, b in zip(self, other): > if casefold(a) != casefold(b): > return False > return True > > Alternatively: how about a === triple-equals operator to do > the same thing? A good idea. But wouldn't that specific usage be inconsistent (even backwards) with the semantics of "===" as defined in most languages that use "==="? For me -- and this comment will be going beyond the scope of strings, and possibly, beyond the scope of this thread -- i feel that python is missing a pair of equality testing devices (sugared or not; but preferably sugared), that define a universal means by which all types can be tested with either "superficial equality" (aka: ==) or "deep equality" (aka: ===). However, such a design (whist quite intuitive) would break equality testing as it exists today in Python. For instance, it would mean that: (1) Superficial Equality >>> "abc" == "abc" True >>> "abc" == "ABC" True (2) Deep Equality >>> "abc" === "abc" True >>> "abc" === "ABC" False And i don't think even GvR's time machine will be much help here. :-( -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-09-02 12:21, Steve D'Aprano wrote: > On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote: > > I'd want to have an optional parameter to take locale into > > consideration. E.g. > > Does regular case-sensitive equality take the locale into > consideration? No. Python says that .casefold() https://docs.python.org/3/library/stdtypes.html#str.casefold implements the Unicode case-folding specification ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt which calls out the additional processing for Turkic languages: # T: special case for uppercase I and dotted uppercase I #- For non-Turkic languages, this mapping is normally not used. #- For Turkic languages (tr, az), this mapping can be used # instead of the normal mapping for these characters. # Note that the Turkic mappings do not maintain canonical # equivalence without additional processing. # See the discussions of case mapping in the Unicode Standard # for more information. So it looks like what Python lacks is that "additional processing", after which .casefold() should solve the problems. According to my reading, if locale doesn't play part in the equation s1.casefold() == s2.casefold() should suffice. Any case-insensitive code using .upper() or .lower() instead of .casefold() becomes a code-smell. > If regular case-sensitive string comparisons don't support the > locale, why should case-insensitive comparisons be required to? Adding new code to Python that just does what is already available is indeed bad justification. But adding *new* functionality that handles the locale-aware-case-insensitive-comparison could be justified. > As far as I'm concerned, the only "must have" is that ASCII letters > do the right thing. Everything beyond that is a "quality of > implementation" issue. But for this use-case, we already have .casefold() which does the job and even extends beyond plain 7-bit ASCII to most of the typical i18n/Unicode use-cases. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Capital ß [was Re: Case-insensitive string equality]
On 2017-09-04 03:28, Steve D'Aprano wrote: On Sat, 2 Sep 2017 01:48 pm, Stefan Ram wrote: Steve D'Aprano writes: [1] I believe that the German government has now officially recognised the uppercase form of ß. [skip to the last paragraph for some "ß" content, unless you want to read details about German spelling rules.] The German language is as free as the English one. It does not come from a government. Nevertheless, even in English there are de facto rules about what you can and cannot use as text for official purposes. In most countries, you cannot change your name to an unpronounceable "love symbol" as the late Artist Formally Known As Prince did. You can't fill in your tax using Alienese http://futurama.wikia.com/wiki/Alienese or even Vietnamese, Greek or Arabic. In Australia, the Victorian state government Department of Births Deaths and Marriages doesn't even accept such unexceptional and minor variants as Zöe for Zoe. Of course you are free to call yourself Zöe when you sign your emails, but your birth certificate, passport and drivers licence will show it as Zoe. Of course they reject "Zöe": the correct spelling is "Zoë". :-) [snip] -- https://mail.python.org/mailman/listinfo/python-list
Capital ß [was Re: Case-insensitive string equality]
On Sat, 2 Sep 2017 01:48 pm, Stefan Ram wrote: > Steve D'Aprano writes: >>[1] I believe that the German government has now officially recognised the >>uppercase form of ß. > > [skip to the last paragraph for some "ß" content, > unless you want to read details about German spelling rules.] > > The German language is as free as the English one. It does > not come from a government. Nevertheless, even in English there are de facto rules about what you can and cannot use as text for official purposes. In most countries, you cannot change your name to an unpronounceable "love symbol" as the late Artist Formally Known As Prince did. You can't fill in your tax using Alienese http://futurama.wikia.com/wiki/Alienese or even Vietnamese, Greek or Arabic. In Australia, the Victorian state government Department of Births Deaths and Marriages doesn't even accept such unexceptional and minor variants as Zöe for Zoe. Of course you are free to call yourself Zöe when you sign your emails, but your birth certificate, passport and drivers licence will show it as Zoe. > The 16 states (Bundesländer), agreed to a common institution > ("Der Rat für deutsche Rechtschreibung") to write down the > rules for their /schools/. The federal government is not > involved. Most publishing houses volunteered to follow those > school rules. Outside of schools or binding contracts, > everyone is free to write as he likes. I'm not suggesting that the Spelling Police will come arrest me in either Germany or the UK/Australia if I were to write my name Ƽτευεη. Switzerland on the other hand ... *wink* But there are very strong conventions about what is acceptable, and often there are actual laws in place that limit what letters are used in official documentation and records, what is taught in schools, etc. The 1996 spelling reforms, and their legal status, are described here: https://en.wikipedia.org/wiki/German_orthography_reform_of_1996 and Der Rat für deutsche Rechtschreibung: https://en.wikipedia.org/wiki/Council_for_German_Orthography (Sorry for not linking to the German versions as well.) > The "ß" sometimes has been uppercased to "SS" and sometimes > to "SZ". Historically, some German publishers used a distinct uppercase ß, while others used a ligature of SZ, explicitly stating that this was an interim measure until they decided on a good looking uppercase ß. More about capital ß: https://typography.guru/journal/germanys-new-character/ https://medium.com/@typefacts/the-german-capital-letter-eszett-e0936c1388f8 https://en.wikipedia.org/wiki/Capital_%E1%BA%9E -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Mon, 4 Sep 2017 09:10 am, Gregory Ewing wrote: > Stefan Ram wrote: >> But of >> course, actually the rules of orthography require "Maße" or >> "Masse" and do not allow "MASSE" or "MASZE", just as in >> English, "English" has to be written "English" and not >> "english" or "ENGLISH". > > While "english" is wrong in English, there's no rule > against using "ENGLISH" as an all-caps version. It's not always wrong. If you're e.e. cummings, then you're allowed to avoid capitals. (If you're anyone else, you're either a derivative hack, or lazy.) And if you are referring to the spin applied to billiard balls, it is acceptable to write it as english. > Are you saying that all-caps text is not allowed in > German? If so, that's very different from English. Germans use ALLCAPS for headlines, book titles, emphasis etc just as English speakers do. For example: http://www.spiegel.de/politik/index.html -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Stefan Ram wrote: But of course, actually the rules of orthography require "Maße" or "Masse" and do not allow "MASSE" or "MASZE", just as in English, "English" has to be written "English" and not "english" or "ENGLISH". While "english" is wrong in English, there's no rule against using "ENGLISH" as an all-caps version. Are you saying that all-caps text is not allowed in German? If so, that's very different from English. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 9/3/17, Steve D'Aprano wrote: > On Sun, 3 Sep 2017 05:17 pm, Stephan Houben wrote: > >> Generally speaking, the more you learn about case normalization, >> the more attractive case sensitivity looks > > Just because something is hard doesn't mean its not worth doing. > > And just because you can't please all the people all the time doesn't mean > its > not worthwhile. I was thinking about compare where false positivity is acceptable (and well defined property). For example if somebody has case sensitive FS and wants to publish files and avoid name collision on any case insensitive FS then compare with false positive equals could be useful. Then backward compatibility problem could be (theoretically) simplified to enhancing equivalence classes in future. I mean something like -> equal = lambda a, b: any(f(a) == f(b) for f in C)# where C is enhanceble list of compare equals functions Could you think that such equivalence relation could solve problems which you describe in first mail in this thread? And if trying to "solve" unicode problem why not? -> a ≐ b a ⋵ L -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sun, 3 Sep 2017 05:17 pm, Stephan Houben wrote: > Generally speaking, the more you learn about case normalization, > the more attractive case sensitivity looks Just because something is hard doesn't mean its not worth doing. And just because you can't please all the people all the time doesn't mean its not worthwhile. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sun, Sep 3, 2017 at 5:17 PM, Stephan Houben wrote: > Generally speaking, the more you learn about case normalization, > the more attractive case sensitivity looks ;-) Absolutely agreed. My general recommendation is to have two vastly different concepts: "equality matching" and "searching". Equality is case sensitive and strict; NFC/NFD normalization is about all you can do. Searching, on the other hand, can be case insensitive, do NFKC/NFKD normalization, and can even (in many contexts) strip off diacritical marks altogether, allowing people to search for "resume" and find "résumé", or to search for "muller" and find "Müller". There can be a whole swathe of other transformations done in search normalization too (collapse whitespace, fold punctuation into a few types, etc), though of course you need to be aware of context. But IMO it's far safer to NOT define things in terms of "equality". ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Op 2017-09-02, Pavol Lisy schreef : > But problem is that if somebody like to have stable API it has to be > changed to "do what the Unicode consortium said (at X.Y. )" :/ It is even more exciting. Presumably a reason to have case-insentivity is to be compatible with existing popular case-insentive systems. So here is, for your entertainment, how some of these systems work. * Windows NTFS case-insensitive file system A NTFS file system contains a hidden table $UpCase which maps characters to their upcase variant. Note: 1. This maps characters in the BMP *only*, so NTFS treats characters outside the BMP as case-insensitive. 2. Every character can in turn only be mapped into a single BMP character, so ß -> SS is not possible. 3. The table is in practice dependent on the language of the Windows system which created it (so a Turkish NTFS partition would contain i -> İ), but in theory can contain any allowed mapping: I can create an NTFS filesystem which maps a -> b. 4. Older Windows versions generated tables which were broken for certain languages (NT 3.51/Georgian). You may still have some NTFS partition with such a table lying around. * macOS case-insensitive file system 1. HFS+ is based on Unicode 3.2; this is fixed in stone because of backward compatibility. 2. APFS is based on Unicode 9.0 and does normalization as well Generally speaking, the more you learn about case normalization, the more attractive case sensitivity looks ;-) Also slim hope to have a single caseInsensitiveCompare function which "Does The Right Thing"™. Stephan -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 9/2/17 at 4:21, Steve D'Aprano wrote: > If regular case-sensitive string comparisons don't support the locale, why > should case-insensitive comparisons be required to? I think that Chris answered very good before: On 9/2/17 at 2:53 AM, Chris Angelico wrote: > On Sat, Sep 2, 2017 at 10:31 AM, Steve D'Aprano > But code is often *wrong* due to backward compatibility concerns. Then you > have to > decide whether, for a brand new API, it's better to "do the same as > the regex module" or to "do what the Unicode consortium says". But problem is that if somebody like to have stable API it has to be changed to "do what the Unicode consortium said (at X.Y. )" :/ Maybe it is simpler to write intelligent linter to catch wrong comparisions? -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sat, Sep 2, 2017 at 12:21 PM, Steve D'Aprano wrote: > On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote: > >> On 2017-08-31 07:10, Steven D'Aprano wrote: >>> So I'd like to propose some additions to 3.7 or 3.8. >> >> Adding my "yes, a case-insensitive equality-check would be useful" >> with the following concerns: >> >> I'd want to have an optional parameter to take locale into >> consideration. E.g. > > Does regular case-sensitive equality take the locale into consideration? > > How do I convince Python to return true for these? > > 'i'.upper() == 'İ' > 'I'.lower() == 'ı' > > > I'm 99% sure that these are rhetorical questions where the answers are > obviously: > > - No it doesn't. > - And you can't. > > If regular case-sensitive string comparisons don't support the locale, why > should case-insensitive comparisons be required to? > > We should not confuse "nice to have" for "must have". As far as I'm concerned, > the only "must have" is that ASCII letters do the right thing. Everything > beyond that is a "quality of implementation" issue. The only "must have" is that the locale-independent conversions do the right thing. We already have str.casefold() that correctly handles >99% of situations; the easiest way to craft something like this is to define it in terms of that. > In any case, thanks to everyone for their feedback. Clearly there not enough > support for this for me to even bother taking it to Python-Ideas. Agreed; if this were important enough for someone to want to run the gauntlet of -ideas and -dev, I'd predict that it would be one of those VERY hotly debated topics. ChrisA On Sat, Sep 2, 2017 at 12:21 PM, Steve D'Aprano wrote: > On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote: > >> On 2017-08-31 07:10, Steven D'Aprano wrote: >>> So I'd like to propose some additions to 3.7 or 3.8. >> >> Adding my "yes, a case-insensitive equality-check would be useful" >> with the following concerns: >> >> I'd want to have an optional parameter to take locale into >> consideration. E.g. > > Does regular case-sensitive equality take the locale into consideration? > > How do I convince Python to return true for these? > > 'i'.upper() == 'İ' > 'I'.lower() == 'ı' > > > I'm 99% sure that these are rhetorical questions where the answers are > obviously: > > - No it doesn't. > - And you can't. > > If regular case-sensitive string comparisons don't support the locale, why > should case-insensitive comparisons be required to? > > We should not confuse "nice to have" for "must have". As far as I'm concerned, > the only "must have" is that ASCII letters do the right thing. Everything > beyond that is a "quality of implementation" issue. > > > In any case, thanks to everyone for their feedback. Clearly there not enough > support for this for me to even bother taking it to Python-Ideas. > > > > > -- > Steve > “Cheer up,” they said, “things could be worse.” So I cheered up, and sure > enough, things got worse. > > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Fri, 1 Sep 2017 01:29 am, Tim Chase wrote: > On 2017-08-31 07:10, Steven D'Aprano wrote: >> So I'd like to propose some additions to 3.7 or 3.8. > > Adding my "yes, a case-insensitive equality-check would be useful" > with the following concerns: > > I'd want to have an optional parameter to take locale into > consideration. E.g. Does regular case-sensitive equality take the locale into consideration? How do I convince Python to return true for these? 'i'.upper() == 'İ' 'I'.lower() == 'ı' I'm 99% sure that these are rhetorical questions where the answers are obviously: - No it doesn't. - And you can't. If regular case-sensitive string comparisons don't support the locale, why should case-insensitive comparisons be required to? We should not confuse "nice to have" for "must have". As far as I'm concerned, the only "must have" is that ASCII letters do the right thing. Everything beyond that is a "quality of implementation" issue. In any case, thanks to everyone for their feedback. Clearly there not enough support for this for me to even bother taking it to Python-Ideas. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sat, Sep 2, 2017 at 10:31 AM, Steve D'Aprano wrote: > On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote: > >> Aside from lower(), which returns the string unchanged, the case >> conversion rules say that this contains two letters. > > Do you have a reference to that? > > I mean, where in the Unicode case conversion rules is that stated? You cannot > take the behaviour of Python as necessarily correct here -- it may be that the > behaviour of Python is erroneous. Yep! It's all in here. ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt > For what its worth, even under Unicode's own rules, there are always going to > be > odd corner cases that surprise people. The most obvious cases are: > > You can't keep everybody happy. Doesn't mean we can't meet 99% of the > usescases. > > After all, what do you think the regex case insensitive matching does? Honestly, I don't know what it does without checking. But code is often *wrong* due to backward compatibility concerns. Then you have to decide whether, for a brand new API, it's better to "do the same as the regex module" or to "do what the Unicode consortium says". As it turns out, the Python 're' module doesn't match the letters against the ligature: >>> re.search("F", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE) >>> re.search("f", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE) >>> re.search("I", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE) >>> re.search("i", "\N{LATIN SMALL LIGATURE FI}", re.IGNORECASE) >>> re.search("S", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE) >>> re.search("s", "\N{LATIN SMALL LETTER SHARP S}", re.IGNORECASE) >>> I would consider that code to be incorrect. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote: > Aside from lower(), which returns the string unchanged, the case > conversion rules say that this contains two letters. Do you have a reference to that? I mean, where in the Unicode case conversion rules is that stated? You cannot take the behaviour of Python as necessarily correct here -- it may be that the behaviour of Python is erroneous. For what its worth, even under Unicode's own rules, there are always going to be odd corner cases that surprise people. The most obvious cases are: - dotted and dottless i - the German eszett, ß, which has two official[1] uppercase forms: 'SS' and an uppercase eszett - long s, ſ, which may or may not be treated as distinct from s - likewise for ligatures -- is æ a ligature, or is it Old English ash? You can't keep everybody happy. Doesn't mean we can't meet 99% of the usescases. After all, what do you think the regex case insensitive matching does? [1] I believe that the German government has now officially recognised the uppercase form of ß. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sat, Sep 2, 2017 at 10:09 AM, Steve D'Aprano wrote: > The question wasn't what "\N{LATIN SMALL LIGATURE FI}".upper() would find, > but "\N{LATIN SMALL LIGATURE FI}". > > Nor did they ask about > > "\N{LATIN SMALL LIGATURE FI}".replace("\N{LATIN SMALL LIGATURE > FI}", "Surprise!") > >> So what's the definition of "case insensitive find"? The most simple >> and obvious form is: >> >> def case_insensitive_find(self, other): >> return self.casefold().find(other.casefold()) > > That's not the definition, that's an implementation. It's a reference implementation that defines certain semantics. Obviously you can *actually* implement it using some kind of C-level loop or something, as long as you can define the semantics somehow. So what IS your definition of "case insensitive find"? Do you case-fold both strings or just one? What do you do about length-changing case folding? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Sat, 2 Sep 2017 01:41 am, Chris Angelico wrote: > On Fri, Sep 1, 2017 at 11:22 PM, Steve D'Aprano > wrote: >> On Fri, 1 Sep 2017 09:53 am, MRAB wrote: >> >>> What would you expect the result would be for: >>> >>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F") >>> >>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I) >> >> That's easy. >> >> -1 in both cases, since neither "F" nor "I" is found in either string. We can >> prove this by manually checking: >> >> py> for c in "\N{LATIN SMALL LIGATURE FI}": >> ... print(c, 'F' in c, 'f' in c) >> ... print(c, 'I' in c, 'i' in c) >> ... >> fi False False >> fi False False >> >> >> If you want some other result, then you're not talking about case >> sensitivity. > "\N{LATIN SMALL LIGATURE FI}".upper() > 'FI' The question wasn't what "\N{LATIN SMALL LIGATURE FI}".upper() would find, but "\N{LATIN SMALL LIGATURE FI}". Nor did they ask about "\N{LATIN SMALL LIGATURE FI}".replace("\N{LATIN SMALL LIGATURE FI}", "Surprise!") > So what's the definition of "case insensitive find"? The most simple > and obvious form is: > > def case_insensitive_find(self, other): > return self.casefold().find(other.casefold()) That's not the definition, that's an implementation. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Fri, Sep 1, 2017 at 11:22 PM, Steve D'Aprano wrote: > On Fri, 1 Sep 2017 09:53 am, MRAB wrote: > >> What would you expect the result would be for: >> >> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F") >> >> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I) > > That's easy. > > -1 in both cases, since neither "F" nor "I" is found in either string. We can > prove this by manually checking: > > py> for c in "\N{LATIN SMALL LIGATURE FI}": > ... print(c, 'F' in c, 'f' in c) > ... print(c, 'I' in c, 'i' in c) > ... > fi False False > fi False False > > > If you want some other result, then you're not talking about case sensitivity. >>> "\N{LATIN SMALL LIGATURE FI}".upper() 'FI' >>> "\N{LATIN SMALL LIGATURE FI}".lower() 'fi' >>> "\N{LATIN SMALL LIGATURE FI}".casefold() 'fi' Aside from lower(), which returns the string unchanged, the case conversion rules say that this contains two letters. So "F" exists in the uppercased version of the string, and "f" exists in the casefolded version. So what's the definition of "case insensitive find"? The most simple and obvious form is: def case_insensitive_find(self, other): return self.casefold().find(other.casefold()) which would clearly return 0 and 1 for the two original searches. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Thu, 31 Aug 2017 08:15 pm, Rhodri James wrote: > I'd quibble about the name and the implementation (length is not > preserved under casefolding), Yes, I'd forgotten about that. > but I'd go for this. The number of times > I've written something like this in different languages... [...] > The only way I can think of to get much traction with this is to have a > separate case-insensitive string class. That feels a bit heavyweight, > though. You might be right about that. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Fri, 1 Sep 2017 09:53 am, MRAB wrote: > What would you expect the result would be for: > > "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F") > > "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I) That's easy. -1 in both cases, since neither "F" nor "I" is found in either string. We can prove this by manually checking: py> for c in "\N{LATIN SMALL LIGATURE FI}": ... print(c, 'F' in c, 'f' in c) ... print(c, 'I' in c, 'i' in c) ... fi False False fi False False If you want some other result, then you're not talking about case sensitivity. If anyone wants to propose "normalisation-insensitive matching", I'll ask you to please start your own thread rather than derailing this one with an unrelated, and much more difficult, problem. The proposal here is *case insensitive* matching, not Unicode normalisation. If you want to decompose the strings, you know how to: py> import unicodedata py> unicodedata.normalize('NFKD', "\N{LATIN SMALL LIGATURE FI}") 'fi' -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Steven D'Aprano writes: > Three times in the last week the devs where I work accidentally > introduced bugs into our code because of a mistake with case-insensitive > string comparisons. They managed to demonstrate three different failures: > > # 1 > a = something().upper() # normalise string > ... much later on > if a == b.lower(): ... > > > # 2 > a = something().upper() > ... much later on > if a == 'maildir': ... > > > # 3 > a = something() # unnormalised > assert 'foo' in a > ... much later on > pos = a.find('FOO') > > > > Not every two line function needs to be in the standard library, but I've > come to the conclusion that case-insensitive testing and searches should > be. I've made these mistakes myself at times, as I'm sure most people > have, and I'm tired of writing my own case-insensitive function over and > over again. > > > So I'd like to propose some additions to 3.7 or 3.8. If the feedback here > is positive, I'll take it to Python-Ideas for the negative feedback :-) > > > (1) Add a new string method, which performs a case-insensitive equality > test. Here is a potential implementation, written in pure Python: > > > def equal(self, other): > if self is other: > return True > if not isinstance(other, str): > raise TypeError > if len(self) != len(other): > return False > casefold = str.casefold > for a, b in zip(self, other): > if casefold(a) != casefold(b): > return False > return True > > Alternatively: how about a === triple-equals operator to do the same > thing? > > > > (2) Add keyword-only arguments to str.find and str.index: > > casefold=False > > which does nothing if false (the default), and switches to a case- > insensitive search if true. > > > > > Alternatives: > > (i) Do nothing. The status quo wins a stalemate. > > (ii) Instead of str.find or index, use a regular expression. > > This is less discoverable (you need to know regular expressions) and > harder to get right than to just call a string method. Also, I expect > that invoking the re engine just for case insensitivity will be a lot > more expensive than a simple search need be. > > (iii) Not every two line function needs to be in the standard library. > Just add this to the top of every module: > > def equal(s, t): > return s.casefold() == t.casefold() > > > That's the status quo wins again. It's an annoyance. A small > annoyance, but multiplied by the sheer number of times it happens, it > becomes a large annoyance. I believe the annoyance factor of > case-insensitive comparisons outweighs the "two line function" > objection. > > And the two-line "equal" function doesn't solve the problem for find > and index, or for sets dicts, list.index and the `in` operator either. > > > Unsolved problems: > > This proposal doesn't help with sets and dicts, list.index and the `in` > operator either. > > > > Thoughts? This seems to me to be rather similar to sort() and sorted(). How about giving equals() an optional parameter key, and perhaps the older cmp? Using casefold or upper or lower would satisfy many use cases but also allow Unicode or more locale specific normalization to be applied. The shortcircuiting in a character based comparison holds little appeal for me. I generally find that a string is a more useful concept than a collection of characters. +1 for using an affix in the name to represent a normalized version of the input. -- Pete Forman -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-09-01 00:53, MRAB wrote: > What would you expect the result would be for: > >>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F") 0 >>> "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I) 0.5 >>> "\N{LATIN SMALL LIGATURE FFI}".case_insensitive_find("I) 0.6 ;-) (mostly joking, but those are good additional tests to consider) -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-08-31 16:29, Tim Chase wrote: On 2017-08-31 07:10, Steven D'Aprano wrote: So I'd like to propose some additions to 3.7 or 3.8. Adding my "yes, a case-insensitive equality-check would be useful" with the following concerns: I'd want to have an optional parameter to take locale into consideration. E.g. "i".case_insensitive_equals("I") # depends on Locale "i".case_insensitive_equals("I", Locale("TR")) == False "i".case_insensitive_equals("I", Locale("US")) == True and other oddities like "ß".case_insensitive_equals("SS") == True (though casefold() takes care of that later one). Then you get things like "III".case_insensitive_equals("\N{ROMAN NUMERAL THREE}") "iii".case_insensitive_equals("\N{ROMAN NUMERAL THREE}") "FI".case_insensitive_equals("\N{LATIN SMALL LIGATURE FI}") where the decomposition might need to be considered. There are just a lot of odd edge-cases to consider when discussing fuzzy equality. (1) Add a new string method, This is my preferred avenue. Alternatively: how about a === triple-equals operator to do the same thing? No. A strong -1 for new operators. This peeves me in other languages (looking at you, PHP & JavaScript) (2) Add keyword-only arguments to str.find and str.index: casefold=False which does nothing if false (the default), and switches to a case- insensitive search if true. I'm okay with some means of conveying the insensitivity to str.find/str.index but have no interest in list.find/list.index growing similar functionality. I'm meh on the "casefold=False" syntax, especially in light of my hope it would take a locale for the comparisons. [snip] What would you expect the result would be for: "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("F") "\N{LATIN SMALL LIGATURE FI}".case_insensitive_find("I) -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-08-31 07:10, Steven D'Aprano wrote: > So I'd like to propose some additions to 3.7 or 3.8. Adding my "yes, a case-insensitive equality-check would be useful" with the following concerns: I'd want to have an optional parameter to take locale into consideration. E.g. "i".case_insensitive_equals("I") # depends on Locale "i".case_insensitive_equals("I", Locale("TR")) == False "i".case_insensitive_equals("I", Locale("US")) == True and other oddities like "ß".case_insensitive_equals("SS") == True (though casefold() takes care of that later one). Then you get things like "III".case_insensitive_equals("\N{ROMAN NUMERAL THREE}") "iii".case_insensitive_equals("\N{ROMAN NUMERAL THREE}") "FI".case_insensitive_equals("\N{LATIN SMALL LIGATURE FI}") where the decomposition might need to be considered. There are just a lot of odd edge-cases to consider when discussing fuzzy equality. > (1) Add a new string method, This is my preferred avenue. > Alternatively: how about a === triple-equals operator to do the > same thing? No. A strong -1 for new operators. This peeves me in other languages (looking at you, PHP & JavaScript) > (2) Add keyword-only arguments to str.find and str.index: > > casefold=False > > which does nothing if false (the default), and switches to a > case- insensitive search if true. I'm okay with some means of conveying the insensitivity to str.find/str.index but have no interest in list.find/list.index growing similar functionality. I'm meh on the "casefold=False" syntax, especially in light of my hope it would take a locale for the comparisons. > Unsolved problems: > > This proposal doesn't help with sets and dicts, list.index and the > `in` operator either. I'd be less concerned about these. If you plan to index a set/dict by the key, normalize it before you put it in. Or perhaps create a CaseInsensitiveDict/CaseInsensitiveSet class. For lists and 'in' operator usage, it's not too hard to make up a helper function based on the newly-grown method: def case_insensitive_in(itr, target, locale=None): return any( target.case_insensitive_equals(x, locale) for x in itr ) def case_insensitive_index(itr, target, locale=None): for i, x in enumerate(itr): if target.case_insensitive_equals(x, locale): return i raise ValueError("Could not find %s" % target) -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 8/31/17, Steve D'Aprano wrote: >> Additionally: a proper "case insensitive comparison" should almost >> certainly start with a Unicode normalization. But should it be NFC/NFD >> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the >> application. > > Normalisation is orthogonal to comparisons and searches. Python doesn't > automatically normalise strings, as people have pointed out a bazillion > times > in the past, and it happily compares > > 'ö' LATIN SMALL LETTER O WITH DIAERESIS > > 'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS > > > as unequal. I don't propose to change that just so that we can get 'a' > equals 'A' :-) Locale-dependent Case Mappings. The principal example of a case mapping that depends on the locale is Turkish, where U+0131 “ı” latin small letter dotless i maps to U+0049 “I” latin capital letter i and U+0069 “i” latin small letter i maps to U+0130 “İ” latin capital letter i with dot above. (source: http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf) So 'SIKISIN'.casefold() could be dangerous -> https://translate.google.com/#tr/en/sikisin%0As%C4%B1k%C4%B1s%C4%B1n (although I am not sure if this story is true -> https://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch ) -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
In Serhiy Storchaka writes: > > But when there is a common source of mistakes, we can help prevent > > that mistake. > How can you do this? I know only one way -- teaching and practicing. Modify the environment so that the mistake simply can't happen (or at least happens much less frequently.) -- John Gordon A is for Amy, who fell down the stairs gor...@panix.com B is for Basil, assaulted by bears -- Edward Gorey, "The Gashlycrumb Tinies" -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-08-31 18:17, Peter Otten wrote: > A quick and dirty fix would be a naming convention: > > upcase_a = something().upper() I tend to use a "_u" suffix as my convention: something_u = something.upper() which keeps the semantics of the original variable-name while hinting at the normalization. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
Steven D'Aprano wrote: > Three times in the last week the devs where I work accidentally > introduced bugs into our code because of a mistake with case-insensitive > string comparisons. They managed to demonstrate three different failures: > > # 1 > a = something().upper() # normalise string > ... much later on > if a == b.lower(): ... A quick and dirty fix would be a naming convention: upcase_a = something().upper() > ... much later on > if upcase_a == b.lower(): ... Wait, what... -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 2017-08-31 23:30, Chris Angelico wrote: > The method you proposed seems a little odd - it steps through the > strings character by character and casefolds them separately. How is > it superior to the two-line function? And it still doesn't solve any > of your other cases. It also breaks when casefold() returns multiple characters: >>> s1 = 'ss' >>> s2 = 'SS' >>> s3 = 'ß' >>> equal(s1,s2) # using Steve's equal() function True >>> equal(s1,s3) False >>> equal(s2,s3) False >>> s1.casefold() == s2.casefold() True >>> s1.casefold() == s3.casefold() True >>> s2.casefold() == s3.casefold() True -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 31/08/17 15:03, Chris Angelico wrote: On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram wrote: Chris Angelico writes: On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano wrote: On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote: 31.08.17 10:10, Steven D'Aprano ???: def equal(s, t): return s.casefold() == t.casefold() The method you proposed seems a little odd - it steps through the strings character by character and casefolds them separately. How is it superior to the two-line function? When the strings are long, casefolding both strings just to be able to tell that the first character of the left string is »;« while the first character of the right string is »'« and so the result is »False« might be slower than necessary. [chomp] However, premature optimization is the root of all evil! Fair enough. However, I'm more concerned about the possibility of a semantic difference between the two. Is it at all possible for the case folding of an entire string to differ from the concatenation of the case foldings of its individual characters? Additionally: a proper "case insensitive comparison" should almost certainly start with a Unicode normalization. But should it be NFC/NFD or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the application. There's also the example in the documentation of str.casefold to consider. We would rather like str.equal("ß", "ss") to be true. -- Rhodri James *-* Kynesim Ltd -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
31.08.17 17:38, Steve D'Aprano пише: On Thu, 31 Aug 2017 11:45 pm, Serhiy Storchaka wrote: It is not clear what is your problem exactly. That is fair. This is why I am discussing it here first, before taking it to Python-Ideas. At the moment my ideas on the matter are still half-formed. What are you discussing? Without knowing what problem you are solving and what solution your are proposed it is hard to discuss it. The easy one-line function solves the problem of testing case-insensitive string equality. True. Except that when a problem is as common as case-insensitive comparisons, there should be a standard solution, instead of having to re-invent the wheel over and over again. Even when the wheel is only two or three lines. This *is* a standard solution. Don't invent the wheel, just use it properly. This is why we have dict.clear, for example, instead of: Just add this function to the top of every module and script def clear(d): for key in list(d.keys()): del d[key] No, there are other reasons for adding the clear() method in dict. Performance and atomicity (and the latter is more important). We say, *not every* two line function needs to be a builtin, rather than **no** two line function. But there should be good reasons for this. If you asked a solution that magically prevent people from making simple programming mistakes, there is no such solution. Very true. But when there is a common source of mistakes, we can help prevent that mistake. How can you do this? I know only one way -- teaching and practicing. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Thu, 31 Aug 2017 11:45 pm, Serhiy Storchaka wrote: > It is not clear what is your problem exactly. That is fair. This is why I am discussing it here first, before taking it to Python-Ideas. At the moment my ideas on the matter are still half-formed. > The easy one-line function > solves the problem of testing case-insensitive string equality. True. Except that when a problem is as common as case-insensitive comparisons, there should be a standard solution, instead of having to re-invent the wheel over and over again. Even when the wheel is only two or three lines. This is why we have dict.clear, for example, instead of: Just add this function to the top of every module and script def clear(d): for key in list(d.keys()): del d[key] We say, *not every* two line function needs to be a builtin, rather than **no** two line function. > Regular > expressions solve the problem of case-insensitive searching a position > of a substring. And now you have two problems... *wink* > If you asked a solution that magically prevent people > from making simple programming mistakes, there is no such solution. Very true. But when there is a common source of mistakes, we can help prevent that mistake. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Fri, Sep 1, 2017 at 12:27 AM, Steve D'Aprano wrote: >> Additionally: a proper "case insensitive comparison" should almost >> certainly start with a Unicode normalization. But should it be NFC/NFD >> or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the >> application. > > Normalisation is orthogonal to comparisons and searches. Python doesn't > automatically normalise strings, as people have pointed out a bazillion times > in the past, and it happily compares > > 'ö' LATIN SMALL LETTER O WITH DIAERESIS > > 'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS > > > as unequal. I don't propose to change that just so that we can get 'a' > equals 'A' :-) You may not, but others will. Which is just one of the reasons that "case insensitive comparison" is not as simple as it initially seems, and thus (IMO) is best NOT baked into the language. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Fri, 1 Sep 2017 12:03 am, Chris Angelico wrote: > On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram wrote: >> Chris Angelico writes: >>>The method you proposed seems a little odd - it steps through the >>>strings character by character and casefolds them separately. How is >>>it superior to the two-line function? >> >> When the strings are long, casefolding both strings >> just to be able to tell that the first character of >> the left string is »;« while the first character of >> the right string is »'« and so the result is »False« >> might be slower than necessary. Thanks Stefan, that was my reasoning. Also, if the implementation was in C, doing the comparison character by character is unlikely to be slower than doing the comparison all at once, since the "all at once" comparison actually is character by character under the hood. >> [chomp] >> However, premature optimization is the root of all evil! > > Fair enough. Its not premature optimization to plan ahead for obvious scenarios. Sometimes we may want to compare two large strings. Calling casefold on them temporarily uses doubles the memory taken by the two strings, which can be significant. Assuming the method were written in C, you would be very unlikely to build up two large temporary case-folded arrays before doing the comparison. If I were subclassing str in pure Python, I wouldn't bother. The tradeoffs are different. > However, I'm more concerned about the possibility of a semantic > difference between the two. Is it at all possible for the case folding > of an entire string to differ from the concatenation of the case > foldings of its individual characters? I don't believe so, but I welcome correction. > Additionally: a proper "case insensitive comparison" should almost > certainly start with a Unicode normalization. But should it be NFC/NFD > or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the > application. Normalisation is orthogonal to comparisons and searches. Python doesn't automatically normalise strings, as people have pointed out a bazillion times in the past, and it happily compares 'ö' LATIN SMALL LETTER O WITH DIAERESIS 'ö' LATIN SMALL LETTER O + COMBINING DIAERESIS as unequal. I don't propose to change that just so that we can get 'a' equals 'A' :-) -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Thu, Aug 31, 2017 at 11:53 PM, Stefan Ram wrote: > Chris Angelico writes: >>On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano >> wrote: >>> On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote: 31.08.17 10:10, Steven D'Aprano ???: > def equal(s, t): > return s.casefold() == t.casefold() >>The method you proposed seems a little odd - it steps through the >>strings character by character and casefolds them separately. How is >>it superior to the two-line function? > > When the strings are long, casefolding both strings > just to be able to tell that the first character of > the left string is »;« while the first character of > the right string is »'« and so the result is »False« > might be slower than necessary. > [chomp] > However, premature optimization is the root of all evil! Fair enough. However, I'm more concerned about the possibility of a semantic difference between the two. Is it at all possible for the case folding of an entire string to differ from the concatenation of the case foldings of its individual characters? Additionally: a proper "case insensitive comparison" should almost certainly start with a Unicode normalization. But should it be NFC/NFD or NFKC/NFKD? IMO that's a good reason to leave it in the hands of the application. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
31.08.17 15:49, Steve D'Aprano пише: On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote: 31.08.17 10:10, Steven D'Aprano пише: (iii) Not every two line function needs to be in the standard library. Just add this to the top of every module: def equal(s, t): return s.casefold() == t.casefold() This is my answer. Unsolved problems: This proposal doesn't help with sets and dicts, list.index and the `in` operator either. This is the end of the discussion. Your answer of an equal() function doesn't help with sets and dicts either. See rejected issue18986 [1], PEP 455 [2] and corresponding mailing lists discussions. The conclusion was that the need of such collections is low and the problems that they are purposed to solve can be solved with normal dict (as they are solved now). So I guess we're stuck with no good standard answer: - the easy two-line function doesn't even come close to solving the problem of case-insensitive string operations; It is not clear what is your problem exactly. The easy one-line function solves the problem of testing case-insensitive string equality. Regular expressions solve the problem of case-insensitive searching a position of a substring. If you asked a solution that magically prevent people from making simple programming mistakes, there is no such solution. [1] https://bugs.python.org/issue18986 [2] https://www.python.org/dev/peps/pep-0455/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Thu, Aug 31, 2017 at 10:49 PM, Steve D'Aprano wrote: > On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote: > >> 31.08.17 10:10, Steven D'Aprano пише: >>> (iii) Not every two line function needs to be in the standard library. >>> Just add this to the top of every module: >>> >>> def equal(s, t): >>> return s.casefold() == t.casefold() >> >> This is my answer. >> >>> Unsolved problems: >>> >>> This proposal doesn't help with sets and dicts, list.index and the `in` >>> operator either. >> >> This is the end of the discussion. > > Your answer of an equal() function doesn't help with sets and dicts either. > > So I guess we're stuck with no good standard answer: > > - the easy two-line function doesn't even come close to solving the problem > of case-insensitive string operations; > > - but we can't have case-insensitive string operations because too many > people say "just use this two-line function". The method you proposed seems a little odd - it steps through the strings character by character and casefolds them separately. How is it superior to the two-line function? And it still doesn't solve any of your other cases. To make dicts/sets case insensitive, you really need something that people have periodically asked for: a dict with a key function. It would retain the actual key used, but for comparison purposes, would use the modified version. Then you could use str.casefold as your key function, and voila, you have a case insensitive dict. I don't know if that would solve your other problems though. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On Thu, 31 Aug 2017 05:51 pm, Serhiy Storchaka wrote: > 31.08.17 10:10, Steven D'Aprano пише: >> (iii) Not every two line function needs to be in the standard library. >> Just add this to the top of every module: >> >> def equal(s, t): >> return s.casefold() == t.casefold() > > This is my answer. > >> Unsolved problems: >> >> This proposal doesn't help with sets and dicts, list.index and the `in` >> operator either. > > This is the end of the discussion. Your answer of an equal() function doesn't help with sets and dicts either. So I guess we're stuck with no good standard answer: - the easy two-line function doesn't even come close to solving the problem of case-insensitive string operations; - but we can't have case-insensitive string operations because too many people say "just use this two-line function". -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
On 31/08/17 08:10, Steven D'Aprano wrote: So I'd like to propose some additions to 3.7 or 3.8. If the feedback here is positive, I'll take it to Python-Ideas for the negative feedback :-) (1) Add a new string method, which performs a case-insensitive equality test. Here is a potential implementation, written in pure Python: def equal(self, other): if self is other: return True if not isinstance(other, str): raise TypeError if len(self) != len(other): return False casefold = str.casefold for a, b in zip(self, other): if casefold(a) != casefold(b): return False return True I'd quibble about the name and the implementation (length is not preserved under casefolding), but I'd go for this. The number of times I've written something like this in different languages... Alternatively: how about a === triple-equals operator to do the same thing? Much less keen on new syntax, especially when other languages use it for other purposes. (2) Add keyword-only arguments to str.find and str.index: casefold=False which does nothing if false (the default), and switches to a case- insensitive search if true. There's an implementation argument to be had about whether separate casefolded methods would be better, but yes. Unsolved problems: This proposal doesn't help with sets and dicts, list.index and the `in` operator either. The only way I can think of to get much traction with this is to have a separate case-insensitive string class. That feels a bit heavyweight, though. -- Rhodri James *-* Kynesim Ltd -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
31.08.17 10:10, Steven D'Aprano пише: (iii) Not every two line function needs to be in the standard library. Just add this to the top of every module: def equal(s, t): return s.casefold() == t.casefold() This is my answer. Unsolved problems: This proposal doesn't help with sets and dicts, list.index and the `in` operator either. This is the end of the discussion. -- https://mail.python.org/mailman/listinfo/python-list
Re: Case-insensitive string equality
IMO this should be solved by a company used library and I would go in the direction of a Normalized_String class. This has the advantages (1) that the company can choose whatever normalization suits them, not all cases are suited by comparing case insentitively, (2) individual devs in the company don't have to write there own. (3) and Normalized_Strings can be keys in directories and members in a set. Op 31-08-17 om 09:10 schreef Steven D'Aprano: > Three times in the last week the devs where I work accidentally > introduced bugs into our code because of a mistake with case-insensitive > string comparisons. They managed to demonstrate three different failures: > > # 1 > a = something().upper() # normalise string > ... much later on > if a == b.lower(): ... > > > # 2 > a = something().upper() > ... much later on > if a == 'maildir': ... > > > # 3 > a = something() # unnormalised > assert 'foo' in a > ... much later on > pos = a.find('FOO') > > > > Not every two line function needs to be in the standard library, but I've > come to the conclusion that case-insensitive testing and searches should > be. I've made these mistakes myself at times, as I'm sure most people > have, and I'm tired of writing my own case-insensitive function over and > over again. > > > So I'd like to propose some additions to 3.7 or 3.8. If the feedback here > is positive, I'll take it to Python-Ideas for the negative feedback :-) > > > (1) Add a new string method, which performs a case-insensitive equality > test. Here is a potential implementation, written in pure Python: > > > def equal(self, other): > if self is other: > return True > if not isinstance(other, str): > raise TypeError > if len(self) != len(other): > return False > casefold = str.casefold > for a, b in zip(self, other): > if casefold(a) != casefold(b): > return False > return True > > Alternatively: how about a === triple-equals operator to do the same > thing? > > > > (2) Add keyword-only arguments to str.find and str.index: > > casefold=False > > which does nothing if false (the default), and switches to a case- > insensitive search if true. > > > > > Alternatives: > > (i) Do nothing. The status quo wins a stalemate. > > (ii) Instead of str.find or index, use a regular expression. > > This is less discoverable (you need to know regular expressions) and > harder to get right than to just call a string method. Also, I expect > that invoking the re engine just for case insensitivity will be a lot > more expensive than a simple search need be. > > (iii) Not every two line function needs to be in the standard library. > Just add this to the top of every module: > > def equal(s, t): > return s.casefold() == t.casefold() > > > That's the status quo wins again. It's an annoyance. A small annoyance, > but multiplied by the sheer number of times it happens, it becomes a > large annoyance. I believe the annoyance factor of case-insensitive > comparisons outweighs the "two line function" objection. > > And the two-line "equal" function doesn't solve the problem for find and > index, or for sets dicts, list.index and the `in` operator either. > > > Unsolved problems: > > This proposal doesn't help with sets and dicts, list.index and the `in` > operator either. > > > > Thoughts? > > > -- https://mail.python.org/mailman/listinfo/python-list
Case-insensitive string equality
Three times in the last week the devs where I work accidentally introduced bugs into our code because of a mistake with case-insensitive string comparisons. They managed to demonstrate three different failures: # 1 a = something().upper() # normalise string ... much later on if a == b.lower(): ... # 2 a = something().upper() ... much later on if a == 'maildir': ... # 3 a = something() # unnormalised assert 'foo' in a ... much later on pos = a.find('FOO') Not every two line function needs to be in the standard library, but I've come to the conclusion that case-insensitive testing and searches should be. I've made these mistakes myself at times, as I'm sure most people have, and I'm tired of writing my own case-insensitive function over and over again. So I'd like to propose some additions to 3.7 or 3.8. If the feedback here is positive, I'll take it to Python-Ideas for the negative feedback :-) (1) Add a new string method, which performs a case-insensitive equality test. Here is a potential implementation, written in pure Python: def equal(self, other): if self is other: return True if not isinstance(other, str): raise TypeError if len(self) != len(other): return False casefold = str.casefold for a, b in zip(self, other): if casefold(a) != casefold(b): return False return True Alternatively: how about a === triple-equals operator to do the same thing? (2) Add keyword-only arguments to str.find and str.index: casefold=False which does nothing if false (the default), and switches to a case- insensitive search if true. Alternatives: (i) Do nothing. The status quo wins a stalemate. (ii) Instead of str.find or index, use a regular expression. This is less discoverable (you need to know regular expressions) and harder to get right than to just call a string method. Also, I expect that invoking the re engine just for case insensitivity will be a lot more expensive than a simple search need be. (iii) Not every two line function needs to be in the standard library. Just add this to the top of every module: def equal(s, t): return s.casefold() == t.casefold() That's the status quo wins again. It's an annoyance. A small annoyance, but multiplied by the sheer number of times it happens, it becomes a large annoyance. I believe the annoyance factor of case-insensitive comparisons outweighs the "two line function" objection. And the two-line "equal" function doesn't solve the problem for find and index, or for sets dicts, list.index and the `in` operator either. Unsolved problems: This proposal doesn't help with sets and dicts, list.index and the `in` operator either. Thoughts? -- Steven D'Aprano “You are deluded if you think software engineers who can't write operating systems or applications without security holes, can write virtualization layers without security holes.” —Theo de Raadt -- https://mail.python.org/mailman/listinfo/python-list