Re: can [[:digit:]] match something other than 0123456789?
Garrett Wollmanwrote: |< said: |> Also, my feeling is that [[:digit:]] should match just the digits |> that are actually relevant for that locale, e.g., just "western" |> digits for en_GB. And fractions and superscripts are not digits. | |Implementations often use the same character definitions for all |locales using the same character set -- such as the Unicode character |data file, for Unicode-based locales. I think changing this may be a |tough sell for many implementers, just given the sheer number of |characters (and bikeshed-painting debates about which particular |character class or collation element should include which characters |in which locales would not be welcome). ..and bugs are everywhere, ... and take a long time to fix. I think Unicode is pretty clear on what is a digit or a number, and what not. And i think they no longer officially support the toolchain that can be used to turn Unicode data tables to Unix/POSIX compliant (a.k.a. localedef) tables. But of course D'Amore from the Solaris faction seems to have done a great job, and Daroussin imported that into FreeBSD (unforgotten the "this is how i like OpenSource software" message, or very nearby that). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
RE: can [[:digit:]] match something other than 0123456789?
< said: > Also, my feeling is that [[:digit:]] should match just the digits > that are actually relevant for that locale, e.g., just "western" > digits for en_GB. And fractions and superscripts are not digits. Implementations often use the same character definitions for all locales using the same character set -- such as the Unicode character data file, for Unicode-based locales. I think changing this may be a tough sell for many implementers, just given the sheer number of characters (and bikeshed-painting debates about which particular character class or collation element should include which characters in which locales would not be welcome). -GAWollman
Re: can [[:digit:]] match something other than 0123456789?
Stephane CHAZELASwrote: > Is that a POSIX invention (the [a-z] based on collation) by the > way, or does it come from implementations that already existed > at the time? Around 1993, all major UNIX platforms used the same code that was derived from IBM. Maybe this is the background... Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
RE: can [[:digit:]] match something other than 0123456789?
> -Original Message- > From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com] > Sent: Sunday, May 20, 2018 10:43 PM > To: Geoff Clare > Cc: austin-group-l@opengroup.org > Subject: Re: can [[:digit:]] match something other than 0123456789? > > Note that having [x-y] be based on collation order would mean that things > like [a-z] > would also match on uppercase letters in the latin script in locales where > case is > not considered in the first weight for sorting (as is typical for English > locales for > instance). > > > Now, in a en_GB.UTF-8 locale on GNU/Linux (here ubuntu 16.04) for instance, > both > bash's and ksh93's [0-9] matches on at least > 142 different characters (see below). That matches on 0123456789 but also > digits 0 > (sometime 1) to 8 (sometimes 9 like for U+0669 which sorts the same as 9 > there!) > in other scripts, and some other random decimal digits, and some non-digits > and > is far from including all the plethora of other decimal digits in Unicode. > (unicode --max 0 --regexp > 'digit.(one|two|three|four|five|six|seven|eight|nine)\b' | > grep -c '^U+' > retuns 696 with an old version of unicode, and that doesn't even include > things like > roman numerals). I'd find [0-9] matching on just "western" digits and [[:digit:]] matching on the locale's digits the most natural solution. If someone wanted to match on Devanagari or whatever digits, she could simply list them in the bracket expression, rather than using "western" digits. If [0-9] is understood to be [[:digit:]], how could one differentiate between "western" and, say, Devanagari digits (other than listing them each explicitly, [0123456789], as Stephane has done)? Same goes for [a-z]: these should match (or should be) the Roman letters, not alphabetic characters in general. Also, my feeling is that [[:digit:]] should match just the digits that are actually relevant for that locale, e.g., just "western" digits for en_GB. And fractions and superscripts are not digits. If you really want to match any digit in any language, you could add a "Unicode" locale or perhaps region.
Re: can [[:digit:]] match something other than 0123456789?
2018-05-23 22:44:46 +0100, Stephane CHAZELAS: [...] > [a-z] is not guaranteed to match on lower case letters only let > alone abcdefghijklmnopqrstuvwxyz only, it may even match on > characters outside the latin script. [...] Actually, I suspect that POSIX requires ranges in the POSIX locale to be based on collation (and unspecified in other locale) so that [a-z] be guaranteed to match on abcdefghijklmnopqrstuvwxyz only even when the POSIX locale's charset is something like EBCDIC where those characters are not contiguous. It's ironic that doing that for other locale would break the expectation that [a-z] should match on abcdefghijklmnopqrstuvwxyz while the locale's charset has them in the correct order. Is that a POSIX invention (the [a-z] based on collation) by the way, or does it come from implementations that already existed at the time? What about the [.elt.], [=equiv=], [:class:]? Is it a POSIX invention of specification of prior art? I've come across a past discussion on the GNU grep mailing list suggesting the based-on-collation ranges should only be done when using [[.a.]-[.z.]], while [a-z] should be based on code point. That sounds to me like a nice idea. -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
2018-05-22 13:49:20 +0100, Stephane CHAZELAS: [...] > In the case of the fnmatch and regexp of most systems, I don't > know how they make so that [0-9] only matches on 0123456789 or > [a-z] not on uppercase letters. Possibly, that's with special > cases as well. [...] Sorry, my bad. It looks like I was basing my conclusions on tests I thought I remembered doing but probably never did. [0-9] matches on characters other than 0123456789 on many systems with grep and system regexps as well. On Solaris 10, in a en_GB.UTF-8 locale, with /usr/xpg4/bin/grep, it matches on hundreds of different characters many of which have nothing to do with digits or are not even assigned in Unicode. Its [a-z] matches on ABC...WXY and hundreds more and even parts of characters like the 0xf0..0xf4 of characters U+1 to U+10. On FreeBSD, [0-9] matches on U+2185 ROMAN NUMERAL SIX LATE FORM in addition to 0123456789 (!?). In GNU locales, whether [a-z] matches BCD..WXY or not depends on the locale and the version of glibc. [0-9] does not always include only 0123456789 either. For instance, in a th_TH.UTF-8 locale, grep '[a-z]' matches on M and grep '[0-9]' matches on U+0E50 THAI DIGIT ZERO U+0E51 THAI DIGIT ONE U+0E52 THAI DIGIT TWO U+0E53 THAI DIGIT THREE U+0E54 THAI DIGIT FOUR U+0E55 THAI DIGIT FIVE U+0E56 THAI DIGIT SIX U+0E57 THAI DIGIT SEVEN U+0E58 THAI DIGIT EIGHT (note the missing DIGIT NINE which would sort after 9). So, that confirms that it's not only a bash/ksh93 "issue", [0-9] cannot be used to match 0123456789 only and what it matches is random and useless and not what one would ever want. [a-z] is not guaranteed to match on lower case letters only let alone abcdefghijklmnopqrstuvwxyz only, it may even match on characters outside the latin script. LC_ALL=C grep '[0-9]' Would be OK, but not in locales that use charsets that have characters that contain the encoding of digits (like GB18030, BIG5...). It was requested that [[:digit:]] match only on 0123456789. While in practice, it seems to be the case for things that use the POSIX API, it's not always the case outside of it (where [0-9] generally matches on 0123456789 but [[:digit:]] can match on all sorts of decimal digits). Like perl -Mopen=locale -ne 'print if /[[:digit:]]/' So it would seem that [0123456789] is the only portable way to match on 0123456789 only. -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
On 5/22/18 6:32 AM, Joerg Schilling wrote: >> bash's [a-z] still matches on A..Y or B..Z though (source of >> much consusion, many bugs and lots of ranting), and that >> makes me realise that bash is actually one of those utilities > > This strange and unexpected behavior did cause once that bash removed > important > files for me. Sorry, I don't remember which locale I used at that time. Bash's bracket expression range expressions use the collating sequence, as Posix specifies (though Posix limits its definition to the Posix locale), so any locale that collates like aAbBcC...zZ will pick up upper and lower case characters. There is a shell option that allows you to control the behavior. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: can [[:digit:]] match something other than 0123456789?
2018-05-22 12:32:20 +0200, Joerg Schilling: [...] > > bash's [a-z] still matches on A..Y or B..Z though (source of > > much consusion, many bugs and lots of ranting), and that > > makes me realise that bash is actually one of those utilities > > This strange and unexpected behavior did cause once that bash > removed important files for me. Sorry, I don't remember which > locale I used at that time. > > I would call this behavior a security risk. [...] Note that (AFAICT from testing) ksh93 behaves like bash in that its ranges are based on collation order, but it has an extra feature in that - if both ends of the range are lowercase letters (or collating elements whose first character is a lowercase letter), then [-] matches on collating elements in between and PROVIDED their first character is lowercase. That's why for instance m is matched by [a-z], [A-z], [a-Z] but not [A-Z] and in a Hungarian locale on a GNU system, Dz is matched by [A-Z] (even though it contains a lowercase letter) and not [a-z]. - and the corresponding case for uppercase letters In the case of the fnmatch and regexp of most systems, I don't know how they make so that [0-9] only matches on 0123456789 or [a-z] not on uppercase letters. Possibly, that's with special cases as well. Note that GNU grep/sed do match Dz with [A-Z] in Hungarian locales, but not GNU "find -name '[A-Z]'" (fnmatch doesn't seem to handle collating elements there). zsh's ranged are based on byte value in locales with single-byte charsets and unicode codepoint (wide character, which probably corresponds to unicode code point on all systems where zsh has been ported) in multi-byte ones. To me, that's the most useful approach (also the one of most modern languages). -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
I Listed digits that were consequitive. I did not list japanese nor chinese digits. But it would be easy to also include japanese and chinese digits. you could just include character classes like zero, one, two etc. Best regards Keld On Tue, May 22, 2018 at 02:15:16PM +0200, Joerg Schilling wrote: > "k...@keldix.com"wrote: > > > I already cited text from 14652 and 30112. That would be fine. > > I mentioned already that japanese/chinese numbers are not consecutive. > > > On Tue, May 22, 2018 at 11:45:26AM +0200, Joerg Schilling wrote: > > > "k...@keldix.com" wrote: > > > > > > > Well, if ctype.h does not cover the functionality that we want, then we > > > > need to > > > > specify new functionality. WG14 is looking into some reentrant > > > > functionality > > > > in this area, in something that could be a TS. > > > > > > Could you please explain what functionallity you like to see? > > Jörg > > -- > EMail:jo...@schily.net(home) Jörg Schilling D-13353 > Berlin > joerg.schill...@fokus.fraunhofer.de (work) Blog: > http://schily.blogspot.com/ > URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Stephane Chazelaswrote: > Note that having [x-y] be based on collation order would mean > that things like [a-z] would also match on uppercase letters in > the latin script in locales where case is not considered in the > first weight for sorting (as is typical for English locales for > instance). ... > bash's [a-z] still matches on A..Y or B..Z though (source of > much consusion, many bugs and lots of ranting), and that > makes me realise that bash is actually one of those utilities This strange and unexpected behavior did cause once that bash removed important files for me. Sorry, I don't remember which locale I used at that time. I would call this behavior a security risk. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
"k...@keldix.com"wrote: > Well, if ctype.h does not cover the functionality that we want, then we need > to > specify new functionality. WG14 is looking into some reentrant functionality > in this area, in something that could be a TS. Could you please explain what functionallity you like to see? Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
2018-05-16 09:42:56 +0100, Geoff Clare: > Stephane Chazelaswrote, on 15 May 2018: > > > > OK, so to rephrase and make sure I understand correctly. In > > locales other than C, [[:digit:]] will be guaranteed to match on > > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be > > in that order but [0-9] is unspecified anyway outside of the C > > locale. > > > > That's a bit counter-intuitive > > Not really, when you consider that ranges should use the collation > sequence, not character encodings. (For the C/POSIX locale that's > required - for others it's not, but it's the obvious way to implement > ranges with multibyte characters.) > > In languages where there are alternative "digit" representations, > the locale definition might give the various representations of each > "digit" the same primary weight in the collating sequence, in which > case [0-9] would include some characters that are not true digits > (according to iswdigit()). [...] Thanks all for replying. Note that having [x-y] be based on collation order would mean that things like [a-z] would also match on uppercase letters in the latin script in locales where case is not considered in the first weight for sorting (as is typical for English locales for instance). Hardly any implementation do it anymore. IIRC, GNU grep used to have [a-z] match on ABCDEF...Y in some locale, but they don't anymore, probably because they got too many bug reports about that. I don't know how they do it now, it still seems to be somehow based on collation order, as [a-e] matches on áâà, ć (but not êèé which come after e, evidence that it's not useful), on dz in Hungarian and so on, but not on ABC, and 0-9 matches only on 0123456788 and that's the same on all systems I tried including certified ones like Solaris. bash's [a-z] still matches on A..Y or B..Z though (source of much consusion, many bugs and lots of ranting), and that makes me realise that bash is actually one of those utilities where 0-9 matches something other than 0123456789. When I said initially I wasn't aware of any that did, I had only considered fnmatch and EREs. Now, in a en_GB.UTF-8 locale on GNU/Linux (here ubuntu 16.04) for instance, both bash's and ksh93's [0-9] matches on at least 142 different characters (see below). That matches on 0123456789 but also digits 0 (sometime 1) to 8 (sometimes 9 like for U+0669 which sorts the same as 9 there!) in other scripts, and some other random decimal digits, and some non-digits and is far from including all the plethora of other decimal digits in Unicode. (unicode --max 0 --regexp 'digit.(one|two|three|four|five|six|seven|eight|nine)\b' | grep -c '^U+' retuns 696 with an old version of unicode, and that doesn't even include things like roman numerals). Now when sorting text, that order makes as much sense as any. If some text ever happened to contain both English and Devanagari digits, it could make very much sense to sort them next to each others, but it makes little sense for [0-9] to match on those. Using [0-9] is very common to validate input and make sure it contains only digits for instance like: case $input in "" | *[!0-9]*) die invalid esac That means it needs to be changed to *[!0123456789]* to actually work. What bash/ksh93's [0-9] match in en_GB.UTF-8: U+0030 DIGIT ZERO U+0031 DIGIT ONE U+0032 DIGIT TWO U+0033 DIGIT THREE U+0034 DIGIT FOUR U+0035 DIGIT FIVE U+0036 DIGIT SIX U+0037 DIGIT SEVEN U+0038 DIGIT EIGHT U+0039 DIGIT NINE U+00B2 SUPERSCRIPT TWO U+00B3 SUPERSCRIPT THREE U+00B9 SUPERSCRIPT ONE U+00BC VULGAR FRACTION ONE QUARTER U+00BD VULGAR FRACTION ONE HALF U+00BE VULGAR FRACTION THREE QUARTERS U+0660 ARABIC-INDIC DIGIT ZERO U+0661 ARABIC-INDIC DIGIT ONE U+0662 ARABIC-INDIC DIGIT TWO U+0663 ARABIC-INDIC DIGIT THREE U+0664 ARABIC-INDIC DIGIT FOUR U+0665 ARABIC-INDIC DIGIT FIVE U+0666 ARABIC-INDIC DIGIT SIX U+0667 ARABIC-INDIC DIGIT SEVEN U+0668 ARABIC-INDIC DIGIT EIGHT U+0669 ARABIC-INDIC DIGIT NINE U+06F0 EXTENDED ARABIC-INDIC DIGIT ZERO U+06F1 EXTENDED ARABIC-INDIC DIGIT ONE U+06F2 EXTENDED ARABIC-INDIC DIGIT TWO U+06F3 EXTENDED ARABIC-INDIC DIGIT THREE U+06F4 EXTENDED ARABIC-INDIC DIGIT FOUR U+06F5 EXTENDED ARABIC-INDIC DIGIT FIVE U+06F6 EXTENDED ARABIC-INDIC DIGIT SIX U+06F7 EXTENDED ARABIC-INDIC DIGIT SEVEN U+06F8 EXTENDED ARABIC-INDIC DIGIT EIGHT U+0966 DEVANAGARI DIGIT ZERO U+0967 DEVANAGARI DIGIT ONE U+0968 DEVANAGARI DIGIT TWO U+0969 DEVANAGARI DIGIT THREE U+096A DEVANAGARI DIGIT FOUR U+096B DEVANAGARI DIGIT FIVE U+096C DEVANAGARI DIGIT SIX U+096D DEVANAGARI DIGIT SEVEN U+096E DEVANAGARI DIGIT EIGHT U+09E6 BENGALI DIGIT ZERO U+09E7 BENGALI DIGIT ONE U+09E8 BENGALI DIGIT TWO U+09E9 BENGALI DIGIT THREE U+09EA BENGALI DIGIT FOUR U+09EB BENGALI DIGIT FIVE U+09EC BENGALI DIGIT SIX U+09ED BENGALI DIGIT SEVEN U+09EE BENGALI DIGIT EIGHT U+0A66 GURMUKHI DIGIT ZERO U+0A67 GURMUKHI DIGIT ONE U+0A68 GURMUKHI DIGIT TWO U+0A69 GURMUKHI DIGIT THREE U+0A6A GURMUKHI DIGIT
Re: can [[:digit:]] match something other than 0123456789?
On Fri, May 18, 2018 at 01:35:03PM -0500, Eric Blake wrote: > On 05/18/2018 12:24 PM, Wheeler, David A wrote: > >This conversation seems strange; many locales use digits other than 0-9 to > >represent numbers. > > > >The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have > >digits, they just aren't 0-9. In Unicode/ISO-646 in particular there are > >the digits U+0660 through U+0669 and U+06F0 through U+06F9. When I > >visited Saudi Arabia I saw the Eastern Arabic digits everywhere, not just > >0-9. For more: > >https://en.wikipedia.org/wiki/Eastern_Arabic_numerals > > > >Here's an example, U+0662: > >http://www.fileformat.info/info/unicode/char/0662/index.htm > >This is a decimal digit with value 2. Java agrees. > > > >It sounds like there are different use cases. Maybe there needs to be a > >standard way to represent different cases, e.g., "exactly 0-9", "a digit > >in the current locale", and "a member of Unicode Character Category > >'Number, Decimal Digit'". I don't know if there's a need to distinguish > >the second and third cases. It seems to me that [[:digit::]] should mean > >the second or third case. > > The problem is that the definition of isdigit() means only the > first case (exactly the locale-independent 10 digits in the portable > file name character set, whether locales are based on ASCII or EBCDIC), > and the definition of [[:FOO:]] defers to isFOO() where > possible. Yes, it may be nice to have additional classification > routines, but as has been pointed out elsewhere in this thread, doing it > solely by one character at a time may not be sufficient to capture all > Unicode rules compared to what people really want to search for (for > example, when searching for a character with an accent, you want to be > able to find both the composed character, and the sequence of a plain > character plus combining mark character, that both represent the same > concept, but an iswFOO() test does not work on the latter example, since > it occupies more than one character). Well, if ctype.h does not cover the functionality that we want, then we need to specify new functionality. WG14 is looking into some reentrant functionality in this area, in something that could be a TS. Also for the comparison, SC35/wg5 has specified an API that takes care of much of these problems, both present in 14652 and 30112. This is an API that was meant for 14651 (the ISO sort standard) but had resistance from the Unicode people. Also the bidi spec was proposed for 10646 but some Unicode people resisted it. I get the impression that some people do not want ISO to specify things in this area which is not controlled by unicode. Best regards keld
Re: can [[:digit:]] match something other than 0123456789?
On 05/18/2018 12:24 PM, Wheeler, David A wrote: This conversation seems strange; many locales use digits other than 0-9 to represent numbers. The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have digits, they just aren't 0-9. In Unicode/ISO-646 in particular there are the digits U+0660 through U+0669 and U+06F0 through U+06F9. When I visited Saudi Arabia I saw the Eastern Arabic digits everywhere, not just 0-9. For more: https://en.wikipedia.org/wiki/Eastern_Arabic_numerals Here's an example, U+0662: http://www.fileformat.info/info/unicode/char/0662/index.htm This is a decimal digit with value 2. Java agrees. It sounds like there are different use cases. Maybe there needs to be a standard way to represent different cases, e.g., "exactly 0-9", "a digit in the current locale", and "a member of Unicode Character Category 'Number, Decimal Digit'". I don't know if there's a need to distinguish the second and third cases. It seems to me that [[:digit::]] should mean the second or third case. The problem is that the definition of isdigit() means only the first case (exactly the locale-independent 10 digits in the portable file name character set, whether locales are based on ASCII or EBCDIC), and the definition of [[:FOO:]] defers to isFOO() where possible. Yes, it may be nice to have additional classification routines, but as has been pointed out elsewhere in this thread, doing it solely by one character at a time may not be sufficient to capture all Unicode rules compared to what people really want to search for (for example, when searching for a character with an accent, you want to be able to find both the composed character, and the sequence of a plain character plus combining mark character, that both represent the same concept, but an iswFOO() test does not work on the latter example, since it occupies more than one character). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
RE: can [[:digit:]] match something other than 0123456789?
This conversation seems strange; many locales use digits other than 0-9 to represent numbers. The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have digits, they just aren't 0-9. In Unicode/ISO-646 in particular there are the digits U+0660 through U+0669 and U+06F0 through U+06F9. When I visited Saudi Arabia I saw the Eastern Arabic digits everywhere, not just 0-9. For more: https://en.wikipedia.org/wiki/Eastern_Arabic_numerals Here's an example, U+0662: http://www.fileformat.info/info/unicode/char/0662/index.htm This is a decimal digit with value 2. Java agrees. It sounds like there are different use cases. Maybe there needs to be a standard way to represent different cases, e.g., "exactly 0-9", "a digit in the current locale", and "a member of Unicode Character Category 'Number, Decimal Digit'". I don't know if there's a need to distinguish the second and third cases. It seems to me that [[:digit::]] should mean the second or third case. --- David A. Wheeler
Re: can [[:digit:]] match something other than 0123456789?
On Thu, May 17, 2018 at 12:36:35PM +0200, Hans Åberg wrote: > > > On 17 May 2018, at 11:02, Joerg Schilling > >wrote: > > > > Hans Åberg wrote: > > > |I asked a person who speaks japanese and he told me that > | > | "\u4e00\u4e8c\u4e09" > | > |is similar to > | > | "one two three" > | > |and this is not used for computing. > > If i recall correctly this has been discussed already; if not here > then on the Unicode list. Unicode brings quite a lot of > codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT > ONE FULL STOP etc. All these are marked "No", and i think the > discussion concluded that they should not be taken into account > when converting strings to numbers. > >> > >> The intent may be that the value of the digit character c can be computed > >> by the expression c - '0' when >= 0 and <= 9, and is otherwise a > >> non-digit. Then 'isdigit' and [[:digit:]] are tied to that, so it is > >> impossible to use any other decimal digits. > > > > This seems to be an important idea, as this japanese one two three > > is not in a contiguous order. > > It provides an efficient implementation, important on earlier computers. The > UTF-8 article [1], "History", mentions that they struggled around 1992 to > find proposals for that providing efficient implementations. > > 1. https://en.wikipedia.org/wiki/UTF-8 Oh, well. You should be able to implement efficient code for the specs from 14652 and 30112, one would be that you, after testing for isdigit, the you index into a 4-bit table with the binary value corresponding to the digit character. This is probably on par speedwise with subtracting the value for zero. Best regards keld
Re: can [[:digit:]] match something other than 0123456789?
> On 17 May 2018, at 11:02, Joerg Schilling >wrote: > > Hans Åberg wrote: > |I asked a person who speaks japanese and he told me that | | "\u4e00\u4e8c\u4e09" | |is similar to | | "one two three" | |and this is not used for computing. If i recall correctly this has been discussed already; if not here then on the Unicode list. Unicode brings quite a lot of codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT ONE FULL STOP etc. All these are marked "No", and i think the discussion concluded that they should not be taken into account when converting strings to numbers. >> >> The intent may be that the value of the digit character c can be computed by >> the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. >> Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use >> any other decimal digits. > > This seems to be an important idea, as this japanese one two three > is not in a contiguous order. It provides an efficient implementation, important on earlier computers. The UTF-8 article [1], "History", mentions that they struggled around 1992 to find proposals for that providing efficient implementations. 1. https://en.wikipedia.org/wiki/UTF-8
Re: can [[:digit:]] match something other than 0123456789?
On Thu, May 17, 2018 at 11:02:48AM +0200, Joerg Schilling wrote: > Hans Åbergwrote: > > > >> |I asked a person who speaks japanese and he told me that > > >> | > > >> | "\u4e00\u4e8c\u4e09" > > >> | > > >> |is similar to > > >> | > > >> | "one two three" > > >> | > > >> |and this is not used for computing. > > >> > > >> If i recall correctly this has been discussed already; if not here > > >> then on the Unicode list. Unicode brings quite a lot of > > >> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT > > >> ONE FULL STOP etc. All these are marked "No", and i think the > > >> discussion concluded that they should not be taken into account > > >> when converting strings to numbers. > > > > The intent may be that the value of the digit character c can be computed > > by the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. > > Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use > > any other decimal digits. > > This seems to be an important idea, as this japanese one two three > is not in a contiguous order. Well, the digits in other scripts are ordered consequetively, so the calculation could easily be done, for the scripts I previously documented, as prescribed in ISO 14652. This is not rocket science. Best regards keld
Re: can [[:digit:]] match something other than 0123456789?
Hans Åbergwrote: > >> |I asked a person who speaks japanese and he told me that > >> | > >> | "\u4e00\u4e8c\u4e09" > >> | > >> |is similar to > >> | > >> | "one two three" > >> | > >> |and this is not used for computing. > >> > >> If i recall correctly this has been discussed already; if not here > >> then on the Unicode list. Unicode brings quite a lot of > >> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT > >> ONE FULL STOP etc. All these are marked "No", and i think the > >> discussion concluded that they should not be taken into account > >> when converting strings to numbers. > > The intent may be that the value of the digit character c can be computed by > the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. Then > 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use any > other decimal digits. This seems to be an important idea, as this japanese one two three is not in a contiguous order. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
> On 16 May 2018, at 18:13, Hans Åbergwrote: > > >> On 16 May 2018, at 17:14, Steffen Nurpmeso wrote: >> >> Joerg Schilling wrote: >> |Steffen Nurpmeso wrote: >> |>|> have some Unicode support. >> |>| >> |>|What do you expect: >> |>| >> |>| strtol("\u4e00\u4e8c\u4e09", , 0); >> |> >> |> The entire is*() family cannot work with multibyte or stateful >> |> encodings, right. >> | >> |I asked a person who speaks japanese and he told me that >> | >> | "\u4e00\u4e8c\u4e09" >> | >> |is similar to >> | >> | "one two three" >> | >> |and this is not used for computing. >> >> If i recall correctly this has been discussed already; if not here >> then on the Unicode list. Unicode brings quite a lot of >> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT >> ONE FULL STOP etc. All these are marked "No", and i think the >> discussion concluded that they should not be taken into account >> when converting strings to numbers. The intent may be that the value of the digit character c can be computed by the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use any other decimal digits.
Re: can [[:digit:]] match something other than 0123456789?
> On 16 May 2018, at 17:14, Steffen Nurpmesowrote: > > Joerg Schilling wrote: > |Steffen Nurpmeso wrote: > |>|> have some Unicode support. > |>| > |>|What do you expect: > |>| > |>| strtol("\u4e00\u4e8c\u4e09", , 0); > |> > |> The entire is*() family cannot work with multibyte or stateful > |> encodings, right. > | > |I asked a person who speaks japanese and he told me that > | > | "\u4e00\u4e8c\u4e09" > | > |is similar to > | > | "one two three" > | > |and this is not used for computing. > > If i recall correctly this has been discussed already; if not here > then on the Unicode list. Unicode brings quite a lot of > codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT > ONE FULL STOP etc. All these are marked "No", and i think the > discussion concluded that they should not be taken into account > when converting strings to numbers. Hans Åberg surely knows > better than I. I am happier the less I know about these issues, and UTF-8 was invented to help with that! :-) It was ICU Regular Expressions I had in mind, which can do matching on all Unicode classes this link says, including case insensitive matching where the cases have different length. http://userguide.icu-project.org/strings/regexp So as for the original question, I think the question is something like that one is supposed to define a C character set, and then those C functions act against those. Harbison & Steele says that the isdigit function tests if it is one of the ten digits one has defined, which is what [[:digit:]] is supposed to match, I think. So you can define your locale to have whatever ten characters you like and render them as you please as long as they are ten and are contiguous and have the intended function as decimal digits. Or so I think. If one wants other character classes matching outside of that, it is safest to do as ICU Regular Expressions, defining with respect to Unicode.
Re: can [[:digit:]] match something other than 0123456789?
Joerg Schillingwrote: |Steffen Nurpmeso wrote: |>|> have some Unicode support. |>| |>|What do you expect: |>| |>| strtol("\u4e00\u4e8c\u4e09", , 0); |> |> The entire is*() family cannot work with multibyte or stateful |> encodings, right. | |I asked a person who speaks japanese and he told me that | | "\u4e00\u4e8c\u4e09" | |is similar to | | "one two three" | |and this is not used for computing. If i recall correctly this has been discussed already; if not here then on the Unicode list. Unicode brings quite a lot of codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT ONE FULL STOP etc. All these are marked "No", and i think the discussion concluded that they should not be taken into account when converting strings to numbers. Hans Åberg surely knows better than I. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: can [[:digit:]] match something other than 0123456789?
Steffen Nurpmesowrote: > |> have some Unicode support. > | > |What do you expect: > | > | strtol("\u4e00\u4e8c\u4e09", , 0); > > The entire is*() family cannot work with multibyte or stateful > encodings, right. I asked a person who speaks japanese and he told me that "\u4e00\u4e8c\u4e09" is similar to "one two three" and this is not used for computing. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Joerg Schillingwrote: |Hans Åberg wrote: |>> On 16 May 2018, at 10:29, Joerg Schilling > er.de> wrote: |>> |>> Robert Elz wrote: |>> |>>> How does one specify a locale for some area using Latin as its |>>> language, where I V X L C D M are the digits ? |>> |>> how do you like to specify a hexadecimal number in this locale? |> |> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, \ |> you might check what the ECMAscript and C++ regex library do, which \ |> have some Unicode support. | |What do you expect: | | strtol("\u4e00\u4e8c\u4e09", , 0); The entire is*() family cannot work with multibyte or stateful encodings, right. In my opinion for which i speak it was an error to simply doctor more and more functionality onto the old interfaces, take silent thread-safety for standard I/O functions, locale awareness for functions which inherently cannot serve their purpose, like the is*() family. Even the w*() family cannot work for all languages, even if you do not use ISO 10646 codepoints in wchar_t, because the necessity of surrounding context that some languages had, have, and will have -- that cannot be aten up like a burger. In fact almost all network protocol-, cryptographic message syntax- (CMS) or whatever standards require plain ASCII as a base and only sometimes require something else, mostly bounded. It is beneficial to have a set of plain and reliable ASCII tools at hand for these tasks. What is wrong with that? Granted it could be done in Cyrillic, Chinese, Korean, Japanese, or any other language, but for on the history is a different one and then, why not. English can be an easy language if so desired, and is like that for most standards, in the end. (It can also be a very hard read, just take Roman soldiery, flung their gnarled arms over a thick carpet of the most delicious green sward; in some places they were intermingled with beeches, hollies, and copsewood of various descriptions, so closely as totally to intercept the level beams of the sinking sun; in others they receded from each other, forming those long sweeping vistas, in the intricacy of which the eye delights to lose itself, while imagination considers them as the paths to yet wilder scenes of silvan solitude. |to return in a japanese locale and what do you expect: | | strtol("0XC", , 0); | |to return in a latin locale? That is hexadecimal for sure. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: can [[:digit:]] match something other than 0123456789?
> On 16 May 2018, at 10:53, Joerg Schilling >wrote: > > Hans Åberg wrote: > >> >>> On 16 May 2018, at 10:29, Joerg Schilling >>> wrote: >>> >>> Robert Elz wrote: >>> How does one specify a locale for some area using Latin as its language, where I V X L C D M are the digits ? >>> >>> how do you like to specify a hexadecimal number in this locale? >> >> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you >> might check what the ECMAscript and C++ regex library do, which have some >> Unicode support. > > What do you expect: > > strtol("\u4e00\u4e8c\u4e09", , 0); > > to return in a japanese locale and what do you expect: > > strtol("0XC", , 0); > > to return in a latin locale? I'm on MacOS, which has no language set, only LC_CTYPE="UTF-8". And std::strtol does not seem to accept explicit Unicode strings [1]. And if you want to use Latin numerals, you should probably use "Ⅹ" U+2169 and "Ⅽ" U+216D, so it is a non-issue. 1. http://en.cppreference.com/w/cpp/string/byte/strtol
Re: can [[:digit:]] match something other than 0123456789?
On Wed, May 16, 2018 at 10:41:15AM +0200, Joerg Schilling wrote: > Robert Elzwrote: > > > would be easy, but you say it alao has to look for > > > > (c) [[:latindigs:]]+ > > (c) [[:vdigits:]]+ > > > > (and how many more)? This is actualy kind of important, as > > > > (c) MMXVI > > > > type strings are not uncommon in certain environments (can't recall > > ever seeing one written in Venusian though...) > > We discussed whether > > \u4e00 \u4e8c \u4e09 > > should be a valid number made of [[:digit:]] in a japanese locale, > but it seems to be not a good idea. > > If we did this, any program that deals with digits would not only need to > know > the rules for the indian (frequently called arabic) numbers but also the > rules > for other schemes. Well for many other scripts than the normal ASCII digits, this is already standardized, in ISO 14652 it says: digit Define the characters to be classified as decimal digits. Digits corresponding to the values 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified in groups of 10 digits, and in ascending order of the values they represent. The digits of the portable character set are automatically included. If this keyword is not specified, the digits 0 through 9 of the portable character set automatically belong to this class, with application-defined character values. The "digit" keyword is used to specify which characters are accepted as digits in input to an application, such as characters typed in or scanned in from an input text file, and should list digits used with all the scripts supported by the FDCC-set. The keyword may be omitted. And the standard i18n locale of 14652 has this for the digit class % The "digit" class of the "i18n" FDCC-set is reflecting % the recommendations in TR 10176 annex A digit / % COLLECTION 1 BASIC LATIN/ ..;/ % COLLECTION 15 ARABIC EXTENDED/ ..;..;/ % COLLECTION 16 DEVANAGARI/ ..;/ % COLLECTION 18 BENGALI/ ..;/ % COLLECTION 18 GURMUKHI/ ..;/ % COLLECTION 19 GUJARATI/ ..;/ % COLLECTION 20 ORIYA/ ..;/ % COLLECTION 21 TAMIL/ <0>;..;/ % COLLECTION 22 TELUGU/ ..;/ % COLLECTION 23 KANNADA/ ..;/ % COLLECTION 24 MALAYALAM/ ..;/ % COLLECTION 25 THAI/ ..;/ % COLLECTION 26 LAO/ ..;/ % COLLECTION 72 BASIC TIBETAN/ ..;/ % COLLECTION 68 HALFWIDTH AND FULLWIDTH FORMS/ .. % Best regards keld
Re: can [[:digit:]] match something other than 0123456789?
For conforming charsets XBD 6 requires the range <0>-<9> to be contiguous. By XBD 9.3.5, Rule 6, {:digit:] may include MBS elements aside from the <0> to <9> in LC_CTYPE, but the range [0-9] depends on whether additional characters have the same collation weight as digits. If this is the case the locale may need to define collating symbols that bracket the range of digits in the order list and use those in range expressions to ensure everything the locale considers a decimal digit is tested for. The collating sequence for Japan might be something like: collating-symbol collating-symbol order_start forward ... bgn-decimal <0> weight N <1> <一> weight N+1 <2> <二> <3> <三> ... <9> weight N+8 end-decimal ... order_end and [0-9] would include and the other digits, but not . The range [[.bgn-decimal.]-[.end-decimal.]] should include too. I'm ambivalent about whether the standard should reserve symbol names like this for common ranges like digits, though. In a message dated 5/16/2018 4:49:44 AM Eastern Standard Time, joerg.schill...@fokus.fraunhofer.de writes: Geoff Clarewrote: > Stephane Chazelas wrote, on 15 May 2018: > > > > OK, so to rephrase and make sure I understand correctly. In > > locales other than C, [[:digit:]] will be guaranteed to match on > > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be > > in that order but [0-9] is unspecified anyway outside of the C > > locale. > > > > That's a bit counter-intuitive > > Not really, when you consider that ranges should use the collation > sequence, not character encodings. (For the C/POSIX locale that's > required - for others it's not, but it's the obvious way to implement > ranges with multibyte characters.) I believe the real problem is the IBM i18n implementation that internally uses collating values to evaluate ranges. With characters, this can result in stramge effects but it permits to implement [[=o=]] easily. For digits, I would expect that there is no other glyph in between [0-9] but it may not be contiguous in a collating value notation. Jörg -- EMail:jo...@schily.net (home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Hans Åbergwrote: > > > On 16 May 2018, at 10:29, Joerg Schilling > > wrote: > > > > Robert Elz wrote: > > > >> How does one specify a locale for some area using Latin as its > >> language, where I V X L C D M are the digits ? > > > > how do you like to specify a hexadecimal number in this locale? > > They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you > might check what the ECMAscript and C++ regex library do, which have some > Unicode support. What do you expect: strtol("\u4e00\u4e8c\u4e09", , 0); to return in a japanese locale and what do you expect: strtol("0XC", , 0); to return in a latin locale? Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
> On 16 May 2018, at 10:29, Joerg Schilling >wrote: > > Robert Elz wrote: > >> How does one specify a locale for some area using Latin as its >> language, where I V X L C D M are the digits ? > > how do you like to specify a hexadecimal number in this locale? They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you might check what the ECMAscript and C++ regex library do, which have some Unicode support. 1. http://en.cppreference.com/w/cpp/regex 2. http://en.cppreference.com/w/cpp/regex/ecmascript
Re: can [[:digit:]] match something other than 0123456789?
Geoff Clarewrote: > Stephane Chazelas wrote, on 15 May 2018: > > > > OK, so to rephrase and make sure I understand correctly. In > > locales other than C, [[:digit:]] will be guaranteed to match on > > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be > > in that order but [0-9] is unspecified anyway outside of the C > > locale. > > > > That's a bit counter-intuitive > > Not really, when you consider that ranges should use the collation > sequence, not character encodings. (For the C/POSIX locale that's > required - for others it's not, but it's the obvious way to implement > ranges with multibyte characters.) I believe the real problem is the IBM i18n implementation that internally uses collating values to evaluate ranges. With characters, this can result in stramge effects but it permits to implement [[=o=]] easily. For digits, I would expect that there is no other glyph in between [0-9] but it may not be contiguous in a collating value notation. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Stephane Chazelaswrote, on 15 May 2018: > > OK, so to rephrase and make sure I understand correctly. In > locales other than C, [[:digit:]] will be guaranteed to match on > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be > in that order but [0-9] is unspecified anyway outside of the C > locale. > > That's a bit counter-intuitive Not really, when you consider that ranges should use the collation sequence, not character encodings. (For the C/POSIX locale that's required - for others it's not, but it's the obvious way to implement ranges with multibyte characters.) In languages where there are alternative "digit" representations, the locale definition might give the various representations of each "digit" the same primary weight in the collating sequence, in which case [0-9] would include some characters that are not true digits (according to iswdigit()). -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: can [[:digit:]] match something other than 0123456789?
Robert Elzwrote: > would be easy, but you say it alao has to look for > > (c) [[:latindigs:]]+ > (c) [[:vdigits:]]+ > > (and how many more)? This is actualy kind of important, as > > (c) MMXVI > > type strings are not uncommon in certain environments (can't recall > ever seeing one written in Venusian though...) We discussed whether \u4e00 \u4e8c \u4e09 should be a valid number made of [[:digit:]] in a japanese locale, but it seems to be not a good idea. If we did this, any program that deals with digits would not only need to know the rules for the indian (frequently called arabic) numbers but also the rules for other schemes. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Robert Elzwrote: > How does one specify a locale for some area using Latin as its > language, where I V X L C D M are the digits ? how do you like to specify a hexadecimal number in this locale? Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'
Re: can [[:digit:]] match something other than 0123456789?
Yes, it nominally is unworkable as static rosters so isn't considered portable enough to standardize, that I see. K originally just wanted to support decimal and octal in C, iirc, and octal only because DEC did PDP core dumps that way. While Unicode provides some support for rosters of arbitrary numbers as char32_t 'digits', this is still limited to what an implementation is willing to provide support for in terms of text fields and numeric conversions, not that a portable application can add to on the fly by defining a POSIX or CLDR locale with a "digit set factory" that [:digit:] could be written to automatically take into account. In a message dated 5/15/2018 7:29:49 PM Eastern Standard Time, k...@munnari.oz.au writes: Date: Tue, 15 May 2018 18:42:29 -0400 From: Shware SystemsMessage-ID: <16365f81e7e-179a-29...@webjas-vab019.srv.aolmail.net> | That locale would define a latindigs charclass, same as Venusians are requi= | red to define a vdigits for theirs, and it's up to the application to do th= | e equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine. That would be unworkable - it would mean that every application would need to know the details of every locale that could possibly be used. Eg: consider an application looking for copyright strings in files (I can't type the c in a circle so I will use (c)). That is, a (c) and a year (or a sequence of years perhaps), matching (c) [[:digit:]]+ would be easy, but you say it alao has to look for (c) [[:latindigs:]]+ (c) [[:vdigits:]]+ (and how many more)? This is actualy kind of important, as (c) MMXVI type strings are not uncommon in certain environments (can't recall ever seeing one written in Venusian though...) It gets worse if it is accepted that [:unknown:] is undefined/unspecified rather than just "no match" - then the code actually has to adapt itself to the locale that is actually in use, rather than simply covering all known locales. kre
Re: can [[:digit:]] match something other than 0123456789?
Date:Tue, 15 May 2018 18:42:29 -0400 From:Shware SystemsMessage-ID: <16365f81e7e-179a-29...@webjas-vab019.srv.aolmail.net> | That locale would define a latindigs charclass, same as Venusians are requi= | red to define a vdigits for theirs, and it's up to the application to do th= | e equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine. That would be unworkable - it would mean that every application would need to know the details of every locale that could possibly be used. Eg: consider an application looking for copyright strings in files (I can't type the c in a circle so I will use (c)). That is, a (c) and a year (or a sequence of years perhaps), matching (c) [[:digit:]]+ would be easy, but you say it alao has to look for (c) [[:latindigs:]]+ (c) [[:vdigits:]]+ (and how many more)? This is actualy kind of important, as (c) MMXVI type strings are not uncommon in certain environments (can't recall ever seeing one written in Venusian though...) It gets worse if it is accepted that [:unknown:] is undefined/unspecified rather than just "no match" - then the code actually has to adapt itself to the locale that is actually in use, rather than simply covering all known locales. kre
Re: can [[:digit:]] match something other than 0123456789?
That locale would define a latindigs charclass, same as Venusians are required to define a vdigits for theirs, and it's up to the application to do the equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine. In a message dated 5/15/2018 6:31:31 PM Eastern Standard Time, k...@munnari.oz.au writes: Date: Tue, 15 May 2018 13:38:15 -0500 From: Eric BlakeMessage-ID: <08af8b99-dcf0-5775-3aed-533611cec...@redhat.com> | Please read http://austingroupbugs.net/view.php?id=1078 where this | wording has been tightened to cover ALL locales, not just the POSIX | locale, to better match with C requirements on isdigit(). How does one specify a locale for some area using Latin as its language, where I V X L C D M are the digits ? kre
Re: can [[:digit:]] match something other than 0123456789?
Stephane Chazelaswrote: |2018-05-15 16:55:45 -0500, Eric Blake: |> On 05/15/2018 03:43 PM, Stephane Chazelas wrote: |>>Does that mean that [0-9] is also guaranteed to match on |>>0123456789 only? And that then [[:digit:]] in regexp/fnmatch is |>>close to useless as it's longer than [0-9] |> |> Yes, I think that's a fair conclusion for the C locale, by virtue of the |> fact that the standard requires the encoding for 0-9 to be contiguous \ |> and in |> order. |> |>>and is a bit |>>misleading as it suggests it would be affected by localisation |>>(like the other character classes) while it's not. |> |> It's still useful in non-C locales within regexp, since ALL uses of - for |> ranges within [] has unspecified (or was it implementation-defined) |> semantics outside of the C locale. Using a named reference guarantees the |> desired semantics of exactly 10 characters, rather than skirting on the |> grounds of whether the range operator behaves as desired in all locales |> rather than just the C locale. |[...] | |OK, so to rephrase and make sure I understand correctly. In |locales other than C, [[:digit:]] will be guaranteed to match on |0123456789 only but not [0-9]. 0123456789 are guaranteed to be |in that order but [0-9] is unspecified anyway outside of the C |locale. | |That's a bit counter-intuitive and (as noted by @isaac at |https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit\ |-and-d/414230?noredirect=1#comment804362_414230) |is the opposite of what perl (in unicode mode), php (in unicode |mode), pcre (with (*UCP)) do: their [0-9] matches 0123456789 |while their \d/[[:digit:]] match based on Unicode properties so |other decimal digits than the 0123456789 ones. Unicode knows about decimal numbers, hexdigits and ascii_hexdigit[s]. If i recall correctly the property of the former is to offer ten successive numbers which correspond to what we know as digits, while possibly looking different etc. Given the latter property it makes sense to treat [0-9] as ASCII compatible but let [:digit:] match whatever a language desires. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: can [[:digit:]] match something other than 0123456789?
Date:Tue, 15 May 2018 13:38:15 -0500 From:Eric BlakeMessage-ID: <08af8b99-dcf0-5775-3aed-533611cec...@redhat.com> | Please read http://austingroupbugs.net/view.php?id=1078 where this | wording has been tightened to cover ALL locales, not just the POSIX | locale, to better match with C requirements on isdigit(). How does one specify a locale for some area using Latin as its language, where I V X L C D M are the digits ? kre
Re: can [[:digit:]] match something other than 0123456789?
2018-05-15 16:55:45 -0500, Eric Blake: > On 05/15/2018 03:43 PM, Stephane Chazelas wrote: > > > >Does that mean that [0-9] is also guaranteed to match on > >0123456789 only? And that then [[:digit:]] in regexp/fnmatch is > >close to useless as it's longer than [0-9] > > Yes, I think that's a fair conclusion for the C locale, by virtue of the > fact that the standard requires the encoding for 0-9 to be contiguous and in > order. > > >and is a bit > >misleading as it suggests it would be affected by localisation > >(like the other character classes) while it's not. > > It's still useful in non-C locales within regexp, since ALL uses of - for > ranges within [] has unspecified (or was it implementation-defined) > semantics outside of the C locale. Using a named reference guarantees the > desired semantics of exactly 10 characters, rather than skirting on the > grounds of whether the range operator behaves as desired in all locales > rather than just the C locale. [...] OK, so to rephrase and make sure I understand correctly. In locales other than C, [[:digit:]] will be guaranteed to match on 0123456789 only but not [0-9]. 0123456789 are guaranteed to be in that order but [0-9] is unspecified anyway outside of the C locale. That's a bit counter-intuitive and (as noted by @isaac at https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit-and-d/414230?noredirect=1#comment804362_414230) is the opposite of what perl (in unicode mode), php (in unicode mode), pcre (with (*UCP)) do: their [0-9] matches 0123456789 while their \d/[[:digit:]] match based on Unicode properties so other decimal digits than the 0123456789 ones. -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
On 05/15/2018 03:43 PM, Stephane Chazelas wrote: Does that mean that [0-9] is also guaranteed to match on 0123456789 only? And that then [[:digit:]] in regexp/fnmatch is close to useless as it's longer than [0-9] Yes, I think that's a fair conclusion for the C locale, by virtue of the fact that the standard requires the encoding for 0-9 to be contiguous and in order. and is a bit misleading as it suggests it would be affected by localisation (like the other character classes) while it's not. It's still useful in non-C locales within regexp, since ALL uses of - for ranges within [] has unspecified (or was it implementation-defined) semantics outside of the C locale. Using a named reference guarantees the desired semantics of exactly 10 characters, rather than skirting on the grounds of whether the range operator behaves as desired in all locales rather than just the C locale. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: can [[:digit:]] match something other than 0123456789?
For that hypothetical Venusian locale, as discussed for 1078, it would be expected to define a VDIGIT (sic) custom LC_CTYPE charclass for specifying other character names representing digits, and then using [[:digit:][:VDIGIT:]] to test for both. Code like this couldn't be considered strictly conforming, but might qualify for NLS-conforming. Also, application-specific locales can add names to a digit definition, with the same caveat, and then [:digit:] would be different from [0-9]. Similarly, VDIGIT could include [0-9] plus other names, and then code would only need to use [:VDIGIT:] to test for both. In a message dated 5/15/2018 4:55:20 PM Eastern Standard Time, stephane.chaze...@gmail.com writes: 2018-05-15 13:38:15 -0500, Eric Blake: > On 05/15/2018 12:50 PM, Stephane Chazelas wrote: [...] > >> digit > >> Define the characters to be classified as numeric digits. > >> > >> In the POSIX locale, only: > >> > >>0 1 2 3 4 5 6 7 8 9 > > Please read http://austingroupbugs.net/view.php?id=1078 where this wording > has been tightened to cover ALL locales, not just the POSIX locale, to > better match with C requirements on isdigit(). [...] Thanks. I somehow missed that one. Does that mean that [0-9] is also guaranteed to match on 0123456789 only? And that then [[:digit:]] in regexp/fnmatch is close to useless as it's longer than [0-9] and is a bit misleading as it suggests it would be affected by localisation (like the other character classes) while it's not. -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
2018-05-15 13:38:15 -0500, Eric Blake: > On 05/15/2018 12:50 PM, Stephane Chazelas wrote: [...] > >> digit > >> Define the characters to be classified as numeric digits. > >> > >> In the POSIX locale, only: > >> > >>0 1 2 3 4 5 6 7 8 9 > > Please read http://austingroupbugs.net/view.php?id=1078 where this wording > has been tightened to cover ALL locales, not just the POSIX locale, to > better match with C requirements on isdigit(). [...] Thanks. I somehow missed that one. Does that mean that [0-9] is also guaranteed to match on 0123456789 only? And that then [[:digit:]] in regexp/fnmatch is close to useless as it's longer than [0-9] and is a bit misleading as it suggests it would be affected by localisation (like the other character classes) while it's not. -- Stephane
Re: can [[:digit:]] match something other than 0123456789?
On 05/15/2018 12:50 PM, Stephane Chazelas wrote: You're a bit late to the party on this question :) digit Define the characters to be classified as numeric digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 Please read http://austingroupbugs.net/view.php?id=1078 where this wording has been tightened to cover ALL locales, not just the POSIX locale, to better match with C requirements on isdigit(). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org