subject:"can \[\[\:digit\:\]\] match something other than 0123456789\?"

Re: can [[:digit:]] match something other than 0123456789?

2018-05-25 Thread Steffen Nurpmeso

Garrett Wollman  wrote:
 |< said:
 |> Also, my feeling is that [[:digit:]] should match just the digits
 |> that are actually relevant for that locale, e.g., just "western"
 |> digits for en_GB.  And fractions and superscripts are not digits.
 |
 |Implementations often use the same character definitions for all
 |locales using the same character set -- such as the Unicode character
 |data file, for Unicode-based locales.  I think changing this may be a
 |tough sell for many implementers, just given the sheer number of
 |characters (and bikeshed-painting debates about which particular
 |character class or collation element should include which characters
 |in which locales would not be welcome).

..and bugs are everywhere, ... and take a long time to fix.
I think Unicode is pretty clear on what is a digit or a number,
and what not.  And i think they no longer officially support the
toolchain that can be used to turn Unicode data tables to
Unix/POSIX compliant (a.k.a. localedef) tables.  But of course
D'Amore from the Solaris faction seems to have done a great job,
and Daroussin imported that into FreeBSD (unforgotten the "this is
how i like OpenSource software" message, or very nearby that).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

RE: can [[:digit:]] match something other than 0123456789?

2018-05-24 Thread Garrett Wollman

< said:

> Also, my feeling is that [[:digit:]] should match just the digits
> that are actually relevant for that locale, e.g., just "western"
> digits for en_GB.  And fractions and superscripts are not digits.

Implementations often use the same character definitions for all
locales using the same character set -- such as the Unicode character
data file, for Unicode-based locales.  I think changing this may be a
tough sell for many implementers, just given the sheer number of
characters (and bikeshed-painting debates about which particular
character class or collation element should include which characters
in which locales would not be welcome).

-GAWollman

Re: can [[:digit:]] match something other than 0123456789?

2018-05-24 Thread Joerg Schilling

Stephane CHAZELAS  wrote:

> Is that a POSIX invention (the [a-z] based on collation) by the
> way, or does it come from implementations that already existed
> at the time?

Around 1993, all major UNIX platforms used the same code that was derived from 
IBM.

Maybe this is the background...


Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

RE: can [[:digit:]] match something other than 0123456789?

2018-05-24 Thread Schwarz, Konrad

> -Original Message-
> From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com]
> Sent: Sunday, May 20, 2018 10:43 PM
> To: Geoff Clare
> Cc: austin-group-l@opengroup.org
> Subject: Re: can [[:digit:]] match something other than 0123456789?
> 

> Note that having [x-y] be based on collation order would mean that things 
> like [a-z]
> would also match on uppercase letters in the latin script in locales where 
> case is
> not considered in the first weight for sorting (as is typical for English 
> locales for
> instance).
> 
> 
> Now, in a en_GB.UTF-8 locale on GNU/Linux (here ubuntu 16.04) for instance, 
> both
> bash's and ksh93's [0-9] matches on at least
> 142 different characters (see below). That matches on 0123456789 but also 
> digits 0
> (sometime 1) to 8 (sometimes 9 like for U+0669 which sorts the same as 9 
> there!)
> in other scripts, and some other random decimal digits, and some non-digits 
> and
> is far from including all the plethora of other decimal digits in Unicode.
> (unicode --max 0 --regexp 
> 'digit.(one|two|three|four|five|six|seven|eight|nine)\b' |
> grep -c '^U+'
> retuns 696 with an old version of unicode, and that doesn't even include 
> things like
> roman numerals).

I'd find [0-9] matching on just "western" digits and [[:digit:]] matching
on the locale's digits the most natural solution.  If someone wanted to match
on Devanagari or whatever digits, she could simply list them in the bracket 
expression, rather than using
"western" digits.  If [0-9] is understood to be [[:digit:]], how could one 
differentiate between "western"
and, say, Devanagari digits (other than listing them each explicitly, 
[0123456789], as Stephane has done)?

Same goes for [a-z]: these should match (or should be) the Roman letters, not 
alphabetic characters
in general.

Also, my feeling is that [[:digit:]] should match just the digits that are 
actually relevant for that locale, e.g.,
just "western" digits for en_GB.  And fractions and superscripts are not digits.

If you really want to match any digit in any language, you could add a 
"Unicode" locale or perhaps region.

Re: can [[:digit:]] match something other than 0123456789?

2018-05-23 Thread Stephane CHAZELAS

2018-05-23 22:44:46 +0100, Stephane CHAZELAS:
[...]
> [a-z] is not guaranteed to match on lower case letters only let
> alone abcdefghijklmnopqrstuvwxyz only, it may even match on
> characters outside the latin script.
[...]

Actually, I suspect that POSIX requires ranges in the POSIX
locale to be based on collation (and unspecified in other
locale) so that [a-z] be guaranteed to match on
abcdefghijklmnopqrstuvwxyz only even when the POSIX locale's
charset is something like EBCDIC where those characters are not
contiguous.

It's ironic that doing that for other locale would break the
expectation that [a-z] should match on
abcdefghijklmnopqrstuvwxyz while the locale's charset has
them in the correct order.

Is that a POSIX invention (the [a-z] based on collation) by the
way, or does it come from implementations that already existed
at the time?

What about the [.elt.], [=equiv=], [:class:]? Is it a POSIX
invention of specification of prior art?

I've come across a past discussion on the GNU grep mailing list
suggesting the based-on-collation ranges should only be done
when using [[.a.]-[.z.]], while [a-z] should be based on code
point.

That sounds to me like a nice idea.

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-23 Thread Stephane CHAZELAS

2018-05-22 13:49:20 +0100, Stephane CHAZELAS:
[...]
> In the case of the fnmatch and regexp of most systems, I don't
> know how they make so that [0-9] only matches on 0123456789 or
> [a-z] not on uppercase letters. Possibly, that's with special
> cases as well.
[...]

Sorry, my bad. It looks like I was basing my conclusions on
tests I thought I remembered doing but probably never did.

[0-9] matches on characters other than 0123456789 on many
systems with grep and system regexps as well.

On Solaris 10, in a en_GB.UTF-8 locale, with /usr/xpg4/bin/grep,
it matches on hundreds of different characters many of which
have nothing to do with digits or are not even assigned in
Unicode. Its [a-z] matches on ABC...WXY and hundreds more and
even parts of characters like the 0xf0..0xf4 of characters
U+1 to U+10.

On FreeBSD, [0-9] matches on U+2185 ROMAN NUMERAL SIX LATE FORM
in addition to 0123456789 (!?).

In GNU locales, whether [a-z] matches BCD..WXY or not depends on
the locale and the version of glibc. [0-9] does not always
include only 0123456789 either. For instance, in a th_TH.UTF-8
locale, grep '[a-z]' matches on M and grep '[0-9]' matches on

U+0E50 THAI DIGIT ZERO
U+0E51 THAI DIGIT ONE
U+0E52 THAI DIGIT TWO
U+0E53 THAI DIGIT THREE
U+0E54 THAI DIGIT FOUR
U+0E55 THAI DIGIT FIVE
U+0E56 THAI DIGIT SIX
U+0E57 THAI DIGIT SEVEN
U+0E58 THAI DIGIT EIGHT

(note the missing DIGIT NINE which would sort after 9).

So, that confirms that it's not only a bash/ksh93 "issue", [0-9]
cannot be used to match 0123456789 only and what it matches is
random and useless and not what one would ever want.

[a-z] is not guaranteed to match on lower case letters only let
alone abcdefghijklmnopqrstuvwxyz only, it may even match on
characters outside the latin script.

LC_ALL=C grep '[0-9]'

Would be OK, but not in locales that use charsets that have
characters that contain the encoding of digits (like GB18030,
BIG5...).

It was requested that [[:digit:]] match only on 0123456789.
While in practice, it seems to be the case for things that use
the POSIX API, it's not always the case outside of it (where
[0-9] generally matches on 0123456789 but [[:digit:]] can match
on all sorts of decimal digits). Like

perl -Mopen=locale -ne 'print if /[[:digit:]]/'

So it would seem that [0123456789] is the only portable way to
match on 0123456789 only.

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-22 Thread Chet Ramey

On 5/22/18 6:32 AM, Joerg Schilling wrote:

>> bash's [a-z] still matches on A..Y or B..Z though (source of
>> much consusion, many bugs and lots of ranting), and that
>> makes me realise that bash is actually one of those utilities
> 
> This strange and unexpected behavior did cause once that bash removed 
> important 
> files for me. Sorry, I don't remember which locale I used at that time.

Bash's bracket expression range expressions use the collating sequence,
as Posix specifies (though Posix limits its definition to the Posix
locale), so any locale that collates like aAbBcC...zZ will pick up
upper and lower case characters.

There is a shell option that allows you to control the behavior.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/

Re: can [[:digit:]] match something other than 0123456789?

2018-05-22 Thread Stephane CHAZELAS

2018-05-22 12:32:20 +0200, Joerg Schilling:
[...]
> > bash's [a-z] still matches on A..Y or B..Z though (source of
> > much consusion, many bugs and lots of ranting), and that
> > makes me realise that bash is actually one of those utilities
> 
> This strange and unexpected behavior did cause once that bash
> removed important files for me. Sorry, I don't remember which
> locale I used at that time.
> 
> I would call this behavior a security risk.
[...]

Note that (AFAICT from testing) ksh93 behaves like bash in that
its ranges are based on collation order, but it has an extra
feature in that

- if both ends of the range are lowercase letters (or collating
  elements whose first character is a lowercase letter), then
  [-] matches on collating elements in between
   and  PROVIDED their first character is lowercase.

  That's why for instance m is matched by [a-z], [A-z], [a-Z]
  but not [A-Z] and in a Hungarian locale on a GNU system, Dz is
  matched by [A-Z] (even though it contains a lowercase letter)
  and not [a-z].

- and the corresponding case for uppercase letters


In the case of the fnmatch and regexp of most systems, I don't
know how they make so that [0-9] only matches on 0123456789 or
[a-z] not on uppercase letters. Possibly, that's with special
cases as well. Note that GNU grep/sed do match Dz with [A-Z] in
Hungarian locales, but not GNU "find -name '[A-Z]'" (fnmatch
doesn't seem to handle collating elements there).

zsh's ranged are based on byte value in locales with single-byte
charsets and unicode codepoint (wide character, which probably
corresponds to unicode code point on all systems where zsh has
been ported) in multi-byte ones. To me, that's the most useful
approach (also the one of most modern languages).

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-22 Thread keld

I Listed digits that were consequitive. I did not list japanese nor chinese
digits. 

But it would be easy to also include japanese and chinese digits.
you could just include character classes like zero, one, two etc.

Best regards
Keld

On Tue, May 22, 2018 at 02:15:16PM +0200, Joerg Schilling wrote:
> "k...@keldix.com"  wrote:
> 
> > I already cited text from 14652 and 30112.  That would be fine.
> 
> I mentioned already that japanese/chinese numbers are not consecutive.
> 
> > On Tue, May 22, 2018 at 11:45:26AM +0200, Joerg Schilling wrote:
> > > "k...@keldix.com"  wrote:
> > > 
> > > > Well, if ctype.h does not cover the functionality that we want, then we 
> > > > need to
> > > > specify new functionality.  WG14 is looking into some reentrant 
> > > > functionality
> > > > in this area, in something that could be a TS.
> > > 
> > > Could you please explain what functionallity you like to see?
> 
> Jörg
> 
> -- 
>  EMail:jo...@schily.net(home) Jörg Schilling D-13353 
> Berlin
> joerg.schill...@fokus.fraunhofer.de (work) Blog: 
> http://schily.blogspot.com/
>  URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-22 Thread Joerg Schilling

Stephane Chazelas  wrote:

> Note that having [x-y] be based on collation order would mean
> that things like [a-z] would also match on uppercase letters in
> the latin script in locales where case is not considered in the
> first weight for sorting (as is typical for English locales for
> instance).

...

> bash's [a-z] still matches on A..Y or B..Z though (source of
> much consusion, many bugs and lots of ranting), and that
> makes me realise that bash is actually one of those utilities

This strange and unexpected behavior did cause once that bash removed important 
files for me. Sorry, I don't remember which locale I used at that time.

I would call this behavior a security risk.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-22 Thread Joerg Schilling

"k...@keldix.com"  wrote:

> Well, if ctype.h does not cover the functionality that we want, then we need 
> to
> specify new functionality.  WG14 is looking into some reentrant functionality
> in this area, in something that could be a TS.

Could you please explain what functionallity you like to see?

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-20 Thread Stephane Chazelas

2018-05-16 09:42:56 +0100, Geoff Clare:
> Stephane Chazelas  wrote, on 15 May 2018:
> >
> > OK, so to rephrase and make sure I understand correctly. In
> > locales other than C, [[:digit:]] will be guaranteed to match on
> > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be
> > in that order but [0-9] is unspecified anyway outside of the C
> > locale.
> > 
> > That's a bit counter-intuitive
> 
> Not really, when you consider that ranges should use the collation
> sequence, not character encodings.  (For the C/POSIX locale that's
> required - for others it's not, but it's the obvious way to implement
> ranges with multibyte characters.)
> 
> In languages where there are alternative "digit" representations,
> the locale definition might give the various representations of each
> "digit" the same primary weight in the collating sequence, in which
> case [0-9] would include some characters that are not true digits
> (according to iswdigit()).
[...]

Thanks all for replying.

Note that having [x-y] be based on collation order would mean
that things like [a-z] would also match on uppercase letters in
the latin script in locales where case is not considered in the
first weight for sorting (as is typical for English locales for
instance).

Hardly any implementation do it anymore. IIRC, GNU grep used to
have [a-z] match on ABCDEF...Y in some locale, but they don't
anymore, probably because they got too many bug reports about
that.

I don't know how they do it now, it still seems to be somehow
based on collation order, as [a-e] matches on áâà, ć (but not
êèé which come after e, evidence that it's not useful), on dz in
Hungarian and so on, but not on ABC, and 0-9 matches only on
0123456788  and that's the same on all systems I tried including
certified ones like Solaris.

bash's [a-z] still matches on A..Y or B..Z though (source of
much consusion, many bugs and lots of ranting), and that
makes me realise that bash is actually one of those utilities
where 0-9 matches something other than 0123456789. When I said
initially I wasn't aware of any that did, I had only considered
fnmatch and EREs.

Now, in a en_GB.UTF-8 locale on GNU/Linux (here ubuntu 16.04)
for instance, both bash's and ksh93's [0-9] matches on at least
142 different characters (see below). That matches on 0123456789
but also digits 0 (sometime 1) to 8 (sometimes 9 like for U+0669
which sorts the same as 9 there!) in other scripts, and some
other random decimal digits, and some non-digits and is far from
including all the plethora of other decimal digits in Unicode.
(unicode --max 0 --regexp 
'digit.(one|two|three|four|five|six|seven|eight|nine)\b' | grep -c '^U+'
retuns 696 with an old version of unicode, and that doesn't even
include things like roman numerals).

Now when sorting text, that order makes as much sense as any. If
some text ever happened to contain both English and Devanagari
digits, it could make very much sense to sort them next to each
others, but it makes little sense for [0-9] to match on those.

Using [0-9] is very common to validate input and make sure it
contains only digits for instance like:

case $input in
  "" | *[!0-9]*) die invalid
esac

That means it needs to be changed to *[!0123456789]* to actually
work.

What bash/ksh93's [0-9] match in en_GB.UTF-8:

U+0030 DIGIT ZERO
U+0031 DIGIT ONE
U+0032 DIGIT TWO
U+0033 DIGIT THREE
U+0034 DIGIT FOUR
U+0035 DIGIT FIVE
U+0036 DIGIT SIX
U+0037 DIGIT SEVEN
U+0038 DIGIT EIGHT
U+0039 DIGIT NINE
U+00B2 SUPERSCRIPT TWO
U+00B3 SUPERSCRIPT THREE
U+00B9 SUPERSCRIPT ONE
U+00BC VULGAR FRACTION ONE QUARTER
U+00BD VULGAR FRACTION ONE HALF
U+00BE VULGAR FRACTION THREE QUARTERS
U+0660 ARABIC-INDIC DIGIT ZERO
U+0661 ARABIC-INDIC DIGIT ONE
U+0662 ARABIC-INDIC DIGIT TWO
U+0663 ARABIC-INDIC DIGIT THREE
U+0664 ARABIC-INDIC DIGIT FOUR
U+0665 ARABIC-INDIC DIGIT FIVE
U+0666 ARABIC-INDIC DIGIT SIX
U+0667 ARABIC-INDIC DIGIT SEVEN
U+0668 ARABIC-INDIC DIGIT EIGHT
U+0669 ARABIC-INDIC DIGIT NINE
U+06F0 EXTENDED ARABIC-INDIC DIGIT ZERO
U+06F1 EXTENDED ARABIC-INDIC DIGIT ONE
U+06F2 EXTENDED ARABIC-INDIC DIGIT TWO
U+06F3 EXTENDED ARABIC-INDIC DIGIT THREE
U+06F4 EXTENDED ARABIC-INDIC DIGIT FOUR
U+06F5 EXTENDED ARABIC-INDIC DIGIT FIVE
U+06F6 EXTENDED ARABIC-INDIC DIGIT SIX
U+06F7 EXTENDED ARABIC-INDIC DIGIT SEVEN
U+06F8 EXTENDED ARABIC-INDIC DIGIT EIGHT
U+0966 DEVANAGARI DIGIT ZERO
U+0967 DEVANAGARI DIGIT ONE
U+0968 DEVANAGARI DIGIT TWO
U+0969 DEVANAGARI DIGIT THREE
U+096A DEVANAGARI DIGIT FOUR
U+096B DEVANAGARI DIGIT FIVE
U+096C DEVANAGARI DIGIT SIX
U+096D DEVANAGARI DIGIT SEVEN
U+096E DEVANAGARI DIGIT EIGHT
U+09E6 BENGALI DIGIT ZERO
U+09E7 BENGALI DIGIT ONE
U+09E8 BENGALI DIGIT TWO
U+09E9 BENGALI DIGIT THREE
U+09EA BENGALI DIGIT FOUR
U+09EB BENGALI DIGIT FIVE
U+09EC BENGALI DIGIT SIX
U+09ED BENGALI DIGIT SEVEN
U+09EE BENGALI DIGIT EIGHT
U+0A66 GURMUKHI DIGIT ZERO
U+0A67 GURMUKHI DIGIT ONE
U+0A68 GURMUKHI DIGIT TWO
U+0A69 GURMUKHI DIGIT THREE
U+0A6A GURMUKHI DIGIT

Re: can [[:digit:]] match something other than 0123456789?

2018-05-18 Thread k...@keldix.com

On Fri, May 18, 2018 at 01:35:03PM -0500, Eric Blake wrote:
> On 05/18/2018 12:24 PM, Wheeler, David A wrote:
> >This conversation seems strange; many locales use digits other than 0-9 to 
> >represent numbers.
> >
> >The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have 
> >digits, they just aren't 0-9.  In Unicode/ISO-646 in particular there are 
> >the digits U+0660 through U+0669 and U+06F0 through U+06F9.  When I 
> >visited Saudi Arabia I saw the Eastern Arabic digits everywhere, not just 
> >0-9.  For more:
> >https://en.wikipedia.org/wiki/Eastern_Arabic_numerals
> >
> >Here's an example, U+0662:
> >http://www.fileformat.info/info/unicode/char/0662/index.htm
> >This is a decimal digit with value 2.  Java agrees.
> >
> >It sounds like there are different use cases.  Maybe there needs to be a 
> >standard way to represent different cases, e.g., "exactly 0-9", "a  digit 
> >in the current locale", and "a member of Unicode Character Category 
> >'Number, Decimal Digit'".  I don't know if there's a need to distinguish 
> >the second and third cases.  It seems to me that [[:digit::]] should mean 
> >the second or third case.
> 
> The problem is that the  definition of isdigit() means only the 
> first case (exactly the locale-independent 10 digits in the portable 
> file name character set, whether locales are based on ASCII or EBCDIC), 
> and the definition of [[:FOO:]] defers to  isFOO() where 
> possible.  Yes, it may be nice to have additional classification 
> routines, but as has been pointed out elsewhere in this thread, doing it 
> solely by one character at a time may not be sufficient to capture all 
> Unicode rules compared to what people really want to search for (for 
> example, when searching for a character with an accent, you want to be 
> able to find both the composed character, and the sequence of a plain 
> character plus combining mark character, that both represent the same 
> concept, but an iswFOO() test does not work on the latter example, since 
> it occupies more than one character).

Well, if ctype.h does not cover the functionality that we want, then we need to
specify new functionality.  WG14 is looking into some reentrant functionality
in this area, in something that could be a TS.

Also for the comparison, SC35/wg5 has specified an API that takes care of much 
of these
problems, both present in 14652 and 30112. This is an API that was meant for 
14651
(the ISO sort standard) but had resistance from the Unicode people. 

Also the bidi spec was proposed for 10646 but some Unicode people resisted it. 
I get the impression that some people do not want ISO to specify things in this 
area
which is not controlled by unicode.

Best regards
keld

Re: can [[:digit:]] match something other than 0123456789?

2018-05-18 Thread Eric Blake

On 05/18/2018 12:24 PM, Wheeler, David A wrote:

This conversation seems strange; many locales use digits other than 0-9 to
represent numbers.

The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have digits,
they just aren't 0-9. In Unicode/ISO-646 in particular there are the digits
U+0660 through U+0669 and U+06F0 through U+06F9. When I visited Saudi Arabia I
saw the Eastern Arabic digits everywhere, not just 0-9. For more:
https://en.wikipedia.org/wiki/Eastern_Arabic_numerals

Here's an example, U+0662:
http://www.fileformat.info/info/unicode/char/0662/index.htm
This is a decimal digit with value 2. Java agrees.

It sounds like there are different use cases. Maybe there needs to be a standard way to represent different
cases, e.g., "exactly 0-9", "a digit in the current locale", and "a member of
Unicode Character Category 'Number, Decimal Digit'". I don't know if there's a need to distinguish the
second and third cases. It seems to me that [[:digit::]] should mean the second or third case.

The problem is that the definition of isdigit() means only the
first case (exactly the locale-independent 10 digits in the portable
file name character set, whether locales are based on ASCII or EBCDIC),
and the definition of [[:FOO:]] defers to isFOO() where
possible. Yes, it may be nice to have additional classification
routines, but as has been pointed out elsewhere in this thread, doing it
solely by one character at a time may not be sufficient to capture all
Unicode rules compared to what people really want to search for (for
example, when searching for a character with an accent, you want to be
able to find both the composed character, and the sequence of a plain
character plus combining mark character, that both represent the same
concept, but an iswFOO() test does not work on the latter example, since
it occupies more than one character).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

RE: can [[:digit:]] match something other than 0123456789?

2018-05-18 Thread Wheeler, David A

This conversation seems strange; many locales use digits other than 0-9 to 
represent numbers.

The Eastern Arabic, Perso-Arabic variant, and Urdu variant all have digits, 
they just aren't 0-9.  In Unicode/ISO-646 in particular there are the digits 
U+0660 through U+0669 and U+06F0 through U+06F9.  When I visited Saudi Arabia I 
saw the Eastern Arabic digits everywhere, not just 0-9.  For more:
https://en.wikipedia.org/wiki/Eastern_Arabic_numerals

Here's an example, U+0662:
http://www.fileformat.info/info/unicode/char/0662/index.htm
This is a decimal digit with value 2.  Java agrees.

It sounds like there are different use cases.  Maybe there needs to be a 
standard way to represent different cases, e.g., "exactly 0-9", "a  digit in 
the current locale", and "a member of Unicode Character Category 'Number, 
Decimal Digit'".  I don't know if there's a need to distinguish the second and 
third cases.  It seems to me that [[:digit::]] should mean the second or third 
case.

--- David A. Wheeler

Re: can [[:digit:]] match something other than 0123456789?

2018-05-17 Thread keld

On Thu, May 17, 2018 at 12:36:35PM +0200, Hans Åberg wrote:
> 
> > On 17 May 2018, at 11:02, Joerg Schilling 
> >  wrote:
> > 
> > Hans Åberg  wrote:
> > 
>  |I asked a person who speaks japanese and he told me that
>  |
>  | "\u4e00\u4e8c\u4e09"
>  |
>  |is similar to
>  |
>  | "one two three"
>  |
>  |and this is not used for computing.
>  
>  If i recall correctly this has been discussed already; if not here
>  then on the Unicode list.  Unicode brings quite a lot of
>  codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
>  ONE FULL STOP etc.  All these are marked "No", and i think the
>  discussion concluded that they should not be taken into account
>  when converting strings to numbers.
> >> 
> >> The intent may be that the value of the digit character c can be computed 
> >> by the expression c - '0' when >= 0 and <= 9, and is otherwise a 
> >> non-digit. Then 'isdigit' and [[:digit:]] are tied to that, so it is 
> >> impossible to use any other decimal digits.
> > 
> > This seems to be an important idea, as this japanese one two three
> > is not in a contiguous order.
> 
> It provides an efficient implementation, important on earlier computers. The 
> UTF-8 article [1], "History", mentions that they struggled around 1992 to 
> find proposals for that providing efficient implementations.
> 
> 1. https://en.wikipedia.org/wiki/UTF-8

Oh, well. You should be able to implement efficient code for the specs from 
14652 and 30112,
one would be that you, after testing for isdigit, the you index into a 4-bit 
table
with the binary value corresponding to the digit character. This is probably on 
par speedwise
with  subtracting the value for zero.

Best regards
keld

Re: can [[:digit:]] match something other than 0123456789?

2018-05-17 Thread Hans Åberg


> On 17 May 2018, at 11:02, Joerg Schilling 
>  wrote:
> 
> Hans Åberg  wrote:
> 
 |I asked a person who speaks japanese and he told me that
 |
 | "\u4e00\u4e8c\u4e09"
 |
 |is similar to
 |
 | "one two three"
 |
 |and this is not used for computing.
 
 If i recall correctly this has been discussed already; if not here
 then on the Unicode list.  Unicode brings quite a lot of
 codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
 ONE FULL STOP etc.  All these are marked "No", and i think the
 discussion concluded that they should not be taken into account
 when converting strings to numbers.
>> 
>> The intent may be that the value of the digit character c can be computed by 
>> the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. 
>> Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use 
>> any other decimal digits.
> 
> This seems to be an important idea, as this japanese one two three
> is not in a contiguous order.

It provides an efficient implementation, important on earlier computers. The 
UTF-8 article [1], "History", mentions that they struggled around 1992 to find 
proposals for that providing efficient implementations.

1. https://en.wikipedia.org/wiki/UTF-8

Re: can [[:digit:]] match something other than 0123456789?

2018-05-17 Thread keld

On Thu, May 17, 2018 at 11:02:48AM +0200, Joerg Schilling wrote:
> Hans Åberg  wrote:
> 
> > >> |I asked a person who speaks japanese and he told me that
> > >> |
> > >> | "\u4e00\u4e8c\u4e09"
> > >> |
> > >> |is similar to
> > >> |
> > >> | "one two three"
> > >> |
> > >> |and this is not used for computing.
> > >> 
> > >> If i recall correctly this has been discussed already; if not here
> > >> then on the Unicode list.  Unicode brings quite a lot of
> > >> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
> > >> ONE FULL STOP etc.  All these are marked "No", and i think the
> > >> discussion concluded that they should not be taken into account
> > >> when converting strings to numbers.
> >
> > The intent may be that the value of the digit character c can be computed 
> > by the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. 
> > Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use 
> > any other decimal digits.
> 
> This seems to be an important idea, as this japanese one two three
> is not in a contiguous order.

Well, the digits in other scripts are ordered consequetively, so the calculation
could easily be  done, for the scripts I previously documented, as prescribed 
in ISO 14652.
This is not rocket science.

Best regards
keld

Re: can [[:digit:]] match something other than 0123456789?

2018-05-17 Thread Joerg Schilling

Hans Åberg  wrote:

> >> |I asked a person who speaks japanese and he told me that
> >> |
> >> | "\u4e00\u4e8c\u4e09"
> >> |
> >> |is similar to
> >> |
> >> | "one two three"
> >> |
> >> |and this is not used for computing.
> >> 
> >> If i recall correctly this has been discussed already; if not here
> >> then on the Unicode list.  Unicode brings quite a lot of
> >> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
> >> ONE FULL STOP etc.  All these are marked "No", and i think the
> >> discussion concluded that they should not be taken into account
> >> when converting strings to numbers.
>
> The intent may be that the value of the digit character c can be computed by 
> the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. Then 
> 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use any 
> other decimal digits.

This seems to be an important idea, as this japanese one two three
is not in a contiguous order.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg



> On 16 May 2018, at 18:13, Hans Åberg  wrote:
> 
> 
>> On 16 May 2018, at 17:14, Steffen Nurpmeso  wrote:
>> 
>> Joerg Schilling  wrote:
>> |Steffen Nurpmeso  wrote:
>> |>|> have some Unicode support.
>> |>|
>> |>|What do you expect: 
>> |>|
>> |>| strtol("\u4e00\u4e8c\u4e09", , 0);
>> |>
>> |> The entire is*() family cannot work with multibyte or stateful
>> |> encodings, right.
>> |
>> |I asked a person who speaks japanese and he told me that
>> |
>> | "\u4e00\u4e8c\u4e09"
>> |
>> |is similar to
>> |
>> | "one two three"
>> |
>> |and this is not used for computing.
>> 
>> If i recall correctly this has been discussed already; if not here
>> then on the Unicode list.  Unicode brings quite a lot of
>> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
>> ONE FULL STOP etc.  All these are marked "No", and i think the
>> discussion concluded that they should not be taken into account
>> when converting strings to numbers.

The intent may be that the value of the digit character c can be computed by 
the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. Then 
'isdigit' and [[:digit:]] are tied to that, so it is impossible to use any 
other decimal digits.

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg

> On 16 May 2018, at 17:14, Steffen Nurpmeso  wrote:
> 
> Joerg Schilling  wrote:
> |Steffen Nurpmeso  wrote:
> |>|> have some Unicode support.
> |>|
> |>|What do you expect: 
> |>|
> |>| strtol("\u4e00\u4e8c\u4e09", , 0);
> |>
> |> The entire is*() family cannot work with multibyte or stateful
> |> encodings, right.
> |
> |I asked a person who speaks japanese and he told me that
> |
> | "\u4e00\u4e8c\u4e09"
> |
> |is similar to
> |
> | "one two three"
> |
> |and this is not used for computing.
> 
> If i recall correctly this has been discussed already; if not here
> then on the Unicode list.  Unicode brings quite a lot of
> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
> ONE FULL STOP etc.  All these are marked "No", and i think the
> discussion concluded that they should not be taken into account
> when converting strings to numbers.  Hans Åberg surely knows
> better than I.

I am happier the less I know about these issues, and UTF-8 was invented to help 
with that! :-)

It was ICU Regular Expressions I had in mind, which can do matching on all 
Unicode classes this link says, including case insensitive matching where the 
cases have different length.
  http://userguide.icu-project.org/strings/regexp

So as for the original question, I think the question is something like that 
one is supposed to define a C character set, and then those C functions act 
against those. Harbison & Steele says that the isdigit function tests if it is 
one of the ten digits one has defined, which is what [[:digit:]] is supposed to 
match, I think.

So you can define your locale to have whatever ten characters you like and 
render them as you please as long as they are ten and are contiguous and have 
the intended function as decimal digits. Or so I think.

If one wants other character classes matching outside of that, it is safest to 
do as ICU Regular Expressions, defining with respect to Unicode.

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Steffen Nurpmeso

Joerg Schilling  wrote:
 |Steffen Nurpmeso  wrote:
 |>|> have some Unicode support.
 |>|
 |>|What do you expect: 
 |>|
 |>| strtol("\u4e00\u4e8c\u4e09", , 0);
 |>
 |> The entire is*() family cannot work with multibyte or stateful
 |> encodings, right.
 |
 |I asked a person who speaks japanese and he told me that
 |
 | "\u4e00\u4e8c\u4e09"
 |
 |is similar to
 |
 | "one two three"
 |
 |and this is not used for computing.

If i recall correctly this has been discussed already; if not here
then on the Unicode list.  Unicode brings quite a lot of
codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
ONE FULL STOP etc.  All these are marked "No", and i think the
discussion concluded that they should not be taken into account
when converting strings to numbers.  Hans Åberg surely knows
better than I.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Joerg Schilling

Steffen Nurpmeso  wrote:

>  |> have some Unicode support.
>  |
>  |What do you expect: 
>  |
>  | strtol("\u4e00\u4e8c\u4e09", , 0);
>
> The entire is*() family cannot work with multibyte or stateful
> encodings, right.

I asked a person who speaks japanese and he told me that

"\u4e00\u4e8c\u4e09"

is similar to

"one two three"

and this is not used for computing.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Steffen Nurpmeso

Joerg Schilling  wrote:
 |Hans Åberg  wrote:
 |>> On 16 May 2018, at 10:29, Joerg Schilling > er.de> wrote:
 |>> 
 |>> Robert Elz  wrote:
 |>> 
 |>>> How does one specify a locale for some area using Latin as its
 |>>> language, where I V X L C D M are the digits ?
 |>> 
 |>> how do you like to specify a hexadecimal number in this locale?
 |>
 |> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, \
 |> you might check what the ECMAscript and C++ regex library do, which \
 |> have some Unicode support.
 |
 |What do you expect: 
 |
 | strtol("\u4e00\u4e8c\u4e09", , 0);

The entire is*() family cannot work with multibyte or stateful
encodings, right.

In my opinion for which i speak it was an error to simply doctor
more and more functionality onto the old interfaces, take silent
thread-safety for standard I/O functions, locale awareness for
functions which inherently cannot serve their purpose, like the
is*() family.  Even the w*() family cannot work for all languages,
even if you do not use ISO 10646 codepoints in wchar_t, because
the necessity of surrounding context that some languages had,
have, and will have -- that cannot be aten up like a burger.

In fact almost all network protocol-, cryptographic message
syntax- (CMS) or whatever standards require plain ASCII as a base
and only sometimes require something else, mostly bounded.  It is
beneficial to have a set of plain and reliable ASCII tools at hand
for these tasks.  What is wrong with that?  Granted it could be
done in Cyrillic, Chinese, Korean, Japanese, or any other
language, but for on the history is a different one and then, why
not.  English can be an easy language if so desired, and is like
that for most standards, in the end.

(It can also be a very hard read, just take

  Roman soldiery, flung their gnarled arms over a thick carpet of
  the most delicious green sward; in some places they were
  intermingled with beeches, hollies, and copsewood of various
  descriptions, so closely as totally to intercept the level beams
  of the sinking sun; in others they receded from each other,
  forming those long sweeping vistas, in the intricacy of which
  the eye delights to lose itself, while imagination considers
  them as the paths to yet wilder scenes of silvan solitude.

 |to return in a japanese locale and what do you expect:
 |
 | strtol("0XC", , 0);
 |
 |to return in a latin locale?

That is hexadecimal for sure.  

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg


> On 16 May 2018, at 10:53, Joerg Schilling 
>  wrote:
> 
> Hans Åberg  wrote:
> 
>> 
>>> On 16 May 2018, at 10:29, Joerg Schilling 
>>>  wrote:
>>> 
>>> Robert Elz  wrote:
>>> 
 How does one specify a locale for some area using Latin as its
 language, where I V X L C D M are the digits ?
>>> 
>>> how do you like to specify a hexadecimal number in this locale?
>> 
>> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you 
>> might check what the ECMAscript and C++ regex library do, which have some 
>> Unicode support.
> 
> What do you expect: 
> 
>   strtol("\u4e00\u4e8c\u4e09", , 0);
> 
> to return in a japanese locale and what do you expect:
> 
>   strtol("0XC", , 0);
> 
> to return in a latin locale?

I'm on MacOS, which has no language set, only LC_CTYPE="UTF-8". And std::strtol 
does not seem to accept explicit Unicode strings [1]. And if you want to use 
Latin numerals, you should probably use "Ⅹ" U+2169 and "Ⅽ" U+216D, so it is a 
non-issue.

1. http://en.cppreference.com/w/cpp/string/byte/strtol

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread keld

On Wed, May 16, 2018 at 10:41:15AM +0200, Joerg Schilling wrote:
> Robert Elz  wrote:
> 
> > would be easy, but you say it alao has to look for
> >
> > (c) [[:latindigs:]]+
> > (c) [[:vdigits:]]+
> >
> > (and how many more)?   This is actualy kind of important, as
> >
> > (c) MMXVI
> >
> > type strings are not uncommon in certain environments (can't recall
> > ever seeing one written in Venusian though...)
> 
> We discussed whether
> 
>   \u4e00 \u4e8c \u4e09
> 
> should be a valid number made of [[:digit:]] in a japanese locale, 
> but it seems to be not a good idea.
> 
> If we did this, any program that deals with digits would not only need to 
> know 
> the rules for the indian (frequently called arabic) numbers but also the 
> rules 
> for other schemes.

Well for many other scripts than the normal ASCII digits, this is already 
standardized,
in ISO 14652 it says:

digit Define the characters to be classified as decimal digits. Digits 
corresponding to the values 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified 
in groups of 10 digits, and in ascending order of the values they represent. 
The digits of the portable character set are automatically included. If this 
keyword is not specified, the digits 0 through 9 of the portable character set 
automatically belong to this class, with application-defined character values. 
The "digit" keyword is used to specify which characters are accepted as digits 
in input to an application, such as characters typed in or scanned in from an 
input text file, and should list digits used with all the scripts supported by 
the FDCC-set. The keyword may be omitted.

And the standard i18n locale of 14652 has this for the digit class

% The "digit" class of the "i18n" FDCC-set is reflecting
% the recommendations in TR 10176 annex A
digit /
% COLLECTION 1 BASIC LATIN/
  ..;/
% COLLECTION 15 ARABIC EXTENDED/
  ..;..;/
% COLLECTION 16 DEVANAGARI/
  ..;/
% COLLECTION 18 BENGALI/
  ..;/
% COLLECTION 18 GURMUKHI/
  ..;/
% COLLECTION 19 GUJARATI/
  ..;/
% COLLECTION 20 ORIYA/
   ..;/
% COLLECTION 21 TAMIL/
   <0>;..;/
% COLLECTION 22 TELUGU/
   ..;/
% COLLECTION 23 KANNADA/
   ..;/
% COLLECTION 24 MALAYALAM/
   ..;/
% COLLECTION 25 THAI/
   ..;/
% COLLECTION 26 LAO/
   ..;/
% COLLECTION 72 BASIC TIBETAN/
   ..;/
% COLLECTION 68 HALFWIDTH AND FULLWIDTH FORMS/
   ..
%

Best regards
keld

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Shware Systems

For conforming charsets XBD 6 requires the range <0>-<9> to be contiguous. By 
XBD 9.3.5, Rule 6, {:digit:] may include MBS elements aside from the <0> to <9> 
in LC_CTYPE, but the range [0-9] depends on whether additional characters have 
the same collation weight as digits. If this is the case the locale may need to 
define collating symbols that bracket the range of digits in the order list and 
use those in range expressions to ensure everything the locale considers a 
decimal digit is tested for.

The collating sequence for Japan might be something like:
collating-symbol 
collating-symbol 
order_start forward
...
bgn-decimal
<0>
 weight N
<1>
<一>    weight N+1
<2>
<二>
<3>
<三>
...
<9>
 weight N+8
end-decimal
...
order_end

and [0-9] would include  and the other digits, but not . 
The range [[.bgn-decimal.]-[.end-decimal.]] should include  too.
I'm ambivalent about whether the standard should reserve symbol names like this 
for common ranges like digits, though.

In a message dated 5/16/2018 4:49:44 AM Eastern Standard Time, 
joerg.schill...@fokus.fraunhofer.de writes:

Geoff Clare  wrote:

> Stephane Chazelas  wrote, on 15 May 2018:
> >
> > OK, so to rephrase and make sure I understand correctly. In
> > locales other than C, [[:digit:]] will be guaranteed to match on
> > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be
> > in that order but [0-9] is unspecified anyway outside of the C
> > locale.
> > 
> > That's a bit counter-intuitive
>
> Not really, when you consider that ranges should use the collation
> sequence, not character encodings. (For the C/POSIX locale that's
> required - for others it's not, but it's the obvious way to implement
> ranges with multibyte characters.)

I believe the real problem is the IBM i18n implementation that internally uses 
collating values to evaluate ranges. With characters, this can result in 
stramge effects but it permits to implement [[=o=]] easily.

For digits, I would expect that there is no other glyph in between [0-9] but it 
may not be contiguous in a collating value notation.

Jörg

-- 
 EMail:jo...@schily.net (home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Joerg Schilling

Hans Åberg  wrote:

>
> > On 16 May 2018, at 10:29, Joerg Schilling 
> >  wrote:
> > 
> > Robert Elz  wrote:
> > 
> >> How does one specify a locale for some area using Latin as its
> >> language, where I V X L C D M are the digits ?
> > 
> > how do you like to specify a hexadecimal number in this locale?
>
> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you 
> might check what the ECMAscript and C++ regex library do, which have some 
> Unicode support.

What do you expect: 

strtol("\u4e00\u4e8c\u4e09", , 0);

to return in a japanese locale and what do you expect:

strtol("0XC", , 0);

to return in a latin locale?

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg

> On 16 May 2018, at 10:29, Joerg Schilling 
>  wrote:
> 
> Robert Elz  wrote:
> 
>> How does one specify a locale for some area using Latin as its
>> language, where I V X L C D M are the digits ?
> 
> how do you like to specify a hexadecimal number in this locale?

They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you 
might check what the ECMAscript and C++ regex library do, which have some 
Unicode support.

1. http://en.cppreference.com/w/cpp/regex
2. http://en.cppreference.com/w/cpp/regex/ecmascript

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Joerg Schilling

Geoff Clare  wrote:

> Stephane Chazelas  wrote, on 15 May 2018:
> >
> > OK, so to rephrase and make sure I understand correctly. In
> > locales other than C, [[:digit:]] will be guaranteed to match on
> > 0123456789 only but not [0-9]. 0123456789 are guaranteed to be
> > in that order but [0-9] is unspecified anyway outside of the C
> > locale.
> > 
> > That's a bit counter-intuitive
>
> Not really, when you consider that ranges should use the collation
> sequence, not character encodings.  (For the C/POSIX locale that's
> required - for others it's not, but it's the obvious way to implement
> ranges with multibyte characters.)

I believe the real problem is the IBM i18n implementation that internally uses 
collating values to evaluate ranges. With characters, this can result in 
stramge effects but it permits to implement [[=o=]] easily.

For digits, I would expect that there is no other glyph in between [0-9] but it 
may not be contiguous in a collating value notation.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Geoff Clare

Stephane Chazelas  wrote, on 15 May 2018:
>
> OK, so to rephrase and make sure I understand correctly. In
> locales other than C, [[:digit:]] will be guaranteed to match on
> 0123456789 only but not [0-9]. 0123456789 are guaranteed to be
> in that order but [0-9] is unspecified anyway outside of the C
> locale.
> 
> That's a bit counter-intuitive

Not really, when you consider that ranges should use the collation
sequence, not character encodings.  (For the C/POSIX locale that's
required - for others it's not, but it's the obvious way to implement
ranges with multibyte characters.)

In languages where there are alternative "digit" representations,
the locale definition might give the various representations of each
"digit" the same primary weight in the collating sequence, in which
case [0-9] would include some characters that are not true digits
(according to iswdigit()).

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Joerg Schilling

Robert Elz  wrote:

> would be easy, but you say it alao has to look for
>
>   (c) [[:latindigs:]]+
>   (c) [[:vdigits:]]+
>
> (and how many more)?   This is actualy kind of important, as
>
>   (c) MMXVI
>
> type strings are not uncommon in certain environments (can't recall
> ever seeing one written in Venusian though...)

We discussed whether

\u4e00 \u4e8c \u4e09

should be a valid number made of [[:digit:]] in a japanese locale, 
but it seems to be not a good idea.

If we did this, any program that deals with digits would not only need to know 
the rules for the indian (frequently called arabic) numbers but also the rules 
for other schemes.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Joerg Schilling

Robert Elz  wrote:

> How does one specify a locale for some area using Latin as its
> language, where I V X L C D M are the digits ?

how do you like to specify a hexadecimal number in this locale?

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Shware Systems

Yes, it nominally is unworkable as static rosters so isn't considered portable 
enough to standardize, that I see. K originally just wanted to support 
decimal and octal in C, iirc, and octal only because DEC did PDP core dumps 
that way. While Unicode provides some support for rosters of arbitrary numbers 
as char32_t 'digits', this is still limited to what an implementation is 
willing to provide support for in terms of text fields and numeric conversions, 
not that a portable application can add to on the fly by defining a POSIX or 
CLDR locale with a "digit set factory" that [:digit:] could be written to 
automatically take into account.


In a message dated 5/15/2018 7:29:49 PM Eastern Standard Time, 
k...@munnari.oz.au writes:

 
Date: Tue, 15 May 2018 18:42:29 -0400

 From: Shware Systems 
 Message-ID: <16365f81e7e-179a-29...@webjas-vab019.srv.aolmail.net>

 | That locale would define a latindigs charclass, same as Venusians are requi=
 | red to define a vdigits for theirs, and it's up to the application to do th=
 | e equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine.

That would be unworkable - it would mean that every application would need to
know the details of every locale that could possibly be used.

Eg: consider an application looking for copyright strings in files (I can't 
type the c in a circle so I will use (c)). That is, a (c) and a year (or
a sequence of years perhaps), matching

 (c) [[:digit:]]+

would be easy, but you say it alao has to look for

 (c) [[:latindigs:]]+
 (c) [[:vdigits:]]+

(and how many more)? This is actualy kind of important, as

 (c) MMXVI

type strings are not uncommon in certain environments (can't recall
ever seeing one written in Venusian though...)

It gets worse if it is accepted that [:unknown:] is undefined/unspecified
rather than just "no match" - then the code actually has to adapt itself
to the locale that is actually in use, rather than simply covering all known
locales.

kre

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Robert Elz

Date:Tue, 15 May 2018 18:42:29 -0400
From:Shware Systems 
Message-ID:  <16365f81e7e-179a-29...@webjas-vab019.srv.aolmail.net>

  | That locale would define a latindigs charclass, same as Venusians are requi=
  | red to define a vdigits for theirs, and it's up to the application to do th=
  | e equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine.

That would be unworkable - it would mean that every application would need to
know the details of every locale that could possibly be used.

Eg: consider an application looking for copyright strings in files (I can't 
type the c in a circle so I will use (c)).   That is, a (c) and a year (or
a sequence of years perhaps), matching

(c) [[:digit:]]+

would be easy, but you say it alao has to look for

(c) [[:latindigs:]]+
(c) [[:vdigits:]]+

(and how many more)?   This is actualy kind of important, as

(c) MMXVI

type strings are not uncommon in certain environments (can't recall
ever seeing one written in Venusian though...)

It gets worse if it is accepted that [:unknown:] is undefined/unspecified
rather than just "no match" - then the code actually has to adapt itself
to the locale that is actually in use, rather than simply covering all known
locales.

kre

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Shware Systems

That locale would define a latindigs charclass, same as Venusians are required 
to define a vdigits for theirs, and it's up to the application to do the 
equivalences to 1, 5, 10, 50, etc. in a latinstr2ull() routine.


In a message dated 5/15/2018 6:31:31 PM Eastern Standard Time, 
k...@munnari.oz.au writes:

 
Date: Tue, 15 May 2018 13:38:15 -0500

 From: Eric Blake 
 Message-ID: <08af8b99-dcf0-5775-3aed-533611cec...@redhat.com>

 | Please read http://austingroupbugs.net/view.php?id=1078 where this 
 | wording has been tightened to cover ALL locales, not just the POSIX 
 | locale, to better match with C requirements on isdigit().

How does one specify a locale for some area using Latin as its
language, where I V X L C D M are the digits ?

kre

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Steffen Nurpmeso

Stephane Chazelas  wrote:
 |2018-05-15 16:55:45 -0500, Eric Blake:
 |> On 05/15/2018 03:43 PM, Stephane Chazelas wrote:
 |>>Does that mean that [0-9] is also guaranteed to match on
 |>>0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
 |>>close to useless as it's longer than [0-9]
 |> 
 |> Yes, I think that's a fair conclusion for the C locale, by virtue of the
 |> fact that the standard requires the encoding for 0-9 to be contiguous \
 |> and in
 |> order.
 |> 
 |>>and is a bit
 |>>misleading as it suggests it would be affected by localisation
 |>>(like the other character classes) while it's not.
 |> 
 |> It's still useful in non-C locales within regexp, since ALL uses of - for
 |> ranges within [] has unspecified (or was it implementation-defined)
 |> semantics outside of the C locale.  Using a named reference guarantees the
 |> desired semantics of exactly 10 characters, rather than skirting on the
 |> grounds of whether the range operator behaves as desired in all locales
 |> rather than just the C locale.
 |[...]
 |
 |OK, so to rephrase and make sure I understand correctly. In
 |locales other than C, [[:digit:]] will be guaranteed to match on
 |0123456789 only but not [0-9]. 0123456789 are guaranteed to be
 |in that order but [0-9] is unspecified anyway outside of the C
 |locale.
 |
 |That's a bit counter-intuitive and (as noted by @isaac at
 |https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit\
 |-and-d/414230?noredirect=1#comment804362_414230)
 |is the opposite of what perl (in unicode mode), php (in unicode
 |mode), pcre (with (*UCP)) do: their [0-9] matches 0123456789
 |while their \d/[[:digit:]] match based on Unicode properties so
 |other decimal digits than the 0123456789 ones.

Unicode knows about decimal numbers, hexdigits and
ascii_hexdigit[s].  If i recall correctly the property of the
former is to offer ten successive numbers which correspond to what
we know as digits, while possibly looking different etc.  Given
the latter property it makes sense to treat [0-9] as ASCII
compatible but let [:digit:] match whatever a language desires.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Robert Elz

Date:Tue, 15 May 2018 13:38:15 -0500
From:Eric Blake 
Message-ID:  <08af8b99-dcf0-5775-3aed-533611cec...@redhat.com>

  | Please read http://austingroupbugs.net/view.php?id=1078 where this 
  | wording has been tightened to cover ALL locales, not just the POSIX 
  | locale, to better match with C requirements on isdigit().

How does one specify a locale for some area using Latin as its
language, where I V X L C D M are the digits ?

kre

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Stephane Chazelas

2018-05-15 16:55:45 -0500, Eric Blake:
> On 05/15/2018 03:43 PM, Stephane Chazelas wrote:
> >
> >Does that mean that [0-9] is also guaranteed to match on
> >0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
> >close to useless as it's longer than [0-9]
> 
> Yes, I think that's a fair conclusion for the C locale, by virtue of the
> fact that the standard requires the encoding for 0-9 to be contiguous and in
> order.
> 
> >and is a bit
> >misleading as it suggests it would be affected by localisation
> >(like the other character classes) while it's not.
> 
> It's still useful in non-C locales within regexp, since ALL uses of - for
> ranges within [] has unspecified (or was it implementation-defined)
> semantics outside of the C locale.  Using a named reference guarantees the
> desired semantics of exactly 10 characters, rather than skirting on the
> grounds of whether the range operator behaves as desired in all locales
> rather than just the C locale.
[...]

OK, so to rephrase and make sure I understand correctly. In
locales other than C, [[:digit:]] will be guaranteed to match on
0123456789 only but not [0-9]. 0123456789 are guaranteed to be
in that order but [0-9] is unspecified anyway outside of the C
locale.

That's a bit counter-intuitive and (as noted by @isaac at
https://unix.stackexchange.com/questions/414226/difference-between-0-9-digit-and-d/414230?noredirect=1#comment804362_414230)
is the opposite of what perl (in unicode mode), php (in unicode
mode), pcre (with (*UCP)) do: their [0-9] matches 0123456789
while their \d/[[:digit:]] match based on Unicode properties so
other decimal digits than the 0123456789 ones.

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Eric Blake


On 05/15/2018 03:43 PM, Stephane Chazelas wrote:


Does that mean that [0-9] is also guaranteed to match on
0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
close to useless as it's longer than [0-9]


Yes, I think that's a fair conclusion for the C locale, by virtue of the 
fact that the standard requires the encoding for 0-9 to be contiguous 
and in order.



and is a bit
misleading as it suggests it would be affected by localisation
(like the other character classes) while it's not.


It's still useful in non-C locales within regexp, since ALL uses of - 
for ranges within [] has unspecified (or was it implementation-defined) 
semantics outside of the C locale.  Using a named reference guarantees 
the desired semantics of exactly 10 characters, rather than skirting on 
the grounds of whether the range operator behaves as desired in all 
locales rather than just the C locale.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Shware Systems

For that hypothetical Venusian locale, as discussed for 1078, it would be 
expected to define a VDIGIT (sic) custom LC_CTYPE charclass for specifying 
other character names representing digits, and then using [[:digit:][:VDIGIT:]] 
to test for both. Code like this couldn't be considered strictly conforming, 
but might qualify for NLS-conforming. Also, application-specific locales can 
add names to a digit definition, with the same caveat, and then [:digit:] would 
be different from [0-9]. Similarly, VDIGIT could include [0-9] plus other 
names, and then code would only need to use [:VDIGIT:] to test for both.

In a message dated 5/15/2018 4:55:20 PM Eastern Standard Time, 
stephane.chaze...@gmail.com writes:

2018-05-15 13:38:15 -0500, Eric Blake:

> On 05/15/2018 12:50 PM, Stephane Chazelas wrote:
[...]
> >> digit
> >> Define the characters to be classified as numeric digits.
> >>
> >> In the POSIX locale, only:
> >>
> >>0 1 2 3 4 5 6 7 8 9
> 
> Please read http://austingroupbugs.net/view.php?id=1078 where this wording
> has been tightened to cover ALL locales, not just the POSIX locale, to
> better match with C requirements on isdigit().
[...]

Thanks.

I somehow missed that one.

Does that mean that [0-9] is also guaranteed to match on
0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
close to useless as it's longer than [0-9] and is a bit
misleading as it suggests it would be affected by localisation
(like the other character classes) while it's not.

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Stephane Chazelas

2018-05-15 13:38:15 -0500, Eric Blake:
> On 05/15/2018 12:50 PM, Stephane Chazelas wrote:
[...]
> >>   digit
> >>   Define the characters to be classified as numeric digits.
> >>
> >>   In the POSIX locale, only:
> >>
> >>0 1 2 3 4 5 6 7 8 9
> 
> Please read http://austingroupbugs.net/view.php?id=1078 where this wording
> has been tightened to cover ALL locales, not just the POSIX locale, to
> better match with C requirements on isdigit().
[...]

Thanks.

I somehow missed that one.

Does that mean that [0-9] is also guaranteed to match on
0123456789 only? And that then [[:digit:]] in regexp/fnmatch is
close to useless as it's longer than [0-9] and is a bit
misleading as it suggests it would be affected by localisation
(like the other character classes) while it's not.

-- 
Stephane

Re: can [[:digit:]] match something other than 0123456789?

2018-05-15 Thread Eric Blake


On 05/15/2018 12:50 PM, Stephane Chazelas wrote:

You're a bit late to the party on this question :)


   digit
   Define the characters to be classified as numeric digits.

   In the POSIX locale, only:

0 1 2 3 4 5 6 7 8 9


Please read http://austingroupbugs.net/view.php?id=1078 where this 
wording has been tightened to cover ALL locales, not just the POSIX 
locale, to better match with C requirements on isdigit().


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

43 matches

Mail list logo