Re: Date era specifications in the standard

2022-01-24 Thread Hans Åberg via austin-group-l at The Open Group


> On 24 Jan 2022, at 10:50, Geoff Clare via austin-group-l at The Open Group 
>  wrote:
> 
> Thorsten Glaser wrote, on 23 Jan 2022:
>> 
>> I just saw “1 AD” (“anno domini”) in line 70246 of the
>> current working draft, and I’d like to ask that date+era
>> specifications do not use the abrahamistic designation (in
>> English, BC/AD) but the neutral one (BCE/CE, common era).
> 
> That particular occurrence in the time() FUTURE DIRECTIONS section
> has been removed by bug 1462, but there are a few other places (all
> relating to the "era" locale keyword) that use AD or BC.  Changing
> some of them would not be as simple as "s/BC/BCE/;s/AD/CE/", because
> they are given specifically as examples relating to "the Christian
> era".

It seems that the standard refers only to the Gregorian calendar, which was 
introduced on October 15, 1582, switching from the Julian calendar that ended 
on October 4 the day before (skipping 10 calendar days).

A way to keep track of this is by using Julian day numbers (JDN).

https://en.wikipedia.org/wiki/Julian_day





Re: Date era specifications in the standard

2022-01-23 Thread Hans Åberg via austin-group-l at The Open Group


> On 23 Jan 2022, at 20:05, Thorsten Glaser via austin-group-l at The Open 
> Group  wrote:
> 
> I just saw “1 AD” (“anno domini”) in line 70246 of the
> current working draft, and I’d like to ask that date+era
> specifications do not use the abrahamistic designation (in
> English, BC/AD) but the neutral one (BCE/CE, common era).

It would be better to use astronomical year numbering.

https://en.wikipedia.org/wiki/Astronomical_year_numbering





Re: LC_CTYPE=UTF-8

2020-06-25 Thread Hans Åberg


> On 25 Jun 2020, at 15:19, Ingo Schwarze  wrote:
> 
> Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200:
> 
>> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale
>> -a' list. Then some software interprets this as though the locale
>> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII
>> (high bit set) char's into octal escape sequences. What is the
>> correct interpretation here?
> 
> The correct interpretation of "LC_CTYPE=UTF-8" is whatever the
> documentation of the respective operating system says.
> All POSIX says is:
> 
>  https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
> 
>  The locale argument is a pointer to a character string containing
>  the required setting of category.  The contents of this string are
>  implementation-defined.
> 
> POSIX only specifies the meaning of the strings "C" and "POSIX";
> any others are implementation-defined.

This is also what I thought. As for the other your comments, rather than 
checking for a particular syntax, it seems that the particular software checks 
the 'locale -a' list, and if it is not there, applies the C/POSIX locale.

Perhaps there should be a default UTF-8 locale: It seems that the current 
construct does not apply so well to it.





LC_CTYPE=UTF-8

2020-06-25 Thread Hans Åberg
MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale -a' list. 
Then some software interprets this as though the locale is C/POSIX, disregards 
the UTF-8 encoding, and converts all non-ASCII (high bit set) char's into octal 
escape sequences. What is the correct interpretation here?





Re: About issue 0001108 and abs(INT_MIN)

2018-07-19 Thread Hans Åberg


> On 19 Jul 2018, at 14:13, Joseph Myers  wrote:
> 
> On Thu, 19 Jul 2018, Joerg Schilling wrote:
> 
>> Since the C++ people already think about making this to happen in ther next 
>> standard, it seems that the C compilers may do something similar in the 
>> future.
> 
> The latest version of the C++ proposal 
>  is 
> clear that it does not change undefined overflow (while adding the new 
> constraint to representations to reflect more or less universal existing 
> practice in that regard).  I quote:
> 
>Status-quo If a signed operation would naturally produce a value that 
>is not within the range of the result type, the behavior is undefined. 
>The author had hoped to make this well-defined as wrapping (the 
>operations produce the same value bits as for the corresponding 
>unsigned type), but WG21 had strong resistance against this.

Is this true for the int_t types, which require 2's complement? The 
uint_t types are required to compute modulo 2^N I recall, so it would seem 
that CPUs that support those also use those for the former but with a different 
interpretation in the set of integers.





Re: can [[:digit:]] match something other than 0123456789?

2018-05-17 Thread Hans Åberg

> On 17 May 2018, at 11:02, Joerg Schilling 
> <joerg.schill...@fokus.fraunhofer.de> wrote:
> 
> Hans Åberg <haber...@telia.com> wrote:
> 
>>>> |I asked a person who speaks japanese and he told me that
>>>> |
>>>> | "\u4e00\u4e8c\u4e09"
>>>> |
>>>> |is similar to
>>>> |
>>>> | "one two three"
>>>> |
>>>> |and this is not used for computing.
>>>> 
>>>> If i recall correctly this has been discussed already; if not here
>>>> then on the Unicode list.  Unicode brings quite a lot of
>>>> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
>>>> ONE FULL STOP etc.  All these are marked "No", and i think the
>>>> discussion concluded that they should not be taken into account
>>>> when converting strings to numbers.
>> 
>> The intent may be that the value of the digit character c can be computed by 
>> the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. 
>> Then 'isdigit' and [[:digit:]] are tied to that, so it is impossible to use 
>> any other decimal digits.
> 
> This seems to be an important idea, as this japanese one two three
> is not in a contiguous order.

It provides an efficient implementation, important on earlier computers. The 
UTF-8 article [1], "History", mentions that they struggled around 1992 to find 
proposals for that providing efficient implementations.

1. https://en.wikipedia.org/wiki/UTF-8





Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg


> On 16 May 2018, at 18:13, Hans Åberg <haber...@telia.com> wrote:
> 
> 
>> On 16 May 2018, at 17:14, Steffen Nurpmeso <stef...@sdaoden.eu> wrote:
>> 
>> Joerg Schilling <joerg.schill...@fokus.fraunhofer.de> wrote:
>> |Steffen Nurpmeso <stef...@sdaoden.eu> wrote:
>> |>|> have some Unicode support.
>> |>|
>> |>|What do you expect: 
>> |>|
>> |>| strtol("\u4e00\u4e8c\u4e09", , 0);
>> |>
>> |> The entire is*() family cannot work with multibyte or stateful
>> |> encodings, right.
>> |
>> |I asked a person who speaks japanese and he told me that
>> |
>> | "\u4e00\u4e8c\u4e09"
>> |
>> |is similar to
>> |
>> | "one two three"
>> |
>> |and this is not used for computing.
>> 
>> If i recall correctly this has been discussed already; if not here
>> then on the Unicode list.  Unicode brings quite a lot of
>> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
>> ONE FULL STOP etc.  All these are marked "No", and i think the
>> discussion concluded that they should not be taken into account
>> when converting strings to numbers.

The intent may be that the value of the digit character c can be computed by 
the expression c - '0' when >= 0 and <= 9, and is otherwise a non-digit. Then 
'isdigit' and [[:digit:]] are tied to that, so it is impossible to use any 
other decimal digits.





Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg

> On 16 May 2018, at 17:14, Steffen Nurpmeso <stef...@sdaoden.eu> wrote:
> 
> Joerg Schilling <joerg.schill...@fokus.fraunhofer.de> wrote:
> |Steffen Nurpmeso <stef...@sdaoden.eu> wrote:
> |>|> have some Unicode support.
> |>|
> |>|What do you expect: 
> |>|
> |>| strtol("\u4e00\u4e8c\u4e09", , 0);
> |>
> |> The entire is*() family cannot work with multibyte or stateful
> |> encodings, right.
> |
> |I asked a person who speaks japanese and he told me that
> |
> | "\u4e00\u4e8c\u4e09"
> |
> |is similar to
> |
> | "one two three"
> |
> |and this is not used for computing.
> 
> If i recall correctly this has been discussed already; if not here
> then on the Unicode list.  Unicode brings quite a lot of
> codepoints, like CIRCLED DIGIT ONE, PARENTHESIZED DIGIT ONE, DIGIT
> ONE FULL STOP etc.  All these are marked "No", and i think the
> discussion concluded that they should not be taken into account
> when converting strings to numbers.  Hans Åberg surely knows
> better than I.

I am happier the less I know about these issues, and UTF-8 was invented to help 
with that! :-)

It was ICU Regular Expressions I had in mind, which can do matching on all 
Unicode classes this link says, including case insensitive matching where the 
cases have different length.
  http://userguide.icu-project.org/strings/regexp

So as for the original question, I think the question is something like that 
one is supposed to define a C character set, and then those C functions act 
against those. Harbison & Steele says that the isdigit function tests if it is 
one of the ten digits one has defined, which is what [[:digit:]] is supposed to 
match, I think.

So you can define your locale to have whatever ten characters you like and 
render them as you please as long as they are ten and are contiguous and have 
the intended function as decimal digits. Or so I think.

If one wants other character classes matching outside of that, it is safest to 
do as ICU Regular Expressions, defining with respect to Unicode.





Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg

> On 16 May 2018, at 10:53, Joerg Schilling 
> <joerg.schill...@fokus.fraunhofer.de> wrote:
> 
> Hans Åberg <haber...@telia.com> wrote:
> 
>> 
>>> On 16 May 2018, at 10:29, Joerg Schilling 
>>> <joerg.schill...@fokus.fraunhofer.de> wrote:
>>> 
>>> Robert Elz <k...@munnari.oz.au> wrote:
>>> 
>>>> How does one specify a locale for some area using Latin as its
>>>> language, where I V X L C D M are the digits ?
>>> 
>>> how do you like to specify a hexadecimal number in this locale?
>> 
>> They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you 
>> might check what the ECMAscript and C++ regex library do, which have some 
>> Unicode support.
> 
> What do you expect: 
> 
>   strtol("\u4e00\u4e8c\u4e09", , 0);
> 
> to return in a japanese locale and what do you expect:
> 
>   strtol("0XC", , 0);
> 
> to return in a latin locale?

I'm on MacOS, which has no language set, only LC_CTYPE="UTF-8". And std::strtol 
does not seem to accept explicit Unicode strings [1]. And if you want to use 
Latin numerals, you should probably use "Ⅹ" U+2169 and "Ⅽ" U+216D, so it is a 
non-issue.

1. http://en.cppreference.com/w/cpp/string/byte/strtol





Re: can [[:digit:]] match something other than 0123456789?

2018-05-16 Thread Hans Åberg

> On 16 May 2018, at 10:29, Joerg Schilling 
>  wrote:
> 
> Robert Elz  wrote:
> 
>> How does one specify a locale for some area using Latin as its
>> language, where I V X L C D M are the digits ?
> 
> how do you like to specify a hexadecimal number in this locale?

They have no need for that in Latin, as "hexa" is Greek. :-) Otherwise, you 
might check what the ECMAscript and C++ regex library do, which have some 
Unicode support.

1. http://en.cppreference.com/w/cpp/regex
2. http://en.cppreference.com/w/cpp/regex/ecmascript





Re: UTF-8 locale & POSIX text model

2017-11-27 Thread Hans Åberg


> On 27 Nov 2017, at 22:51, Chet Ramey <chet.ra...@case.edu> wrote:
> 
> On 11/27/17 1:12 PM, Hans Åberg wrote:
> 
>>>> On MacOS 10.13, one can set locale environment variables. The Terminal 
>>>> default login shell reads .profile; xterm reads .bashrc. There are other 
>>>> ways to set them system-wide, changing with the OS version.
>>> 
>>> Terminal has been able to pass the locale environment variables to the
>>> shell it starts for a long time.
>> 
>> With that, I only get LC_CTYPE=UTF-8.
> 
> That's odd. If I choose `UTF-8' as the text encoding and check the box for
> "set locale environment variables ..." I get en_US.UTF-8.

Indeed. Stephane said other MacOS users reported the same as me, so you may 
have gotten some system-wide alterations. I switched to using .profile as it 
seemed more reliable, and in the .bashrc, I have 'source ~/.profile' settling 
that part, too.





Re: UTF-8 locale & POSIX text model

2017-11-27 Thread Hans Åberg


> On 27 Nov 2017, at 22:04, Chet Ramey <chet.ra...@case.edu> wrote:
> 
> On 11/27/17 12:51 PM, Hans Åberg wrote:
> 
>> On MacOS 10.13, one can set locale environment variables. The Terminal 
>> default login shell reads .profile; xterm reads .bashrc. There are other 
>> ways to set them system-wide, changing with the OS version.
> 
> Terminal has been able to pass the locale environment variables to the
> shell it starts for a long time.

With that, I only get LC_CTYPE=UTF-8.

> xterm behaves differently (but you can
> set your XQuartz menu entry to start xterm with -ls so it starts as a
> login shell if you only want to change one startup file).

I recall different ways to start bash: Terminal starts a login shell, but xterm 
does not.





Re: UTF-8 locale & POSIX text model

2017-11-27 Thread Hans Åberg


> On 27 Nov 2017, at 19:35, Chet Ramey <chet.ra...@case.edu> wrote:
> 
> On 11/27/17 1:19 AM, Hans Åberg wrote:
> 
>>>> The deprecated HFS uses UTF-16, but MacOS has LC_CTYPE=UTF-8; thus with no 
>>>> additional qualifications like in LC_CTYPE=en_US.UTF-8. It would be 
>>>> interesting to know if it is POSIX conforming, as it causes confusion with 
>>>> some software. 
>>> 
>>> I don't see that:
>>> 
>>> $ uname -a
>>> Darwin jenna.local 16.7.0 Darwin Kernel Version 16.7.0: Wed Oct  4 00:17:00
>>> PDT 2017; root:xnu-3789.71.6~1/RELEASE_X86_64 x86_64
>>> $ locale
>>> LANG="en_US.UTF-8"
>>> LC_COLLATE="en_US.UTF-8"
>>> LC_CTYPE="en_US.UTF-8"
>>> LC_MESSAGES="en_US.UTF-8"
>>> LC_MONETARY="en_US.UTF-8"
>>> LC_NUMERIC="en_US.UTF-8"
>>> LC_TIME="en_US.UTF-8"
>>> LC_ALL=
>> 
>> Make sure they are not set in .profile or somewhere. I get:
> 
> They're not, but I have Terminal preferences set to export the appropriate
> environment variables to each (usually shell) process Terminal starts. I
> assume that's where it omes from.

On MacOS 10.13, one can set locale environment variables. The Terminal default 
login shell reads .profile; xterm reads .bashrc. There are other ways to set 
them system-wide, changing with the OS version.





Re: UTF-8 locale & POSIX text model

2017-11-27 Thread Hans Åberg


> On 27 Nov 2017, at 10:43, Stephane Chazelas <stephane.chaze...@gmail.com> 
> wrote:
> 
> 2017-11-26 22:40:45 +0100, Hans Åberg:
> [...]
>> The deprecated HFS uses UTF-16, but MacOS has LC_CTYPE=UTF-8;
>> thus with no additional qualifications like in
>> LC_CTYPE=en_US.UTF-8. It would be interesting to know if it is
>> POSIX conforming, as it causes confusion with some software. 
> 
> Yes, I've seen that being reported by several macOS users as
> well.
> 
> I've seen it causing problem when sshing to other systems (that
> have AcceptEnv LC* LANG in their sshd configuration) which
> generally don't have a locale named like that (but then again,
> it's a general issue when sshing between systems that have
> different locale names, POSIX doesn't specify locale names other
> than "C" and "POSIX").

I have seen it in some X Window programs, for example, some earlier version of 
xboard got the language display scrambled.

> Having LC_CTYPE set means it overrides the $LANG variable so
> could cause problem when users set $LANG to some other locale
> that uses a charset incompatible with UTF-8 (I don't know if
> macOS has such locales) or use LANG=C expecting it to cause the
> charset to be single-byte (they should use LC_ALL=C for that).
> 
> That probably means that macOS has no notion of regional
> differences in character classification or capitalisation, but
> then again, it's probably just as well.

I presume it is intentional, as they never supported any other locale-I 
suggested them to use UTF-8 when the shell system was about to become default 
installation with OS X.

> In any case, that doesn't make macOS non-compliant, as POSIX
> only specifies a "C" and "POSIX" locale. IIRC, there was a
> proposal of specifying a C.UTF-8. I don't know how far that
> went.

It can cause compatibility problems, so it might be worth considering. UTF-8 
only seems fine





Re: UTF-8 locale & POSIX text model

2017-11-27 Thread Hans Åberg


> On 27 Nov 2017, at 03:16, Chet Ramey <chet.ra...@case.edu> wrote:
> 
> On 11/26/17 1:40 PM, Hans Åberg wrote:
> 
>> The deprecated HFS uses UTF-16, but MacOS has LC_CTYPE=UTF-8; thus with no 
>> additional qualifications like in LC_CTYPE=en_US.UTF-8. It would be 
>> interesting to know if it is POSIX conforming, as it causes confusion with 
>> some software. 
> 
> I don't see that:
> 
> $ uname -a
> Darwin jenna.local 16.7.0 Darwin Kernel Version 16.7.0: Wed Oct  4 00:17:00
> PDT 2017; root:xnu-3789.71.6~1/RELEASE_X86_64 x86_64
> $ locale
> LANG="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_CTYPE="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_ALL=

Make sure they are not set in .profile or somewhere. I get:
$ uname -a
Darwin xxx.local 17.2.0 Darwin Kernel Version 17.2.0: Fri Sep 29 18:27:05 PDT 
2017; root:xnu-4570.20.62~3/RELEASE_X86_64 x86_64

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

$ env
TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
...
LC_CTYPE=UTF-8





Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Hans Åberg

> On 26 Nov 2017, at 13:43, k...@keldix.com wrote:
> 
> Well, the pathname processing should be a function of the filesystem. Eg if 
> you have a windows
> filesystem, or an apple filesystem mounted on a linux operating system, then 
> the file names
> of the foreign system should be interpreted as for the originating system in 
> question.
> I am not sure of the encoding of filenemes on windows and apple system, but 
> their modern
> default character encoding is utf-16.

The deprecated HFS uses UTF-16, but MacOS has LC_CTYPE=UTF-8; thus with no 
additional qualifications like in LC_CTYPE=en_US.UTF-8. It would be interesting 
to know if it is POSIX conforming, as it causes confusion with some software. 





Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Hans Åberg


> On 26 Nov 2017, at 13:43, k...@keldix.com wrote:
> 
> I don't have windos nor apple systems, but they run utf-16 natively, and 
> recent
> Windows 10 system have a full linux (ubuntu) subsystem. I could also see 
> problems
> with utf-16 and posix, but at least apple should have solved that problem 
> with OS X and IOS.

APFS uses UTF-8.

https://developer.apple.com/library/content/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html