Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 14:57, Inada Naoki wrote:
> 
> Background: PEP 597 adds new `encoding="locale"`option to open() and 
> TextIOWrapper(). It is same to `encoding=None` for now, but it means using 
> "locale encoding" explicitly.
> 
> But this is wrong in UTF-8 mode.

Please address UTF-8 mode explicitly in open() or elsewhere. The locale
module is about the state of the lib C, not what Python enforces via
options in its own I/O layers.

As mentioned, both should ideally be synchronized, though, so
UTF-8 mode in Python should trigger setting a UTF-8 encoding
via setlocale().

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 14:47, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> - If you add "current", people will rightly ask: then what do all the
>> other APIs in the locale module return ? Of course, they all return
>> the current state of settings :-) So this is unnecessary as well.
> 
> The problem is that there are two different "locale encodings", what I call:
> 
> * "current locale encoding": nl_langinfo(CODESET) in short
> * "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) 
> otherwise

The UTF-8 mode is a Python invention. It doesn't have anything to
do with the lib C locale functions, which this module addresses and
interfaces to.

Please don't mix the two.

In fact, in order to avoid issues, Python should probably set the locale
encoding to UTF-8 as well, when run in UTF-8 mode. It's dangerous to
have Python and the lib C use different assumptions about the encoding,
esp. in embedded applications.

> It is unfortunate that the Python UTF-8 Mode which "ignores the locale" 
> changes the behavior of the locale module, of the 
> locale.getpreferredencoding() function. But the ship has sailed.
> 
> People are used to look into the "locale" module to get the "locale" 
> encoding. So I prefer to put  the function to get the "Python locale 
> encoding" in the locale module.
> 
> I propose to add "current" in the name since this encoding is not the one you 
> are looking for usually.
> 
> An alternative is to have a single function with an optional parameter. 
> Example:
> 
> * get_locale_encoding() or get_locale_encoding(True) returns the locale 
> encoding
> * get_locale_encoding(False) returns the current locale encoding

-1, both on the names and the idea to again add parameters which change
their meaning. We should have one function per meaning and really
only need the interface getencoding(), since the UTF-8 mode
doesn't fit into the locale module scope.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:35, Eryk Sun wrote:
> 
> Eryk Sun  added the comment:
> 
>> Read the ANSI code page on Windows,
> 
> I don't see why the Windows implementation is inconsistent with POSIX here. 
> If it were changed to be consistent, the default encoding at startup would 
> remain the same, since setlocale(LC_CTYPE, "") uses the process code page 
> from GetACP().

I'm not sure I understand what you're saying (but then, I have little
experience with locales on Windows).

My assumption is that nl_langinfo(CODESET) does not work on Windows
or gives wrong results. Is that incorrect ?

If it does work, getencoding() could just be a shim over
nl_langinfo(CODESET) on all platforms.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:26, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
> Recently, I spent some days to document properly encodings used by Python.

Thanks for documenting this.

I would prefer to leave the locale module to really just an interface
to the lib C locale logic and not add encoding details which are
specific to Python's view on I/O (sys or io) or the file system (os).

Hopefully, in a few years, we can get rid of all this and standardize
on UTF-8 everywhere.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:05, STINNER Victor wrote:
> I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate 
> it? I never used this function. How is it used? For which purpose?
>
> I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid 
> calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") 
> by default at startup since the early versions, and it's now called on all 
> platforms since Python 3.8. Moreover, its internal database seems to be 
> outdated and is painful to maintain (especially if we consider all platforms 
> supported by Python, not only Linux, there are many issues on macOS).

Yes, deprecate it as well. If Python calls setlocale() per default now,
it has served its purpose.

The alias database is needed by the normalization engine. We may be
able to drop the encoding part, but this would have to be checked.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 11:36, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> locale.getencoding()
>>
>> which interfaces to nl_langinfo(CODESET) or the Windows code
>> page and does not try to do any magic, ie. does *not* call
>> setlocale(). It needs to return what the lib C currently
>> knows and uses as encoding.
> 
> This is locale.get_current_locale_encoding(). I would like to put "current" 
> in the name, because there is a lot of confusion between 
> get_current_locale_encoding() encoding and locale.getpreferredencoding(False) 
> encoding. In locale.getpreferredencoding(False), Python ignores the locale in 
> some cases which is counter intuitive.

These attempts have resulted much of the confusion around the locale
module. It's better not to create more of it.

- "locale" in the name is unnecessary, since this is the locale module.

- If you add "current", people will rightly ask: then what do all the
other APIs in the locale module return ? Of course, they all return
the current state of settings :-) So this is unnecessary as well.

locale.getencoding() works in the same way as locale.getlocale().
It interfaces to the lib C and returns the current encoding setting
as known by the lib C. It's just a more intuitive name than
locale.nl_langinfo(CODESET) and works on Windows as well.

And, again, locale.getpreferredencoding() should be deprecated.
The API has been misused in too many ways and is completely broken
by now. It was a good idea at the time, when Martin added it,
even though I never liked the name.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 10:17, STINNER Victor wrote:
> 
> New submission from STINNER Victor :
> 
> I propose to add two new functions:
> 
> * locale.get_locale_encoding(): it's exactly the same than 
> locale.getpreferredencoding(False).
> 
> * locale.get_current_locale_encoding(): always get the current locale 
> encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on 
> other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, 
> Android, VxWorks.

I'm not sure whether this would improve the situation much.

The problem is that the locale module is meant to expose the lib C
locale settings, but many of the recent additions actually do something
completely different: they look into the process and user environment
and try to determine external settings, which are not reflected in
the lib C locale settings.

I had added locale.getdefaultlocale() to give applications a chance
to determine the locale setting defined by the process environment
*without* calling setlocale(LC_ALL, '') and causing problems
in other threads. I used the X11 database for locale encodings,
which was the closest you could get to in terms of a standard for
encodings at the time (around 2000).

Part of the return value is the encoding, which would be set.

Martin later added locale.getpreferredencoding(), which tries to
determine the encoding in a different way new way, based on
nl_langset(CODEINFO). As you mentioned, this intention was broken
on several platforms by forcing UTF-8 as output. And in many cases,
the API had to call setlocale() as well, causing the thread problems.

However, the problem with nl_langset(CODEINFO) is the same as
with setlocale(): it returns the current state of the lib C
settings, which may well point to the 'C' locale. Not the ones
the user has configured in the OS environment. So while you get
an encoding defined by lib C for the current locale settings
(without guessing it as with locale.getdefaultlocale()), you
still don't get what the user really wants to use.

Unfortunately, lib C does not provide a way to query the locale
database without changing the locale settings at the same time.
This is the main issue we're facing.

Now, the correct way in all this would be to just call
setlocale(LC_ALL, '') at the start of the application and
not try to apply all the magic to get around this. But this
has to be done by the application and not Python (which may
well be embedded into some other application).

I'd suggest to add a single new API:

locale.getencoding()

which interfaces to nl_langinfo(CODESET) or the Windows code
page and does not try to do any magic, ie. does *not* call
setlocale(). It needs to return what the lib C currently
knows and uses as encoding.

locale.getpreferredencoding() should then be deprecated.

It does not make sense to pretend to query information which is
not really directly available from the lib C locale system.

And the documentation should point out that applications should
call setlocale(LC_ALL, '') when they start up, if they want to
get the lib C locale, and thus Python locale module, setup to
work based on what the user really wants -- instead of just
guessing at this.

PS: The locale module normally does not use underscores in
function names, so it's not a good idea to add more.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com