[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-06-28 Thread STINNER Victor


STINNER Victor  added the comment:

PEP 597 was implemented successfully in Python 3.10 with this feature.

This is no agreement yet on what is the "current locale encoding".

For now, I prefer to close the issue.

We can reconsider this feature once there will be more user requests for such 
function and when there will be an agreement on what is the "current locale 
encoding".

--
resolution:  -> rejected
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-20 Thread Eryk Sun

Eryk Sun  added the comment:

> In my experience, most applications use the ANSI code page because 
> they use the ANSI flavor of the Windows API.

The default encoding at startup and in the "C" locale wouldn't change. It would 
only differ from the default if setlocale(LC_CTYPE, locale_name) sets it 
otherwise. The suggestion is to match the behavior of nl_langinfo(CODESET) in 
Linux and many other POSIX systems.

When I say the default encoding won't change, I mean that the Universal C 
Runtime (ucrt) system component uses the process ANSI code page as the default 
locale encoding for setlocale(LC_CTYPE, ""). This agrees with what Python has 
always done, but it disagrees with previous versions of the CRT in Windows. 
Personally, I think it's a misstep because the user locale isn't necessarily 
compatible with the process code page, but I'm not looking to change this 
decision. For example, if the user locale is "el_GR" (Greek, Greece) but the 
process code page is 1252 (Latin) instead of 1253 (Greek), I get the following 
result in Python 3.4 (VC++ 10) vs Python 3.5 (ucrt):

>py -3.4 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
Greek_Greece.1253

>py -3.5 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
Greek_Greece.1252

The result from VC++ 10 is consistent with the user locale. It's also 
consistent with multilingual user interface (MUI) text, such as error messages, 
or at least it should be, because the user locale and user preferred language 
(i.e. Windows display language) should be consistent. (The control panel dialog 
to set the user locale in Windows 10 has an option to match the display 
language, which is the recommended and default setting.)  For example, Python 
uses system error messages that are localized to the user's preferred language:

>py -c "import os; os.stat('spam')"
Traceback (most recent call last):
  File "", line 1, in 
FileNotFoundError: [WinError 2] Δεν είναι δυνατή η εύρεση του καθορισμένου 
αρχείου από το σύστημα: 'spam'

This example is on a system where the process (system) ANSI code page is 1252 
(Latin), which cannot encode the user's preferred Greek text. Thankfully Python 
3.6+ uses the console's Unicode API, so neither the console session's output 
code page nor the process code page gets in the way. On the other hand, if this 
Greek text is written to a file or piped to a child process using 
subprocess.Popen(), Python's choice of locale encoding based on the process 
code page (Latin) is incompatible with Greek text, and thus it's incompatible 
with the current user's preferred locale and language settings.

The process ANSI code page from GetACP() has its uses, which are important. 
It's a system setting that's independent of the current user locale and thus 
useful when interacting with the legacy system API and as a common encoding for 
inter-process data exchange when applications do not use Unicode and may be 
operating in different locales. So if you're writing to a legacy-encoded text 
file that's shared by multiple users or piping text to an arbitrary program, 
then using the ANSI code page is probably okay. Though, especially for IPC, 
there's a good chance that's it's wrong since Windows has never set, let alone 
enforced, a standard in that case. 

Using the process ANSI code page in the "C" locale makes sense to me. 

> What is the use case for using ___lc_codepage()? Is it a different 
> encoding?

I always forget the "_func" suffix in the name; it's ___lc_codepage_func() [1]. 
The lc_codepage value is the current LC_CTYPE codeset as an integer code page. 
It's the equivalent of nl_langinfo(CODESET) in POSIX. For UTF-8, the code page 
is CP_UTF8 (65001), but this get displayed in locale strings as "UTF-8" (or 
variants such as "utf8"). It could be the LC_CTYPE encoding of just the current 
thread, but Python does not enable per-thread locales.

The CRT has exported ___lc_codepage_func() since VC++ 7.0 (2002), and before 
that the current lc_codepage value itself was directly exported as 
__lc_codepage. However, this triple-dundered function is documented as internal 
and not recommended for use. That's why the code snippet I showed uses 
_get_current_locale() with locinfo cast to __crt_locale_data_public *. This 
takes "public" in the struct name at face value. Anything that's declared 
public should be safe to use, but the locale_t type is frustratingly 
undocumented even for this public data [2].

If neither approach is supported, locale.get_current_locale_encoding() could 
instead parse the current locale encoding from setlocale(LC_CTYPE, NULL). The 
resulting locale string usually includes the codeset (e.g. 
"Greek_Greece.1253"). The exceptions are the "C" locale and BCP-47 (RFC 5646) 
locales that do not explicitly use UTF-8 (e.g. "el_GR" or "el" instead of 
"el_GR.UTF-8"), but these cases can be handled reliably.

---

[1] 

[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-20 Thread STINNER Victor


STINNER Victor  added the comment:

Python uses GetACP(), the ANSI code page of the operating system, for years. 
What is the advantage of using a different encoding? In my experience, most 
applications use the ANSI code page because they use the ANSI flavor of the 
Windows API.

What is the use case for using ___lc_codepage()? Is it a different encoding?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Eryk Sun


Eryk Sun  added the comment:

> But please discuss it in another issue.

What's returned by locale.get_locale_encoding() and 
locale.get_current_locale_encoding() is relevant to adding them as new 
functions and is a chance to implement this correctly in Windows. 

You're right that what open() does for encoding="locale" is a separate issue, 
with backwards compatibility problems. I think it was implemented badly and 
needlessly inconsistent with POSIX. But we may be stuck with the behavior 
considering scripts are within their rights, per documented behavior, to expect 
that calling setlocale(LC_CTYPE, locale_name) in Windows has no effect on the 
result of locale.getpreferredencoding(False), unlike POSIX generally, except 
for some platforms such as macOS and Android.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Inada Naoki

Inada Naoki  added the comment:

> Why is it being specified that the current LC_CTYPE encoding should be 
> ignored in Windows when a "locale" encoding is requested?

Because `encoding="locale"` must be replacement of the current `encoding=None` 
(i.e. locale.getpreferredencoding(False).

`encoding=None` behavior will be changed if we change the default encoding or 
enable UTF-8 mode by default. So we are adding an explicit name to current 
behavior.

So It is not an option to assign other encoding. ​See PEP 597 for detail.

I know you are proposing to use CRT locale on Windows. If we change the 
`locale.getpreferredencoding(False)` to use CRT locale, `encoding="locale"` 
follow it.
But please discuss it in another issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Eryk Sun


Eryk Sun  added the comment:

> But it is not what I want for now. I want to ignore UTF-8 mode 
> when `encoding="locale"` is specified.
> This is almost "only in Windows" issue, and users can use 
> `encoding="mbcs"` in Windows-only script.

Why is it being specified that the current LC_CTYPE encoding should be ignored 
in Windows when a "locale" encoding is requested? Cross-platform C code would 
use mbstowcs() and wcstombs(), with the current LC_CTYPE encoding. That's 
Latin-1 in the initial "C" locale and defaults to GetACP() if 
setlocale(LC_CTYPE, "") is called, but otherwise it's whatever locale is 
requested by the program and supported by the system (all Windows installations 
support pretty much every locale).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

On 19.03.2021 16:15, Inada Naoki wrote:
> 
> `locale.getpreferredencoding()` is special, because it "Return the encoding 
> used for text data, according to user preferences. User preferences are 
> expressed differently on different systems, and might not be available 
> programmatically on some systems, so this function only returns a guess."

I already wrote earlier that we should deprecate this API, since the
overloading with different meanings in the past has turned it into
an unreliable source of information. At this point, it returns
"some encoding, which may or may not be what you want" :-)

We need to get things separated out clearly again: the locale
module is for the lib C locale state. What Python does in the
I/O layers has to be defined and queries at the appropriate
places elsewhere (e.g. os, sys or io modules).

>> As mentioned, both should ideally be synchronized, though, so
>> UTF-8 mode in Python should trigger setting a UTF-8 encoding
>> via setlocale().
> 
> There is PEP 538 already :)

Great :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Inada Naoki


Inada Naoki  added the comment:

> Please address UTF-8 mode explicitly in open() or elsewhere. The locale
> module is about the state of the lib C, not what Python enforces via
> options in its own I/O layers.

I agree with you. APIs in locale module shouldn't aware UTF-8 mode.

`locale.getpreferredencoding()` is special, because it "Return the encoding 
used for text data, according to user preferences. User preferences are 
expressed differently on different systems, and might not be available 
programmatically on some systems, so this function only returns a guess."


> As mentioned, both should ideally be synchronized, though, so
> UTF-8 mode in Python should trigger setting a UTF-8 encoding
> via setlocale().

There is PEP 538 already :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 14:57, Inada Naoki wrote:
> 
> Background: PEP 597 adds new `encoding="locale"`option to open() and 
> TextIOWrapper(). It is same to `encoding=None` for now, but it means using 
> "locale encoding" explicitly.
> 
> But this is wrong in UTF-8 mode.

Please address UTF-8 mode explicitly in open() or elsewhere. The locale
module is about the state of the lib C, not what Python enforces via
options in its own I/O layers.

As mentioned, both should ideally be synchronized, though, so
UTF-8 mode in Python should trigger setting a UTF-8 encoding
via setlocale().

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 14:47, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> - If you add "current", people will rightly ask: then what do all the
>> other APIs in the locale module return ? Of course, they all return
>> the current state of settings :-) So this is unnecessary as well.
> 
> The problem is that there are two different "locale encodings", what I call:
> 
> * "current locale encoding": nl_langinfo(CODESET) in short
> * "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) 
> otherwise

The UTF-8 mode is a Python invention. It doesn't have anything to
do with the lib C locale functions, which this module addresses and
interfaces to.

Please don't mix the two.

In fact, in order to avoid issues, Python should probably set the locale
encoding to UTF-8 as well, when run in UTF-8 mode. It's dangerous to
have Python and the lib C use different assumptions about the encoding,
esp. in embedded applications.

> It is unfortunate that the Python UTF-8 Mode which "ignores the locale" 
> changes the behavior of the locale module, of the 
> locale.getpreferredencoding() function. But the ship has sailed.
> 
> People are used to look into the "locale" module to get the "locale" 
> encoding. So I prefer to put  the function to get the "Python locale 
> encoding" in the locale module.
> 
> I propose to add "current" in the name since this encoding is not the one you 
> are looking for usually.
> 
> An alternative is to have a single function with an optional parameter. 
> Example:
> 
> * get_locale_encoding() or get_locale_encoding(True) returns the locale 
> encoding
> * get_locale_encoding(False) returns the current locale encoding

-1, both on the names and the idea to again add parameters which change
their meaning. We should have one function per meaning and really
only need the interface getencoding(), since the UTF-8 mode
doesn't fit into the locale module scope.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

Hum, latest messages are specific to the PEP 597 (implementation).

> I had forgot to consider about UTF-8 mode while finishing PEP 597.

I propose to continue the discussion about the PEP 597 in bpo-43510. I replied 
there.

I prefer to keep this issue to discuss the locale module.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Inada Naoki


Inada Naoki  added the comment:

> Is it about the current implementation of the PEP 597, or are you thinking at 
> the future Python which would use UTF-8 by default?

I had forgot to consider about UTF-8 mode while finishing PEP 597. If possible, 
I want to ignore UTF-8 mode when `encoding="locale"` is specified from Python 
3.10.
Otherwise, behavior will be changed between Python 3.10 and 3.11.

> Currently, getpreferredencoding(False) respects the behavior that you 
> described, no?

getpreferredencoding(False) respects UTF-8 mode. That's what PEP 597 said 
(because the PEP don't define behavior in UTF-8 mode) and GH-19481 implements. 

But it is not what I want for now. I want to ignore UTF-8 mode when 
`encoding="locale"` is specified.

This is almost "only in Windows" issue, and users can use `encoding="mbcs"` in 
Windows-only script.

But `encoding="locale"` is new and recommended way to specify using "locale" 
encoding explicitly. When user specify "locale" encoding explicitly, I think we 
should respect it regardless UTF-8 mode.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

> In UTF-8 mode, it's fine to `open(filename)` uses UTF-8. But I want to use 
> "locale encoding" for `open(filename, encoding="locale")` because "locale" 
> encoding is specified.

Is it about the current implementation of the PEP 597, or are you thinking at 
the future Python which would use UTF-8 by default?

Currently, getpreferredencoding(False) respects the behavior that you 
described, no?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Inada Naoki


Inada Naoki  added the comment:

> I created this issue while reviewing the implementation of the PEP 597: PR 
> 19481.

What I want is same to `locale.getpreferredencoding(False)` but ignores UTF-8 
mode.

Background: PEP 597 adds new `encoding="locale"`option to open() and 
TextIOWrapper(). It is same to `encoding=None` for now, but it means using 
"locale encoding" explicitly.

But this is wrong in UTF-8 mode.

In UTF-8 mode, it's fine to `open(filename)` uses UTF-8. But I want to use 
"locale encoding" for `open(filename, encoding="locale")` because "locale" 
encoding is specified.

I don't want to add new meaning here. It should be same to 
`locale.getpreferredencoding(False)` without UTF-8 mode. So I need "cp%d" % 
GetACP() on Windows, not CRT locale encoding.

I don't care its name. both of sys.locale_encoding() and locale.get_encoding() 
are OK.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

> - If you add "current", people will rightly ask: then what do all the
> other APIs in the locale module return ? Of course, they all return
> the current state of settings :-) So this is unnecessary as well.

The problem is that there are two different "locale encodings", what I call:

* "current locale encoding": nl_langinfo(CODESET) in short
* "Python locale encoding": "UTF-8" in some cases, nl_langinfo(CODESET) 
otherwise

It is unfortunate that the Python UTF-8 Mode which "ignores the locale" changes 
the behavior of the locale module, of the locale.getpreferredencoding() 
function. But the ship has sailed.

People are used to look into the "locale" module to get the "locale" encoding. 
So I prefer to put  the function to get the "Python locale encoding" in the 
locale module.

I propose to add "current" in the name since this encoding is not the one you 
are looking for usually.

An alternative is to have a single function with an optional parameter. Example:

* get_locale_encoding() or get_locale_encoding(True) returns the locale encoding
* get_locale_encoding(False) returns the current locale encoding

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

I created bpo-43557 "Deprecate getdefaultlocale(), getlocale() and normalize() 
functions". Let's discuss deprecating getdefaultlocale() there.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

On 19.03.2021 13:25, Eryk Sun wrote:
>> My assumption is that nl_langinfo(CODESET) does not work on Windows
>> or gives wrong results. Is that incorrect ?
> 
> There is no such function for CRT locales. I provided two alternatives that 
> would allow implementing this consistent with POSIX, and thus avoid all of 
> the "except on Windows..." disclaimers that have to explain (apologize) that 
> only the process ANSI code page is used in Windows, and, for no good reason 
> as far as I can tell, the LC_CTYPE locale encoding is completely ignored.

Sounds good. If we can get consistent behavior on Windows as well,
all the better :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Eryk Sun


Eryk Sun  added the comment:

> If Python calls setlocale() per default now, it has served its purpose.

Except not for embedding applications if configure_locale [1] isn't set. But in 
that case determining the default locale isn't Python's problem to solve.

> My assumption is that nl_langinfo(CODESET) does not work on Windows
> or gives wrong results. Is that incorrect ?

There is no such function for CRT locales. I provided two alternatives that 
would allow implementing this consistent with POSIX, and thus avoid all of the 
"except on Windows..." disclaimers that have to explain (apologize) that only 
the process ANSI code page is used in Windows, and, for no good reason as far 
as I can tell, the LC_CTYPE locale encoding is completely ignored.

---

[1] 
https://docs.python.org/3/c-api/init_config.html#c.PyPreConfig.configure_locale

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:35, Eryk Sun wrote:
> 
> Eryk Sun  added the comment:
> 
>> Read the ANSI code page on Windows,
> 
> I don't see why the Windows implementation is inconsistent with POSIX here. 
> If it were changed to be consistent, the default encoding at startup would 
> remain the same, since setlocale(LC_CTYPE, "") uses the process code page 
> from GetACP().

I'm not sure I understand what you're saying (but then, I have little
experience with locales on Windows).

My assumption is that nl_langinfo(CODESET) does not work on Windows
or gives wrong results. Is that incorrect ?

If it does work, getencoding() could just be a shim over
nl_langinfo(CODESET) on all platforms.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:26, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
> Recently, I spent some days to document properly encodings used by Python.

Thanks for documenting this.

I would prefer to leave the locale module to really just an interface
to the lib C locale logic and not add encoding details which are
specific to Python's view on I/O (sys or io) or the file system (os).

Hopefully, in a few years, we can get rid of all this and standardize
on UTF-8 everywhere.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 12:05, STINNER Victor wrote:
> I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate 
> it? I never used this function. How is it used? For which purpose?
>
> I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid 
> calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") 
> by default at startup since the early versions, and it's now called on all 
> platforms since Python 3.8. Moreover, its internal database seems to be 
> outdated and is painful to maintain (especially if we consider all platforms 
> supported by Python, not only Linux, there are many issues on macOS).

Yes, deprecate it as well. If Python calls setlocale() per default now,
it has served its purpose.

The alias database is needed by the normalization engine. We may be
able to drop the encoding part, but this would have to be checked.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread Eryk Sun


Eryk Sun  added the comment:

> Read the ANSI code page on Windows,

I don't see why the Windows implementation is inconsistent with POSIX here. If 
it were changed to be consistent, the default encoding at startup would remain 
the same, since setlocale(LC_CTYPE, "") uses the process code page from 
GetACP(). In many if not most cases, no one would be the wiser. But it seems to 
me that if a script calls setlocale(LC_CTYPE, "el_GR"), then it clearly wants 
to encode Greek text (code page 1253). open() with encoding passed as None or 
"locale" should respect this. Similarly if it calls setlocale(LC_CTYPE, 
".UTF-8"), then it wants the default locale (language/region), but with UTF-8 
encoding.

The following is a snippet to get the current locale encoding with ucrt in 
Windows:

#include 

int cp = 0;
__crt_locale_data_public *locale_data;

_locale_t locale = _get_current_locale();
if (locale) {
locale_data = (__crt_locale_data_public *)locale->locinfo;
cp = locale_data->_locale_lc_codepage;
   _free_locale(locale);
}

if (cp == 0) {
/* "C" locale. The CRT in effect uses Latin-1 (cp28591), but 
   Windows Python prefers the process code page. */
cp = GetACP();
}

With ucrt, the C runtime was changed to hide most of the locale definition that 
was previously public, but it intentionally defines __crt_locale_data_public, 
so I'm assuming it's there for programs to use. That said, the fact that we 
have to cast locinfo seems suspect to me. Steve Dower could maybe check with 
the ucrt devs to ensure that this is supported. 

There's also ___lc_codepage() to get the same value more simply, and also more 
efficiently since the current locale data doesn't have to be copied and freed. 
However, it's documented as internal and could be removed (unlikely as that is).

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


Change by STINNER Victor :


--
nosy: +methane

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

Recently, I spent some days to document properly encodings used by Python.

Python filesystem encoding:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_encoding

Python filesystem errors:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_errors

stdio encoding and errors:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.stdio_encoding

Glossary: "Locale encoding"
https://docs.python.org/dev/glossary.html#term-locale-encoding

Glossary: "filesystem encoding and error handler"
https://docs.python.org/dev/glossary.html#term-filesystem-encoding-and-error-handler

Python UTF-8 Mode:
https://docs.python.org/dev/library/os.html#utf8-mode

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

> Martin later added locale.getpreferredencoding(), which tries to
> determine the encoding in a different way new way, based on
> nl_langset(CODEINFO). As you mentioned, this intention was broken
> on several platforms by forcing UTF-8 as output.

When I designed and implemented the PEP 540 (Python UTF-8 Mode), I tried to 
leave getpreferredencoding() unchanged. The problem was that I quickly got 
mojibake because too many functions call getpreferredencoding(False):

* open() and _pyio.open() -- in Python 3.10, open() now calls the C 
_Py_GetLocaleEncoding() function to fix issues during Python shutdown, it also 
avoids issues at startup.
* Many gettext functions
* cgi to decode the query string from QUERY_STRING env var or sys.argv[1]}
* xml.etree.ElementTree.write(encoding="unicode") is some cases

The Python UTF-8 Mode ignores the locale *on purpose*. But I agree that it's 
surprising and can lead to confusion. That's what I'm trying to fix here :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

Attached encodings.py lists the different "locale encodings" used by Python. 
Example:
---
$ LANG=fr_FR ./python -X utf8 encodings.py fr_FR@euro
Set LC_CTYPE to 'fr_FR@euro'

LC_ALL env var: ''
LC_CTYPE env var: ''
LANG env var: 'fr_FR'
LC_CTYPE locale: 'fr_FR@euro'
Coerce C locale: 0
Python UTF-8 Mode: 1

(1) Python FS encoding
sys.getfilesystemencoding(): 'utf-8'

(2) Python locale encoding
_locale._get_locale_encoding(): 'UTF-8'
locale.getpreferredencoding(False): 'UTF-8'

(3) Current locale encoding
locale.get_current_locale_encoding(): 'ISO-8859-15'

(4) And more encodings for more fun!
locale.getdefaultlocale()[1]: 'ISO8859-1'
locale.getpreferredencoding(True): 'UTF-8'
---

Python starts with LC_CTYPE locale set to fr_FR (ISO8859-1), then the script 
sets the LC_CTYPE locale to fr_FR@euro (ISO-8859-15). The Python UTF-8 Mode is 
enabled explicitly. We get a funny combination of not less than 3 encodings!

* UTF-8
* ISO-8859-1
* ISO-8859-15

Which one is the correct one? Wel... It depends :-)

(1) The Python filesystem encoding is used to call almost all operating system 
functions: encode to the OS and decode from the OS. Filenames, environment 
variables, command line options, etc.

(2) The "Python" locale encoding is used by open() when no encoding is specific.

(3) The current locale encoding is used for a limited amount of functions that 
I listed in msg389063. Most users should not use it.

(4) locale.getpreferredencoding(True) is a weird beast. It is Python locale 
encoding until setlocale(LC_CTYPE, locale) is called for the first time. But it 
can be same if the Python UTF-8 Mode is enabled. I'm not sure in which category 
we should put this function :-(

(4 bis) locale.getdefaultlocale()[1] is the only function returning the 
ISO-8859-1 encoding. This encoding is not used by any function. I'm not sure of 
the purpose of this function. It sounds confusing.


I suggest to deprecate locale.getpreferredencoding(True).

I'm not sure what to do with locale.getdefaultlocale(). Should we deprecate it? 
I never used this function. How is it used? For which purpose?

I undertand that in 2000, locale.getdefaultlocale() was interesting to avoid 
calling setlocale(LC_CTYPE, ""). But Python 3 calls setlocale(LC_CTYPE, "") by 
default at startup since the early versions, and it's now called on all 
platforms since Python 3.8. Moreover, its internal database seems to be 
outdated and is painful to maintain (especially if we consider all platforms 
supported by Python, not only Linux, there are many issues on macOS).

--
Added file: https://bugs.python.org/file49894/encodings.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 11:36, STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> locale.getencoding()
>>
>> which interfaces to nl_langinfo(CODESET) or the Windows code
>> page and does not try to do any magic, ie. does *not* call
>> setlocale(). It needs to return what the lib C currently
>> knows and uses as encoding.
> 
> This is locale.get_current_locale_encoding(). I would like to put "current" 
> in the name, because there is a lot of confusion between 
> get_current_locale_encoding() encoding and locale.getpreferredencoding(False) 
> encoding. In locale.getpreferredencoding(False), Python ignores the locale in 
> some cases which is counter intuitive.

These attempts have resulted much of the confusion around the locale
module. It's better not to create more of it.

- "locale" in the name is unnecessary, since this is the locale module.

- If you add "current", people will rightly ask: then what do all the
other APIs in the locale module return ? Of course, they all return
the current state of settings :-) So this is unnecessary as well.

locale.getencoding() works in the same way as locale.getlocale().
It interfaces to the lib C and returns the current encoding setting
as known by the lib C. It's just a more intuitive name than
locale.nl_langinfo(CODESET) and works on Windows as well.

And, again, locale.getpreferredencoding() should be deprecated.
The API has been misused in too many ways and is completely broken
by now. It was a good idea at the time, when Martin added it,
even though I never liked the name.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

> locale.getencoding()
>
> which interfaces to nl_langinfo(CODESET) or the Windows code
> page and does not try to do any magic, ie. does *not* call
> setlocale(). It needs to return what the lib C currently
> knows and uses as encoding.

This is locale.get_current_locale_encoding(). I would like to put "current" in 
the name, because there is a lot of confusion between 
get_current_locale_encoding() encoding and locale.getpreferredencoding(False) 
encoding. In locale.getpreferredencoding(False), Python ignores the locale in 
some cases which is counter intuitive.

I propose to add new functions to reduce confusion and better document the 
subtle differences between the different "locale encodings".

That's also why I propose to rename the "locale encoding" to the "Python locale 
encoding" in the documentation: clarify the Python ignores the locale sometimes.

The PEP 538 (coerce the C locale) and PEP 540 (Python UTF-8 Mode) introduced 
confusion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

> Now, the correct way in all this would be to just call setlocale(LC_ALL, '') 
> at the start of the application

Python now does that during its initialization on all platforms. So 
getpreferredencoding(False) is what its documentation says: the user preferred 
encoding, the LC_CTYPE locale encoding.

On Python 3.7, _Py_SetLocaleFromEnv(LC_CTYPE) was called in 
_Py_InitializeCore() on Unix, but not on Windows.

Since Python 3.8, _PyPreConfig_Write() calls _Py_SetLocaleFromEnv(LC_CTYPE) on 
all platforms including Windows. See bpo-34485 and my article for more details 
("C locale on Windows" section):
https://vstinner.github.io/python3-locales-encodings.html

_Py_SetLocaleFromEnv(LC_CTYPE) calls setlocale(LC_CTYPE, ""), but has more 
complex code on Android.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

I created PR 24931 to add locale.get_current_locale_encoding(). I tried to 
clarified the differences between the "current locale encoding" and the "locale 
encoding".

Maybe we should rename the "locale encoding" to the "Python locale encoding", 
since it's not what most Unix developers would expect. What do you think?

While most locale function have no underscore in their name, it seems like the 
current trend is to allow underscores in names for *new* functions. For 
example, the sys module has without underscores:

* sys.getallocatedblocks()
* sys.getdefaultencoding()
* sys.getfilesystemencodeerrors
* ...

But it got new functions with underscores:

* sys.set_asyncgen_hooks()
* sys.set_coroutine_origin_tracking_depth()

... and there are some old functions with underscores:

* sys.exc_info()
* sys.call_tracing()
* sys._clear_type_cache()
* sys._current_frames()

In the locale module, there is one existing function with an undercore:

* locale.format_string()

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


STINNER Victor  added the comment:

I created this issue while reviewing the implementation of the PEP 597: PR 
19481.

Copy of my comments on the PR related to this issue.


_locale.get_locale_encoding() calls _Py_GetLocaleEncoding() which returns UTF-8 
if the Python UTF-8 Mode is enabled.

Maybe the function could have a flag: please don't lie to me and return the 
current locale encoding ;-)

Or we could add a function to get the *current* locale encoding: 
**locale.get_current_locale_encoding()**.

This one would ignore the UTF-8 Mode and call nl_langinfo(CODESET). There are 
APIs to use the *current* locale encoding: 
PyUnicode_EncodeLocale/PyUnicode_DecodeLocale and 
_Py_EncodeLocaleEx/_Py_DecodeLocaleEx with current_locale=1. You can see which 
functions use it:

* decode tm_zone field of localtime_r() and gmtime()
* decode tzname[0] and tzname[1] strings
* decode setlocale() result
* decode some localeconv() fields (this function requires to switch to 
different locale encoding, it's bad!)
* decode nl_langinfo() result
* decode gettext(), dgettext(), dcgettext(), textdomain(), bindtextdomain(), 
bind_textdomain_codeset() result
* decode strerror() and dlerror() result
* encode/decode in the readline module
* encode format string for strftime() in time.strftime() (only used on Windows, 
Unix provides wcsftime) and then decode strftime() result


> encoding="locale" : Uses locale encoding regardless UTF-8 mode.

Currently, open(encoding=None) doesn't work like that. For example, on macOS, 
Android and VxWorks, it always use UTF-8. And if the UTF-8 Mode is used, UTF-8 
is used.

In the PEP 597, I read the encoding="locale" is the same than encoding=None but 
don't emit an EncodingWarning. Where the PEP 597 changes the chosen encoding 
for encoding=None case? The PEP says "locale encoding" without specifying 
exactly what it is. In Python, it means different things depending on the 
context. There is subtle difference the **current** locale encoding and "the 
locale encoding". I agree that it needs some clarification :-)

While we discuss encodings, I never understood why open() gets the current 
locale encoding from nl_langinfo(CODESET), encoding which can change at runtime 
while Python is running. For example, if thread A calls open(filename, 
encoding=None), thread B calls locale.localeconv(), and the LC_MONETARY locale 
uses a different encoding than the LC_CTYPE locale, thread A can get the 
LC_MONETARY encoding because of how locale.localeconv() is currently 
implemented: it changes temporarily LC_CTYPE to LC_MONETARY to decode the 
monetary fields of localeconv() result.

I would prefer that Python uses the same encoding for the whole lifetime of the 
process, since the beginning until the end. The Python filesystem encoding is a 
good choice for that. It's the same than locale.getpreferredencoding(False) 
(currently used by open() and friends), but becomes different if the LC_CTYPE 
is changed (temporarily or permanently).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


Change by STINNER Victor :


--
keywords: +patch
pull_requests: +23693
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24931

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread M.-A. Lemburg
On 19.03.2021 10:17, STINNER Victor wrote:
> 
> New submission from STINNER Victor :
> 
> I propose to add two new functions:
> 
> * locale.get_locale_encoding(): it's exactly the same than 
> locale.getpreferredencoding(False).
> 
> * locale.get_current_locale_encoding(): always get the current locale 
> encoding. Read the ANSI code page on Windows, or nl_langinfo(CODESET) on 
> other platforms. Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, 
> Android, VxWorks.

I'm not sure whether this would improve the situation much.

The problem is that the locale module is meant to expose the lib C
locale settings, but many of the recent additions actually do something
completely different: they look into the process and user environment
and try to determine external settings, which are not reflected in
the lib C locale settings.

I had added locale.getdefaultlocale() to give applications a chance
to determine the locale setting defined by the process environment
*without* calling setlocale(LC_ALL, '') and causing problems
in other threads. I used the X11 database for locale encodings,
which was the closest you could get to in terms of a standard for
encodings at the time (around 2000).

Part of the return value is the encoding, which would be set.

Martin later added locale.getpreferredencoding(), which tries to
determine the encoding in a different way new way, based on
nl_langset(CODEINFO). As you mentioned, this intention was broken
on several platforms by forcing UTF-8 as output. And in many cases,
the API had to call setlocale() as well, causing the thread problems.

However, the problem with nl_langset(CODEINFO) is the same as
with setlocale(): it returns the current state of the lib C
settings, which may well point to the 'C' locale. Not the ones
the user has configured in the OS environment. So while you get
an encoding defined by lib C for the current locale settings
(without guessing it as with locale.getdefaultlocale()), you
still don't get what the user really wants to use.

Unfortunately, lib C does not provide a way to query the locale
database without changing the locale settings at the same time.
This is the main issue we're facing.

Now, the correct way in all this would be to just call
setlocale(LC_ALL, '') at the start of the application and
not try to apply all the magic to get around this. But this
has to be done by the application and not Python (which may
well be embedded into some other application).

I'd suggest to add a single new API:

locale.getencoding()

which interfaces to nl_langinfo(CODESET) or the Windows code
page and does not try to do any magic, ie. does *not* call
setlocale(). It needs to return what the lib C currently
knows and uses as encoding.

locale.getpreferredencoding() should then be deprecated.

It does not make sense to pretend to query information which is
not really directly available from the lib C locale system.

And the documentation should point out that applications should
call setlocale(LC_ALL, '') when they start up, if they want to
get the lib C locale, and thus Python locale module, setup to
work based on what the user really wants -- instead of just
guessing at this.

PS: The locale module normally does not use underscores in
function names, so it's not a good idea to add more.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue43552] Add locale.get_locale_encoding() and locale.get_current_locale_encoding()

2021-03-19 Thread STINNER Victor


New submission from STINNER Victor :

I propose to add two new functions:

* locale.get_locale_encoding(): it's exactly the same than 
locale.getpreferredencoding(False).

* locale.get_current_locale_encoding(): always get the current locale encoding. 
Read the ANSI code page on Windows, or nl_langinfo(CODESET) on other platforms. 
Ignore the UTF-8 Mode. Don't always return "UTF-8" on macOS, Android, VxWorks.


Technically, locale.get_locale_encoding() would simply expose 
_locale.get_locale_encoding() that I added recently. It calls the new private 
_Py_GetLocaleEncoding() function (which has no argument).

By the way, Python requires nl_langinfo(CODESET) to be built. It's not a new 
requirement of Python 3.10, but I wanted to note that, I noticed it when I 
implemented _locale.get_locale_encoding() :-)


Python has a bad habit of lying to the user: locale.getpreferredencoding(False) 
is *NOT* the current locale encoding in multiple cases.

* locale.getpreferredencoding(False) always return "UTF-8" on macOS, Android 
and VxWorks
* locale.getpreferredencoding(False) always return "UTF-8" if the UTF-8 Mode is 
enabled
* otherwise, it returns the current locale encoding: ANSI code page on Windwos, 
or nl_langinfo(CODESET) on other platforms


Even if locale.getpreferredencoding(False) already exists, I propose to add 
locale.get_locale_encoding() because I dislike locale.getpreferredencoding() 
API. By default, this function sets temporarily LC_CTYPE to the user preferred 
locale. It can cause mojibake in other threads since setlocale(LC_CTYPE, "") 
affects all threads :-( Calling locale.getpreferredencoding(), rather than 
locale.getpreferredencoding(False), is not what most people expect. This API 
can be misused.

On the other side, locale.get_locale_encoding() does exactly what it says: only 
*get* the encoding, don't *set* temporarily a locale to something else.

By the way, the locale.localeconv() function can change temporarily LC_CTYPE 
locale to the LC_MONETARY locale which can cause other threads to use the wrong 
LC_CTYPE locale! But this is a different issue.

--
components: Library (Lib)
messages: 389057
nosy: vstinner
priority: normal
severity: normal
status: open
title: Add  locale.get_locale_encoding() and 
locale.get_current_locale_encoding()
versions: Python 3.10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com