[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Richard Damon
On 7/2/20 10:19 AM, Victor Stinner wrote: > Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode > character set but uses the annoying surrogate pairs for characters > outside the BMP.* Minor quibble, UTF-16 handles all of the CURRENTLY defined Unicode set, and there is a currently a

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Petr Viktorin
On 2020-07-02 14:57, Victor Stinner wrote: Le jeu. 2 juil. 2020 à 14:44, Barry Scott a écrit : It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets. Why is it needed? The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Victor Stinner
UCS-2 means units of 16 bits so it's limited to Unicode BMP: U+-U+. UCS-4 means units of 32 bits and so gives access to the whole (current) Unicode character set. Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode character set but uses the annoying surrogate pairs for

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Barry Scott
> On 30 Jun 2020, at 13:43, Emily Bowman wrote: > > I completely agree with this, that UTF-8 has become the One True > Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the > Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let > alone composite

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Victor Stinner
Le jeu. 2 juil. 2020 à 14:44, Barry Scott a écrit : > It's not obvious to me why the latin1 encoding is in this list as its just > one of all the 8-bit char sets. > Why is it needed? The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes 0x00-0xFF to Unicode characters U+-U+00FF

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Paul Moore
Latin-1 is the encoding that maps every byte (0-255) to the Unicode character with the same number. So it's special in that sense, and it gets used when mapping 8-bit bytes via Unicode "without encoding". Excuse my imprecise language here, I don't know the correct Unicode terms without going &

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Barry Scott
> On 29 Jun 2020, at 10:57, Victor Stinner wrote: > > I would prefer to only have a fast-path for the most common encodings: > ASCII, Latin1, UTF-8, Windows ANSI code page. That's all. It's not obvious to me why the latin1 encoding is in this list as its just one of all the 8-bit char sets.

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread Inada Naoki
On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg wrote: > > > The reasoning here is the same as for decoding: you have the original > data you want to process available in some array and want to turn > this into the Python object. > > The path Victor suggested requires always going via a Python

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread Glenn Linderman
On 7/1/2020 1:20 PM, M.-A. Lemburg wrote: As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread M.-A. Lemburg
On 30.06.2020 15:17, Victor Stinner wrote: > Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg a écrit : >>> I would prefer to analyze the list on a case by case basis. I don't >>> think that it's useful to expose every single encoding supported by >>> Python as a C function. >> >> (...) >> This does

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread M.-A. Lemburg
On 28.06.2020 16:24, Inada Naoki wrote: > Hi, Lamburg. > > Thank you for quick response. > >> >> We can't just remove access to one half of a codec (the decoding >> part) without at least providing an alternative for C extensions >> to use. >> >> Py_UNICODE can be removed from the API, but only

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Rhodri James
On 30/06/2020 13:43, Emily Bowman wrote: I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji. You say that

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Victor Stinner
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg a écrit : > > I would prefer to analyze the list on a case by case basis. I don't > > think that it's useful to expose every single encoding supported by > > Python as a C function. > > (...) > This does not mean we have to give up the symmetry in the C

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Richard Damon
On 6/30/20 8:43 AM, Emily Bowman wrote: > I completely agree with this, that UTF-8 has become the One True > Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside > of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 > wchar_t, let alone composite emoji. > > So

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Emily Bowman
I completely agree with this, that UTF-8 has become the One True Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let alone composite emoji. So how to make that C-compatible? Make everything a

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Rhodri James
On 30/06/2020 13:16, Richard Damon wrote: On 6/30/20 7:53 AM, M.-A. Lemburg wrote: Since the C world has adopted wchar_t for this purpose, it's the natural choice. I would disagree with this comment. Microsoft Windows has chosen to use 'wchar_t' for Unicode, because they adopted UCS-2 before

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Richard Damon
On 6/30/20 7:53 AM, M.-A. Lemburg wrote: > Since the C world has adopted wchar_t for this purpose, it's the > natural choice. I would disagree with this comment. Microsoft Windows has chosen to use 'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into UTF-16 due to the

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread M.-A. Lemburg
On 29.06.2020 11:57, Victor Stinner wrote: > Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg a écrit : >> as you may remember, I wasn't happy with the deprecations of the >> APIs in PEP 393, since there are no C API alternatives for >> the encoding APIs deprecated in the PEP, which allow direct >>

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 12:50, Inada Naoki a écrit : > > > ## PyUnicode_EncodeDecimal > > > > > > It is not documented. It has not been deprecated by Py_DEPRECATED. > > > Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11. > > > > I understood that the replacement function is the private

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
On Mon, Jun 29, 2020 at 6:51 PM Victor Stinner wrote: > > > I understand that these ".. deprecated" markups will be added to 3.8 > and 3.9 documentation, right? > They are documented as "Deprecated since version 3.3, will be removed in version 4.0" already. I am proposing s/4.0/3.10/ in 3.8 and

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 12:36, Inada Naoki a écrit : > * UTF16 and UTF32; `int byteorder` parameter. * UTF-16 byte_order=0 means "UTF-16" encoding * UTF-16 byte_order<0 means "UTF-16-BE" encoding * UTF-16 byte_order>0 means "UTF-16-LE" encoding Same applies for UTF-32. > * UTF7; int

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
Many existing public APIs doesn't have `const char *errors` argument. As there are very few users, we can ignore that limitation. On the other hand, some encoding have special options. * UTF16 and UTF32; `int byteorder` parameter. * UTF7; int base64SetO, int base64WhiteSpace So

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 17:21, Inada Naoki a écrit : > More aggressive idea: override current PyUnicode_EncodeXXX() apis. > Change from `Py_UNICODE *object` to `PyObject *unicode`. > > This idea might look crazy. But PyUnicode_EncodeXXX APIs are > deprecated for a long time, and there are only a

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 08:41, Inada Naoki a écrit : > That's all. > Now I think it is safe to override deprecated APIs with private APIs > accepts Unicode Object. > > * _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7 Use PyUnicode_AsEncodedString("UTF-7"). This encoding is not common enough to

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg a écrit : > as you may remember, I wasn't happy with the deprecations of the > APIs in PEP 393, since there are no C API alternatives for > the encoding APIs deprecated in the PEP, which allow direct > encoding provided by these important codecs. > >

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 04:39, Inada Naoki a écrit : > ## Documented and have Py_DEPRECATED > > * PyLong_FromUnicode > * PyUnicode_AsUnicodeCopy > * PyUnicode_Encode > * PyUnicode_EncodeUTF7 > * PyUnicode_EncodeUTF8 > * PyUnicode_EncodeUTF16 > * PyUnicode_EncodeUTF32 > *

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
On Mon, Jun 29, 2020 at 12:17 AM Inada Naoki wrote: > > > More aggressive idea: override current PyUnicode_EncodeXXX() apis. > Change from `Py_UNICODE *object` to `PyObject *unicode`. > This is a list of PyUnicode_Encode usage in top4000 packages.

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread Inada Naoki
On Sun, Jun 28, 2020 at 11:24 PM Inada Naoki wrote: > > > So how about making them public, instead of undeprecate Py_UNICODE* encode > APIs? > > 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. >Current private APIs can become macro (e.g. #define > _PyUnicode_AsAsciiString

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread Inada Naoki
Hi, Lamburg. Thank you for quick response. > > We can't just remove access to one half of a codec (the decoding > part) without at least providing an alternative for C extensions > to use. > > Py_UNICODE can be removed from the API, but only if there are > alternative APIs which C extensions can

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread M.-A. Lemburg
Hi Inada-san, as you may remember, I wasn't happy with the deprecations of the APIs in PEP 393, since there are no C API alternatives for the encoding APIs deprecated in the PEP, which allow direct encoding provided by these important codecs. AFAIK, the situation hasn't changed since then. We