[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Richard Damon
On 7/2/20 10:19 AM, Victor Stinner wrote:
> Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode
> character set but uses the annoying surrogate pairs for characters
> outside the BMP.*

Minor quibble, UTF-16 handles all of the CURRENTLY defined Unicode set,
and there is a currently a promise not to extend Unicode past that, but
at some point they may need to break that promise.

UTF-8, as previously defined (and could be again) easily handles
U+ to U+7FFF.

UTF-16 can handle via the surrogate pairs U+ to U+0010 and
stop there, To extend past that would require some form of heroics,
which is the reason that U+0010 is currently defined as the highest
possible code point, as to allow a higher value breaks UTF-16, and there
currently isn't a desire to do so. At some point in the distant future,
we may run out of 'valid' code points, and this promise will need to be
broken.

UTF-16 grew out of a need to fix what has become UCS-2, which is the
encoding used for earlier Unicode standards, before the need for code
points above U+ (now the BMP) was seen.

-- 
Richard Damon
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HJ7R5Q25EVCSBS7CZFZ5CNYITXOLWWFG/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Petr Viktorin

On 2020-07-02 14:57, Victor Stinner wrote:

Le jeu. 2 juil. 2020 à 14:44, Barry Scott  a écrit :

It's not obvious to me why the latin1 encoding is in this list as its just one 
of all the 8-bit char sets.
Why is it needed?


The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes
0x00-0xFF to Unicode characters U+-U+00FF and decoding from latin1
cannot fail.


This apparently makes it useful for not-quite-text, not-quite-bytes 
protocols like HTTP. In particular, WSGI (PEP ) uses latin-1 for 
headers.




It was commonly used as the locale encoding in Europe 10 years ago,
but nowadays most Linux distributions use UTF-8 as the locale
encoding.

I'm also fine with restricting the list to 3 encodings: ASCII, UTF-8
and Windows ANSI code page.


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DQI2UW5WOQ3EMHRP5VEGDG3MIU364I6K/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Victor Stinner
UCS-2 means units of 16 bits so it's limited to Unicode BMP: U+-U+.

UCS-4 means units of 32 bits and so gives access to the whole
(current) Unicode character set.

Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode
character set but uses the annoying surrogate pairs for characters
outside the BMP.*

UTF-32 is UCS-4 in practice.

Victor

Le jeu. 2 juil. 2020 à 15:08, Barry Scott  a écrit :
>
>
>
> On 30 Jun 2020, at 13:43, Emily Bowman  wrote:
>
> I completely agree with this, that UTF-8 has become the One True 
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the 
> Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let 
> alone composite emoji.
>
>
> I use UCS-32 in my extensions, but never persist UCS-32 for which I use UTF-8.
>
> If you are calling WIN32 "unicode" APIs then you need UCS-16.
>
> My plan with PyCXX is to replace Py_UNICODE with UCS-32.
> I think all the UCS-32 APIs will still be present.
>
> Once I add that support to PyCXX all my users should easily port to a 
> non-Py_UNICODE world.
>
> Barry
>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-dev@python.org/message/YIKT5XGPZIMEIAPBJS3OQAZTWW4JM3Z2/
> Code of Conduct: http://python.org/psf/codeofconduct/



-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/K5MKE6EDM7HKAGFXQ4EYWKACDX6OCFFH/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Barry Scott


> On 30 Jun 2020, at 13:43, Emily Bowman  wrote:
> 
> I completely agree with this, that UTF-8 has become the One True 
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the 
> Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let 
> alone composite emoji.
> 

I use UCS-32 in my extensions, but never persist UCS-32 for which I use UTF-8.

If you are calling WIN32 "unicode" APIs then you need UCS-16.

My plan with PyCXX is to replace Py_UNICODE with UCS-32.
I think all the UCS-32 APIs will still be present.

Once I add that support to PyCXX all my users should easily port to a 
non-Py_UNICODE world.

Barry

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YIKT5XGPZIMEIAPBJS3OQAZTWW4JM3Z2/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Victor Stinner
Le jeu. 2 juil. 2020 à 14:44, Barry Scott  a écrit :
> It's not obvious to me why the latin1 encoding is in this list as its just 
> one of all the 8-bit char sets.
> Why is it needed?

The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes
0x00-0xFF to Unicode characters U+-U+00FF and decoding from latin1
cannot fail.

It was commonly used as the locale encoding in Europe 10 years ago,
but nowadays most Linux distributions use UTF-8 as the locale
encoding.

I'm also fine with restricting the list to 3 encodings: ASCII, UTF-8
and Windows ANSI code page.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YWL3WU3PN7NXIGZNLIBLAO7O225VF46C/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Paul Moore
Latin-1 is the encoding that maps every byte (0-255) to the Unicode
character with the same number. So it's special in that sense, and it
gets used when mapping 8-bit bytes via Unicode "without encoding".
Excuse my imprecise language here, I don't know the correct Unicode
terms without going & looking them up.

Paul

On Thu, 2 Jul 2020 at 13:48, Barry Scott  wrote:
>
>
>
> On 29 Jun 2020, at 10:57, Victor Stinner  wrote:
>
> I would prefer to only have a fast-path for the most common encodings:
> ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
>
>
> It's not obvious to me why the latin1 encoding is in this list as its just 
> one of all the 8-bit char sets.
> Why is it needed?
>
> Barry
>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at 
> https://mail.python.org/archives/list/python-dev@python.org/message/XGJ5NG4WPJKUOZY7KPWD2R3FP6XJDXPM/
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QPFM5L5UEKTICPDVFIE3NT5M7RX4C4ID/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-02 Thread Barry Scott


> On 29 Jun 2020, at 10:57, Victor Stinner  wrote:
> 
> I would prefer to only have a fast-path for the most common encodings:
> ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.

It's not obvious to me why the latin1 encoding is in this list as its just one 
of all the 8-bit char sets.
Why is it needed?

Barry

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/XGJ5NG4WPJKUOZY7KPWD2R3FP6XJDXPM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread Inada Naoki
On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg  wrote:
>
>
> The reasoning here is the same as for decoding: you have the original
> data you want to process available in some array and want to turn
> this into the Python object.
>
> The path Victor suggested requires always going via a Python Unicode
> object, but that it very expensive and not really an appropriate
> way to address the use case.
>

But current PyUnicode_Encode* APIs does `PyUnicode_FromWideChar`.
It is no direct API already.

Additionally, pyodbc, the only user of the encoder API, did
PyUnicode_EncodeUTF16(PyUnicode_AsUnicode(unicode), ...)
It is very inefficient.  Unicode Object -> Py_UNICODE* -> Unicode
Object -> byte object.

And as many others already said, most C world use UTF-8 for Unicode
representation in C,
not wchar_t.

So I don't want to undeprecate current API.


> As an example application, think of a database module which provides
> the Unicode data as Py_UNICODE buffer.

Py_UNICODE is deprecated.  So I assume you are talking about wchar_t.


> You want to write this as UTF-8
> data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
> decode this for you into a bytes object which you can then write out
> using the Python C APIs for this.

PyUnicode_FromWideChar + PyUnicode_AsUTF8AndSize is better than
PyUnicode_EncodeUTF8.

PyUnicode_EncodeUTF8 allocate temporary Unicode object anyway. So it needs
to allocate Unicode object *and* char* buffer for UTF-8.
On the other hand, PyUnicode_AsUTF8AndSize can just expose internal
data when it is plain ASCII. Since ASCII string is very common, this
is effective
optimization.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/UYOPQDKLSNOVPFGPCR5BIW3GHYB3V3KZ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread Glenn Linderman

On 7/1/2020 1:20 PM, M.-A. Lemburg wrote:

As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object which you can then write out
using the Python C APIs for this.
But based on Victor's survey of usages in Python extensions, which found 
few to no uses of these APIs, it would seem that hypothetical 
applications are insufficient to justify the continued provision and 
maintenance of these APIs.


After all, Python extensions are written as a way to interface "other 
stuff" to Python, and converting data to/from Python objects seems far 
more likely than converting data from one non-Python format to a 
different non-Python format. Not that such applications couldn't be 
written as Python extensions, but ... are they? ... and why?


A rich interface is nice, but an unused interface is a burden.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MA7U7A6OCO46TEWXRKHGDRBYH5FMWJUP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread M.-A. Lemburg
On 30.06.2020 15:17, Victor Stinner wrote:
> Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg  a écrit :
>>> I would prefer to analyze the list on a case by case basis. I don't
>>> think that it's useful to expose every single encoding supported by
>>> Python as a C function.
>>
>> (...)
>> This does not mean we have to give up the symmetry in the C API,
>> or that the encoding APIs are now suddenly useless. It only means
>> that we have to replace Py_UNICODE with one of the supported data
>> for storing Unicode.
> 
> Let's agree to disagree :-)
> 
> I don't think that completeness is a good rationale to design the C API.

Oh, if that's your opinion, then we definitely disagree :-)

I strongly believe that the success of Python was in major parts
built on the fact that Python does have a complete and easily
usable C API.

Without this, Python would never have convinced
the "Python is slow" advocates that you can actually build
fast applications in Python by using Python to orchestrate and
integrate with low level C libraries, and we'd be regarded as
yet another Tcl.

> The C API is too large, we have to make it smaller.

That*s a different discussion, but disagree on that perspective
as well: we have to refactor parts of the Python C API to make it
more consistent and remove hacks which developers sometimes added
as helper functions without considering the big picture approach.

The Unicode API has over the year grown a lot of such helpers
and there's certainly room for improvement, but simply ripping out
things is not always the right answer, esp. not when you touch the
very core of the design.

> A specialized
> function, like PyUnicode_AsUTF8String(), can be justified by different
> reasons:
> 
> * It is a very common use case and so it helps to write C extensions
> * It is significantly faster than the alternative generic function
> 
> In C, you can execute arbitrary Python code by calling methods on
> Python str objects. For example, "abc".encode("utf-8",
> "surrogateescape") in Python becomes PyObject_CallMethod(obj,
> "encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already
> a more specialized and generic PyUnicode_AsEncodedObject() function.

You know as well as I do, that the Python call mechanism is by far
the slowest part in the Python C API, so telling developers to
use this as the main way to run tasks which can be run much faster,
easier and with less memory overhead or copying of data by directly
calling a simple C API, is not a good way to advocate for a useful
Python C API.

> We must not add a C API function for every single Python feature,
> otherwise it would be too expensive to maintain, and it would become
> impossible for other Python implementations to implement the fully C
> API. Well, even today, PyPy already only implements a small subset of
> the C API.

I honestly don't think that other Python implementations should
even try to implement the Python C API. Instead, they should build
a bridge to use the CPython runtime and integrate this into their
system.
>> Since the C world has adopted wchar_t for this purpose, it's the
>> natural choice.
> 
> In my experience, in C extensions, there are two kind of data:
> 
> * bytes is used as a "char*": array of bytes
> * Unicode is used as a Python object

Uhm, what about all those applications, libraries and OS calls
producing Unicode data ? It is not always feasible or even desired
to first convert this into a Python Unicode object.

> For the very rare cases involving wchar_t*, PyUnicode_FromWideChar()
> can be used. I don't think that performance justifies to duplicate
> each function, once for a Python str object, once for wchar_t*. I
> mostly saw code involving wchar_t* to initialize Python. But this code
> was wrong since it used PyUnicode function *before* Python was
> initialized. That's bad and can now crash in recent Python versions.

But that*s an entirely unrelated issue, right ? The C lib has
full support for wchar_t and provides plenty of APIs for using
it. The main() invocation is just one small part of the lib C
Unicode system.

> The new PEP 587 has a different design and avoids Python objects and
> anything related to the Python runtime:
> https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString
> 
> Moreover, CPython implements functions taking wchar_t* string by
> calling PyUnicode_FromWideChar() internally...

I mentioned wchar_t as buffer input replacement for the
PyUnicode_Encode*() API as alternative to the deprecated
Py_UNICODE.

Of course, you can convert all whcar_t data into a Python Unicode
object first and then apply operations on this, but the point of
the encode APIs is to have a low-level access to the Python codecs
which works directly on a data buffer - not a Unicode object.

Again, with the main intent to avoid unnecessary copying of data,
scanning, preparing, etc. etc. as is needed for
PyUnicode_FromWideChar().

>> PyUnicode_AsEncodedString() converts Unicode objects to a
>> 

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-07-01 Thread M.-A. Lemburg
On 28.06.2020 16:24, Inada Naoki wrote:
> Hi, Lamburg.
> 
> Thank you for quick response.
> 
>>
>> We can't just remove access to one half of a codec (the decoding
>> part) without at least providing an alternative for C extensions
>> to use.
>>
>> Py_UNICODE can be removed from the API, but only if there are
>> alternative APIs which C extensions can use to the same effect.
>>
>> Given PEP 393, this would be APIs which use wchar_t instead of
>> Py_UNICODE.
>>
> 
> Decoding part is implemented as `const char *` -> `PyObject*` (Unicode 
> object).
> I think this is reasonable since `const char *` is perfect to abstract
> the encoded string,
> 
> In case of encoding part, `wchar_t *` is not perfect abstraction for
> (decoded) unicode
> string.

Note that the PyUnicode_Encode*() APIs are meant to be make the
codec's encoding machinery available to C extensions, so that they
don't have to implement this again.

In that sense, their purpose is not to encode an existing Unicode
object, but instead, to provide access to the low-level buffer to
bytes object encoding.

The reasoning here is the same as for decoding: you have the original
data you want to process available in some array and want to turn
this into the Python object.

The path Victor suggested requires always going via a Python Unicode
object, but that it very expensive and not really an appropriate
way to address the use case.

As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object which you can then write out
using the Python C APIs for this.

>  Converting from Unicode object into `wchar_t *` is not zero-cost.
> I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object)
> looks better signature than
> `wchar_t *` -> `Pyobject *` (bytes object) because for encoders.

See above. The motivation for these APIs is different. They are
not about taking a Unicode object and converting it into bytes,
they are deliberately about taking a data buffer as input and
producing the Python bytes object as output (to implement symmetry
between decoding and encoding).

> * Unicode object is more important than `wchar_t *` in Python.

Right, but as I tried to explain in my reply to Victor, I designed
the Unicode API in Python to be a rich API, which provides all
necessary tools to easily work with Unicode in C extensions as
well as in the CPython interpreter.

The API is not only focused on what the CPython interpreter needs.
It's an API which implements a concise interface to Unicode as
used in Python.

> * All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.
> 
> For example, we have these private encode APIs:
> 
> * PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char
> *errors, int byteorder)
> ...
> 
> So how about making them public, instead of undeprecate Py_UNICODE* encode 
> APIs?

I'd be fine with keeping just a generic PyUnicode_Encode() API,
but this should then be encoding from a buffer to a bytes object.

The above all take Unicode objects as input and create the same
problem as I described above, with the temporary Unicode object being
created and all the associated malloc and scanning overhead needed
for this.

The reason I mention wchar_t as new basis for the PyUnicde_Encode()
API is that whcar_t has grown to be accepted as the standard for
Unicode buffers in C. If you don't believe that this is good enough,
we could also force Py_UCS4, but this would alienate Windows extension
writers.

> 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
>Current private APIs can become macro (e.g. #define
> _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes),
>or deprecated static inline function.
> 2. Remove Py_UNICODE* encode APIs in Python 3.12.

FWIW: I don't object to deprecating Py_UNICODE. I just don't
want to lose the symmetry in decoding/encoding and add the cost
of having to go via a Python Unicode object just to decode
to bytes.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Rhodri James

On 30/06/2020 13:43, Emily Bowman wrote:

I completely agree with this, that UTF-8 has become the One True
Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the
Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t,
let alone composite emoji.


You say that as if it's a bad thing :-)


So how to make that C-compatible? Make everything a void* and it just comes
back with as many bytes as it gets?


I'd be inclined to something like that.  You really don't want people 
trying to roll their own UTF-8 handling if you can help it.  That does 
imply the C API will need to be pretty comprehensive, though.


(If you want nightmares, take a look at the parsing code in Expat. 
Multiple layers of macros and function tables make it a horror to 
comprehend.)


--
Rhodri James *-* Kynesim Ltd
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7HPGNVZ46ROP3HMRUJXJXX2WI4LI4JAL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Victor Stinner
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg  a écrit :
> > I would prefer to analyze the list on a case by case basis. I don't
> > think that it's useful to expose every single encoding supported by
> > Python as a C function.
>
> (...)
> This does not mean we have to give up the symmetry in the C API,
> or that the encoding APIs are now suddenly useless. It only means
> that we have to replace Py_UNICODE with one of the supported data
> for storing Unicode.

Let's agree to disagree :-)

I don't think that completeness is a good rationale to design the C API.

The C API is too large, we have to make it smaller. A specialized
function, like PyUnicode_AsUTF8String(), can be justified by different
reasons:

* It is a very common use case and so it helps to write C extensions
* It is significantly faster than the alternative generic function

In C, you can execute arbitrary Python code by calling methods on
Python str objects. For example, "abc".encode("utf-8",
"surrogateescape") in Python becomes PyObject_CallMethod(obj,
"encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already
a more specialized and generic PyUnicode_AsEncodedObject() function.

We must not add a C API function for every single Python feature,
otherwise it would be too expensive to maintain, and it would become
impossible for other Python implementations to implement the fully C
API. Well, even today, PyPy already only implements a small subset of
the C API.


> Since the C world has adopted wchar_t for this purpose, it's the
> natural choice.

In my experience, in C extensions, there are two kind of data:

* bytes is used as a "char*": array of bytes
* Unicode is used as a Python object

For the very rare cases involving wchar_t*, PyUnicode_FromWideChar()
can be used. I don't think that performance justifies to duplicate
each function, once for a Python str object, once for wchar_t*. I
mostly saw code involving wchar_t* to initialize Python. But this code
was wrong since it used PyUnicode function *before* Python was
initialized. That's bad and can now crash in recent Python versions.
The new PEP 587 has a different design and avoids Python objects and
anything related to the Python runtime:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString

Moreover, CPython implements functions taking wchar_t* string by
calling PyUnicode_FromWideChar() internally...


> PyUnicode_AsEncodedString() converts Unicode objects to a
> bytes object. This is not an symmetric replacement for the
> PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
> a bytes object.

I don't see which feature is missing from PyUnicode_AsEncodedString().
If it's about parameters specific to some encodings like UTF-7, I
already replied in another email.


> Since the C API is not only meant to be used by the CPython interpreter,
> we should stick to standards rather than expecting the world to adapt
> to our implementations. This also makes the APIs future proof, e.g.
> in case we make another transition from the current hybrid internal
> data type for Unicode towards UTF-8 buffers as internal data type.

Do you know C extensions in the wild which are using wchar_t* on
purpose? I haven't seen such a C extension yet.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YW4FDLEHFXA6NKWZYVFOBNGPE33VQJ7U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Richard Damon
On 6/30/20 8:43 AM, Emily Bowman wrote:
> I completely agree with this, that UTF-8 has become the One True
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside
> of the Win32 API. Nearly all basic emoji can't be represented in UCS-2
> wchar_t, let alone composite emoji.
>
> So how to make that C-compatible? Make everything a void* and it just
> comes back with as many bytes as it gets?

Actually, in C you would tend to represent UTF-8 as a char* (or maybe an
unsigned char*) type. This points out that straight 'ASCII' strings are
also UTF-8, and that many of the string functions will actually work ok
with UTF-8 strings. This was an intentional part of the design of UTF-8.
Anything looking for specific character values will tend to 'just work',
as long as those values really represent a character. The code also
needs to take account of that now bytes != characters, so if you want to
actually count how many characters are in a string, you need to be
aware, and avoid splitting a string in the middle of a code-point, but a
lot will still just work.

-- 
Richard Damon
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YQWLGMN2M4JDVFSOYGFMOPUB7QAAWH2U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Emily Bowman
I completely agree with this, that UTF-8 has become the One True
Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the
Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t,
let alone composite emoji.

So how to make that C-compatible? Make everything a void* and it just comes
back with as many bytes as it gets?

On Tue, Jun 30, 2020 at 5:22 AM Richard Damon 
wrote:

> On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
> > Since the C world has adopted wchar_t for this purpose, it's the
> > natural choice.
>
> I would disagree with this comment. Microsoft Windows has chosen to use
> 'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into
> UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of
> the world has chosen to use UTF-8 as the preferred way to store Unicode
> characters.
>
> Also, in Windows, wchar_t doesn't really meet the requirements for what
> C defines wchar_t to mean, as wchar_t is supposed to represent every
> character as a single unit, and thus would need to be at least a 21 bit
> type (typically, it would be a 32 bit type), but Windows makes it a 16
> bit type due to ABIs being locked before the Unicode expansion.
>
> --
> Richard Damon
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/TA2ITVZY6ZGH2Y42JAXD243RSG7MONTV/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KHFCEVSMTF6LIJAKHCAKTYAYWU6JEBNB/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Rhodri James

On 30/06/2020 13:16, Richard Damon wrote:

On 6/30/20 7:53 AM, M.-A. Lemburg wrote:

Since the C world has adopted wchar_t for this purpose, it's the
natural choice.


I would disagree with this comment. Microsoft Windows has chosen to use
'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into
UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of
the world has chosen to use UTF-8 as the preferred way to store Unicode
characters.

Also, in Windows, wchar_t doesn't really meet the requirements for what
C defines wchar_t to mean, as wchar_t is supposed to represent every
character as a single unit, and thus would need to be at least a 21 bit
type (typically, it would be a 32 bit type), but Windows makes it a 16
bit type due to ABIs being locked before the Unicode expansion.


Seconded.  I've had to do cross-platform (Linux and Windows)* unicode 
work in C.  Using wchar_t was eventually rejected as infeasible.


* Sorry, I had a Blues Brothers moment.

--
Rhodri James *-* Kynesim Ltd
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Y5SZNQYUULRY75CVHV34CSQTUI2FBUZ6/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread Richard Damon
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
> Since the C world has adopted wchar_t for this purpose, it's the
> natural choice.

I would disagree with this comment. Microsoft Windows has chosen to use
'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into
UTF-16 due to the expansion of Unicode above 16 bits. The *nix side of
the world has chosen to use UTF-8 as the preferred way to store Unicode
characters.

Also, in Windows, wchar_t doesn't really meet the requirements for what
C defines wchar_t to mean, as wchar_t is supposed to represent every
character as a single unit, and thus would need to be at least a 21 bit
type (typically, it would be a 32 bit type), but Windows makes it a 16
bit type due to ABIs being locked before the Unicode expansion.

-- 
Richard Damon
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TA2ITVZY6ZGH2Y42JAXD243RSG7MONTV/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-30 Thread M.-A. Lemburg
On 29.06.2020 11:57, Victor Stinner wrote:
> Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg  a écrit :
>> as you may remember, I wasn't happy with the deprecations of the
>> APIs in PEP 393, since there are no C API alternatives for
>> the encoding APIs deprecated in the PEP, which allow direct
>> encoding provided by these important codecs.
>>
>> AFAIK, the situation hasn't changed since then.
> 
> I would prefer to analyze the list on a case by case basis. I don't
> think that it's useful to expose every single encoding supported by
> Python as a C function.

I designed the Unicode C API as a rich API, so that it's easy
to use from C extensions and the interpreter as well.

The main theme was to have symmetric API for both encoding and
decoding. The PEP now suggests to remove the API on the basis of
deprecating Py_UNICODE, which is a change in data type.

This does not mean we have to give up the symmetry in the C API,
or that the encoding APIs are now suddenly useless. It only means
that we have to replace Py_UNICODE with one of the supported data
for storing Unicode.

Since the C world has adopted wchar_t for this purpose, it's the
natural choice.

> I would prefer to only have a fast-path for the most common encodings:
> ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
> 
> For any other encodings, the general PyUnicode_AsEncodedString() and
> PyUnicode_Decode() function are good enough.

PyUnicode_AsEncodedString() converts Unicode objects to a
bytes object. This is not an symmetric replacement for the
PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
a bytes object.

> If someone expects an overhead of passing a string, please prove it
> with a benchmark. But IMO a small overhead is acceptable for rare
> encodings.
> 
> Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have
> "fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python
> alias of the Windows ANSI code page), Latin1. But also UTF-16 and
> UTF-32: I'm not if it's really worth it to have these ones, but it was
> cheap to have them :-)
> 
> 
>> We can't just remove access to one half of a codec (the decoding
>> part) without at least providing an alternative for C extensions

Sorry, I meant the "encoding part".

>> to use.
> 
> I disagree, we can. The alternative exists since Python 2:
> PyUnicode_AsEncodedString() and PyUnicode_Decode().

See above.

If we remove the direct encoding/decoding C APIs we should at the
very least provide generic alternatives which can be used as drop-in
replacement for the PyUnicde_Encode*() APIs.

>> Given PEP 393, this would be APIs which use wchar_t instead of
>> Py_UNICODE.
> 
> Using wchar_t is inefficient on all platforms using 16-bit wchar_t
> since surrogate pairs need a special code path. For example,
> PyUnicode_FromWideChar() has to scan the string twice: the first time
> to count the number of surrogate pairs, to allocate the exact buffer
> size.

If you want full UCS4 compatibility, that's true, but those platforms
suffer from this deficiency platform wide, so Python is in no way
special.

The main point is that wchar_t is the standard in C to represent
Unicode code points, so it's a natural choice as replacement for
Py_UNICODE.

Since the C API is not only meant to be used by the CPython interpreter,
we should stick to standards rather than expecting the world to adapt
to our implementations. This also makes the APIs future proof, e.g.
in case we make another transition from the current hybrid internal
data type for Unicode towards UTF-8 buffers as internal data type.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/XWYHZK6LPAM2MV3E7AXGKZSIPJ43MMFX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 12:50, Inada Naoki  a écrit :
> > > ## PyUnicode_EncodeDecimal
> > >
> > > It is not documented.  It has not been deprecated by Py_DEPRECATED.
> > > Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
> >
> > I understood that the replacement function is the private
> > _PyUnicode_TransformDecimalAndSpaceToASCII() function. This function
> > is used by complex, float and int types to convert a string into a
> > number.
> >
>
> Should we make it public?

In the past, we expose everything "just in case" someone would like to
use it. 30 years later, the C API has hundreds of functions, we don't
know which ones are used or not, the C API is not well tested, etc.

Unless there is a clear user request with a strong use case which
cannot be solved with existing functions, I suggest to *not* add any
new C API function.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KBTGRZLHNXPDN6CVP4CNMVMQN5Y3M5QS/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
On Mon, Jun 29, 2020 at 6:51 PM Victor Stinner  wrote:
>
>
> I understand that these ".. deprecated" markups will be added to 3.8
> and 3.9 documentation, right?
>

They are documented as "Deprecated since version 3.3, will be removed
in version 4.0" already.
I am proposing s/4.0/3.10/ in 3.8 and 3.9 documents.

> For each function, I would be nice to suggest a replacement function.
> For example, PyUnicode_EncodeMBCS() (Py_UNICODE*) can be replaced with
> PyUnicode_EncodeCodePage() using code_page=CP_ACP (PyObject*).

Of course.

> > ## PyUnicode_EncodeDecimal
> >
> > It is not documented.  It has not been deprecated by Py_DEPRECATED.
> > Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
>
> I understood that the replacement function is the private
> _PyUnicode_TransformDecimalAndSpaceToASCII() function. This function
> is used by complex, float and int types to convert a string into a
> number.
>

Should we make it public?

>
> > ## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
> >
> > They are not deprecated by PEP 393, but bpo-12736.
> >
> > They are documented as deprecated, but don't have ``Py_DEPRECATED``.
> >
> > Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
> >
> > Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.
>
> bpo-12736 is "Request for python casemapping functions to use full not
> simple casemaps per Unicode's recommendation". IMHO the replacement
> function is to call lower() and method() of a Python str object.
>

We have private functions; _PyUnicode_ToTitleFull, _PyUnicode_ToLowerFull,
and _PyUnicode_ToUpperFull.
I am not sure we should make them public too.

> If you change the 3.9 documentation, please also update 3.8 doc.
>

I see.

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FMO57MPHTGZZULWL4RGEJHER3ZZFCYBO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 12:36, Inada Naoki  a écrit :
> * UTF16 and UTF32; `int byteorder` parameter.

* UTF-16 byte_order=0 means "UTF-16" encoding
* UTF-16 byte_order<0 means "UTF-16-BE" encoding
* UTF-16 byte_order>0 means "UTF-16-LE" encoding

Same applies for UTF-32.

> * UTF7;  int base64SetO, int base64WhiteSpace

Does anyone use these parameters? I would prefer to ensure that they
are used before continuing to maintain code to support these
parameters.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2RO2NFBNZTZQZ2O6PUNRDVKG25PLLURC/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
Many existing public APIs doesn't have `const char *errors` argument.
As there are very few users, we can ignore that limitation.

On the other hand, some encoding have special options.
* UTF16 and UTF32; `int byteorder` parameter.
* UTF7;  int base64SetO, int base64WhiteSpace

So PyUnicode_AsEncodedString can not replace them.

Regards,
-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KLICTHLKXCSHOMG3LEZLFTDMWJIJA2U4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 17:21, Inada Naoki  a écrit :
> More aggressive idea: override current PyUnicode_EncodeXXX() apis.
> Change from `Py_UNICODE *object` to `PyObject *unicode`.
>
> This idea might look crazy.  But PyUnicode_EncodeXXX APIs are
> deprecated for a long time, and there are only a few users.
> I grepped from 3874 source packages in top 4000 downloaded packages.
> (126 packages are wheel-only)

IMO it's a violation of the C API stability warranty. I would prefer
to use different function names to ensure that building an old C
extension fails with a compiler error, rather than emit a compiler
warning and crash at runtime.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6KSSSYZLJK7J6CZIUYASUU53TQISEJ67/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le lun. 29 juin 2020 à 08:41, Inada Naoki  a écrit :
> That's all.
> Now I think it is safe to override deprecated APIs with private APIs
> accepts Unicode Object.
>
> * _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7

Use PyUnicode_AsEncodedString("UTF-7"). This encoding is not common
enough to justify to have to maintain a public C API for just it.
Adding public C API functions have a cost in CPython, but also in
other Python implementations which then have to maintain it as well.

The C API is too large, we have to make it smaller, not larger.


> * _PyUnicode_AsUTF8String -> PyUnicode_EncodeUTF8

Use PyUnicode_AsUTF8String(), or PyUnicode_AsEncodedString() if you
need to pass errors.

> * _PyUnicode_EncodeUTF16 -> PyUnicode_EncodeUTF16

Use PyUnicode_AsUTF16String(), or PyUnicode_AsEncodedString() if you
need to pass errors or the byte order.

> * _PyUnicode_EncodeUTF32 -> PyUnicode_EncodeUTF32

Who use UTF32? There is PyUnicode_AsUTF32String().

> * _PyUnicode_AsLatin1String -> PyUnicode_EncodeLatin1

PyUnicode_AsLatin1String()

> * _PyUnicode_AsASCIIString -> PyUnicode_EncodeASCII

PyUnicode_AsASCIIString()

> * _PyUnicode_EncodeCharmap -> PyUnicode_EncodeCharmap

PyUnicode_AsCharmapString()

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4B4HSQSW4ZFNOAJATV5XMPGWF5TSNZRN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg  a écrit :
> as you may remember, I wasn't happy with the deprecations of the
> APIs in PEP 393, since there are no C API alternatives for
> the encoding APIs deprecated in the PEP, which allow direct
> encoding provided by these important codecs.
>
> AFAIK, the situation hasn't changed since then.

I would prefer to analyze the list on a case by case basis. I don't
think that it's useful to expose every single encoding supported by
Python as a C function.

I would prefer to only have a fast-path for the most common encodings:
ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.

For any other encodings, the general PyUnicode_AsEncodedString() and
PyUnicode_Decode() function are good enough.

If someone expects an overhead of passing a string, please prove it
with a benchmark. But IMO a small overhead is acceptable for rare
encodings.

Note: PyUnicode_AsEncodedString() and PyUnicode_Decode() also have
"fast paths" for most common encodings: ASCII, UTF-8, "mbcs" (Python
alias of the Windows ANSI code page), Latin1. But also UTF-16 and
UTF-32: I'm not if it's really worth it to have these ones, but it was
cheap to have them :-)


> We can't just remove access to one half of a codec (the decoding
> part) without at least providing an alternative for C extensions
> to use.

I disagree, we can. The alternative exists since Python 2:
PyUnicode_AsEncodedString() and PyUnicode_Decode().


> Given PEP 393, this would be APIs which use wchar_t instead of
> Py_UNICODE.

Using wchar_t is inefficient on all platforms using 16-bit wchar_t
since surrogate pairs need a special code path. For example,
PyUnicode_FromWideChar() has to scan the string twice: the first time
to count the number of surrogate pairs, to allocate the exact buffer
size.


Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GANSSQW5TRR22H4TR3Z3QQ6XLZTO7MVL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Victor Stinner
Le dim. 28 juin 2020 à 04:39, Inada Naoki  a écrit :
> ## Documented and have Py_DEPRECATED
>
> * PyLong_FromUnicode
> * PyUnicode_AsUnicodeCopy
> * PyUnicode_Encode
> * PyUnicode_EncodeUTF7
> * PyUnicode_EncodeUTF8
> * PyUnicode_EncodeUTF16
> * PyUnicode_EncodeUTF32
> * PyUnicode_EncodeUnicodeEscape
> * PyUnicode_EncodeRawUnicodeEscape
> * PyUnicode_EncodeLatin1
> * PyUnicode_EncodeASCII
> * PyUnicode_EncodeCharmap
> * PyUnicode_TranslateCharmap
> * PyUnicode_EncodeMBCS
>
> These APIs are documented.  The document has ``.. deprecated:: 3.3
> 4.0`` directive.
> They have been `Py_DEPRECATED` since Python 3.6 too.
>
> Plan: Change the document to ``.. deprecated:: 3.0 3.10`` and remove
> them in Python 3.10.

".. deprecated" markups are nice, but not easy to discover. I would
help to add a "Deprecated" section of C API Changes and list functions
scheduled for removal in the next Python version:
https://docs.python.org/dev/whatsnew/3.9.html#c-api-changes

I understand that these ".. deprecated" markups will be added to 3.8
and 3.9 documentation, right?

For each function, I would be nice to suggest a replacement function.
For example, PyUnicode_EncodeMBCS() (Py_UNICODE*) can be replaced with
PyUnicode_EncodeCodePage() using code_page=CP_ACP (PyObject*).


> ## PyUnicode_EncodeDecimal
>
> It is not documented.  It has not been deprecated by Py_DEPRECATED.
> Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.

I understood that the replacement function is the private
_PyUnicode_TransformDecimalAndSpaceToASCII() function. This function
is used by complex, float and int types to convert a string into a
number.


> ## PyUnicode_TransformDecimalToASCII
>
> It is documented, but doesn't have ``deprecated`` directive. It is not
> deprecated by Py_DEPRECATED.
>
> Plan: Add Py_DEPRECATED and ``deprecated 3.3 3.11`` directive in 3.9,
> and remove it in 3.11.

I don't think that we need to expose such function as part of the
public C API. IMHO it only was exposed to be consumed by Python
itself. So I don't think that we need to provide a replacement
function.

After the function will be removed, if someone complains, we can
design a new replacement function. But I prefer to not *guess* what is
the exact use case.



> ## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
>
> They are not deprecated by PEP 393, but bpo-12736.
>
> They are documented as deprecated, but don't have ``Py_DEPRECATED``.
>
> Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
>
> Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.

bpo-12736 is "Request for python casemapping functions to use full not
simple casemaps per Unicode's recommendation". IMHO the replacement
function is to call lower() and method() of a Python str object.

If you change the 3.9 documentation, please also update 3.8 doc.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TDP27YPEZH7CWKQ52ZEPW5YIIDXSLS55/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-29 Thread Inada Naoki
On Mon, Jun 29, 2020 at 12:17 AM Inada Naoki  wrote:
>
>
> More aggressive idea: override current PyUnicode_EncodeXXX() apis.
> Change from `Py_UNICODE *object` to `PyObject *unicode`.
>

This is a list of PyUnicode_Encode usage in top4000 packages.
https://gist.github.com/methane/0f97391c9dbf5b53a818aa39a8285a29

Scandir use PyUnicode_EncodeMBCS only in `#if PY_MAJOR_VERSION < 3 &&
defined(MS_WINDOWS)` block.
So it is false positive.

Cython has prototype of these APIs.  pyodbc uses PyUnicode_EncodeUTF16
and PyUnicode_EncodeUTF8.
But pyodbc is converting Unicode Object into bytes object.  So current
API is very inefficient.

That's all.
Now I think it is safe to override deprecated APIs with private APIs
accepts Unicode Object.

* _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7
* _PyUnicode_AsUTF8String -> PyUnicode_EncodeUTF8
* _PyUnicode_EncodeUTF16 -> PyUnicode_EncodeUTF16
* _PyUnicode_EncodeUTF32 -> PyUnicode_EncodeUTF32
* _PyUnicode_AsLatin1String -> PyUnicode_EncodeLatin1
* _PyUnicode_AsASCIIString -> PyUnicode_EncodeASCII
* _PyUnicode_EncodeCharmap -> PyUnicode_EncodeCharmap

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/7PC5SYKPZSQQVC6KPOMO6GLGMOXGE76U/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread Inada Naoki
On Sun, Jun 28, 2020 at 11:24 PM Inada Naoki  wrote:
>
>
> So how about making them public, instead of undeprecate Py_UNICODE* encode 
> APIs?
>
> 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
>Current private APIs can become macro (e.g. #define
> _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes),
>or deprecated static inline function.
> 2. Remove Py_UNICODE* encode APIs in Python 3.12.
>

More aggressive idea: override current PyUnicode_EncodeXXX() apis.
Change from `Py_UNICODE *object` to `PyObject *unicode`.

This idea might look crazy.  But PyUnicode_EncodeXXX APIs are
deprecated for a long time, and there are only a few users.
I grepped from 3874 source packages in top 4000 downloaded packages.
(126 packages are wheel-only)

$ rg -w PyUnicode_EncodeASCII
Cython-0.29.20/Cython/Includes/cpython/unicode.pxd
424:bytes PyUnicode_EncodeASCII(Py_UNICODE *s, Py_ssize_t size,
char *errors)

$ rg -w PyUnicode_EncodeLatin1
Cython-0.29.20/Cython/Includes/cpython/unicode.pxd
406:bytes PyUnicode_EncodeLatin1(Py_UNICODE *s, Py_ssize_t size,
char *errors)

$ rg -w PyUnicode_EncodeUTF7
(no output)

$ rg -w PyUnicode_EncodeUTF8
subprocess32-3.5.4/_posixsubprocess_helpers.c
38:return PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(unicode),

pyodbc-4.0.30/src/params.cpp
1932:bytes = PyUnicode_EncodeUTF8(source, cb, "strict");

pyodbc-4.0.30/src/cnxninfo.cpp
45:Object bytes(PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(p),
PyUnicode_GET_SIZE(p), 0));
50:Object bytes(PyUnicode_Check(p) ?
PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(p), PyUnicode_GET_SIZE(p),
0) : 0);

Cython-0.29.20/Cython/Includes/cpython/unicode.pxd
304:bytes PyUnicode_EncodeUTF8(Py_UNICODE *s, Py_ssize_t size, char *errors)

Note that subprocess32 is Python 2 only project.  Only pyodbc-4.0.30
use this API.
https://github.com/mkleehammer/pyodbc/blob/b4ea03220dd8243e452c91689bef34823b2f7d8f/src/params.cpp#L1926-L1942
https://github.com/mkleehammer/pyodbc/blob/master/src/cnxninfo.cpp#L45

Anyway, current PyUnicode_EncodeXXX APis are not used commonly.
I don't think it's worth enough to undeprecate.

Regards,

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SFZ64X5JIOERQYCGGAD63FLRTJ657WWM/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread Inada Naoki
Hi, Lamburg.

Thank you for quick response.

>
> We can't just remove access to one half of a codec (the decoding
> part) without at least providing an alternative for C extensions
> to use.
>
> Py_UNICODE can be removed from the API, but only if there are
> alternative APIs which C extensions can use to the same effect.
>
> Given PEP 393, this would be APIs which use wchar_t instead of
> Py_UNICODE.
>

Decoding part is implemented as `const char *` -> `PyObject*` (Unicode object).
I think this is reasonable since `const char *` is perfect to abstract
the encoded string,

In case of encoding part, `wchar_t *` is not perfect abstraction for
(decoded) unicode
string.  Converting from Unicode object into `wchar_t *` is not zero-cost.
I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object)
looks better signature than
`wchar_t *` -> `Pyobject *` (bytes object) because for encoders.

* Unicode object is more important than `wchar_t *` in Python.
* All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.

For example, we have these private encode APIs:

* PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors)
* PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char
*errors, int byteorder)
...

So how about making them public, instead of undeprecate Py_UNICODE* encode APIs?

1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
   Current private APIs can become macro (e.g. #define
_PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes),
   or deprecated static inline function.
2. Remove Py_UNICODE* encode APIs in Python 3.12.

Regards,

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z4MT6IJBVWP2QOV3OMVJ32BZ5HLH5DG5/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

2020-06-28 Thread M.-A. Lemburg
Hi Inada-san,

as you may remember, I wasn't happy with the deprecations of the
APIs in PEP 393, since there are no C API alternatives for
the encoding APIs deprecated in the PEP, which allow direct
encoding provided by these important codecs.

AFAIK, the situation hasn't changed since then.

We can't just remove access to one half of a codec (the decoding
part) without at least providing an alternative for C extensions
to use.

Py_UNICODE can be removed from the API, but only if there are
alternative APIs which C extensions can use to the same effect.

Given PEP 393, this would be APIs which use wchar_t instead of
Py_UNICODE.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jun 28 2020)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/


On 28.06.2020 04:35, Inada Naoki wrote:
> Hi, all.
> 
> I proposed PEP 623 to remove Unicode APIs deprecated by PEP 393.
> 
> In this thread, I am proposing removal of Py_UNICODE (not Unicode
> objects) APIs deprecated by PEP 393.
> Please reply for any comments.
> 
> 
> ## Undocumented, have Py_DEPRECATED
> 
> There is no problem to remove them in Python 3.10.  I will just do it.
> 
> * Py_UNICODE_str*** functions -- already removed in
> https://github.com/python/cpython/pull/21164
> * PyUnicode_GetMax()
> 
> 
> ## Documented and have Py_DEPRECATED
> 
> * PyLong_FromUnicode
> * PyUnicode_AsUnicodeCopy
> * PyUnicode_Encode
> * PyUnicode_EncodeUTF7
> * PyUnicode_EncodeUTF8
> * PyUnicode_EncodeUTF16
> * PyUnicode_EncodeUTF32
> * PyUnicode_EncodeUnicodeEscape
> * PyUnicode_EncodeRawUnicodeEscape
> * PyUnicode_EncodeLatin1
> * PyUnicode_EncodeASCII
> * PyUnicode_EncodeCharmap
> * PyUnicode_TranslateCharmap
> * PyUnicode_EncodeMBCS
> 
> These APIs are documented.  The document has ``.. deprecated:: 3.3
> 4.0`` directive.
> They have been `Py_DEPRECATED` since Python 3.6 too.
> 
> Plan: Change the document to ``.. deprecated:: 3.0 3.10`` and remove
> them in Python 3.10.
> 
> 
> ## PyUnicode_EncodeDecimal
> 
> It is not documented.  It has not been deprecated by Py_DEPRECATED.
> 
> Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
> 
> 
> ## PyUnicode_TransformDecimalToASCII
> 
> It is documented, but doesn't have ``deprecated`` directive. It is not
> deprecated by Py_DEPRECATED.
> 
> Plan: Add Py_DEPRECATED and ``deprecated 3.3 3.11`` directive in 3.9,
> and remove it in 3.11.
> 
> 
> ## _PyUnicode_ToLowercase, _PyUnicode_ToUppercase
> 
> They are not deprecated by PEP 393, but bpo-12736.
> They are documented as deprecated, but don't have ``Py_DEPRECATED``.
> 
> Plan: Add Py_DEPRECATED in 3.9, and remove them in 3.11.
> 
> Note: _PyUnicode_ToTitlecase has Py_DEPRECATED. It can be removed in 3.10.
> 
> 
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RFVZQJIM3AZ4IJDC6QKWFH4PQZYQGRQD/
Code of Conduct: http://python.org/psf/codeofconduct/