[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-31 Thread STINNER Victor


STINNER Victor  added the comment:

> My use case for these low-level APIs is to write tests for low-level 
> string/encoding handling in my custom use of the PyPreConfig and PyConfig 
> structs. I wanted to verify that exact byte sequences were turned into 
> specific representations inside of Python strings. This includes ensuring 
> that certain byte sequences retain their appropriate "character" width in 
> internal storage.

CPython contains many checks to ensure that a string always use the most 
effecient storage, especially in debug mode. The C API should not allow to 
create a string using an inefficient storage, unless you "abuse" the C API :-D

I'm not sure what do you test.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread Gregory Szorc


Gregory Szorc  added the comment:

My use case for these low-level APIs is to write tests for low-level 
string/encoding handling in my custom use of the PyPreConfig and PyConfig 
structs. I wanted to verify that exact byte sequences were turned into specific 
representations inside of Python strings. This includes ensuring that certain 
byte sequences retain their appropriate "character" width in internal storage.

I know there are alternative ways of performing this testing. But testing 
against the actual data structure used internally by CPython seemed the most 
precise since it isolates problems to the "store in Python" side of the problem 
and not "what does Python do once the data is stored."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread STINNER Victor


STINNER Victor  added the comment:

> PyUnicode_KIND does *not* expose the implementation details to the programmer.

PyUnicode_KIND() is very specific to the exact PEP 393 implementation. 
Documentation of this field:
---
/* Character size:

   - PyUnicode_WCHAR_KIND (0):

 * character type = wchar_t (16 or 32 bits, depending on the
   platform)

   - PyUnicode_1BYTE_KIND (1):

 * character type = Py_UCS1 (8 bits, unsigned)
 * all characters are in the range U+-U+00FF (latin1)
 * if ascii is set, all characters are in the range U+-U+007F
   (ASCII), otherwise at least one character is in the range
   U+0080-U+00FF

   - PyUnicode_2BYTE_KIND (2):

 * character type = Py_UCS2 (16 bits, unsigned)
 * all characters are in the range U+-U+ (BMP)
 * at least one character is in the range U+0100-U+

   - PyUnicode_4BYTE_KIND (4):

 * character type = Py_UCS4 (32 bits, unsigned)
 * all characters are in the range U+-U+10
 * at least one character is in the range U+1-U+10
 */
unsigned int kind:3;
---

I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow.


> If the internal representation os strings is switched to use masks and shifts 
> instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new 
> details without breaking API compatibility.

PyUnicode_KIND() was exposed in the *public* C API because unicodeobject.h 
provides functions as macros for best performances, and these macros use 
PyUnicode_KIND() internally.

Macros like PyUnicode_READ(kind, data, index) are also designed for best 
performances with the exact PEP 393 implementation.

The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this 
macro doesn't use "kind" or "data" which are (again) specific to the PEP 393.

In the CPython implementation, we should use the most efficient code, it's fine 
to use macros accessing directly structures.

But for the public C API, I would recommend to only provide abstractions, even 
if there are a little bit slower.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread Petr Viktorin


Petr Viktorin  added the comment:

PyUnicode_KIND does *not* expose the implementation details to the programmer.

If the internal representation os strings is switched to use masks and shifts 
instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new 
details without breaking API compatibility.
And that switch would fix this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread STINNER Victor


STINNER Victor  added the comment:

> In order to avoid undefined behavior, Python's C API should avoid all use of 
> bit fields.

See also the PEP 620. IMO more generally, the C API should not expose 
structures, but provide ways to access it through getter and setter functions.

See bpo-40120 "Undefined C behavior going beyond end of struct via a [1] 
arrays" which is a similar issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread STINNER Victor


STINNER Victor  added the comment:

> The macro PyUnicode_KIND is part of the documented public C API.

IMO it was a mistake to expose it as part of the public C API. This is an 
implementation detail which should not be exposed. The C API should not expose 
*directly* how characters are stored in memory, but provide an abstract way to 
read and write Unicode characters.

The PEP 393 implementation broke the old C API in many ways because it exposed 
too many implementation details. Sadly, the new C API is... not better :-(

If tomorrow, CPython is modified to use UTF-8 internally (as PyPy does), the C 
API will likely be broken *again* in many (new funny) ways.

11 years after the PEP 393 (Python 3.3), we only start fixing the old C API :-( 
The work will be completed in 2 or 3 Python releases (Python 3.12 or 3.13):

* https://www.python.org/dev/peps/pep-0623/
* https://www.python.org/dev/peps/pep-0624/

The C API for Unicode strings is causing a lot of issues in PyPy which uses 
UTF-8 internally. C extensions can fail to build on PyPy if they use functions 
(macros) like PyUnicode_KIND().

--
nosy: +methane, serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread Petr Viktorin


Petr Viktorin  added the comment:

The macro PyUnicode_KIND is part of the documented public C API. It accesses 
the bit field "state.kind" directly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread STINNER Victor


STINNER Victor  added the comment:

> At least the PyASCIIObject struct in Include/cpython/unicodeobject.h uses bit 
> fields. Various preprocessor macros like PyUnicode_IS_ASCII() and 
> PyUnicode_KIND() access this struct's bit field.

What is your use case? Which functions do you need?

You should not access directly the PyASCIIObject structure. Python provides 
many functions to access the content of a Unicode string object.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread Georg Brandl


Change by Georg Brandl :


--
nosy: +georg.brandl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-30 Thread Georg Brandl


Change by Georg Brandl :


--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-27 Thread Erlend E. Aasland


Change by Erlend E. Aasland :


--
nosy: +petr.viktorin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45025] Reliance on C bit fields in C API is undefined behavior

2021-08-26 Thread Gregory Szorc


Change by Gregory Szorc :


--
title: Reliance on C bit fields is C API is undefined behavior -> Reliance on C 
bit fields in C API is undefined behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com