On 7/2/20 10:19 AM, Victor Stinner wrote:
> Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode
> character set but uses the annoying surrogate pairs for characters
> outside the BMP.*
Minor quibble, UTF-16 handles all of the CURRENTLY defined Unicode set,
and there is a currently a
On 2020-07-02 14:57, Victor Stinner wrote:
Le jeu. 2 juil. 2020 à 14:44, Barry Scott a écrit :
It's not obvious to me why the latin1 encoding is in this list as its just one
of all the 8-bit char sets.
Why is it needed?
The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes
UCS-2 means units of 16 bits so it's limited to Unicode BMP: U+-U+.
UCS-4 means units of 32 bits and so gives access to the whole
(current) Unicode character set.
Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode
character set but uses the annoying surrogate pairs for
> On 30 Jun 2020, at 13:43, Emily Bowman wrote:
>
> I completely agree with this, that UTF-8 has become the One True
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the
> Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t, let
> alone composite
Le jeu. 2 juil. 2020 à 14:44, Barry Scott a écrit :
> It's not obvious to me why the latin1 encoding is in this list as its just
> one of all the 8-bit char sets.
> Why is it needed?
The Latin-1 (ISO 8859-1) charset is kind of special: it maps bytes
0x00-0xFF to Unicode characters U+-U+00FF
Latin-1 is the encoding that maps every byte (0-255) to the Unicode
character with the same number. So it's special in that sense, and it
gets used when mapping 8-bit bytes via Unicode "without encoding".
Excuse my imprecise language here, I don't know the correct Unicode
terms without going &
> On 29 Jun 2020, at 10:57, Victor Stinner wrote:
>
> I would prefer to only have a fast-path for the most common encodings:
> ASCII, Latin1, UTF-8, Windows ANSI code page. That's all.
It's not obvious to me why the latin1 encoding is in this list as its just one
of all the 8-bit char sets.
On Thu, Jul 2, 2020 at 5:20 AM M.-A. Lemburg wrote:
>
>
> The reasoning here is the same as for decoding: you have the original
> data you want to process available in some array and want to turn
> this into the Python object.
>
> The path Victor suggested requires always going via a Python
On 7/1/2020 1:20 PM, M.-A. Lemburg wrote:
As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object
On 30.06.2020 15:17, Victor Stinner wrote:
> Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg a écrit :
>>> I would prefer to analyze the list on a case by case basis. I don't
>>> think that it's useful to expose every single encoding supported by
>>> Python as a C function.
>>
>> (...)
>> This does
On 28.06.2020 16:24, Inada Naoki wrote:
> Hi, Lamburg.
>
> Thank you for quick response.
>
>>
>> We can't just remove access to one half of a codec (the decoding
>> part) without at least providing an alternative for C extensions
>> to use.
>>
>> Py_UNICODE can be removed from the API, but only
On 30/06/2020 13:43, Emily Bowman wrote:
I completely agree with this, that UTF-8 has become the One True
Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the
Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t,
let alone composite emoji.
You say that
Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg a écrit :
> > I would prefer to analyze the list on a case by case basis. I don't
> > think that it's useful to expose every single encoding supported by
> > Python as a C function.
>
> (...)
> This does not mean we have to give up the symmetry in the C
On 6/30/20 8:43 AM, Emily Bowman wrote:
> I completely agree with this, that UTF-8 has become the One True
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside
> of the Win32 API. Nearly all basic emoji can't be represented in UCS-2
> wchar_t, let alone composite emoji.
>
> So
I completely agree with this, that UTF-8 has become the One True
Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside of the
Win32 API. Nearly all basic emoji can't be represented in UCS-2 wchar_t,
let alone composite emoji.
So how to make that C-compatible? Make everything a
On 30/06/2020 13:16, Richard Damon wrote:
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
Since the C world has adopted wchar_t for this purpose, it's the
natural choice.
I would disagree with this comment. Microsoft Windows has chosen to use
'wchar_t' for Unicode, because they adopted UCS-2 before
On 6/30/20 7:53 AM, M.-A. Lemburg wrote:
> Since the C world has adopted wchar_t for this purpose, it's the
> natural choice.
I would disagree with this comment. Microsoft Windows has chosen to use
'wchar_t' for Unicode, because they adopted UCS-2 before it morphed into
UTF-16 due to the
On 29.06.2020 11:57, Victor Stinner wrote:
> Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg a écrit :
>> as you may remember, I wasn't happy with the deprecations of the
>> APIs in PEP 393, since there are no C API alternatives for
>> the encoding APIs deprecated in the PEP, which allow direct
>>
Le lun. 29 juin 2020 à 12:50, Inada Naoki a écrit :
> > > ## PyUnicode_EncodeDecimal
> > >
> > > It is not documented. It has not been deprecated by Py_DEPRECATED.
> > > Plan: Add Py_DEPRECATED in Python 3.9 and remove it in 3.11.
> >
> > I understood that the replacement function is the private
On Mon, Jun 29, 2020 at 6:51 PM Victor Stinner wrote:
>
>
> I understand that these ".. deprecated" markups will be added to 3.8
> and 3.9 documentation, right?
>
They are documented as "Deprecated since version 3.3, will be removed
in version 4.0" already.
I am proposing s/4.0/3.10/ in 3.8 and
Le lun. 29 juin 2020 à 12:36, Inada Naoki a écrit :
> * UTF16 and UTF32; `int byteorder` parameter.
* UTF-16 byte_order=0 means "UTF-16" encoding
* UTF-16 byte_order<0 means "UTF-16-BE" encoding
* UTF-16 byte_order>0 means "UTF-16-LE" encoding
Same applies for UTF-32.
> * UTF7; int
Many existing public APIs doesn't have `const char *errors` argument.
As there are very few users, we can ignore that limitation.
On the other hand, some encoding have special options.
* UTF16 and UTF32; `int byteorder` parameter.
* UTF7; int base64SetO, int base64WhiteSpace
So
Le dim. 28 juin 2020 à 17:21, Inada Naoki a écrit :
> More aggressive idea: override current PyUnicode_EncodeXXX() apis.
> Change from `Py_UNICODE *object` to `PyObject *unicode`.
>
> This idea might look crazy. But PyUnicode_EncodeXXX APIs are
> deprecated for a long time, and there are only a
Le lun. 29 juin 2020 à 08:41, Inada Naoki a écrit :
> That's all.
> Now I think it is safe to override deprecated APIs with private APIs
> accepts Unicode Object.
>
> * _PyUnicode_EncodeUTF7 -> PyUnicode_EncodeUTF7
Use PyUnicode_AsEncodedString("UTF-7"). This encoding is not common
enough to
Le dim. 28 juin 2020 à 11:22, M.-A. Lemburg a écrit :
> as you may remember, I wasn't happy with the deprecations of the
> APIs in PEP 393, since there are no C API alternatives for
> the encoding APIs deprecated in the PEP, which allow direct
> encoding provided by these important codecs.
>
>
Le dim. 28 juin 2020 à 04:39, Inada Naoki a écrit :
> ## Documented and have Py_DEPRECATED
>
> * PyLong_FromUnicode
> * PyUnicode_AsUnicodeCopy
> * PyUnicode_Encode
> * PyUnicode_EncodeUTF7
> * PyUnicode_EncodeUTF8
> * PyUnicode_EncodeUTF16
> * PyUnicode_EncodeUTF32
> *
On Mon, Jun 29, 2020 at 12:17 AM Inada Naoki wrote:
>
>
> More aggressive idea: override current PyUnicode_EncodeXXX() apis.
> Change from `Py_UNICODE *object` to `PyObject *unicode`.
>
This is a list of PyUnicode_Encode usage in top4000 packages.
On Sun, Jun 28, 2020 at 11:24 PM Inada Naoki wrote:
>
>
> So how about making them public, instead of undeprecate Py_UNICODE* encode
> APIs?
>
> 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
>Current private APIs can become macro (e.g. #define
> _PyUnicode_AsAsciiString
Hi, Lamburg.
Thank you for quick response.
>
> We can't just remove access to one half of a codec (the decoding
> part) without at least providing an alternative for C extensions
> to use.
>
> Py_UNICODE can be removed from the API, but only if there are
> alternative APIs which C extensions can
Hi Inada-san,
as you may remember, I wasn't happy with the deprecations of the
APIs in PEP 393, since there are no C API alternatives for
the encoding APIs deprecated in the PEP, which allow direct
encoding provided by these important codecs.
AFAIK, the situation hasn't changed since then.
We
30 matches
Mail list logo