Hi, all.

I updated the PEP 597 yesterday.
Please review it to move it forward.

PEP: https://www.python.org/dev/peps/pep-0597/
Previous thread: https://discuss.python.org/t/3880

Main difference from the previous version:

* Added new warning category; EncodingWarning
* Added dedicated option to enable the warning instead of using dev mode.


Abstract
========

Add a new warning category ``EncodingWarning``. It is emitted when
``encoding`` option is omitted and the default encoding is a locale
encoding.

The warning is disabled by default. New ``-X warn_encoding``
command-line option and ``PYTHONWARNENCODING`` environment variable
are used to enable the warnings.


Motivation
==========

Using the default encoding is a common mistake
----------------------------------------------

Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake. Many Windows users can not install
the package if there is at least one non-ASCII character (e.g. emoji)
in the ``README.md`` file which is encoded in UTF-8.

For example, 489 packages of the 4000 most downloaded packages from
PyPI used non-ASCII characters in README. And 82 packages of them
can not be installed from source package when locale encoding is
ASCII. [1_] They used the default encoding to read README or TOML
file.

Another example is ``logging.basicConfig(filename="log.txt")``.
Some users expect UTF-8 is used by default, but locale encoding is
used actually. [2_]

Even Python experts assume that default encoding is UTF-8.
It creates bugs that happen only on Windows. See [3_] and [4_].

Emitting a warning when the ``encoding`` option is omitted will help
to find such mistakes.


Prepare to change the default encoding to UTF-8
-----------------------------------------------

We had chosen to use locale encoding for the default text encoding in
Python 3.0. But UTF-8 has been adopted very widely since then.

We might change the default text encoding to UTF-8 in the future.
But this change will affect many applications and libraries.
Many ``DeprecationWarning`` will be emitted if we start emitting the
warning by default. It will be too noisy.

Although this PEP doesn't propose to change the default encoding,
this PEP will help to reduce the warning in the future if we decide
to change the default encoding.


Specification
=============

``EncodingWarning``
--------------------

Add new ``EncodingWarning`` warning class which is a subclass of
``Warning``. It is used to warn when the ``encoding`` option is
omitted and the default encoding is locale-specific.


Options to enable the warning
------------------------------

``-X warn_encoding`` option and the ``PYTHONWARNENCODING``
environment variable are added. They are used to enable the
``EncodingWarning``.

``sys.flags.encoding_warning`` is also added. The flag represents
``EncodingWarning`` is enabled.

When the option is enabled, ``io.TextIOWrapper()``, ``open()``, and
other modules using them will emit ``EncodingWarning`` when
``encoding`` is omitted.


``encoding="locale"`` option
----------------------------

``io.TextIOWrapper`` accepts ``encoding="locale"`` option. It means
same to current ``encoding=None``. But ``io.TextIOWrapper`` doesn't
emit ``EncodingWarning`` when ``encoding="locale"`` is specified.

Add ``io.LOCALE_ENCODING = "locale"`` constant too. This constant can
be used to avoid confusing ``LookupError: unknown encoding: locale``
error when the code is run in old Python accidentally.

The constant can be used to test that ``encoding="locale"`` option is
supported too. For example,

.. code-block::

   # Want to suppress an EncodingWarning but still need support
   # old Python versions.
   locale_encoding = getattr(io, "LOCALE_ENCODING", None)
   with open(filename, encoding=locale_encoding) as f:
       ...


``io.text_encoding()``
-----------------------

``io.text_encoding()`` is a helper function for functions having
``encoding=None`` option and passing it to ``io.TextIOWrapper()`` or
``open()``.

Pure Python implementation will be like this::

   def text_encoding(encoding, stacklevel=1):
       """Helper function to choose the text encoding.

       When *encoding* is not None, just return it.
       Otherwise, return the default text encoding (i.e., "locale").

       This function emits EncodingWarning if *encoding* is None and
       sys.flags.encoding_warning is true.

       This function can be used in APIs having encoding=None option
       and pass it to TextIOWrapper or open.
       But please consider using encoding="utf-8" for new APIs.
       """
       if encoding is None:
           if sys.flags.encoding_warning:
               import warnings
               warnings.warn("'encoding' option is omitted",
                            EncodingWarning, stacklevel + 2)
           encoding = LOCALE_ENCODING
       return encoding

For example, ``pathlib.Path.read_text()`` can use the function like:

.. code-block::

   def read_text(self, encoding=None, errors=None):
       encoding = io.text_encoding(encoding)
       with self.open(mode='r', encoding=encoding, errors=errors) as f:
           return f.read()

By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for
the caller of ``read_text()`` instead of ``read_text()``.


Affected stdlibs
-------------------

Many stdlibs will be affected by this change.

Most APIs accepting ``encoding=None`` will use ``io.text_encoding()``
as written in the previous section.

Where using locale encoding as the default encoding is reasonable,
``encoding=io.LOCALE_ENCODING`` will be used instead. For example,
``subprocess`` module will use locale encoding for the default
encoding of the pipes.

Many tests use ``open()`` without ``encoding`` specified to read
ASCII text files. They should be rewritten with ``encoding="ascii"``.


Rationale
=========

Opt-in warning
---------------

Although ``DeprecationWarning`` is suppressed by default, emitting
``DeprecationWarning`` always when ``encoding`` option is omitted
would be too noisy.

Noisy warnings may lead developers to dismiss the
``DeprecationWarning``.


"locale" is not a codec alias
-----------------------------

We don't add the "locale" to the codec alias because locale can be
changed in runtime.

Additionally, ``TextIOWrapper`` checks ``os.device_encoding()``
when ``encoding=None``. This behavior can not be implemented in
the codec.


Reference Implementation
========================

https://github.com/python/cpython/pull/19481


References
==========

.. [1] "Packages can't be installed when encoding is not UTF-8"
       (https://github.com/methane/pep597-pypi-ascii)

.. [2] "Logging - Inconsistent behaviour when handling unicode"
       (https://bugs.python.org/issue37111)

.. [3] Packaging tutorial in packaging.python.org didn't specify
       encoding to read a ``README.md``
       (https://github.com/pypa/packaging.python.org/pull/682)

.. [4] ``json.tool`` had used locale encoding to read JSON files.
       (https://bugs.python.org/issue33684)


Copyright
=========

This document has been placed in the public domain.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SFYUP2TWD5JZ5KDLVSTZ44GWKVY4YNCV/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to