On 10Aug2017 20:40, boB Stepp <robertvst...@gmail.com> wrote:
(By the way, it is nearly 14 years later, and PHP still believes that
the world is ASCII.)

I thought you must surely be engaging in hyperbole, but at
http://php.net/manual/en/xml.encoding.php I found:

"The default source encoding used by PHP is ISO-8859-1."

This kind of amounts to Python 2's situation in some ways: a PHP string or Python 2 str is effectively just an array of bytes, treated like a lexical stringy thing.

If you're working only in ASCII or _universally_ in some fixed 8-bit character set (eg ISO8859-1 in Western Europe) you mostly get by if you don't look closely. PHP's "default source encoding" means that the variable _character_ based routines in PHP (things that know about characters as letter, punctuation etc) treat these strings as using IS8859-1 encoding. You can load UTF-8 into these strings and work that way too (there's a PHP global setting for the encoding).

Python 2 has a "unicode" type for proper Unicode strings.

In Python 3 str is Unicode text, and you use bytes for bytes. It is hugely better, because you don't need to concern yourself about what text encoding a str is - it doesn't have one - it is Unicode. You only need to care when reading and writing data.

So long as your editor knows to save the file in UTF-8, it will Just

So Python 3's default behavior for strings is to store them as UTF-8
encodings in both RAM and files?

Not quite.

In memory Python 3 strings are sequences of Unicode code points. The CPython internals pick an 8 or 16 or 32 bit storage mode for these based on the highest code point value in the string as a space optimisation decision, but that is concealed at the language level. UTF-8 as a storage format is nearly as compact, but has the disadvantage that you can't directly index the string (i.e. go to character "n") because UTF-8 uses variable length encodings for the various code points.

In files however, the default encoding for text files is 'utf-8': Python will read the file's bytes as UTF-8 data and will write Python string characters in UTF-8 encoding when writing.

If you open a file in "binary" mode there's no encoding: you get bytes. But if you open in text mode (no "b" in the open mode string) you get text, and you can define the character encoding used as an optional parameter to the open() function call.

No funny business anywhere?  Except
perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
Python's fault.  Which makes me wonder, what is my editor's default
encoding/decoding?  I will have to investigate!

On most UNIX platforms most situations expect and use UTF-8. There aresome complications because this needn't be the case, but most modern environments provide UTF-8 by default.

The situation in Windows is more complex for historic reasons. I believe Eryk Sun is the go to guy for precise technical descriptions of the Windows situation. I'm not a Windows guy, but I gather modern Windows generally gives you a pretty clean UTF-8 environment in most situations.

Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:

Reply via email to