Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Cameron Simpson Thu, 10 Aug 2017 20:07:03 -0700

On 10Aug2017 20:40, boB Stepp <[email protected]> wrote:

(By the way, it is nearly 14 years later, and PHP still believes that
the world is ASCII.)


I thought you must surely be engaging in hyperbole, but at
http://php.net/manual/en/xml.encoding.php I found:

"The default source encoding used by PHP is ISO-8859-1."

This kind of amounts to Python 2's situation in some ways: a PHP string orPython 2 str is effectively just an array of bytes, treated like a lexicalstringy thing.

If you're working only in ASCII or _universally_ in some fixed 8-bit characterset (eg ISO8859-1 in Western Europe) you mostly get by if you don't lookclosely. PHP's "default source encoding" means that the variable _character_based routines in PHP (things that know about characters as letter, punctuationetc) treat these strings as using IS8859-1 encoding. You can load UTF-8 intothese strings and work that way too (there's a PHP global setting for theencoding).


Python 2 has a "unicode" type for proper Unicode strings.

In Python 3 str is Unicode text, and you use bytes for bytes. It is hugelybetter, because you don't need to concern yourself about what text encoding astr is - it doesn't have one - it is Unicode. You only need to care whenreading and writing data.

So long as your editor knows to save the file in UTF-8, it will Just
Work.


So Python 3's default behavior for strings is to store them as UTF-8
encodings in both RAM and files?


Not quite.

In memory Python 3 strings are sequences of Unicode code points. The CPythoninternals pick an 8 or 16 or 32 bit storage mode for these based on the highestcode point value in the string as a space optimisation decision, but that isconcealed at the language level. UTF-8 as a storage format is nearly ascompact, but has the disadvantage that you can't directly index the string(i.e. go to character "n") because UTF-8 uses variable length encodings for thevarious code points.

In files however, the default encoding for text files is 'utf-8': Python willread the file's bytes as UTF-8 data and will write Python string characters inUTF-8 encoding when writing.

If you open a file in "binary" mode there's no encoding: you get bytes. But ifyou open in text mode (no "b" in the open mode string) you get text, and youcan define the character encoding used as an optional parameter to the open()function call.

No funny business anywhere?  Except
perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
Python's fault.  Which makes me wonder, what is my editor's default
encoding/decoding?  I will have to investigate!

On most UNIX platforms most situations expect and use UTF-8. There aresomecomplications because this needn't be the case, but most modern environmentsprovide UTF-8 by default.

The situation in Windows is more complex for historic reasons. I believe ErykSun is the go to guy for precise technical descriptions of the Windowssituation. I'm not a Windows guy, but I gather modern Windows generally givesyou a pretty clean UTF-8 environment in most situations.


Cheers,
Cameron Simpson <[email protected]> (formerly [email protected])
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Reply via email to