Re: Question on Strings
On Feb 6, 9:24 pm, Chris Rebert c...@rebertia.com wrote: On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan soft_sm...@yahoo.com wrote: Hi, Excuse me if this is a repeat question! I just wanted to know how are strings represented in python? I need to know in terms of: a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters? Neither. IIRC, Depends on what the build settings were when CPython was compiled. UTF-16 is the default. Unicode strings are held as arrays of 16-bit numbers or 32-bit numbers [of which only 21 are used]. If you must use an acronym, use UCS-2 or UCS-4. The UTF-n siblings are *external* representations. 2.x: a_unicode_object.decode('UTF-16') - an_str_object 3.x: an_str_object.decode('UTF-16') - a_bytes_object By the way, has anyone come up with a name for the shifting effect observed above on str, and also with repr, range, and the iter* family? If not, I suggest that the language's association with the best of English humour be widened so that it be dubbed the Mad Hatter's Tea Party effect. -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
John Machin s...@le..n.net wrote: By the way, has anyone come up with a name for the shifting effect observed above on str, and also with repr, range, and the iter* family? If not, I suggest that the language's association with the best of English humour be widened so that it be dubbed the Mad Hatter's Tea Party effect. The MHTP effect. Sounds educated, almost like a network protocol. +1 - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
John Machin wrote: On Feb 6, 9:24 pm, Chris Rebert c...@rebertia.com wrote: On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan soft_sm...@yahoo.com wrote: Hi, Excuse me if this is a repeat question! I just wanted to know how are strings represented in python? I need to know in terms of: a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters? Neither. IIRC, Depends on what the build settings were when CPython was compiled. UTF-16 is the default. Unicode strings are held as arrays of 16-bit numbers or 32-bit numbers [of which only 21 are used]. If you must use an acronym, use UCS-2 or UCS-4. The UTF-n siblings are *external* representations. 2.x: a_unicode_object.decode('UTF-16') - an_str_object 3.x: an_str_object.decode('UTF-16') - a_bytes_object By the way, has anyone come up with a name for the shifting effect observed above on str, and also with repr, range, and the iter* family? If not, I suggest that the language's association with the best of English humour be widened so that it be dubbed the Mad Hatter's Tea Party effect. Bitwise shifts and rotates are collectively referred to as skew operations. I therefore suggest the term skewing. :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan soft_sm...@yahoo.com wrote: Hi, Excuse me if this is a repeat question! I just wanted to know how are strings represented in python? I need to know in terms of: a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters? IIRC, Depends on what the build settings were when CPython was compiled. UTF-16 is the default. b) They are converted to utf-8 format when it is needed for e.g. when storing the string to disk or sending it through a socket (tcp/ip)? No. They are implicitly converted to ASCII in such cases. To properly handle non-ASCII Unicode characters, you need to encode/decode the strings to/from bytes manually by specifying the encoding. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
Hi, Kalyankumar Ramaseshan wrote: Hi, Excuse me if this is a repeat question! I just wanted to know how are strings represented in python? It depents on if you mean python2.x or python3.x - the model changed. Python 2.x knows str and unicode - the former a sequence of single byte characters and unicode depending on configure options either 16 or 32 bit per character. str in python3.x replaces unicode and what formerly used to be like str is now bytes (iirc). I need to know in terms of: a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters? It uses an internal fixed length encoding for unicode, not UTF b) They are converted to utf-8 format when it is needed for e.g. when storing the string to disk or sending it through a socket (tcp/ip)? Nope. You need to do this explicitely. Default encoding for python2.x implicit conversion is ascii. In python2.x you would use unicodestr.encode('utf-8') and simplestr.decode('utf-8') to convert an utf-8 encoded string back to internal unicode. There are many encodings available to select from. Any help in this regard is appreciated. Please see also pythons documentation which is very good and just try it out in the interactive interpreter Regards Tino smime.p7s Description: S/MIME Cryptographic Signature -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
John Machin wrote: The UTF-n siblings are *external* representations. 2.x: a_unicode_object.decode('UTF-16') - an_str_object 3.x: an_str_object.decode('UTF-16') - a_bytes_object That should be .encode() to bytes, which is the coded form. .decode is bytes = str/unicode -- http://mail.python.org/mailman/listinfo/python-list
Re: Question on Strings
On Feb 7, 5:23 am, Terry Reedy tjre...@udel.edu wrote: John Machin wrote: The UTF-n siblings are *external* representations. 2.x: a_unicode_object.decode('UTF-16') - an_str_object 3.x: an_str_object.decode('UTF-16') - a_bytes_object That should be .encode() to bytes, which is the coded form. .decode is bytes = str/unicode True. I guess that makes me the Dohmouse :-) -- http://mail.python.org/mailman/listinfo/python-list