Re: [sugar] Python Style Guide

Ian Bicking Tue, 14 Nov 2006 10:29:18 -0800

Dan Williams wrote:

Not directly your point, but I do think we need to be very careful aboutavoiding encoding strings, doing the decoding at the boundaries wheneverpossible (where dbus is a boundary).
I'd advocate mandating UTF-8 everwhere, but that's just me...  Is there
a way to make python's constant strings (ie, 'a = "something"') always
be Unicode objects?

Well, "inside" the Python process there are actual unicode strings,which aren't encoded in any form. And there do exist strings whichcan't be decoded, because they are not textual data.


Anyway, about encoding:

* A constant string is just data, it doesn't hold any encodinginformation. So there's no way to indicate what encoding it has. Ifyou know its encoding, you should probably just decode it.

* A constant unicode string (u"") doesn't have encoding either, it'sjust unicode. The *source* is encoded in some way. If you include theUTF8 marker at the beginning of a Python source file, it will be assumedto be UTF8 content. This is what I advocate for OLPC (in the styleguide here:http://wiki.laptop.org/go/Python_Style_Guide#Encodings_.28PEP_263.29).Mostly we just have to make sure the editing tools included with thelaptop produce and preserve that UTF8 marker.

* I just tested it, and a non-unicode constant string also gets theencoded data. So if you include something like s = "tɛst" in yoursource (with the UTF8 marker), the result is s == 't\xc9\x9bst'. Thoughwe really should be using unicode strings for our textual data. But ifwe have some non-unicode docstring and someone puts some unicode data init (e.g., an author name), it at least won't break, and that's good.

* Some textual strings can't be unicode; specifically Pythonidentifiers. So, for instance, "obj.__dict__[u'tɛst'] = 'foo'" isillegal, because objects can't have unicode attributes. In these casesthe attributes should simply be ASCII.

In Python 3 normal constant strings will all be unicode strings, andthere will be a different syntax for binary literals (or maybe no syntaxat all). But that definitely won't happen in the Python 2 line.

There is a way to set the default encoding to utf8 (by default it isascii), but that introduces lots of weird artifacts and is stronglydiscouraged. The default encoding matters when you do comparisonsbetween strings and a few other situations -- generally either the strhas to be decoded to a unicode object, or the unicode object encodedwith the default encoding before they can be usefully compared. You candisable the default encoding entirely, but I suspect it would break fartoo many things to do so. (Though I suppose we could also make thedefault encoding something that will produce a warning, and at least getsome indication of possible encoding errors.)

Generally if we test with non-ASCII data we'll see encoding problemsfairly early. With ASCII data encoding problems can stay hidden quiteeasily. English speakers tend not to create non-ASCII test data, whichis a problem.


--
Ian Bicking | [EMAIL PROTECTED] | http://blog.ianbicking.org
_______________________________________________
Sugar mailing list
[email protected]
http://mailman.laptop.org/mailman/listinfo/sugar

Re: [sugar] Python Style Guide

Reply via email to