Dan Williams wrote:
Not directly your point, but I do think we need to be very careful about avoiding encoding strings, doing the decoding at the boundaries whenever possible (where dbus is a boundary).

I'd advocate mandating UTF-8 everwhere, but that's just me...  Is there
a way to make python's constant strings (ie, 'a = "something"') always
be Unicode objects?

Well, "inside" the Python process there are actual unicode strings, which aren't encoded in any form. And there do exist strings which can't be decoded, because they are not textual data.

Anyway, about encoding:

* A constant string is just data, it doesn't hold any encoding information. So there's no way to indicate what encoding it has. If you know its encoding, you should probably just decode it.

* A constant unicode string (u"") doesn't have encoding either, it's just unicode. The *source* is encoded in some way. If you include the UTF8 marker at the beginning of a Python source file, it will be assumed to be UTF8 content. This is what I advocate for OLPC (in the style guide here: http://wiki.laptop.org/go/Python_Style_Guide#Encodings_.28PEP_263.29). Mostly we just have to make sure the editing tools included with the laptop produce and preserve that UTF8 marker.

* I just tested it, and a non-unicode constant string also gets the encoded data. So if you include something like s = "tɛst" in your source (with the UTF8 marker), the result is s == 't\xc9\x9bst'. Though we really should be using unicode strings for our textual data. But if we have some non-unicode docstring and someone puts some unicode data in it (e.g., an author name), it at least won't break, and that's good.

* Some textual strings can't be unicode; specifically Python identifiers. So, for instance, "obj.__dict__[u'tɛst'] = 'foo'" is illegal, because objects can't have unicode attributes. In these cases the attributes should simply be ASCII.

In Python 3 normal constant strings will all be unicode strings, and there will be a different syntax for binary literals (or maybe no syntax at all). But that definitely won't happen in the Python 2 line.

There is a way to set the default encoding to utf8 (by default it is ascii), but that introduces lots of weird artifacts and is strongly discouraged. The default encoding matters when you do comparisons between strings and a few other situations -- generally either the str has to be decoded to a unicode object, or the unicode object encoded with the default encoding before they can be usefully compared. You can disable the default encoding entirely, but I suspect it would break far too many things to do so. (Though I suppose we could also make the default encoding something that will produce a warning, and at least get some indication of possible encoding errors.)

Generally if we test with non-ASCII data we'll see encoding problems fairly early. With ASCII data encoding problems can stay hidden quite easily. English speakers tend not to create non-ASCII test data, which is a problem.

--
Ian Bicking | [EMAIL PROTECTED] | http://blog.ianbicking.org
_______________________________________________
Sugar mailing list
[email protected]
http://mailman.laptop.org/mailman/listinfo/sugar

Reply via email to