Hmm, looking at this, it seems I'm not the only one with this sort of problem. http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf
Maybe I will just build a wall around these objects and declare "none but unicode shall pass." On 10/3/05, Liam Clarke <[EMAIL PROTECTED]> wrote: > OK, one last kick. > > So, using > > val = unicode(value) > self._slaveMap[attr].setPayload(value.encode("UTF-16")) > > I can stick normal strings in happily. Of course, as you mentioned, > Kent, this leaves me vulnerable if the string differs to > sys.getdefaultencoding(). > > Other than directly from the user, the most likely source of data will > be from pyid3lib, which for the time being assumes all strings are > ISO-8859-1. > > http://pyid3lib.sourceforge.net > > Erk. Talk about big design up front. Would you recommend a different > method of dealing with this? Basically, most of the strings in the > database are UTF-16, and I just need to make them readable, and make > sure any of the strings going in are UTF-16 as well. > > Alternatively, I've thought about just cycling through the various 100 > codecs until I don't get any UnicodeDecodeErrors, but that's no > guarantee that it'll be human readble...oh dear. > > Thanks for any assistance offered. > > Liam Clarke > > On 10/3/05, Liam Clarke <[EMAIL PROTECTED]> wrote: > > Hi, > > > > If I can just beat this horse one more time, can I just get > > confirmation that I'm going about this the right way? > > > > I have a base object, which reads the unicode string as bytes like so, > > this ignores all but important bits. > > > > class Mhod: > > def __init__(self, f): > > self.payload = struct.unpack("36s", f.read(36)) > > > > Which in turn, is utilised in a Song object, which works like this - > > > > class Song: > > def __init__(self, mhod): > > self.location = unicode(mhod.payload, "UTF-16") > > self.mhod = mhod > > def gLoc(self): > > return self.location > > def sLoc(self, value): > > #Need to coerce data into UTF-16 here > > self.mhod.payload = value.encode("UTF-16") > > > > location = property(gLoc, sLoc) > > > > If I were to do a > > > > >>>x = Mhod(open("test", "rb")) > > >>>y = Song(x) > > > > I get > > > > >>>x.payload > > ':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l > > \x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L > > \x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added. > > > > >>>y.location > > u':iPod_Control:Music:F44:LWBR.mp3' > > > > Which is what I'm after. What I'm struggling with is coercing the > > string that's being passed to sLoc() into UTF-16, and actually > > creating any form of unicode string at all without using > > > > >>>foo = u'Monkies!' > > > > Which I'm sure is going to be in UTF-8, just to spite me. > > > > So far, the best I've come up with is - > > > > >>> foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16") > > > > Which, as you mention above, is likely to cause me errors. And > > apparently "Hi Bob!" is an 8 bit string encoded in UTF-16... > > *sigh* I suppose I could go the XP route and expect any further users > > to just deal with it and pass in a UTF-16 string, but there's got to > > be a simple way to handle it., and I'm not having too much luck with > > this. > > > > I've been working from the below document, if anyone can recommend > > something further, I'd much appreciate it. > > > > http://www.amk.ca/python/howto/unicode > > > > Regards, > > > > Liam Clarke > > On 10/3/05, Liam Clarke <[EMAIL PROTECTED]> wrote: > > > Thanks Kent, > > > > > > My first time dealing with Python and unicode vs 'normal' strings, I > > > do look forward to Python 3.0... at the moment I'm just trying to > > > understand how to use UTF-16. > > > > > > Basically, I have data which is coming straight from struct.unpack() > > > and it's an UTF-16 string, and I'm just trying to get my head around > > > dealing with the data coming in from struct, and putting my data out > > > through struct. > > > > > > It doesn't help overly that struct considers all strings to consist of > > > one byte per char, whereas UTF-16 is two. And I was having trouble as > > > to how to write UTF-16 stuff out properly. > > > > > > But, if I understand it correctly, I could use > > > > > > j = #some unicode string > > > out = j.encode("UTF-16") > > > pattern = "%ds" % len(out) > > > struct.pack(pattern, out) > > > > > > without too much difficulty. > > > > > > Regards, > > > > > > Liam Clarke > > > > > > On 10/3/05, Kent Johnson <[EMAIL PROTECTED]> wrote: > > > > Liam Clarke wrote: > > > > > What's the difference between > > > > > > > > > > x = "Hi" > > > > > y = x.encode("UTF-16") > > > > > > > > > > and > > > > > > > > > > y = unicode(x, "UTF-16") > > > > > > > > They are more-or-less opposite. > > > > > > > > encode() converts away from unicode. (Think of unicode as the 'normal' > > > > format, anything else in 'encoded'.) Normally it is used on a unicode > > > > string, not a byte string. It means, "interpret this string as unicode, > > > > then convert it to an encoded byte string using the given encoding". > > > > > > > > When you encode a non-unicode string (like "Hi"), the string is first > > > > converted to unicode (decoded) using sys.getdefaultencoding(), then > > > > encoded using the supplied encoding. So > > > > 'Hi'.encode('utf-16') > > > > is the same as > > > > 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16') > > > > > > > > In either case, the result is a string in UTF-16 encoding: > > > > >>> 'Hi'.encode('UTF-16') > > > > '\xff\xfeH\x00i\x00' > > > > >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16') > > > > '\xff\xfeH\x00i\x00' > > > > > > > > Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the > > > > output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'. > > > > > > > > Because sys.getdefaultencoding() is used to convert to unicode, you > > > > will get an error if the original string cannot be decoded with this > > > > encoding: > > > > > > > > >>> '\xe3'.encode('utf-16') > > > > Traceback (most recent call last): > > > > File "<stdin>", line 1, in ? > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: > > > > ordinal not in range(128) > > > > > > > > > > > > What about unicode('Hi', 'utf-16')? This doesn't do anything useful: > > > > >>> unicode('Hi', 'UTF-16') > > > > u'\u6948' > > > > > > > > unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In > > > > this case we are saying, "Interpret this string as an encoded byte > > > > string in the given encoding, and convert it to a unicode string." > > > > Since 'Hi' is not, in fact, a byte string encoded in UTF-16, the > > > > results are not very useful. > > > > > > > > > > > > To summarize: > > > > If you have an encoded byte string and you want a unicode string, use > > > > str.decode() or unicode() > > > > > > > > If you have a unicode string and you want an encoded byte string, use > > > > unicode.encode(). > > > > > > > > If you are using str.encode() you probably haven't though through your > > > > problem completely and you will likely get UnicodeDecodeErrors when you > > > > have non-ASCII data. > > > > > > > > > > > > If you are writing a unicode-aware application, a good strategy is to > > > > keep all strings internally as unicode and to convert to and from the > > > > required encodings at the boundaries. > > > > > > > > Kent > > > > > > > > _______________________________________________ > > > > Tutor maillist - Tutor@python.org > > > > http://mail.python.org/mailman/listinfo/tutor > > > > > > > > > > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor