Re: [Tutor] Struct and UTF-16

Liam Clarke Mon, 03 Oct 2005 02:51:44 -0700

Hi,

If I can just beat this horse one more time, can I just get
confirmation that I'm going about this the right way?


I have a base object, which reads the unicode string as bytes like so,
this ignores all but important bits.

class Mhod:
    def __init__(self, f):
        self.payload = struct.unpack("36s", f.read(36))

Which in turn, is utilised in a Song object, which works like this -

class Song:
    def __init__(self, mhod):
        self.location = unicode(mhod.payload, "UTF-16")
        self.mhod = mhod
    def gLoc(self):
        return self.location
    def sLoc(self, value):
        #Need to coerce data into UTF-16 here
        self.mhod.payload = value.encode("UTF-16")

    location = property(gLoc, sLoc)

If I were to do a

>>>x = Mhod(open("test", "rb"))
>>>y = Song(x)

I get

>>>x.payload
':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l
\x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L
\x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added.

>>>y.location
u':iPod_Control:Music:F44:LWBR.mp3'

Which is what I'm after. What I'm struggling with is coercing the
string that's being passed to sLoc() into UTF-16, and actually
creating any form of unicode string at all without using

>>>foo = u'Monkies!'

Which I'm sure is going to be in UTF-8, just to spite me.

So far, the best I've come up with is -

>>> foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16")

Which, as you mention above, is likely to cause me errors. And
apparently "Hi Bob!" is an 8 bit string encoded in UTF-16...
 *sigh* I suppose I could go the XP route and expect any further users
to just deal with it and pass in a UTF-16 string, but there's got to
be a simple way to handle it., and I'm not having too much luck with
this.

I've been working from the below document, if anyone can recommend
something further, I'd much appreciate it.

http://www.amk.ca/python/howto/unicode

Regards,

Liam Clarke
On 10/3/05, Liam Clarke <[EMAIL PROTECTED]> wrote:
> Thanks Kent,
>
> My first time dealing with Python and unicode vs 'normal' strings, I
> do look forward to Python 3.0... at the moment I'm just trying to
> understand how to use UTF-16.
>
> Basically, I have data which is coming straight from struct.unpack()
> and it's an UTF-16 string, and I'm just trying to get my head around
> dealing with the data coming in from struct, and putting my data out
> through struct.
>
> It doesn't help overly that struct considers all strings to consist of
> one byte per char, whereas UTF-16 is two. And I was having trouble as
> to how to write UTF-16 stuff out properly.
>
> But, if I understand it correctly, I could use
>
> j = #some unicode string
> out = j.encode("UTF-16")
> pattern = "%ds" % len(out)
> struct.pack(pattern, out)
>
> without too much difficulty.
>
> Regards,
>
> Liam Clarke
>
> On 10/3/05, Kent Johnson <[EMAIL PROTECTED]> wrote:
> > Liam Clarke wrote:
> > > What's the difference between
> > >
> > > x = "Hi"
> > > y = x.encode("UTF-16")
> > >
> > > and
> > >
> > > y = unicode(x, "UTF-16")
> >
> > They are more-or-less opposite.
> >
> > encode() converts away from unicode. (Think of unicode as the 'normal' 
> > format, anything else in 'encoded'.) Normally it is used on a unicode 
> > string, not a byte string. It means, "interpret this string as unicode, 
> > then convert it to an encoded byte string using the given encoding".
> >
> > When you encode a non-unicode string (like "Hi"), the string is first 
> > converted to unicode (decoded) using sys.getdefaultencoding(), then encoded 
> > using the supplied encoding. So
> > 'Hi'.encode('utf-16')
> > is the same as
> > 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> >
> > In either case, the result is a string in UTF-16 encoding:
> >  >>> 'Hi'.encode('UTF-16')
> > '\xff\xfeH\x00i\x00'
> >  >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> > '\xff\xfeH\x00i\x00'
> >
> > Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the 
> > output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'.
> >
> > Because sys.getdefaultencoding() is used to convert to unicode, you will 
> > get an error if the original string cannot be decoded with this encoding:
> >
> >  >>> '\xe3'.encode('utf-16')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: 
> > ordinal not in range(128)
> >
> >
> > What about unicode('Hi', 'utf-16')? This doesn't do anything useful:
> >  >>> unicode('Hi', 'UTF-16')
> > u'\u6948'
> >
> > unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In this 
> > case we are saying, "Interpret this string as an encoded byte string in the 
> > given encoding, and convert it to a unicode string." Since 'Hi' is not, in 
> > fact, a byte string encoded in UTF-16, the results are not very useful.
> >
> >
> > To summarize:
> > If you have an encoded byte string and you want a unicode string, use 
> > str.decode() or unicode()
> >
> > If you have a unicode string and you want an encoded byte string, use 
> > unicode.encode().
> >
> > If you are using str.encode() you probably haven't though through your 
> > problem completely and you will likely get UnicodeDecodeErrors when you 
> > have non-ASCII data.
> >
> >
> > If you are writing a unicode-aware application, a good strategy is to keep 
> > all strings internally as unicode and to convert to and from the required 
> > encodings at the boundaries.
> >
> > Kent
> >
> > _______________________________________________
> > Tutor maillist  -  [email protected]
> > http://mail.python.org/mailman/listinfo/tutor
> >
>
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Struct and UTF-16

Reply via email to