Re: unicode and hashlib
Scott David Daniels wrote: Bryan Olson wrote: ... I think that's good behavior, except that the error message is likely to end beginners to look up the obscure buffer interface before they find they just need mystring.decode('utf8') or bytes(mystring, 'utf8'). Oops, careful here (I made this mistake once in this thread as well). You _decode_ from unicode to bytes. The code you quoted doesn't run. Doh! I even tested it with .encode(), then wrote it wrong. Just in case anyone Googles the error message and lands here: If you are working with a Python str (string) object and get, TypeError: object supporting the buffer API required Then you probably want to encode the string to a bytes object, and UTF-8 is likely the encoding of choice, as in: mystring.encode('utf8') or bytes(mystring, 'utf8') Thanks for the correction. -- --Bryan -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: [...] So once I have character strings transformed internally to unicode objects, I should encode them in 'utf-8' before attempting to do things that guess at the proper way to encode them for further processing.(i.e. hashlib) It looks like hashlib in Python 3 will not even attempt to digest a unicode object. Trying to hash 'abcdefg' in in Python 3.0rc3 I get: TypeError: object supporting the buffer API required I think that's good behavior, except that the error message is likely to send beginners to look up the obscure buffer interface before they find they just need mystring.decode('utf8') or bytes(mystring, 'utf8'). a='André' b=unicode(a,'cp1252') b u'Andr\xc3\xa9' hashlib.md5(b.encode('utf-8')).hexdigest() 'b4e5418a36bc4badfc47deb657a2b50c' Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also includes the stronger SHA-2 family. -- --Bryan -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Bryan Olson wrote: ... I think that's good behavior, except that the error message is likely to end beginners to look up the obscure buffer interface before they find they just need mystring.decode('utf8') or bytes(mystring, 'utf8'). Oops, careful here (I made this mistake once in this thread as well). You _decode_ from unicode to bytes. The code you quoted doesn't run. This does: a = 'Andr\xe9' b = unicode(a, 'cp1252') b.encode('utf-8') 'Andr\xc3\xa9' b.decode('utf-8') Traceback (most recent call last): File pyshell#19, line 1, in module b.decode('utf-8') File C:\Python26\lib\encodings\utf_8.py, line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128) hashlib.md5(b.encode('utf-8')).hexdigest() '45f1deffb45a5f6c2380a4cee9b3e452' hashlib.md5(b.decode('utf-8')).hexdigest() Traceback (most recent call last): File pyshell#21, line 1, in module hashlib.md5(b.decode('utf-8')).hexdigest() File C:\Python26\lib\encodings\utf_8.py, line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128) Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also includes the stronger SHA-2 family. Well, the choice of hash always depends on the app. -Scott -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: ... decode vs encode You decode from on character set to a unicode object You encode from a unicode object to a specifed character set Pretty close: encode: Think of characters a conceptual -- you encode a character string into a bunch of bytes (unicode - bytes) in order to send the characters along a wire, into an e-mail, or put in a database. decode: You got the bytes from the wire, database, Morse code, whatever. You decode the byte stream into characters, and now you really have characters. Thinking of it this way makes it clear which name is which, unless (as I did once in this thread) you switch opposite concepts carelessly. Characters are content (understood by humans), bytes are gibberish carried by hardware which likes that kid of thing. You encode a message into nonsense for your carrier to carry to your recipient, and the recipient decodes the nonsense back into the message. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 28, 1:24 pm, Scott David Daniels [EMAIL PROTECTED] wrote: Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) ... Unicode is characters, not a character encoding. You could hash on a utf-8 encoding of the Unicode. So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? There is no _the_ way to hash Unicode, any more than there is no _the_ way to hash vectors. You need to convert the abstract entity something concrete with a well-defined representation in bytes, and hash that. Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? No, it is a definitional problem. Perhaps you could explain how you want to use the hash. If the internal hash is acceptable (e.g. for grouping in dictionaries within a single run), use that. If you intend to store and compare on the same system, say that. If you want cross- platform execution of your code to produce the same hashes, say that. A hash is a means to an end, and it is hard to give advice without knowing the goal. I am checking for changes to large text objects stored in a database against outside sources. So the hash needs to be reproducible/stable. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 28, 2:03 pm, Terry Reedy [EMAIL PROTECTED] wrote: Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) It is the (default) ascii encoder that does not like non-ascii chars. I suspect that is you encode to bytes first with an encoder that does work (latin-???), md5 will be happy. Reports like this should include Python version. After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? -- http://mail.python.org/mailman/listinfo/python-list Python v2.52 -- however, this is not really a bug report because your analysis is correct. I am converting cp1252 strings to unicode before I persist them in a database. I am looking for advice/direction/ wisdom on how to sling these stringsg -Jeff -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 29, 8:27 am, Jeff H [EMAIL PROTECTED] wrote: On Nov 28, 2:03 pm, Terry Reedy [EMAIL PROTECTED] wrote: Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) It is the (default) ascii encoder that does not like non-ascii chars. I suspect that is you encode to bytes first with an encoder that does work (latin-???), md5 will be happy. Reports like this should include Python version. After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? -- http://mail.python.org/mailman/listinfo/python-list Python v2.52 -- however, this is not really a bug report because your analysis is correct. I am converting cp1252 strings to unicode before I persist them in a database. I am looking for advice/direction/ wisdom on how to sling these stringsg -Jeff Actually, what I am surprised by, is the fact that hashlib cares at all about the encoding. A md5 hash can be produced for an .iso file which means it can handle bytes, why does it care what it is being fed, as long as there are bytes. I would have assumed that it would take whatever was feed to it and view it as a byte array and then hash it. You can read a binary file and hash it print md5.new(file('foo.iso').read()).hexdigest() What do I need to do to tell hashlib not to try and decode, just treat the data as binary? -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Sat, 29 Nov 2008 06:51:33 -0800, Jeff H wrote: Actually, what I am surprised by, is the fact that hashlib cares at all about the encoding. A md5 hash can be produced for an .iso file which means it can handle bytes, why does it care what it is being fed, as long as there are bytes. But you don't have bytes, you have a `unicode` object. The internal byte representation is implementation specific and not your business. I would have assumed that it would take whatever was feed to it and view it as a byte array and then hash it. How? There is no (sane) way to get at the internal byte representation. And that byte representation might contain things like pointers to memory locations that are different for two `unicode` objects which compare equal, so you would get different hash values for objects that otherwise look the same from the Python level. Not very useful. You can read a binary file and hash it print md5.new(file('foo.iso').read()).hexdigest() What do I need to do to tell hashlib not to try and decode, just treat the data as binary? It's not about *de*coding, it is about *en*coding your `unicode` object so you get bytes to feed to the MD5 algorithm. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: ... Actually, what I am surprised by, is the fact that hashlib cares at all about the encoding. A md5 hash can be produced for an .iso file which means it can handle bytes, why does it care what it is being fed, as long as there are bytes. I would have assumed that it would take whatever was feed to it and view it as a byte array and then hash it. You can read a binary file and hash it print md5.new(file('foo.iso').read()).hexdigest() What do I need to do to tell hashlib not to try and decode, just treat the data as binary? If you do not care about portability or reproducability, you can just go with the bytes you get to most easily. To take your example: with open('foo.iso', 'r'): print hashlib.md5(src.read()).hexdigest() will print different things on Linux and windows. with open('foo.iso', 'rb'): print hashlib.md5(src.read()).hexdigest() should print the same thing on both; hashingdoes not magically allow you to stop thinking. If you now, and for all time, decide that the only source you will take is cp1252, perhaps you should decode to cp1252 before hashing. Even if you have Unicode, you can have alternative Unicode expression of the same characters, so you may want to convert the Unicode to a Normalized Form of Unicode before decoding to bytes. The major candidates for that are NFC, NFD, NFKC, and NFKD, see: http://unicode.org/reports/tr15/ Again, once have chosen your normalized form (or decided to skip the normalization step), I'd suggest going to UTF-8 (which is pretty unambiguous) and them hash the result. The problem with another choice is that UTF-16 comes in two flavors (UTF-16BE and UTF-16LE); UTF-32 also has two flavors (UTF-32BE and UTF-32LE), and whatever your current Python, you may well switch between UTF-16 and UTF-32 internally at some point as you do regular upgrades (or BE vs. LE if you switch CPUs). --Scott David Daniels [EMAIL PROTECTED] you'll have to decide , but you could -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Scott David Daniels wrote: ... If you now, and for all time, decide that the only source you will take is cp1252, perhaps you should decode to cp1252 before hashing. Of course my dyslexia sticks out here as I get encode and decode exactly backwards -- Marc 'BlackJack' Rintsch has it right. Characters (a concept) are encoded to a byte format (representation). Bytes (a precise representation) are decoded to characters (a format with semantics). --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On Nov 29, 12:23 pm, Scott David Daniels [EMAIL PROTECTED] wrote: Scott David Daniels wrote: ... If you now, and for all time, decide that the only source you will take is cp1252, perhaps you should decode to cp1252 before hashing. Of course my dyslexia sticks out here as I get encode and decode exactly backwards -- Marc 'BlackJack' Rintsch has it right. Characters (a concept) are encoded to a byte format (representation). Bytes (a precise representation) are decoded to characters (a format with semantics). --Scott David Daniels [EMAIL PROTECTED] Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize that the hashlib was trying to encode (not decode) my unicode object as 'ascii' (my default encoding) and since that resulted in characters 128 - shhh'boom. So once I have character strings transformed internally to unicode objects, I should encode them in 'utf-8' before attempting to do things that guess at the proper way to encode them for further processing.(i.e. hashlib) a='André' b=unicode(a,'cp1252') b u'Andr\xc3\xa9' hashlib.md5(b.encode('utf-8')).hexdigest() 'b4e5418a36bc4badfc47deb657a2b50c' Scott then points out that utf-8 is probably superior (for use within the code I control) to utf-16 and utf-32 which both have 2 variants and sometimes which one used is based on installed software and/or processors. utf-8 unlike -16/-32 stays reliable and reproducible irrespective of software or hardware. decode vs encode You decode from on character set to a unicode object You encode from a unicode object to a specifed character set Please correct me if you see something wrong and thank you for your advice and direction. u'unicordial-ly yours. ;)' Jeff -- http://mail.python.org/mailman/listinfo/python-list
unicode and hashlib
hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) ... Unicode is characters, not a character encoding. You could hash on a utf-8 encoding of the Unicode. So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? There is no _the_ way to hash Unicode, any more than there is no _the_ way to hash vectors. You need to convert the abstract entity something concrete with a well-defined representation in bytes, and hash that. Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? No, it is a definitional problem. Perhaps you could explain how you want to use the hash. If the internal hash is acceptable (e.g. for grouping in dictionaries within a single run), use that. If you intend to store and compare on the same system, say that. If you want cross- platform execution of your code to produce the same hashes, say that. A hash is a means to an end, and it is hard to give advice without knowing the goal. --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? It can handle bytestrings, but if you give it unicode it performs a default encoding to ASCII, but that fails if there's a codepoint = U+0080. Personally, I'd recommend encoding unicode to UTF-8. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
Jeff H wrote: hashlib.md5 does not appear to like unicode, UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in position 1650: ordinal not in range(128) It is the (default) ascii encoder that does not like non-ascii chars. I suspect that is you encode to bytes first with an encoder that does work (latin-???), md5 will be happy. Reports like this should include Python version. After googling, I've found BDFL and others on Py3K talking about the problems of hashing non-bytes (i.e. buffers) http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html So what is the canonical way to hash unicode? * convert unicode to local * hash in current local ??? but what if local has ordinals outside of 128? Is this just a problem for md5 hashes that I would not encounter using a different method? i.e. Should I just use the built-in hash function? -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and hashlib
On 28 Nov, 21:03, Terry Reedy [EMAIL PROTECTED] wrote: It is the (default) ascii encoder that does not like non-ascii chars. I suspect that is you encode to bytes first with an encoder that does work (latin-???), md5 will be happy. I know that the Python roadmap answer to such questions might refer to Python 3.0 and its strings are Unicode features, and having seen this mentioned a lot recently, I'm surprised that no-one has done so at the time of writing, but I do wonder whether good old Python 2.x wouldn't benefit from a more explicit error message in these situations. Since the introduction of Unicode in Python 1.6/2.0, I've always tried to make the distinction between what I call plain strings or byte strings and Unicode objects or character strings, and perhaps the UnicodeEncodeError message should be enhanced to say what is actually going on: that an attempt is being made to convert characters into byte values and that the chosen way of doing so (which often involves the default, ASCII encoding) cannot manage the job. Paul -- http://mail.python.org/mailman/listinfo/python-list