Re: unicode and hashlib

2008-12-02 Thread Bryan Olson

Scott David Daniels wrote:

Bryan Olson wrote:

... I think that's good behavior, except that the error message is likely
to end beginners to look up the obscure buffer interface before they 
find they just need mystring.decode('utf8') or bytes(mystring, 'utf8').


Oops, careful here (I made this mistake once in this thread as well). 
You _decode_ from unicode to bytes.  The code you quoted doesn't run.


Doh! I even tested it with .encode(), then wrote it wrong.

Just in case anyone Googles the error message and lands here: If you are 
working with a Python str (string) object and get,


  TypeError: object supporting the buffer API required

Then you probably want to encode the string to a bytes object, and 
UTF-8 is likely the encoding of choice, as in:


mystring.encode('utf8')

or

bytes(mystring, 'utf8')


Thanks for the correction.
--
--Bryan
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-12-01 Thread Bryan Olson

Jeff H wrote:

[...] So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)


It looks like hashlib in Python 3 will not even attempt to digest a 
unicode object. Trying to hash 'abcdefg' in in Python 3.0rc3 I get:


  TypeError: object supporting the buffer API required

I think that's good behavior, except that the error message is likely to 
send beginners to look up the obscure buffer interface before they find 
they just need mystring.decode('utf8') or bytes(mystring, 'utf8').



a='André'
b=unicode(a,'cp1252')
b

u'Andr\xc3\xa9'

hashlib.md5(b.encode('utf-8')).hexdigest()

'b4e5418a36bc4badfc47deb657a2b50c'


Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also 
includes the stronger SHA-2 family.



--
--Bryan
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-12-01 Thread Scott David Daniels

Bryan Olson wrote:

... I think that's good behavior, except that the error message is likely
to end beginners to look up the obscure buffer interface before they find 
they just need mystring.decode('utf8') or bytes(mystring, 'utf8').
Oops, careful here (I made this mistake once in this thread as well). 
You _decode_ from unicode to bytes.  The code you quoted doesn't run.

This does:

 a = 'Andr\xe9'
 b = unicode(a, 'cp1252')
 b.encode('utf-8')
'Andr\xc3\xa9'
 b.decode('utf-8')

Traceback (most recent call last):
  File pyshell#19, line 1, in module
b.decode('utf-8')
  File C:\Python26\lib\encodings\utf_8.py, line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in 
position 4: ordinal not in range(128)


 hashlib.md5(b.encode('utf-8')).hexdigest()
'45f1deffb45a5f6c2380a4cee9b3e452'

 hashlib.md5(b.decode('utf-8')).hexdigest()

Traceback (most recent call last):
  File pyshell#21, line 1, in module
hashlib.md5(b.decode('utf-8')).hexdigest()
  File C:\Python26\lib\encodings\utf_8.py, line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in 
position 4: ordinal not in range(128)



Incidentally, MD5 has fallen and SHA-1 is falling. Python's hashlib also 
includes the stronger SHA-2 family.


Well, the choice of hash always depends on the app.


-Scott
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-30 Thread Scott David Daniels

Jeff H wrote:

...

 decode vs encode

You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set


Pretty close:

encode:
Think of characters a conceptual -- you encode a character
string into a bunch of bytes (unicode - bytes) in order to send
the characters along a wire, into an e-mail, or put in a database.

decode:
You got the bytes from the wire, database, Morse code, whatever.
You decode the byte stream into characters, and now you really have
characters.  Thinking of it this way makes it clear which name is
which, unless (as I did once in this thread) you switch opposite
concepts carelessly.


Characters are content (understood by humans), bytes are gibberish
carried by hardware which likes that kid of thing.  You encode a
message into nonsense for your carrier to carry to your recipient,
and the recipient decodes the nonsense back into the message.

--Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 28, 1:24 pm, Scott David Daniels [EMAIL PROTECTED] wrote:
 Jeff H wrote:
  hashlib.md5 does not appear to like unicode,
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
  position 1650: ordinal not in range(128)

  After googling, I've found BDFL and others on Py3K talking about the
  problems of hashing non-bytes (i.e. buffers) ...

 Unicode is characters, not a character encoding.
 You could hash on a utf-8 encoding of the Unicode.

  So what is the canonical way to hash unicode?
   * convert unicode to local
   * hash in current local
  ???

 There is no _the_ way to hash Unicode, any more than
 there is no _the_ way to hash vectors.  You need to
 convert the abstract entity something concrete with
 a well-defined representation in bytes, and hash that.

  Is this just a problem for md5 hashes that I would not encounter using
  a different method?  i.e. Should I just use the built-in hash function?

 No, it is a definitional problem.  Perhaps you could explain how you
 want to use the hash.  If the internal hash is acceptable (e.g. for
 grouping in dictionaries within a single run), use that.  If you intend
 to store and compare on the same system, say that.  If you want cross-
 platform execution of your code to produce the same hashes, say that.
 A hash is a means to an end, and it is hard to give advice without
 knowing the goal.

I am checking for changes to large text objects stored in a database
against outside sources. So the hash needs to be reproducible/stable.

 --Scott David Daniels
 [EMAIL PROTECTED]

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 28, 2:03 pm, Terry Reedy [EMAIL PROTECTED] wrote:
 Jeff H wrote:
  hashlib.md5 does not appear to like unicode,
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
  position 1650: ordinal not in range(128)

 It is the (default) ascii encoder that does not like non-ascii chars.
 I suspect that is you encode to bytes first with an encoder that does
 work (latin-???), md5 will be happy.

 Reports like this should include Python version.

  After googling, I've found BDFL and others on Py3K talking about the
  problems of hashing non-bytes (i.e. buffers)
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html

  So what is the canonical way to hash unicode?
   * convert unicode to local
   * hash in current local
  ???
  but what if local has ordinals outside of 128?

  Is this just a problem for md5 hashes that I would not encounter using
  a different method?  i.e. Should I just use the built-in hash function?
  --
 http://mail.python.org/mailman/listinfo/python-list



Python v2.52 -- however, this is not really a bug report because your
analysis is correct. I am converting cp1252 strings to unicode before
I persist them in a database.  I am looking for advice/direction/
wisdom on how to sling these stringsg

-Jeff
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 29, 8:27 am, Jeff H [EMAIL PROTECTED] wrote:
 On Nov 28, 2:03 pm, Terry Reedy [EMAIL PROTECTED] wrote:



  Jeff H wrote:
   hashlib.md5 does not appear to like unicode,
     UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
   position 1650: ordinal not in range(128)

  It is the (default) ascii encoder that does not like non-ascii chars.
  I suspect that is you encode to bytes first with an encoder that does
  work (latin-???), md5 will be happy.

  Reports like this should include Python version.

   After googling, I've found BDFL and others on Py3K talking about the
   problems of hashing non-bytes (i.e. buffers)
   http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html

   So what is the canonical way to hash unicode?
    * convert unicode to local
    * hash in current local
   ???
   but what if local has ordinals outside of 128?

   Is this just a problem for md5 hashes that I would not encounter using
   a different method?  i.e. Should I just use the built-in hash function?
   --
  http://mail.python.org/mailman/listinfo/python-list

 Python v2.52 -- however, this is not really a bug report because your
 analysis is correct. I am converting cp1252 strings to unicode before
 I persist them in a database.  I am looking for advice/direction/
 wisdom on how to sling these stringsg

 -Jeff

Actually, what I am surprised by, is the fact that hashlib cares at
all about the encoding.  A md5 hash can be produced for an .iso file
which means it can handle bytes, why does it care what it is being
fed, as long as there are bytes.  I would have assumed that it would
take whatever was feed to it and view it as a byte array and then hash
it.  You can read a binary file and hash it
  print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Marc 'BlackJack' Rintsch
On Sat, 29 Nov 2008 06:51:33 -0800, Jeff H wrote:

 Actually, what I am surprised by, is the fact that hashlib cares at all
 about the encoding.  A md5 hash can be produced for an .iso file which
 means it can handle bytes, why does it care what it is being fed, as
 long as there are bytes.

But you don't have bytes, you have a `unicode` object.  The internal byte 
representation is implementation specific and not your business.

  I would have assumed that it would take
 whatever was feed to it and view it as a byte array and then hash it.

How?  There is no (sane) way to get at the internal byte representation.  
And that byte representation might contain things like pointers to memory 
locations that are different for two `unicode` objects which compare 
equal, so you would get different hash values for objects that otherwise 
look the same from the Python level.  Not very useful.

 You can read a binary file and hash it
   print md5.new(file('foo.iso').read()).hexdigest()
 What do I need to do to tell hashlib not to try and decode, just treat
 the data as binary?

It's not about *de*coding, it is about *en*coding your `unicode` object 
so you get bytes to feed to the MD5 algorithm.

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Scott David Daniels

Jeff H wrote:

...
Actually, what I am surprised by, is the fact that hashlib cares at
all about the encoding.  A md5 hash can be produced for an .iso file
which means it can handle bytes, why does it care what it is being
fed, as long as there are bytes.  I would have assumed that it would
take whatever was feed to it and view it as a byte array and then hash
it.  You can read a binary file and hash it
  print md5.new(file('foo.iso').read()).hexdigest()
What do I need to do to tell hashlib not to try and decode, just treat
the data as binary?


If you do not care about portability or reproducability, you can just go
with the bytes you get to most easily.

To take your example:
with open('foo.iso', 'r'):
print hashlib.md5(src.read()).hexdigest()

will print different things on Linux and windows.

with open('foo.iso', 'rb'):
print hashlib.md5(src.read()).hexdigest()

should print the same thing on both; hashingdoes not magically allow
you to stop thinking.

If you now, and for all time, decide that the only source you will take 
is cp1252, perhaps you should decode to cp1252 before hashing.


Even if you have Unicode, you can have alternative Unicode expression
of the same characters, so you may want to convert the Unicode to a
Normalized Form of Unicode before decoding to bytes.  The major
candidates for that are NFC, NFD, NFKC, and NFKD, see:
http://unicode.org/reports/tr15/
Again, once have chosen your normalized form (or decided to skip the
normalization step), I'd suggest going to UTF-8 (which is pretty
unambiguous) and them hash the result.  The problem with another choice
is that UTF-16 comes in two flavors (UTF-16BE and UTF-16LE); UTF-32 also
has two flavors (UTF-32BE and UTF-32LE), and whatever your current
Python, you may well switch between UTF-16 and UTF-32 internally at some
point as you do regular upgrades (or BE vs. LE if you switch CPUs).

--Scott David Daniels
[EMAIL PROTECTED]

you'll have to decide
, but you could
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Scott David Daniels

Scott David Daniels wrote:
...
If you now, and for all time, decide that the only source you will take 
is cp1252, perhaps you should decode to cp1252 before hashing.


Of course my dyslexia sticks out here as I get encode and decode exactly
backwards -- Marc 'BlackJack' Rintsch has it right.

Characters (a concept) are encoded to a byte format (representation).
Bytes (a precise representation) are decoded to characters (a format
with semantics).

--Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-29 Thread Jeff H
On Nov 29, 12:23 pm, Scott David Daniels [EMAIL PROTECTED]
wrote:
 Scott David Daniels wrote:

 ...

  If you now, and for all time, decide that the only source you will take
  is cp1252, perhaps you should decode to cp1252 before hashing.

 Of course my dyslexia sticks out here as I get encode and decode exactly
 backwards -- Marc 'BlackJack' Rintsch has it right.

 Characters (a concept) are encoded to a byte format (representation).
 Bytes (a precise representation) are decoded to characters (a format
 with semantics).

 --Scott David Daniels
 [EMAIL PROTECTED]

Ok, so the fog lifts, thanks to Scott and Marc, and I begin to realize
that the hashlib was trying to encode (not decode) my unicode object
as 'ascii' (my default encoding) and since that resulted in characters
128 - shhh'boom.  So once I have character strings transformed
internally to unicode objects, I should encode them in 'utf-8' before
attempting to do things that guess at the proper way to encode them
for further processing.(i.e. hashlib)

 a='André'
 b=unicode(a,'cp1252')
 b
u'Andr\xc3\xa9'
 hashlib.md5(b.encode('utf-8')).hexdigest()
'b4e5418a36bc4badfc47deb657a2b50c'

Scott then points out that utf-8 is probably superior (for use within
the code I control) to utf-16 and utf-32 which both have 2 variants
and sometimes which one used is based on installed software and/or
processors. utf-8 unlike -16/-32 stays reliable and reproducible
irrespective of software or hardware.

decode vs encode
You decode from on character set to a unicode object
You encode from a unicode object to a specifed character set

Please correct me if you see something wrong and thank you for your
advice and direction.

u'unicordial-ly yours. ;)'
Jeff
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-28 Thread Scott David Daniels

Jeff H wrote:

hashlib.md5 does not appear to like unicode,
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers) ...

Unicode is characters, not a character encoding.
You could hash on a utf-8 encoding of the Unicode.


So what is the canonical way to hash unicode?
 * convert unicode to local
 * hash in current local
???

There is no _the_ way to hash Unicode, any more than
there is no _the_ way to hash vectors.  You need to
convert the abstract entity something concrete with
a well-defined representation in bytes, and hash that.


Is this just a problem for md5 hashes that I would not encounter using
a different method?  i.e. Should I just use the built-in hash function?

No, it is a definitional problem.  Perhaps you could explain how you
want to use the hash.  If the internal hash is acceptable (e.g. for
grouping in dictionaries within a single run), use that.  If you intend
to store and compare on the same system, say that.  If you want cross-
platform execution of your code to produce the same hashes, say that.
A hash is a means to an end, and it is hard to give advice without
knowing the goal.

--Scott David Daniels
[EMAIL PROTECTED]
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-28 Thread MRAB

Jeff H wrote:

hashlib.md5 does not appear to like unicode,
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)

After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html

So what is the canonical way to hash unicode?
 * convert unicode to local
 * hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method?  i.e. Should I just use the built-in hash function?


It can handle bytestrings, but if you give it unicode it performs a 
default encoding to ASCII, but that fails if there's a codepoint = 
U+0080. Personally, I'd recommend encoding unicode to UTF-8.

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-28 Thread Terry Reedy

Jeff H wrote:

hashlib.md5 does not appear to like unicode,
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
position 1650: ordinal not in range(128)


It is the (default) ascii encoder that does not like non-ascii chars.
I suspect that is you encode to bytes first with an encoder that does 
work (latin-???), md5 will be happy.


Reports like this should include Python version.


After googling, I've found BDFL and others on Py3K talking about the
problems of hashing non-bytes (i.e. buffers)
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09824.html

So what is the canonical way to hash unicode?
 * convert unicode to local
 * hash in current local
???
but what if local has ordinals outside of 128?

Is this just a problem for md5 hashes that I would not encounter using
a different method?  i.e. Should I just use the built-in hash function?
--
http://mail.python.org/mailman/listinfo/python-list



--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode and hashlib

2008-11-28 Thread Paul Boddie
On 28 Nov, 21:03, Terry Reedy [EMAIL PROTECTED] wrote:

 It is the (default) ascii encoder that does not like non-ascii chars.
 I suspect that is you encode to bytes first with an encoder that does
 work (latin-???), md5 will be happy.

I know that the Python roadmap answer to such questions might refer
to Python 3.0 and its strings are Unicode features, and having seen
this mentioned a lot recently, I'm surprised that no-one has done so
at the time of writing, but I do wonder whether good old Python 2.x
wouldn't benefit from a more explicit error message in these
situations.

Since the introduction of Unicode in Python 1.6/2.0, I've always tried
to make the distinction between what I call plain strings or byte
strings and Unicode objects or character strings, and perhaps the
UnicodeEncodeError message should be enhanced to say what is actually
going on: that an attempt is being made to convert characters into
byte values and that the chosen way of doing so (which often involves
the default, ASCII encoding) cannot manage the job.

Paul
--
http://mail.python.org/mailman/listinfo/python-list