Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Ron Adam
Greg Ewing wrote:
 Ron Adam wrote:
 
 This uses syntax to determine the direction of encoding.  It would be 
 easier and clearer to just require two arguments or a tuple.

   u = unicode(b, 'encode', 'base64')
   b = bytes(u, 'decode', 'base64')
 
 The point of the exercise was to avoid using the terms
 'encode' and 'decode' entirely, since some people claim
 to be confused by them.

Yes, that was what I was trying for with the tounicode, tostring 
(tobyte) suggestion, but the direction could become ambiguous as you 
pointed out.

The constructors above have 4 data items implied:
  1: The source object which includes the source type and data
  2: The codec to use
  3: The direction of the operation
  4: The destination type (determined by the constructor used)

There isn't any ambiguity other than when to use encode or decode, but 
in this case that really is a documentation problem because there is no 
ambiguities in this form.  Everything is explicit.

Another version of the above was pointed out to me off line that might 
be preferable.

   u = unicode(b, encode='base64')
   b = bytes(u, decode='base64')

Which would also work with the tostring and tounicode methods.

   u = b.tounicode(decode='base64')
   b = u.tobytes(incode='base64')


 If we're going to continue to use 'encode' and 'decode',
 why not just make them functions:
 
b = encode(u, 'utf-8')
u = decode(b, 'utf-8')

  import codecs
  codecs.decode('abc', 'ascii')
u'abc'

There's that time machine again. ;-)

 In the case of Unicode encodings, if you get them
 backwards you'll get a type error.
 
 The advantage of using functions over methods or
 constructor arguments is that they can be applied
 uniformly to any input and output types.

If codecs are to be more general, then there may be time when the 
returned type needs to be specified.  This would apply to codecs that 
could return either bytes or strings, or strings or unicode, or bytes or 
unicode.  Some inputs may equally work with more than one output type. 
Of course, the answer in these cases may be to just 'know' what you will 
get, and then convert it to what you want.

Cheers,
Ron


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Greg Ewing
Stephen J. Turnbull wrote:

 Doesn't that make base64 non-text by analogy to other look but don't
 touch strings like a .gz or vmlinuz?

No, because I can take a piece of base64 encoded data
and use a text editor to manually paste it in with some
other text (e.g. a plain-text (not MIME) mail message).
Then I can mail it to someone, or send it by text-mode
ftp, or translate it from Unix to MSDOS line endings and
give it to a Windows user, or translate it into EBCDIC
and give it to someone who has an IBM mainframe, etc,
etc. And the person at the other end can use their text
editor to manually extract it and decode it and recover
the original data.

I can't do any of those directly with a .gz file or
vmlinuz.

I'm not just making those uses up, BTW. It's not very
long ago people used to do things like that all the
time with uuencode, binhex, etc -- because mail and
news at the time were strictly text channels. They
still are, really -- otherwise we wouldn't be using
anything as hairy as MIME, we'd just mail our binary
files as-is.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Greg Ewing
Ron Adam wrote:

 This would apply to codecs that 
 could return either bytes or strings, or strings or unicode, or bytes or 
 unicode.

I'd need to see some concrete examples of such codecs
before being convinced that they exist, or that they
couldn't just as well return a fixed type that you
then transform to what you want.

I suspect that said transformation would involve some
further encoding or decoding, in which case you really
have more than one codec.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Ron Adam
Greg Ewing wrote:
 Ron Adam wrote:
 
 This would apply to codecs that 
 could return either bytes or strings, or strings or unicode, or bytes or 
 unicode.
 
 I'd need to see some concrete examples of such codecs
 before being convinced that they exist, or that they
 couldn't just as well return a fixed type that you
 then transform to what you want.

I think text some codecs that currently return 'ascii' encoded text 
would be candidates.  If you use u'abc'.encode('rot13') you get an ascii 
string back and not a unicode string. And if you use decode to get back, 
you don't get the original unicode back, but an ascii representation of 
the original you then need to decode to unicode.

 I suspect that said transformation would involve some
 further encoding or decoding, in which case you really
 have more than one codec.

Yes, I can see that.

So the following are probable better reasons to specify the type.

Codecs are very close to types and they quite often result in a type 
change, having the change visible in the code adds to overall 
readability.  This is probably my main desire for this.

There is another reason for being explicit about types with codecs. If 
you store the codecs with a tuple of attributes as the keys, (name, 
in_type, out_type), then it makes it possible to look up the codec with 
the correct behavior and then just do it.

The alternative is to test the input, try it, then test the output.  The 
look up doesn't add much overhead, but does adds safety.  Codecs don't 
seem to be the type of thing where you will want to be able to pass a 
wide variety of objects into.  So a narrow slot is probably preferable 
to a wide one here.

In cases where a codec might be useful in more than one combination of 
types, it could have an entry for each valid combination in the lookup 
table.  The codec lookup also validates the desired operation for nearly 
free.  Of course, the data will need to be valid as well. ;-)


Cheers,
   Ron



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Ron Adam
Josiah Carlson wrote:
 Greg Ewing [EMAIL PROTECTED] wrote:
u = unicode(b)
u = unicode(b, 'utf8')
b = bytes['utf8'](u)
u = unicode['base64'](b)   # encoding
b = bytes(u, 'base64') # decoding
u2 = unicode['piglatin'](u1)   # encoding
u1 = unicode(u2, 'piglatin')   # decoding
 
 Your provided semantics feel cumbersome and confusing to me, as compared
 with str/unicode.encode/decode() .
 
  - Josiah

This uses syntax to determine the direction of encoding.  It would be 
easier and clearer to just require two arguments or a tuple.

  u = unicode(b, 'encode', 'base64')
  b = bytes(u, 'decode', 'base64')

  b = bytes(u, 'encode', 'utf-8')
  u = unicode(b, 'decode', 'utf-8')

  u2 = unicode(u1, 'encode', 'piglatin')
  u1 = unicode(u2, 'decode', 'piglatin')



It looks somewhat cleaner if you combine them in a path style string.

  b = bytes(u, 'encode/utf-8')
  u = unicode(b, 'decode/utf-8')

Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Just van Rossum
Ron Adam wrote:

 Josiah Carlson wrote:
  Greg Ewing [EMAIL PROTECTED] wrote:
 u = unicode(b)
 u = unicode(b, 'utf8')
 b = bytes['utf8'](u)
 u = unicode['base64'](b)   # encoding
 b = bytes(u, 'base64') # decoding
 u2 = unicode['piglatin'](u1)   # encoding
 u1 = unicode(u2, 'piglatin')   # decoding
  
  Your provided semantics feel cumbersome and confusing to me, as
  compared with str/unicode.encode/decode() .
  
   - Josiah
 
 This uses syntax to determine the direction of encoding.  It would be 
 easier and clearer to just require two arguments or a tuple.
 
   u = unicode(b, 'encode', 'base64')
   b = bytes(u, 'decode', 'base64')
 
   b = bytes(u, 'encode', 'utf-8')
   u = unicode(b, 'decode', 'utf-8')
 
   u2 = unicode(u1, 'encode', 'piglatin')
   u1 = unicode(u2, 'decode', 'piglatin')
 
 
 
 It looks somewhat cleaner if you combine them in a path style string.
 
   b = bytes(u, 'encode/utf-8')
   u = unicode(b, 'decode/utf-8')

It gets from bad to worse :(

I always liked the assymmetry between

u = unicode(s, utf8)

and

s = u.encode(utf8)

which I think was the original design of the unicode API. Cudos for
whoever came up with that.

When I saw

b = bytes(u, utf8)

mentioned for the first time, I thought: why on earth must the bytes
constructor be coupled to the unicode API?!?! It makes no sense to me
whatsoever. Bytes have so much more use besides encoded text.

I believe (please correct me if I'm wrong) that the encoding argument of
bytes() was invented to make it easier to write byte literals. Perhaps a
true bytes literal notation is in order after all?

My preference for bytes - unicode - bytes API would be this:

u = unicode(b, utf8)  # just like we have now
b = u.tobytes(utf8)   # like u.encode(), but being explicit
# about the resulting type

As to base64, while it works as a codec (Why a base64 codec? Because we
can!), I don't find it a natural API at all, for such conversions.

(I do however agree with Greg Ewing that base64 encoded data is text,
not ascii-encoded bytes ;-)

Just-my-2-cts
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Greg Ewing
Ron Adam wrote:

 This uses syntax to determine the direction of encoding.  It would be 
 easier and clearer to just require two arguments or a tuple.
 
   u = unicode(b, 'encode', 'base64')
   b = bytes(u, 'decode', 'base64')

The point of the exercise was to avoid using the terms
'encode' and 'decode' entirely, since some people claim
to be confused by them.

While I succeeded in that, I concede that the result
isn't particularly intuitive and is arguably even more
confusing.

If we're going to continue to use 'encode' and 'decode',
why not just make them functions:

   b = encode(u, 'utf-8')
   u = decode(b, 'utf-8')

In the case of Unicode encodings, if you get them
backwards you'll get a type error.

The advantage of using functions over methods or
constructor arguments is that they can be applied
uniformly to any input and output types.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Greg Ewing
Stephen J. Turnbull wrote:

 What you presumably meant was what would you consider the proper type
 for (P)CDATA?

No, I mean the whole thing, including all the ... tags
etc. Like you see when you load an XML file into a text
editor. (BTW, doesn't the fact that you *can* load an
XML file into what we call a text editor say something?)

 nobody but authors of
 wire drivers[2] and introspective code will need to _explicitly_ call
 .encode('base64').

Even a wire driver writer will only need it if he's
trying to turn a text wire into a binary wire, as
far as I can see.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Stephen J. Turnbull
 Greg == Greg Ewing [EMAIL PROTECTED] writes:

Greg (BTW, doesn't the fact that you *can* load an XML file into
Greg what we call a text editor say something?)

Why not answer that question for yourself, and then turn that answer
into a description of text semantics?

For me, it says that, just like a gzipped file or the Linux kernel, I
can load an XML file into a text editor.  But unlike the .gz or
vmlinuz, I can easily find many useful things to do to the XML string
in the text editor.

Doesn't that make base64 non-text by analogy to other look but don't
touch strings like a .gz or vmlinuz?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Donovan Baarda
On Tue, 2006-02-28 at 15:23 -0800, Bill Janssen wrote:
 Greg Ewing wrote:
  Bill Janssen wrote:
  
   bytes - base64 - text
   text - de-base64 - bytes
  
  It's nice to hear I'm not out of step with
  the entire world on this. :-)
 
 Well, I can certainly understand the bytes-base64-bytes side of
 thing too.  The text produced is specified as using a 65-character
 subset of US-ASCII, so that's really bytes.

Huh... just joining here but surely you don't mean a text string that
doesn't use every character available in a particular encoding is
really bytes... it's still a text string...

If you base64 encode some bytes, you get a string. If you then want to
access that base64 string as if it was a bunch of bytes, cast it to
bytes.

Be careful not to confuse (type)cast with (type)convert... 

A convert transforms the data from one type/class to another,
modifying it to be a valid equivalent instance of the other type/class;
ie int - float. 

A cast does not modify the data in any way, it just changes its
type/class to be the other type, and assumes that the data is a valid
instance of the other type; eg int32 - bytes[4]. Minor data munging
under the hood to cleanly switch the type/class is acceptable (ie adding
array length info etc) provided you keep to the spirit of the cast.

Keep these two concepts separate and you should be right :-)

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Nick Coghlan
Bill Janssen wrote:
 Greg Ewing wrote:
 Bill Janssen wrote:

 bytes - base64 - text
 text - de-base64 - bytes
 It's nice to hear I'm not out of step with
 the entire world on this. :-)
 
 Well, I can certainly understand the bytes-base64-bytes side of
 thing too.  The text produced is specified as using a 65-character
 subset of US-ASCII, so that's really bytes.

If the base64 codec was a text-bytes codec, and bytes did not have an encode 
method, then if you want to convert your original bytes to ascii bytes, you 
would do:

   ascii_bytes = orig_bytes.decode(base64).encode(ascii)

Use base64 to convert my byte sequence to characters, then give me the 
corresponding ascii byte sequence

To reverse the process:

   orig_bytes = ascii_bytes.decode(ascii).encode(base64)

Use ascii to convert my byte sequence to characters, then use base64 to 
convert those characters back to the original byte sequence

The only slightly odd aspect is that this inverts the conventional meaning of 
base64 encoding and decoding, where you expect to encode from bytes to 
characters and decode from characters to bytes.

As strings currently have both methods, the existing codec is able to use the 
conventional sense for base64: encode goes from str-as-bytes to 
str-as-text (giving a longer string with characters that fit in the base64 
subset) and decode goes from str-as-text to str-as-bytes (giving back the 
original string)

All the unicode codecs, on the other hand, use encode to get from characters 
to bytes and decode to get from bytes to characters.

So if bytes objects *did* have an encode method, it should still result in a 
unicode object, just the same as a decode method does (because you are 
encoding bytes as characters), and unicode objects would acquire a 
corresponding decode method (that decodes from a character format such as 
base64 to the original byte sequence).

In the name of TOOWTDI, I'd suggest that we just eat the slight terminology 
glitch in the rare cases like base64, hex and oct (where the character format 
is technically the encoded format), and leave it so that there is a single 
method pair (bytes.decode to go from bytes to characters, and text.encode to 
go from characters to bytes).

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Ron Adam
Nick Coghlan wrote:
 All the unicode codecs, on the other hand, use encode to get from characters 
 to bytes and decode to get from bytes to characters.
 
 So if bytes objects *did* have an encode method, it should still result in a 
 unicode object, just the same as a decode method does (because you are 
 encoding bytes as characters), and unicode objects would acquire a 
 corresponding decode method (that decodes from a character format such as 
 base64 to the original byte sequence).

 In the name of TOOWTDI, I'd suggest that we just eat the slight terminology 
 glitch in the rare cases like base64, hex and oct (where the character format 
 is technically the encoded format), and leave it so that there is a single 
 method pair (bytes.decode to go from bytes to characters, and text.encode to 
 go from characters to bytes).

I think you have it pretty straight here.


While playing around with the example bytes class I noticed code reads 
much better when I use methods called tounicode and tostring.

b64ustring = b.tounicode('base64')
b = bytes(b64ustring, 'base64')

The bytes could then *not* ignore the string decode codec but use it for 
string to string decoding.

b64string = b.tostring('base64')
b = bytes(b64string, 'base64')

b = bytes(hexstring, 'hex')
hexstring = b.tostring('hex')

hexstring = b.tounicode('hex')

An exception could be raised if the codec does not support input or 
output type depending on the situation.

This would allow for differnt types of codecs to live together without 
as much confusion I think.

I'm not suggesting we start using to-type everywhere, just where it 
might make things clearer over decode and encode.


Expecting it not to fly, but just maybe it could?
   Ron


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Chermside, Michael
Ron Adam writes:
 While playing around with the example bytes class I noticed code reads

 much better when I use methods called tounicode and tostring.
[...]
 I'm not suggesting we start using to-type everywhere, just where it 
 might make things clearer over decode and encode.

+1

I always find myself slightly confused by encode() and decode()
despite the fact that I understand (I think) the reason for the
choice of those names and by rights ought to have no trouble.

I'm not arguing that it's worth the gratuitous code breakage (I
don't have enough code using encode() and decode() for my opinion
to count in that matter.) But I will say that if there were no
legacy I'd prefer the tounicode() and tostring() (but shouldn't it
be 'tobytes()' instead?) names for Python 3.0.

-- Michael Chermside





*
This email may contain confidential or privileged information. If you believe
 you have received the message in error, please notify the sender and delete 
the message without copying or disclosing it.
*

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Bill Janssen
 Huh... just joining here but surely you don't mean a text string that
 doesn't use every character available in a particular encoding is
 really bytes... it's still a text string...

No, once it's in a particular encoding it's bytes, no longer text.

As you say,
 Keep these two concepts separate and you should be right :-)

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Scott David Daniels
Chermside, Michael wrote:
 ... I will say that if there were no legacy I'd prefer the tounicode()
 and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0.

Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode?

-- 
-- Scott David Daniels
[EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Michael Chermside
I wrote:
 ... I will say that if there were no legacy I'd prefer the tounicode()
 and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0.

Scott Daniels replied:
 Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode?

Um... yes. Sorry, I'm not completely used to 3.0 yet. I'll need to borrow
the time machine for a little longer before my fingers really pick up on
the 3.0 names and idioms.

-- Michael Chermside

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Nick Coghlan wrote:

ascii_bytes = orig_bytes.decode(base64).encode(ascii)
 
orig_bytes = ascii_bytes.decode(ascii).encode(base64)
 
 The only slightly odd aspect is that this inverts the conventional meaning of 
 base64 encoding and decoding,

-1. Whatever we do, we shouldn't design things so
that it's necessary to write anything as
unintuitive as that.

We need to make up our minds whether the .encode()
and .decode() methods are only meant for Unicode
encodings, or whether they are for general
transformations between bytes and characters.

If they're only meant for Unicode, then bytes
should only have .decode(), unicode strings
should only have .encode(), and only Unicode
codecs should be available that way. Things
like base64 would need to have a different
interface.

If they're for general transformations, then
both types should have both methods, with the
return type depending on the codec you're
using, and it's the programmer's responsibility
to use codecs that make sense for what he's
doing.

But if they're for general transformations,
why limit them to just bytes and characters?
Following that through leads to giving *every*
object .encode() and .decode() methods. I
don't think we should go that far, but it's
hard to see where to draw the line. Are
bytes and strings special enough to justify
them having their own peculiar methods for
codec access?

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Bill Janssen wrote:

 No, once it's in a particular encoding it's bytes, no longer text.

The point at issue is whether the characters produced
by base64 are in a particular encoding. According to
my reading of the RFC, they're not.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Ron Adam wrote:

 While playing around with the example bytes class I noticed code reads 
 much better when I use methods called tounicode and tostring.
 
 b64ustring = b.tounicode('base64')
 b = bytes(b64ustring, 'base64')

I don't like that, because it creates a dependency
(conceptually, at least) between the bytes type and
the unicode type. And why unicode in particular?
Why should it have a tounicode() method, but not
a toint() or tofloat() or tolist() etc.?

 I'm not suggesting we start using to-type everywhere, just where it 
 might make things clearer over decode and encode.

Another thing is that it only works if the codec
transforms between two different types. If you
have a bytes-to-bytes transformation, for example,
then

   b2 = b1.tobytes('some-weird-encoding')

is ambiguous.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Michael Urman
[My apologies Greg; I meant to send this to the whole list. I really
need a list-reply button in GMail. ]

On 3/1/06, Greg Ewing [EMAIL PROTECTED] wrote:
 I don't like that, because it creates a dependency
 (conceptually, at least) between the bytes type and
 the unicode type.

I only find half of this bothersome. The unicode type has a pretty
clear dependency on the bytestring type: all I/O needs to be done in
bytes. Various APIs may mask this by accepting unicode values and
transparently doing the right thing, but from the theoretical
standpoint we pretend there is no simple serialization of unicode
values. But the reverse is not true: the bytestring type has no
dependency on unicode.

As a practicality vs purity, however, I think it's a good choice to
let the bytestring type have a tie to unicode, much like the str type
implicitly does now. But you're absolutely right that adding a
.tounicode begs the question why not a .tointeger?

To try to step back and summarize the viewpoints I've seen so far,
there are three main requirements.

  1) We want things that are conceptually text to be stored in memory
as unicode values.
  2) We want there to be some unambiguous conversion via codecs
between bytestrings and unicode values. This should help teaching,
learning, and remembering unicode.
  3) We want a way to apply and reverse compressions, encodings,
encryptions, etc., which are not only between bytestrings and unicode
values; they may be between any two arbitrary types. This allows
writing practical programs.

There seems to be little disagreement over 1, provided sufficiently
efficient implementation, or sufficient string powers in the
bytestring type. To satisfy both 2 and 3, there seem to be a couple
options. What other requirements do we have?

For (2):
  a) Restrict the existing helpers to be only bytestring.decode and
unicode.encode, possibly enforcing output types of the opposite kind,
and removing bytestring.encode
  b) Add new methods with these semantics, e.g. bytestring.udecode and
unicode.uencode

For (3):
  c) Create new helpers codecs.encode(obj, encoding, errors) and
codecs.decode(obj, encoding, errors)
  d) [Keep existing bytestring and unicode helper methods as is, and]
require use of codecs.getencoder() and codecs.getdecoder() for
arbitrary starting object types

Obviously 2a and 3d do not work together, but 2b and 3c work with
either complementary option. What other options do we have?

Michael
--
Michael Urman  http://www.tortall.net/mu/blog
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Ron Adam
Greg Ewing wrote:
 Ron Adam wrote:
 
 While playing around with the example bytes class I noticed code reads 
 much better when I use methods called tounicode and tostring.

 b64ustring = b.tounicode('base64')
 b = bytes(b64ustring, 'base64')
 
 I don't like that, because it creates a dependency
 (conceptually, at least) between the bytes type and
 the unicode type. And why unicode in particular?
 Why should it have a tounicode() method, but not
 a toint() or tofloat() or tolist() etc.?

I don't think it creates a dependency between the types, but it does 
create a stronger relationship between them when a method that returns a 
fixed type is used.

No reason not to other than avoiding having methods that really aren't 
needed. But if it makes sense to have them, sure.  If a codec isn't 
needed probably using a regular constructor should be used instead.


 I'm not suggesting we start using to-type everywhere, just where it 
 might make things clearer over decode and encode.
 
 Another thing is that it only works if the codec
 transforms between two different types. If you
 have a bytes-to-bytes transformation, for example,
 then
 
   b2 = b1.tobytes('some-weird-encoding')
 
 is ambiguous.

Are you asking if it's decoding or encoding?

   bytes to unicode -  decoding
   unicode to bytes -  encoding

   bytes to bytes - ?

Good point, I think this defines part the difficulty.

1. We can specify the operation and not be sure of the resulting type.

   *or*

2. We can specify the type and not always be sure of the operation.

maybe there's a way to specify both so it's unambiguous?


Ron






___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Josiah Carlson

Greg Ewing [EMAIL PROTECTED] wrote:
u = unicode(b)
u = unicode(b, 'utf8')
b = bytes['utf8'](u)
u = unicode['base64'](b)   # encoding
b = bytes(u, 'base64') # decoding
u2 = unicode['piglatin'](u1)   # encoding
u1 = unicode(u2, 'piglatin')   # decoding

Your provided semantics feel cumbersome and confusing to me, as compared
with str/unicode.encode/decode() .

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-28 Thread Greg Ewing
Bill Janssen wrote:

 Well, I can certainly understand the bytes-base64-bytes side of
 thing too.  The text produced is specified as using a 65-character
 subset of US-ASCII, so that's really bytes.

But it then goes on to say that these same characters
are also a subset of EBCDIC. So it seems to be
talking about characters as abstract entities here,
not as bit patterns.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-27 Thread Greg Ewing
Bill Janssen wrote:

 I use it quite a bit for image processing (converting to and from the
 data: URL form), and various checksum applications (converting SHA
 into a string).

Aha! We have a customer!

For those cases, would you find it more convenient
for the result to be text or bytes in Py3k?

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Stephen J. Turnbull
 Ron == Ron Adam [EMAIL PROTECTED] writes:

Ron So, lets consider a codec and a coding as being two
Ron different things where a codec is a character sub set of
Ron unicode characters expressed in a native format.  And a
Ron coding is *not* a subset of the unicode character set, but an
Ron _opperation_ performed on text.

Ron codec -  text is always in *one_codec* at any time.

No, a codec is an operation, not a state.

And text qua text has no need of state; the whole point of defining
text (as in the unicode type) is to abstract from such
representational issues.

Ron Pure codecs such as latin-1 can be envoked over and over and
Ron you can always get back what you put in in a single step.

Maybe you'd like to define them that way, but it doesn't work in
general.  Given that str and unicode currently don't carry state with
them, it's not possible for to ASCII and to EBCDIC to be
idempotent at the same time.  And for the languages spoken by 75% of
the world's population, to latin-1 cannot be successfully invoked
even once, let alone be idempotent.  You really need to think about
how your examples apply to codecs like KOI8-R for Russian and Shift
JIS for Japanese.

In practice, I just don't think you can distinguish codecs from
coding using the kind of mathematical properties you have described
here.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Greg Ewing
Stephen J. Turnbull wrote:

 The reason that Python source code is text is that the primary
 producers/consumers of Python source code are human beings, not
 compilers

I disagree with primary -- I think human and computer
use of source code have equal importance. Because of the
fact that Python source code must be acceptable to the
Python compiler, a great many transformations that would
be harmless to English text (upper casing, paragraph
wrapping, etc.) would cause disaster if applied to a
Python program. I don't see how base64 is any different.

 Yes, which implies that you assume he has control of the data all the
 way to the channel that actually requires base64.

Yes. If he doesn't, he can't safely use base64 at all.
That's true regardless of how the base64-encoded data
is represented. It's true of any data of any kind.

 Use case: the Gnus MUA supports the RFC that allows non-ASCII names in
 MIME headers that take file names...

I'm not familiar with all the details you're alluding
to here, but if there's a bug here, I'd say it's due
to somebody not thinking something through properly.
It shouldn't matter if something gets encoded four
times as long as it gets decoded four times at the
other end. If it's not possible to do that, someone
made an assumption about the channel that wasn't
true.

 It's what is the Python compiler/interpreter going
  to think?  AFAICS, it's going to think that base64 is
  a unicode codec.

Only if it's designed that way, and I specifically
think it shouldn't -- i.e. it should be an error
to attempt the likes of a_unicode_string.encode(base64)
or unicode(something, base64). The interface for
doing base64 encoding should be something else.

 I don't believe that takes a character string as
 input has any intrinsic meaning.

I'm using that phrase in the context of Python, where
it means a function that takes a Python character
string as input.

In the particular case of base64, it has the added
restriction that it must preserve the particular
65 characters used.

  In practice, I think it's a loaded gun
 aimed at my foot.  And yours.

Whereas it seems quite the opposite to me, i.e.
*failing* to clearly distinguish between text and
binary data here is what will lead to confusion and
foot-shooting.

I think we need some concrete use cases to talk
about if we're to get any further with this. Do
you have any such use cases in mind?

Greg


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Stephen J. Turnbull
 Ron == Ron Adam [EMAIL PROTECTED] writes:

Ron We could call it transform or translate if needed.

You're still losing the directionality, which is my primary objection
to recode.  The absence of directionality is precisely why recode
is used in that sense for i18n work.

There really isn't a good reason that I can see to use anything other
than the pair encode and decode.  In monolingual environments,
once _all_ human-readable text (specifically including Python programs
and console I/O) is automatically mapped to a Python (unicode) string,
most programmers will never need to think about it as long as Python
(the project) very very strongly encourages that all Python programs
be written in UTF-8 if there's any chance the program will be reused
in a locale other than the one where it was written.  (Alternatively
you can depend on PEP 263 coding cookies.)  Then the user (or the
Python interpreter) just changes console and file I/O codecs to the
encoding in use in that locale, and everything just works.

So the remaining uses of encode and decode are for advanced users
and specialists: people using stuff like base64 or gzip, and those who
need to use unicode codecs explicitly.

I could be wrong about the possibility to get rid of explicit unicode
codec use in monolingual environments, but I hope that we can at least
try to achieve that.

 Unlikely.  Errors like A
 string.encode(base64).encode(base64) are all too easy to
 commit in practice.

Ron Yes,... and wouldn't the above just result in a copy so it
Ron wouldn't be an out right error.

No, you either get the following:

A string. - QSBzdHJpbmcu - UVNCemRISnBibWN1

or you might get an error if base64 is defined as bytes-unicode.

Ron * Given that the string type gains a __codec__ attribute
Ron to handle automatic decoding when needed.  (is there a reason
Ron not to?)

Ronstr(object[,codec][,error]) - string coded with codec

Ronunicode(object[,error]) - unicode

Ronbytes(object) - bytes

str == unicode in Py3k, so this is a non-starter.  What do you want to
say?

Ron  * a recode() method is used for transformations that
Ron *do_not* change the current codec.

I'm not sure what you mean by the current codec.  If it's attached
to an encoded object, it should be the codec needed to decode the
object.  And it should be allowed to be a codec stack.  So suppose
you start with a unicode object obj.  Then

 bytes = bytes (obj, 'utf-8')# implicit .encode()
 print bytes.codec
['utf-8']
 wire = bytes.encode ('base64')  # with apologies to Greg E.
 print wire.codec
['base64', 'utf-8']
 obj2 = wire.decode ('gzip')
CodecMatchException
 obj2 = wire.decode (wire.codec)
 print obj == obj2
True
 print obj2.codec
[]

or maybe None for the last.  I think this would be very nice as a
basis for improving the email module (for one), but I don't really
think it belongs in Python core.

Ron That may be why it wasn't done this way to start.  (?)

I suspect the real reason is that Marc-Andre had the generalized codec
in mind from Day 0, and your proposal only works with duck-typing if
codecs always have a well-defined signature with two different types
for the argument and return of the constructor.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Stephen J. Turnbull
 Greg == Greg Ewing [EMAIL PROTECTED] writes:

Greg Stephen J. Turnbull wrote:

 No, base64 isn't a wire protocol.  It's a family[...].

Greg Yes, and it's up to the programmer to choose those code
Greg units (i.e. pick an encoding for the characters) that will,
Greg in fact, pass through the channel he is using without
Greg corruption. I don't see how any of this is inconsistent with
Greg what I've said.

It's not.  It just shows that there are other correct ways to think
about the issue.

 Only if you do no transformations that will harm the
 base64-encoding.  ...  It doesn't allow any of the usual
 transformations on characters that might be applied globally to
 a mail composition buffer, for example.

Greg I don't understand that. Obviously if you rot13 your mail
Greg message or turn it into pig latin or something, it's going
Greg to mess up any base64 it might contain.  But that would be a
Greg silly thing to do to a message containing base64.

What message containing base64?  Any base64 in there?  Nope,
nobody here but us Unicode characters!  I certainly hope that in Py3k
bytes objects will have neither ROT13 nor case-changing methods, but
str objects certainly will.  Why give up the safety of that
distinction?

Greg Given any piece of text, there are things it makes sense to
Greg do with it and things it doesn't, depending entirely on the
Greg use to which the text will eventually be put.  I don't see
Greg how base64 is any different in this regard.

If you're going to be binary about it, it's not different.  However
the kind of text for which Unicode was designed is normally produced
and consumed by people, who wll pt up w/ ll knds f nnsns.  Base64
decoders will not put up with the same kinds of nonsense that people
will.

You're basically assuming that the person who implements the code that
processes a Unicode string is the same person who implemented the code
that converts a binary object into base64 and inserts it into a
string.  I think that's a dangerous (and certainly invalid) assumption.

I know I've lost time and data to applications that make assumptions
like that.  In fact, that's why MULE is a four-letter word in Emacs
channels.wink

 So then you bring it right back in with base64.  Now they need
 to know about bytes-unicode codecs.

Greg No, they need to know about the characteristics of the
Greg channel over which they're sending the data.

I meant it in a trivial sense: How do you use a bytes-unicode codec
properly without knowing that it's a bytes-unicode codec?

In most environments, it should be possible to hide bytes-unicode
codecs almost all the time, and I think that's a very good thing.  I
don't think it's a good idea to gratuitously introduce wire protocols
as unicode codecs, even if a class of bit patterns which represent the
integer 65 are denoted A in various sources.  Practicality beats
purity (especially when you're talking about the purity of a pregnant
virgin).

Greg It might be appropriate to to use base64 followed by some
Greg encoding, but the programmer needs to be aware of that and
Greg choose the encoding wisely. It's not possible to shield him
Greg from having to know about encodings in that situation, even
Greg if the encoding is just ascii.

What do you think the email module does?  Assuming conforming MIME
messages and receivers capable of handling UTF-8, the user of the
email module does not need to know anything about any encodings at
all.  With a little more smarts, the email module could even make a
good choice of output encoding based on the _language_ of the text,
removing the restriction to UTF-8 on the output side, too.  With the
aid of file(1), it can make excellent guesses about attachments.

Sure, the email module programmer needs to know, but the email module
programmer needs to know an awful lot about codecs anyway, since mail
at that level is a binary channel, while users will be throwing a
mixed bag of binary and textual objects at it.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Ron Adam

* The following reply is a rather longer than I intended explanation of 
why codings (and how they differ) like 'rot' aren't the same thing as 
pure unicode codecs and probably should be treated differently.
If you already understand that, then I suggest skipping this.  But if 
you like detailed logical analysis, it might be of some interest even if 
it's reviewing the obvious to those who already know.

(And hopefully I didn't make any really obvious errors myself.)


Stephen J. Turnbull wrote:
 Ron == Ron Adam [EMAIL PROTECTED] writes:
 
 Ron We could call it transform or translate if needed.
 
 You're still losing the directionality, which is my primary objection
 to recode.  The absence of directionality is precisely why recode
 is used in that sense for i18n work.

I think your not understanding what I suggested.  It might help if we 
could agree on some points and then go from there.

So, lets consider a codec and a coding as being two different things 
where a codec is a character sub set of unicode characters expressed in 
a native format.  And a coding is *not* a subset of the unicode 
character set, but an _opperation_ performed on text.  So you would have 
the following properties.

codec -  text is always in *one_codec* at any time.

coding -  operation performed on text.

Lets add a special default coding called 'none' to represent a do 
nothing coding. (figuratively for explanation purposes)

'none' - return the input as is, or the uncoded text


Given the above relationships we have the following possible 
transformations.

   1. codec to like codec:   'ascii' to 'ascii'
   2. codec to unlike codec:   'ascii' to 'latin1'

And we have coding relationships of:

   a. coding to like coding  # Unchanged, do nothing
   b. coding to unlike coding


Then we can express all the possible combinations as...

[1.a, 1.b, 2.a, 2.b]


1.a - coding in codec to like coding in like codec:

'none' in 'ascii' to 'none' in 'ascii'

1.b - coding in codec to diff coding in like codec:

'none' in 'ascii' to 'base64' in 'ascii'

2.a - coding in codec to same coding in diff codec:

'none' in 'ascii' to 'none' in 'latin1'

2.b - coding in codec to diff coding in diff codec:

'none' in 'latin1' to 'base64' in 'ascii'

This last one is a problem as some codecs combine coding with character 
set encoding and return text in a differnt encoding than they recieved. 
  The line is also blurred between types and encodings.  Is unicode and 
encoding?  Will bytes also be a encoding?


Using the above combinations:

(1.a) is just creating a new copy of a object.

s = str(s)


(1.b) is recoding an object, it returns a copy of the object in the same 
encoding.

s = s.encode('hex-codec')  # ascii str - ascii str coded in hex
s = s.decode('hex-codec')  # ascii str coded in hex - ascii str

* these are really two differnt operations. And encoding repeatedly 
results in nested codings.  Codecs (as a pure subset of unicode) don't 
have that property.

* the hex-codec also fit the 2.b pattern below if the source string is 
of a differnt type than ascii. (or the the default string?)


(2.a) creates a copy encoded in a new codec.

s = s.encode('latin1')

* I beleive string constructors should have a encoding argument for use 
with unicode strings.

s = str(u, 'latin1')   # This would match the bytes constructor.


(2.b) are combinations of the above.

   s = u.encode('base64')
  # unicode to ascii string as base64 coded characters

   u = unicode(s.decode('base64'))
  # ascii string coded in base64 to unicode characters

or

 u = unicode(s, 'base64')
  Traceback (most recent call last):
File stdin, line 1, in ?
  TypeError: decoder did not return an unicode object (type=str)

Ooops...  ;)

So is coding the same as a codec?  I think they have different 
properties and should be treated differently except when the 
practicality over purity rule is needed.  And in those cases maybe the 
names could clearly state the result.

u.decode('base64ascii')  # name indicates coding to codec


 A string. - QSBzdHJpbmcu - UVNCemRISnBibWN1

Looks like the underlying sequence is:

  native string - unicode - unicode coded base64 - coded ascii str

And decode operation would be...

  coded ascii str - unicode coded base64 - unicode - ascii str

Except it may combine some of these steps to speed it up.

Since it's a hybred codec including a coding operation. We have to treat 
it as a codec.


 Ron * Given that the string type gains a __codec__ attribute
 Ron to handle automatic decoding when needed.  (is there a reason
 Ron not to?)
 
 Ronstr(object[,codec][,error]) - string coded with codec
 
 Ronunicode(object[,error]) - unicode
 
 Ronbytes(object) - bytes
 
 str == unicode in Py3k, so this is a non-starter.  What do you want to
 say?
 
 Ron  * a recode() method is used for 

Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Greg Ewing
Stephen J. Turnbull wrote:

 the kind of text for which Unicode was designed is normally produced
 and consumed by people, who wll pt up w/ ll knds f nnsns.  Base64
 decoders will not put up with the same kinds of nonsense that people
 will.

The Python compiler won't put up with that sort of
nonsense either. Would you consider that makes Python
source code binary data rather than text, and that
it's inappropriate to represent it using a unicode
string?

 You're basically assuming that the person who implements the code that
 processes a Unicode string is the same person who implemented the code
 that converts a binary object into base64 and inserts it into a
 string.

No, I'm assuming the user of base64 knows the
characteristics of the channel he's using. You
can only use base64 if you know the channel
promises not to munge the particular characters
that base64 uses. If you don't know that, you
shouldn't be trying to send base64 through that
channel.

 In most environments, it should be possible to hide bytes-unicode
 codecs almost all the time,

But it *is* hidden in the situation I'm talking
about, because all the Unicode encoding/decoding
takes place inside the implementation of the
text channel, which I'm taking as a given.

 I don't think it's a good idea to gratuitously introduce
  wire protocols as unicode codecs,

I am *not* saying that base64 is a unicode codec!
If that's what you thought I was saying, it's no
wonder we're confusing each other.

It's just a transformation from bytes to
text. I'm only calling it unicode because all
text will be unicode in Py3k. In py2.x it could
just as well be a str -- but a str interpreted
as text, not binary.

 What do you think the email module does?
 Assuming conforming MIME messages

But I'm not assuming mime in the first place. If I
have a mail interface that will accept chunks of
binary data and encode them as a mime message for
me, then I don't need to use base64 in the first
place.

The only time I need to use something like base64
is when I have something that will only accept
text. In Py3k, accepts text is going to mean
takes a character string as input, where
character string is a distinct type from
binary data. So having base64 produce anything
other than a character string would be awkward
and inconvenient.

I phrased that paragraph carefully to avoid using
the word unicode anywhere. Does that make it
clearer what I'm getting at?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-23 Thread Greg Ewing
Stephen J. Turnbull wrote:

 Please define character, and explain how its semantics map to
 Python's unicode objects.

One of the 65 abstract entities referred to in the RFC
and represented in that RFC by certain visual glyphs.
There is a subset of the Unicode code points that
are conventionally associated with very similar glyphs,
so that there is an obvious one-to-one mapping between
these entities and those Unicode code points. These
entities therefore have a natural and obvious
representation using Python unicode strings.

 No, base64 isn't a wire protocol.  Rather, it's a schema for a family
 of wire protocols, whose alphabets are heuristically chosen on the
 assumption that code units which happen to correspond to alpha-numeric
 code points in a commonly-used coded character set are more likely to
 pass through a communication channel without corruption.

Yes, and it's up to the programmer to choose those code
units (i.e. pick an encoding for the characters) that
will, in fact, pass through the channel he is using
without corruption. I don't see how any of this is
inconsistent with what I've said.

 Only if you do no transformations that will harm the base64-encoding.
 ...  It doesn't allow any of the
 usual transformations on characters that might be applied globally to
 a mail composition buffer, for example.

I don't understand that. Obviously if you rot13 your
mail message or turn it into pig latin or something,
it's going to mess up any base64 it might contain.
But that would be a silly thing to do to a message
containing base64.

Given any piece of text, there are things it makes
sense to do with it and things it doesn't, depending
entirely on the use to which the text will eventually
be put. I don't see how base64 is any different in
this regard.

 So then you bring it right back in with base64.  Now they need to know
 about bytes-unicode codecs.

No, they need to know about the characteristics of
the channel over which they're sending the data.

Base64 is designed for situations in which you
have a *text* channel that you know is capable of
transmitting at least a certain subset of characters,
where character means whatever is used as input
to that channel.

In Py3k, text will be represented by unicode strings.
So a Py3k text channel should take unicode as its
input, not bytes.

I think we've got a bit sidetracked by talking about
mime. I wasn't actually thinking about mime, but
just a plain text message into which some base64
data was being inserted. That's the way we used to
do things in the old days with uuencode etc, before
mime was invented.

Here, the channel is NOT the socket or whatever
that the ultimate transmission takes place over --
it's the interface to your mail sending software
that takes a piece of plain text and sends it off
as a mail message somehow.

In Py3k, if a channel doesn't take unicode as input,
then it's not a text channel, and it's not appropriate
to be using base64 with it directly. It might be
appropriate to to use base64 followed by some encoding,
but the programmer needs to be aware of that and
choose the encoding wisely. It's not possible to
shield him from having to know about encodings in
that situation, even if the encoding is just ascii.
Trying to do so will just lead to more confusion,
in my opinion.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
 Greg == Greg Ewing [EMAIL PROTECTED] writes:

Greg Stephen J. Turnbull wrote:

 What I advocate for Python is to require that the standard
 base64 codec be defined only on bytes, and always produce
 bytes.

Greg I don't understand that. It seems quite clear to me that
Greg base64 encoding (in the general sense of encoding, not the
Greg unicode sense) takes binary data (bytes) and produces
Greg characters.

Base64 is a (family of) wire protocol(s).  It's not clear to me that
it makes sense to say that the alphabets used by baseNN encodings
are composed of characters, but suppose we stipulate that.

Greg So in Py3k the correct usage would be [bytes-unicode].

IMHO, as a wire protocol, base64 simply doesn't care what Python's
internal representation of characters is.  I don't see any case for
correctness here, only for convenience, both for programmers on the
job and students in the classroom.  We can choose the character set
that works best for us.  I think that's 8-bit US ASCII.

My belief is that bytes-bytes is going to be the dominant use case,
although I don't use binary representation in XML.  However, AFAIK for
on the wire use UTF-8 is strongly recommended for XML, and in that
case it's also efficient to use bytes-bytes for XML, since
conversion of base64 bytes to UTF-8 characters is simply a matter of
Simon says, be UTF-8!

And in the classroom, you're just going to confuse students by telling
them that UTF-8 --[Unicode codec]-- Python string is decoding but
UTF-8 --[base64 codec]-- Python string is encoding, when MAL is
telling them that -- Python string is always decoding.

Sure, it all makes sense if you already know what's going on.  But I
have trouble remembering, especially in cases like UTF-8 vs UTF-16
where Perl and Python have opposite internal representations, and
glibc has a third which isn't either.  If base64 (and gzip, etc) are
all considered bytes-bytes, there just isn't an issue any more.  The
simple rule wins: to Python string is always decoding.

Why fight it when we can run away with efficiency gains?wink

(In the above, Python string means the unicode type, not str.)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Stephen J. Turnbull wrote:

 Base64 is a (family of) wire protocol(s).  It's not clear to me that
 it makes sense to say that the alphabets used by baseNN encodings
 are composed of characters,

Take a look at

   http://en.wikipedia.org/wiki/Base64

where it says

   ...base64 is a binary to text encoding scheme whereby an
   arbitrary sequence of bytes is converted to a sequence of
   printable ASCII characters.

Also see RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt) which
defines base64 in terms of an encoding from octets to characters,
and also says

   A 65-character subset of US-ASCII is used ... This subset has
   the important property that it is represented identically in
   all versions of ISO 646 ... and all characters in the subset
   are also represented identically in all versions of EBCDIC.

Which seems to make it perfectly clear that the result
of the encoding is to be considered as characters, which
are not necessarily going to be encoded using ascii.

So base64 on its own is *not* a wire protocol. Only after
encoding the characters do you have a wire protocol.

 I don't see any case for
 correctness here, only for convenience,

I'm thinking of convenience, too. Keep in mind that in Py3k,
'unicode' will be called 'str' (or something equally neutral
like 'text') and you will rarely have to deal explicitly with
unicode codings, this being done mostly for you by the I/O
objects. So most of the time, using base64 will be just as
convenient as it is today: base64_encode(my_bytes) and write
the result out somewhere.

The reason I say it's *corrrect* is that if you go straight
from bytes to bytes, you're *assuming* the eventual encoding
is going to be an ascii superset. The programmer is going to
have to know about this assumption and understand all its
consequences and decide whether it's right, and if not, do
something to change it.

Whereas if the result is text, the right thing happens
automatically whatever the ultimate encoding turns out to
be. You can take the text from your base64 encoding, combine
it with other text from any other source to form a complete
mail message or xml document or whatever, and write it out
through a file object that's using any unicode encoding
at all, and the result will be correct.

  it's also efficient to use bytes-bytes for XML, since
 conversion of base64 bytes to UTF-8 characters is simply a matter of
 Simon says, be UTF-8!

Efficiency is an implementation concern. In Py3k, strings
which contain only ascii or latin-1 might be stored as
1 byte per character, in which case this would not be an
issue.

 And in the classroom, you're just going to confuse students by telling
 them that UTF-8 --[Unicode codec]-- Python string is decoding but
 UTF-8 --[base64 codec]-- Python string is encoding, when MAL is
 telling them that -- Python string is always decoding.

Which is why I think that only *unicode* codings should be
available through the .encode and .decode interface. Or
alternatively there should be something more explicit like
.unicode_encode and .unicode_decode that is thus restricted.

Also, if most unicode coding is done in the I/O objects, there
will be far less need for programmers to do explicit unicode
coding in the first place, so likely it will become more of
an advanced topic, rather than something you need to come to
grips with on day one of using unicode, like it is now.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread James Y Knight

On Feb 22, 2006, at 6:35 AM, Greg Ewing wrote:

 I'm thinking of convenience, too. Keep in mind that in Py3k,
 'unicode' will be called 'str' (or something equally neutral
 like 'text') and you will rarely have to deal explicitly with
 unicode codings, this being done mostly for you by the I/O
 objects. So most of the time, using base64 will be just as
 convenient as it is today: base64_encode(my_bytes) and write
 the result out somewhere.

 The reason I say it's *corrrect* is that if you go straight
 from bytes to bytes, you're *assuming* the eventual encoding
 is going to be an ascii superset. The programmer is going to
 have to know about this assumption and understand all its
 consequences and decide whether it's right, and if not, do
 something to change it.

 Whereas if the result is text, the right thing happens
 automatically whatever the ultimate encoding turns out to
 be. You can take the text from your base64 encoding, combine
 it with other text from any other source to form a complete
 mail message or xml document or whatever, and write it out
 through a file object that's using any unicode encoding
 at all, and the result will be correct.

This makes little sense for mail. You combine *bytes*, in various and  
possibly different encodings to form a mail message. Some MIME  
sections might have a base64 Content-Transfer-Encoding, others might  
be 8bit encoded, others might be 7bit encoded, others might be quoted- 
printable encoded. Before the C-T-E encoding, you will have had to do  
the Content-Type encoding, coverting your text into bytes with the  
desired character encoding: utf-8, iso-8859-1, etc. Having the final  
mail message be made up of characters, right before transmission to  
the socket would be crazy.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Terry Reedy

Greg Ewing [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Efficiency is an implementation concern.

It is also a user concern, especially if inefficiency overruns memory 
limits.

 In Py3k, strings
 which contain only ascii or latin-1 might be stored as
 1 byte per character, in which case this would not be an
 issue.

If 'might' becomes 'will', I and I suspect others will be happier with the 
change.  And I would be happy if the choice of physical storage was pretty 
much handled behind the scenes, as with the direction int/long unification 
is going.

 Which is why I think that only *unicode* codings should be
 available through the .encode and .decode interface. Or
 alternatively there should be something more explicit like
 .unicode_encode and .unicode_decode that is thus restricted.

I prefer the shorter names and using recode, for instance, for bytes to 
bytes.

Terry Jan Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Ron Adam
Terry Reedy wrote:

 Greg Ewing [EMAIL PROTECTED] wrote in message 

 Which is why I think that only *unicode* codings should be
 available through the .encode and .decode interface. Or
 alternatively there should be something more explicit like
 .unicode_encode and .unicode_decode that is thus restricted.
 
 I prefer the shorter names and using recode, for instance, for bytes to 
 bytes.

While I prefer constructors with an explicit encode argument, and use a 
recode() method for 'like to like' coding.  Then the whole encode/decode 
confusion goes away.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Terry Reedy wrote:
 Greg Ewing [EMAIL PROTECTED] wrote in message 

Efficiency is an implementation concern.

 It is also a user concern, especially if inefficiency overruns memory 
 limits.

Sure, but what I mean is that it's better to find what's
conceptually right and then look for an efficient way
of implementing it, rather than letting the implementation
drive the design.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Ron Adam wrote:

 While I prefer constructors with an explicit encode argument, and use a 
 recode() method for 'like to like' coding.  Then the whole encode/decode 
 confusion goes away.

I'd be happy with that, too.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
James Y Knight wrote:

 Some MIME  sections 
 might have a base64 Content-Transfer-Encoding, others might  be 8bit 
 encoded, others might be 7bit encoded, others might be quoted- printable 
 encoded.

I stand corrected -- in that situation you would have to encode
the characters before combining them with other material.

However, this doesn't change my view that the result of base64
encoding by itself is characters, not bytes. To go straight
to bytes would require assuming an encoding, and that would
make it *harder* to use in cases where you wanted a different
encoding, because you'd first have to undo the default
encoding and then re-encode it using the one you wanted.

It may be reasonable to provide an easy way to go straight
from raw bytes to ascii-encoded-base64 bytes, but that should
be a different codec. The plain base64 codec should produce
text.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
 Greg == Greg Ewing [EMAIL PROTECTED] writes:

Greg Stephen J. Turnbull wrote:

 Base64 is a (family of) wire protocol(s).  It's not clear to me
 that it makes sense to say that the alphabets used by baseNN
 encodings are composed of characters,

Greg Take a look at [this that the other]

Those references use character in an ambiguous and ill-defined way.
Trying to impose Python unicode object semantics on vague characters
is a bad idea IMO.

Greg Which seems to make it perfectly clear that the result of
Greg the encoding is to be considered as characters, which are
Greg not necessarily going to be encoded using ascii.

Please define character, and explain how its semantics map to
Python's unicode objects.

Greg So base64 on its own is *not* a wire protocol. Only after
Greg encoding the characters do you have a wire protocol.

No, base64 isn't a wire protocol.  Rather, it's a schema for a family
of wire protocols, whose alphabets are heuristically chosen on the
assumption that code units which happen to correspond to alpha-numeric
code points in a commonly-used coded character set are more likely to
pass through a communication channel without corruption.

Note that I have _precisely_ defined what I mean.  You still have the
problem that you haven't defined character, and that is a real
problem, see below.

 I don't see any case for correctness here, only for
 convenience,

Greg I'm thinking of convenience, too. Keep in mind that in Py3k,
Greg 'unicode' will be called 'str' (or something equally neutral
Greg like 'text') and you will rarely have to deal explicitly
Greg with unicode codings, this being done mostly for you by the
Greg I/O objects. So most of the time, using base64 will be just
Greg as convenient as it is today: base64_encode(my_bytes) and
Greg write the result out somewhere.

Convenient, yes, but incorrect.  Once you mix those bytes with the
Python string type, they become subject to all the usual operations on
characters, and there's no way for Python to tell you that you didn't
want to do that.  Ie,

Greg Whereas if the result is text, the right thing happens
Greg automatically whatever the ultimate encoding turns out to
Greg be. You can take the text from your base64 encoding, combine
Greg it with other text from any other source to form a complete
Greg mail message or xml document or whatever, and write it out
Greg through a file object that's using any unicode encoding at
Greg all, and the result will be correct.

Only if you do no transformations that will harm the base64-encoding.
This is why I say base64 is _not_ based on characters, at least not in
the way they are used in Python strings.  It doesn't allow any of the
usual transformations on characters that might be applied globally to
a mail composition buffer, for example.

In other words, you don't escape from the programmer having to know
what he's doing.  EIBTI, and the setup I advocate forces the
programmer to explicitly decide where to convert base64 objects to a
textual representation.  This reminds him that he'd better not touch
that text.

Greg The reason I say it's *corrrect* is that if you go straight
Greg from bytes to bytes, you're *assuming* the eventual encoding
Greg is going to be an ascii superset.  The programmer is going
Greg to have to know about this assumption and understand all its
Greg consequences and decide whether it's right, and if not, do
Greg something to change it.

I'm not assuming any such thing, except in the context of analysis of
implementation efficiency.  And the programmer needs to know about the
semantics of text that is actually a base64-encoded object, and that
they are different from string semantics.

This is something that programmers are used to dealing with in the
case of Python 2.x str and C char[]; the whole point of the unicode
type is to allow the programmer to abstract from that when dealing
human-readable text.  Why confuse the issue.

 And in the classroom, you're just going to confuse students by
 telling them that UTF-8 --[Unicode codec]-- Python string is
 decoding but UTF-8 --[base64 codec]-- Python string is
 encoding, when MAL is telling them that -- Python string is
 always decoding.

Greg Which is why I think that only *unicode* codings should be
Greg available through the .encode and .decode interface. Or
Greg alternatively there should be something more explicit like
Greg .unicode_encode and .unicode_decode that is thus restricted.

Greg Also, if most unicode coding is done in the I/O objects,
Greg there will be far less need for programmers to do explicit
Greg unicode coding in the first place, so likely it will become
Greg more of an advanced topic, rather than something you need to
Greg come to grips with on day one of using unicode, like it is
Greg now.

So then you bring it 

Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
 Ron == Ron Adam [EMAIL PROTECTED] writes:

Ron Terry Reedy wrote:

 I prefer the shorter names and using recode, for instance, for
 bytes to bytes.

Ron While I prefer constructors with an explicit encode argument,
Ron and use a recode() method for 'like to like' coding. 

'Recode' is a great name for the conceptual process, but the methods
are directional.  Also, in internationalization work, recode
strongly connotes encodingA - original - encodingB, as in iconv.

I do prefer constructors, as it's generally not a good idea to do
encoding/decoding in-place for human-readable text, since the codecs
are often lossy.

Ron Then the whole encode/decode confusion goes away.

Unlikely.  Errors like A string.encode(base64).encode(base64)
are all too easy to commit in practice.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Greg Ewing
Stephen J. Turnbull wrote:

 What I advocate for Python is to require that the standard base64
 codec be defined only on bytes, and always produce bytes.

I don't understand that. It seems quite clear to me that
base64 encoding (in the general sense of encoding, not the
unicode sense) takes binary data (bytes) and produces characters.
That's the whole point of base64 -- so you can send arbitrary
data over a channel that is only capable of dealing with
characters.

So in Py3k the correct usage would be

   base64unicode
   encodeencode(x)
   original bytes  unicode - bytes for transmission
   -
   base64unicode
   decodedecode(x)

where x is whatever unicode encoding the transmission
channel uses for characters (probably ascii or an ascii
superset, but not necessarily).

So, however it's spelled, the typing is such that

base64_encode(bytes) -- unicode

and

base64_decode(unicode) -- bytes

--
Greg



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Barry Warsaw
On Sun, 2006-02-19 at 23:30 +0900, Stephen J. Turnbull wrote:
  M == M.-A. Lemburg [EMAIL PROTECTED] writes:

 M * for Unicode codecs the original form is Unicode, the derived
 M form is, in most cases, a string
 
 First of all, that's Martin's point!
 
 Second, almost all Americans, a large majority of Japanese, and I
 would bet most Western Europeans would say you have that backwards.
 That's the problem, and it's the Unicode advocates' problem (ie,
 ours), not the users'.  Even if we're right: education will require
 lots of effort.  Rather, we should just make it as easy as possible to
 do it right, and hard to do it wrong.

I think you've hit the nail squarely on the head.  Even though I /know/
what the intended semantics are, the originality of the string form is
deeply embedded in my nearly 30 years of programming experience, almost
all of it completely American English-centric.  

I always have to stop and think about which direction .encode()
and .decode() go in because it simply doesn't feel natural.  Or more
simply put, my brain knows what's right, but my heart doesn't and that's
why converting from one to the other is always a hiccup in the smooth
flow of coding.  And while I'm sympathetic to MAL's design decisions,
the overlaying of the generalizations doesn't help.

-Barry



signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Greg Ewing
Josiah Carlson wrote:

 It doesn't seem strange to you to need to encode data twice to be able
 to have a usable sequence of characters which can be embedded in an
 effectively 7-bit email;

I'm talking about a 3.0 world where all strings are unicode
and the unicode - external coding is for the most part
done automatically by the I/O objects. So you'd be building
up your whole email as a string (aka unicode) which happens
to only contain code points in the range 0..127, and then
writing it to your socket or whatever. You wouldn't need
to do the second encoding step explicitly very often.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

Martin Stephen J. Turnbull wrote:

Bengt The characters in b could be encoded in plain ascii, or
Bengt utf16le, you have to know.

 Which base64 are you thinking about?  Both RFC 3548 and RFC
 2045 (MIME) specify subsets of US-ASCII explicitly.

Martin Unfortunately, it is ambiguous as to whether they refer to
Martin US-ASCII, the character set, or US-ASCII, the encoding.

True for RFC 3548, but the authors of RFC 2045 clearly had the
encoding in mind, since they depend on RFC 822.

Martin It appears that RFC 3548 talks about the character set
Martin only:

OK, although RFC 3548 cites RFC 20 (!) as its source for US-ASCII,
which clearly has bytes (though not necessarily octets) in mind, it
doesn't actually restrict base encoding to be a subset of US-ASCII.

On the other hand, RFC 3548 doesn't define base64 (or any other base
encoding), it simply provides a set of requirements that a conforming
implementation must satisfy.  Python can therefore choose to define
its base64 as a bytes-bytes codec, with the alphabet drawn from
US-ASCII interpreted as encoding.

I would definitely prefer that, as png_image = unicode.encode('base64')
violates MAL's intuitive schema for the method.

Martin For an example where base64 is *not* necessarily
Martin ASCII-encoded, see the binary data type in XML
Martin Schema. There, base64 is embedded into an XML document,
Martin and uses the encoding of the entire XML document. As a
Martin result, you may get base64 data in utf16le.

I'll have to take a look.  It depends on whether base64 is specified
as an octet-stream to Unicode stream transformation or as an embedding
of an intermediate representation into Unicode.  Granted, defining the
base64 alphabet as a subset of Unicode seems like the logical way to
do it in the context of XML.

P.S.  My apologies for munging your name in the To: header.  I'm
having problems with my MUA.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-20 Thread Bengt Richter
On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters [EMAIL PROTECTED] wrote:

On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:

[...]
   - The return value for the non-unicode encodings depends on the value of
 the encoding argument.

 Not really: you'll always get a basestring instance.

But actually basestring is weird graft of semantic apples and empty bags IMO.
unicode is essentially an abstract character vector type,
and str is an abstract binary octet vector type having nothing to do with 
characters
except by inferential association with an encoding.

Which is not a particularly useful distinction, since in any real world
application, you have to be careful not to mix unicode with (non-ascii)
bytestrings. The only way to reliably deal with unicode is to have it
well-contained (when migrating an application from using bytestrings to
using unicode) or to use unicode everywhere, decoding/encoding at
entrypoints. Containment is hard to achieve.

 Still, I believe that this is an educational problem. There are
 a couple of gotchas users will have to be aware of (and this is
 unrelated to the methods in question):
 
 * encoding always refers to transforming original data into
   a derived form
ISTM encoding separates type information from the source and sets it aside
as the identity of the encoding, and renders the data in a composite of
more primitive types, octets being the most primitive short of bits.

 
 * decoding always refers to transforming a derived form of
   data back into its original form
Decoding of a composite of primitives requires additional separate information
(namely identification of the encoding) to create a higher composite type.
 
 * for Unicode codecs the original form is Unicode, the derived
   form is, in most cases, a string
You mean a str instance, right? Where the original type as character vector
is gone. That's not a string in the sense of character string.
 
 As a result, if you want to use a Unicode codec such as utf-8,
 you encode Unicode into a utf-8 string and decode a utf-8 string
 into Unicode.
s/string/str instance/
 
 Encoding a string is only possible if the string itself is
 original data, e.g. some data that is supposed to be transformed
 into a base64 encoded form.
note what base64 really is for. It's essence is to create a _character_ sequence
which can succeed in being encoded as ascii. The concept of base64 going 
str-str
is really a mental shortcut for s_str.decode('base64').encode('ascii'), where
3 octets are decoded as code for 4 characters modulo padding logic.

 
 Decoding Unicode is only possible if the Unicode string itself
 represents a derived form, e.g. a sequence of hex literals.
Again, it's an abbreviation, e.g. 
print u'4cf6776973'.encode('hex_chars_to_octets').decode('latin-1')
Should print Löwis


Most of these gotchas would not have been gotchas had encode/decode only
been usable for unicode encodings.

  That is why I disagree with the hypergeneralization of the encode/decode
  methods
[..]
 That's because you only look at one specific task.

 Codecs also unify the various interfaces to common encodings
 such as base64, uu or zip which are not Unicode related.
I think the trouble is that these view the transformations as octets-octets
whereas IMO decoding should always result in a container type that knows what 
it is
semantically without association with external use-this-codec information. IOW,

octets.decode('zip') - archive
archive.encode('bzip') - octets

You could even subclass octet to make archive that knows it's an octet vector
representing a decoded zip, so it can have an encode method that could
(specifying 'zip' again) encode itself back to the original zip, or an alternate
method to encode itself as something else, which you couldn't do from plain 
octets
without specifying both transformations at once. (hence the .recode idea, but I 
don't
think that is as pure. The constructor for the container type could also be 
used, like
Archive(octets, 'zip') analogous to unicode('abc', 'ascii')

IOW 
octets + decoding info - container type instance
container type instance + encoding info - octets

No, I think you misunderstand. I object to the hypergeneralization of the
*encode/decode methods*, not the codec system. I would have been fine with
another set of methods for non-unicode transformations. Although I would
have been even more fine if they got their encoding not as a string, but as,
say, a module object, or something imported from a module.

Not that I think any of this matters; we have what we have and I'll have to
live with it ;)
Probably.
BTW, you may notice I'm saying octet instead of bytes. I have another post on 
that,
arguing that the basic binary information type should be octet, since binary 
files
are made of octets that have no instrinsic numerical or character significance.
See other post if interested ;-)

Regards,
Bengt Richter

___
Python-Dev 

Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
 Josiah == Josiah Carlson [EMAIL PROTECTED] writes:

Josiah I try to internalize it by not thinking of strings as
Josiah encoded data, but as binary data, and unicode as text.  I
Josiah then remind myself that unicode isn't native on-disk or
Josiah cross-network (which stores and transports bytes, not
Josiah characters), so one needs to encode it as binary data.
Josiah It's a subtle difference, but it has worked so far for me.

Seems like a lot of work for something that for monolingual usage
should Just Work almost all of the time.

Josiah I notice that you seem to be in Japan, so teaching unicode
Josiah is a must.

Yes.  Japan is more complicated than that, but in Python unicode is a
must.

Josiah If you are using the unicode is text and strings are
Josiah data, and they aren't getting it; then I don't know.

Well, I can tell you that they don't get it.  One problem is PEP 263.
It makes it very easy to write programs that do line-oriented I/O with
input() and print, and the students come to think it should always be
that easy.  Since Japan has at least 6 common encodings that students
encounter on a daily basis while browsing the web, plus a couple more
that live inside of MSFT Word and Java, they're used to huge amounts
of magic.  The normal response of novice programmers is to mandate
that users of their programs use the encoding of choice and put it in
ordinary strings so that it just works.

Ie, the average student just eats the F on the codecs assignment,
and writes the rest of her programs without them.

 simple, and the exceptions for using a nonexistent method
 mean I don't have to reinforce---the students will be able to
 teach each other.  The exceptions also directly help reinforce
 the notion that text == Unicode.

Josiah Are you sure that they would help?  If .encode() and
Josiah .decode() drop from strings and unicode (respectively),
Josiah they get an AttributeError.  That's almost useless.

Well, I'm not _sure_, but this is the kind of thing that you can learn
by rote.  And it will happen on a sufficiently regular basis that a
large fraction of students will experience it.  They'll ask each
other, and usually they'll find a classmate who knows what happened.

I haven't tried this with codecs, but that's been my experience with
statistical packages where some routines understand non-linear
equations but others insist on linear equations.[1] The error messages
(Equation is non-linear!  Aaugh!) are not much more specific than
AttributeError.

Josiah Raising a better exception (with more information) would
Josiah be better in that case, but losing the functionality that
Josiah either would offer seems unnecessary;

Well, the point is that for the usual suspects (ie, Unicode codecs)
there is no functionality that would be lost.  As MAL pointed out, for
these codecs the original text is always Unicode; that's the role
Unicode is designed for, and by and large it fits the bill very well.
With few exceptions (such as rot13) the derived text will be bytes
that peripherals such as keyboards and terminals can generate and
display.

Josiah You are trying to encode/decode to/from incompatible
Josiah types. expected: a-b got: x-y is better.  Some of those
Josiah can be done *very soon*, given the capabilities of the
Josiah encodings module,

That's probably the way to go.

If we can have a derived Unicode codec class that does this, that
would pretty much entirely serve the need I perceive.  Beginning
students could learn to write iconv.py, more advanced students could
learn to create codec stacks to generate MIME bodies, which could
include base64 or quoted-printable bytes - bytes codecs.



Footnotes: 
[1]  If you're not familiar with regression analysis, the problem is
that the equation z = a*log(x) + b*log(y) where a and b are to be
estimated is _linear_ in the sense that x, y, and z are data series,
and X = log(x) and Y = log(y) can be precomputed so that the equation
actually computed is z = a*X + b*Y.  On the other hand z = a*(x +
b*y) is _nonlinear_ because of the coefficient on y being a*b.
Students find this hard to grasp in the classroom, but they learn
quickly in the lab.

I believe the parameter/variable inversion that my students have
trouble with in statistics is similar to the original/derived
inversion that happens with text you can see (derived, string) and
abstract text inside the program (original, Unicode).

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 

Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
 Martin For an example where base64 is *not* necessarily
 Martin ASCII-encoded, see the binary data type in XML
 Martin Schema. There, base64 is embedded into an XML document,
 Martin and uses the encoding of the entire XML document. As a
 Martin result, you may get base64 data in utf16le.
 
 I'll have to take a look.  It depends on whether base64 is specified
 as an octet-stream to Unicode stream transformation or as an embedding
 of an intermediate representation into Unicode.  Granted, defining the
 base64 alphabet as a subset of Unicode seems like the logical way to
 do it in the context of XML.

Please do take a look. It is the only way: If you were to embed base64
*bytes* into character data content of an XML element, the resulting
XML file might not be well-formed anymore (if the encoding of the XML
file is not an ASCII superencoding).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

Martin Please do take a look. It is the only way: If you were to
Martin embed base64 *bytes* into character data content of an XML
Martin element, the resulting XML file might not be well-formed
Martin anymore (if the encoding of the XML file is not an ASCII
Martin superencoding).

Excuse me, I've been doing category theory recently.  By embedding I
mean a map from an intermediate object which is a stream of bytes to
the corresponding stream of characters.  In the case of UTF-16-coded
characters, this would necessarily imply a representation change, as
you say.

What I advocate for Python is to require that the standard base64
codec be defined only on bytes, and always produce bytes.  Any
representation change should be done explicitly.  This is surely
conformant with RFC 2045's definition and with RFC 3548.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Bob Ippolito

On Feb 20, 2006, at 7:25 PM, Stephen J. Turnbull wrote:

 Martin == Martin v Löwis [EMAIL PROTECTED] writes:

 Martin Please do take a look. It is the only way: If you were to
 Martin embed base64 *bytes* into character data content of an XML
 Martin element, the resulting XML file might not be well-formed
 Martin anymore (if the encoding of the XML file is not an ASCII
 Martin superencoding).

 Excuse me, I've been doing category theory recently.  By embedding I
 mean a map from an intermediate object which is a stream of bytes to
 the corresponding stream of characters.  In the case of UTF-16-coded
 characters, this would necessarily imply a representation change, as
 you say.

 What I advocate for Python is to require that the standard base64
 codec be defined only on bytes, and always produce bytes.  Any
 representation change should be done explicitly.  This is surely
 conformant with RFC 2045's definition and with RFC 3548.

+1

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Michael Hudson
M.-A. Lemburg [EMAIL PROTECTED] writes:

 Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 True. However, note that the .encode()/.decode() methods on
 strings and Unicode narrow down the possible return types.
 The corresponding .bytes methods should only allow bytes and
 Unicode.
 I forgot that: what is the rationale for that restriction?

 To assure that only those types can be returned from those
 methods, ie. instances of basestring, which in return permits
 type inference for those methods.
 
 Hmm. So it for type inference
 Where is that documented?

 Somewhere in the python-dev mailing list archives ;-)

 Seriously, we should probably add this to the documentation.

Err.. I don't think this is a good argument, for quite
a few reasons.  There certainly aren't many other features in Python
designed to aid type inference and the knowledge that something
returns unicode or str isn't especially useful...

Cheers,
mwh

-- 
  ROOSTA:  Ever since you arrived on this planet last night you've
   been going round telling people that you're Zaphod
   Beeblebrox, but that they're not to tell anyone else.
-- The Hitch-Hikers Guide to the Galaxy, Episode 7
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 Ian == Ian Bicking [EMAIL PROTECTED] writes:

Ian Encodings cover up eclectic interfaces, where those
Ian interfaces fit a basic pattern -- data in, data out.

Isn't filter the word you're looking for?

I think you've just made a very strong case that this is a slippery
slope that we should avoid.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 M == M.-A. Lemburg [EMAIL PROTECTED] writes:

M Martin v. Löwis wrote:

 No. The reason to ban string.decode and bytes.encode is that it
 confuses users.

M Instead of starting to ban everything that can potentially
M confuse a few users, we should educate those users and tell
M them what these methods mean and how they should be used.

ISTM it's neither potential nor a few.

As Aahz pointed out, for the common use of text I/O it requires only a
single clue (Unicode is The One True Plain Text, everything else must
be decoded to Unicode before use.) and you don't need any education
about how to use codecs under Martin's restrictions; you just need
to know which ones to use.

This is not a benefit to be given up lightly.

Would it be reasonable to put those restrictions in the codecs?  Ie,
so that bytes().encode('gzip') is allowed for the generic codec
'gzip', but bytes().encode('utf-8') is an error for the charset
codec 'utf-8'?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 M == M.-A. Lemburg [EMAIL PROTECTED] writes:

M The main reason is symmetry and the fact that strings and
M Unicode should be as similar as possible in order to simplify
M the task of moving from one to the other.

Those are perfectly compatible with Martin's suggestion.

M Still, I believe that this is an educational problem. There are
M a couple of gotchas users will have to be aware of (and this is
M unrelated to the methods in question):

But IMO that's wrong, both in attitude and in fact.  As for attitude,
users should not have to be aware of these gotchas.  Codec writers, on
the other hand, should be required to avoid presenting users with
those gotchas.  Martin's draconian restriction is in the right
direction, but you can argue it goes way too far.

In fact, of course it's related to the methods in question.
Original vs derived data can only be defined in terms of some
notion of the usual semantics of the streams, and that is going to
be strongly reflected in the semantics of the methods.

M * encoding always refers to transforming original data into a
M derived form

M * decoding always refers to transforming a derived form of
M data back into its original form

Users *already* know that; it's a very strong connotation of the
English words.  The problem is that users typically have their own
concept of what's original and what's derived.  For example:

M * for Unicode codecs the original form is Unicode, the derived
M form is, in most cases, a string

First of all, that's Martin's point!

Second, almost all Americans, a large majority of Japanese, and I
would bet most Western Europeans would say you have that backwards.
That's the problem, and it's the Unicode advocates' problem (ie,
ours), not the users'.  Even if we're right: education will require
lots of effort.  Rather, we should just make it as easy as possible to
do it right, and hard to do it wrong.

BTW, what use cases do you have in mind for Unicode - Unicode
decoding?  Maximally decomposed forms and/or eliminating compatibility
characters etc?  Very specialized.

M Codecs also unify the various interfaces to common encodings
M such as base64, uu or zip which are not Unicode related.

Now this is useful and has use cases I've run into, for example in
email, where you would like to use the same interface for base64 as
for shift_jis and you'd like to be able to write

def encode-mime-body (string, codec-list):
if codec-list[0] not in charset-codec-list:
raise NotCharsetCodecException
if len (codec-list)  1 and codec-list[-1] not in transfer-codec-list:
raise NotTransferCodecException
for codec in codec-list:
string = string.encode (codec)
return string

mime-body = encode-mime-body (This is a pen.,
  [ 'shift_jis', 'zip', 'base64' ])


I guess I have to admit I'm backtracking from my earlier hardline
support for Martin's position, but I'm still sympathetic: (a) that's
the direct way to make it easy to do it right, and (b) I still think
the use cases for non-Unicode codecs are YAGNI very often.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 Josiah == Josiah Carlson [EMAIL PROTECTED] writes:

Josiah The question remains: is str.decode() returning a string
Josiah or unicode depending on the argument passed, when the
Josiah argument quite literally names the codec involved,
Josiah difficult to understand?  I don't believe so; am I the
Josiah only one?

Do you do any of the user education *about codec use* that you
recommend?  The people I try to teach about coding invariably find it
difficult to understand.  The problem is that the near-universal
intuition is that for human-usable text is pretty much anything *but
Unicode* will do.  This is a really hard block to get them past.
There is very good reason why Unicode is plain text (original in
MAL's terms) and everything else is encoded (derived), but students
new to the concept often take a while to get it.

Maybe it's just me, but whether it's the teacher or the students, I am
*not* excited about the education route.  Martin's simple rule *is*
simple, and the exceptions for using a nonexistent method mean I
don't have to reinforce---the students will be able to teach each
other.  The exceptions also directly help reinforce the notion that
text == Unicode.

I grant the point that .decode('base64') is useful, but I also believe
that education is a lot more easily said than done in this case.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 Bob == Bob Ippolito [EMAIL PROTECTED] writes:

Bob On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote:

 But you aren't always getting *unicode* text from the decoding
 of bytes, and you may be encoding bytes *to* bytes:

Please note that I presumed that you can indeed assume that decoding
of bytes always results in unicode, and encoding of unicode always
results in bytes.  I believe Guido made the proposal relying on that
assumption too.  The constructor notation makes no sense for making an
object of the same type as the original unless it's a copy constructor.

You could argue that the base64 language is indeed a different
language from the bytes language, and I'd agree.  But since there's no
way in Python to determine whether a string that conforms to base64 is
supposed to be base64 or bytes, it would be a very bad idea to
interpret the distinction as one of type.

 b2 = bytes(b, base64)

 b3 = bytes(b2, base64)

 Which direction are we going again?

Bob This is *exactly* why the current set of codecs are INSANE.
Bob unicode.encode and str.decode should be used *only* for
Bob unicode codecs.  Byte transforms are entirely different
Bob semantically and should be some other method pair.

General filters are semantically different, I agree.  But encode and
decode in English are certainly far more general than character
coding conversion.  The use of those methods for any stream conversion
that is invertible (eg, compression or encryption) is not insane.
It's just pedagogically inconvenient given the existing confusion
(outside of python-dev, of coursewink) about character coding
issues.

I'd like to rephrase your statement as *only* unicode.encode and
str.decode should be used for unicode codecs.  Ie, str.encode(codec)
and unicode.decode(codec) should raise errors if codec is a unicode
codec.  The question in my mind is whether we should allow other
kinds of codecs or not.

I could live with notwink, but if we're going to have other kinds
of codecs, I think they should have concrete signatures.  Ie,
basestring - basestring shouldn't be allowed.  Content transfer
encodings like BASE64 and quoted-printable, compression, encryption,
etc IMO should be bytes - bytes.  Overloading to unicode - unicode
is sorta plausible for BASE64 or QP, but YAGNI.  OTOH, the Unicode
standard does define a number of unicode - unicode transformations,
and it might make sense to generalize to case conversions etc.  (Note
that these conversions are pseudo-invertible, so you can think of them
as generalized .encode/.decode pairs.  The inverse is usually the
identity, which seems weird, but from the pedagogical standpoint you
could handle that weirdness by raising an error if the .encode method
were invoked.)

To be concrete, I could imagine writing

s2 = s1.decode('upcase')
if s2 == s1:
print Why are you shouting at me?
else:
print I like calm, well-spoken snakes.

s3 = s2.encode('upcase')
if s3 == s2:
print Never fails!
else:
print See a vet; your Python is *very* sick.

I chose the decode method to do the non-trivial transformation because
.decode()'s value is supposed to be original text in MAL's terms.
And that's true of uppercase-only text; you're still supposed to be
able to read it, so I guess it's not encoded.  That's pretty
pedantic; I think it's better to raise on .encode('upcase').


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
 Bengt == Bengt Richter [EMAIL PROTECTED] writes:

Bengt The characters in b could be encoded in plain ascii, or
Bengt utf16le, you have to know.

Which base64 are you thinking about?  Both RFC 3548 and RFC 2045
(MIME) specify subsets of US-ASCII explicitly.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
 BTW, what use cases do you have in mind for Unicode - Unicode
 decoding?

I think rot13 falls into that category: it is a transformation
on text, not on bytes.

For other odd cases: base64 goes Unicode-bytes in the *decode*
direction, not in the encode direction. Some may argue that base64
is bytes, not text, but in many applications, you can combine base64
(or uuencode) with abitrary other text in a single stream. Of course,
it could be required that you go u.encode(ascii).decode(base64).

 def encode-mime-body (string, codec-list):
 if codec-list[0] not in charset-codec-list:
 raise NotCharsetCodecException
 if len (codec-list)  1 and codec-list[-1] not in transfer-codec-list:
 raise NotTransferCodecException
 for codec in codec-list:
 string = string.encode (codec)
 return string
 
 mime-body = encode-mime-body (This is a pen.,
   [ 'shift_jis', 'zip', 'base64' ])

I think this is an example where you *should* use the codec API,
as designed. As that apparently requires streams for stacking (ie.
no support for codec stacking), you would have to write

def encode_mime_body(string, codec_list):
stack = output = cStringIO.StringIO()
for codec in reversed(codec_list):
stack = codecs.getwriter(codec)(stack)
stack.write(string)
stack.reset()
return output.getValue()

Notice that you have to start the stacking with the last codec,
and you have to keep a reference to the StringIO object where
the actual bytes end up.

Regards,
Martin

P.S. there shows some LISP through in your Python code :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
 Do you do any of the user education *about codec use* that you
 recommend?  The people I try to teach about coding invariably find it
 difficult to understand.  The problem is that the near-universal
 intuition is that for human-usable text is pretty much anything *but
 Unicode* will do.

It really is a matter of education. For the first time in my career,
I have been teaching the first-semester programming course, and I
was happy to see that the text book already has a section on text
and Unicode (actually, I selected the text book also based on whether
there was good discussion of that aspect). So I spent quite some
time with data representation (integrals, floats, characters), and
I hope that the students now got it.

If they didn't learn it that way in the first semester (or already
got mis-educated in highschool), it will be very hard for them to
relearn. So I expect that it will take a decade or two until this
all is common knowledge.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
 Bengt The characters in b could be encoded in plain ascii, or
 Bengt utf16le, you have to know.
 
 Which base64 are you thinking about?  Both RFC 3548 and RFC 2045
 (MIME) specify subsets of US-ASCII explicitly.

Unfortunately, it is ambiguous as to whether they refer to US-ASCII,
the character set, or US-ASCII, the encoding. It appears that
RFC 3548 talks about the character set only:

- section 2.4 talks about choosing an alphabet, and how it should
  be possible for humans to handle such data.
- section 2.3 talks about non-alphabet characters

So it appears that RFC 3548 defines a conversion bytes-text.
To transmit this, you then also need encoding. MIME appears
to also use the US-ASCII *encoding* (charset, in IETF speak),
for the base64 Content-Transfer-Encoding.

For an example where base64 is *not* necessarily ASCII-encoded,
see the binary data type in XML Schema. There, base64 is embedded
into an XML document, and uses the encoding of the entire XML
document. As a result, you may get base64 data in utf16le.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Bob Ippolito
On Feb 19, 2006, at 10:55 AM, Martin v. Löwis wrote:

 Stephen J. Turnbull wrote:
 BTW, what use cases do you have in mind for Unicode - Unicode
 decoding?

 I think rot13 falls into that category: it is a transformation
 on text, not on bytes.

The current implementation is a transformation on bytes, not text.   
Conceptually though, it's a text-text transform.

 For other odd cases: base64 goes Unicode-bytes in the *decode*
 direction, not in the encode direction. Some may argue that base64
 is bytes, not text, but in many applications, you can combine base64
 (or uuencode) with abitrary other text in a single stream. Of course,
 it could be required that you go u.encode(ascii).decode(base64).

I would say that base64 is bytes-bytes.  Just because those bytes  
happen to be in a subset of ASCII, it's still a serialization meant  
for wire transmission.  Sometimes it ends up in unicode (e.g. in  
XML), but that's the exception not the rule.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Josiah Carlson

Stephen J. Turnbull [EMAIL PROTECTED] wrote:
 
  Josiah == Josiah Carlson [EMAIL PROTECTED] writes:
 
 Josiah The question remains: is str.decode() returning a string
 Josiah or unicode depending on the argument passed, when the
 Josiah argument quite literally names the codec involved,
 Josiah difficult to understand?  I don't believe so; am I the
 Josiah only one?
 
 Do you do any of the user education *about codec use* that you
 recommend?  The people I try to teach about coding invariably find it
 difficult to understand.  The problem is that the near-universal
 intuition is that for human-usable text is pretty much anything *but
 Unicode* will do.  This is a really hard block to get them past.
 There is very good reason why Unicode is plain text (original in
 MAL's terms) and everything else is encoded (derived), but students
 new to the concept often take a while to get it.

I've not been teaching Python; when I was still a TA, it was strictly
algorithms and data structures.  Of those people who I have had the
opportunity to entice into Python, I've not followed up on their
progress to know if they had any issues.

I try to internalize it by not thinking of strings as encoded data, but
as binary data, and unicode as text.  I then remind myself that unicode
isn't native on-disk or cross-network (which stores and transports bytes,
not characters), so one needs to encode it as binary data.  It's a
subtle difference, but it has worked so far for me.

In my experience, at least for only-English speaking users, most people
don't even get to unicode.  I didn't even touch it until I had been well
versed with the encoding and decoding of all different kinds of binary
data, when a half-dozen international users (China, Japan, Russia, ...)
requested its support in my source editor; so I added it.  Supporting it
properly hasn't been very difficult, and the only real nit I have
experienced is supporting the encoding line just after the #! line for
arbitrary codecs (sometimes saving a file in a particular encoding dies).

I notice that you seem to be in Japan, so teaching unicode is a must. 
If you are using the unicode is text and strings are data, and they
aren't getting it; then I don't know.


 Maybe it's just me, but whether it's the teacher or the students, I am
 *not* excited about the education route.  Martin's simple rule *is*
 simple, and the exceptions for using a nonexistent method mean I
 don't have to reinforce---the students will be able to teach each
 other.  The exceptions also directly help reinforce the notion that
 text == Unicode.

Are you sure that they would help?  If .encode() and .decode() drop from
strings and unicode (respectively), they get an AttributeError.  That's
almost useless.  Raising a better exception (with more information)
would be better in that case, but losing the functionality that either
would offer seems unnecessary; which is why I had suggested some of the
other method names.  Perhaps a This method was removed because it
confused users.  Use help(str.encode) (or unicode.decode) to find out
how you can do the equivalent, or do what you *really* wanted to do.


 I grant the point that .decode('base64') is useful, but I also believe
 that education is a lot more easily said than done in this case.

What I meant by education is 'better documentation' and 'better
exception messages'.  I didn't learn Python by sitting in a class; I
learned it by going through the tutorial over a weekend as a 2nd year
undergrad and writing software which could do what I wanted/needed.
Compared to the compiler messages I'd been seeing from Codewarrior and
MSVC 6, Python exceptions were like an oracle.  I can understand how
first-time programmers can have issues with *some* Python exception
messages, which is why I think that we could use better ones.  There is
also the other issue that sometimes people fail to actually read the
messages.

Again, I don't believe that an AttributeError is any better than an
ordinal not in range(128), but You are trying to encode/decode
to/from incompatible types. expected: a-b got: x-y is better.  Some
of those can be done *very soon*, given the capabilities of the
encodings module, and they could likely be easily migrated, regardless
of the decisions with .encode()/.decode() .

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Josiah Carlson wrote:
 Bob Ippolito [EMAIL PROTECTED] wrote:

 On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote:

 Greg Ewing [EMAIL PROTECTED] wrote:
 Stephen J. Turnbull wrote:
 Guido == Guido van Rossum [EMAIL PROTECTED] writes:
 Guido - b = bytes(t, enc); t = text(b, enc)

 +1  The coding conversion operation has always felt like a  
 constructor
 to me, and in this particular usage that's exactly what it is.  I
 prefer the nomenclature to reflect that.
 This also has the advantage that it competely
 avoids using the verbs encode and decode
 and the attendant confusion about which direction
 they go in.

 e.g.

s = text(b, base64)

 makes it obvious that you're going from the
 binary side to the text side of the base64
 conversion.
 But you aren't always getting *unicode* text from the decoding of  
 bytes,
 and you may be encoding bytes *to* bytes:

 b2 = bytes(b, base64)
 b3 = bytes(b2, base64)

 Which direction are we going again?
 This is *exactly* why the current set of codecs are INSANE.   
 unicode.encode and str.decode should be used *only* for unicode  
 codecs.  Byte transforms are entirely different semantically and  
 should be some other method pair.
 
 The problem is that we are overloading data types.  Strings (and bytes)
 can contain both encoded text as well as data, or even encoded data.

Right

 Educate the users.  Raise better exceptions telling people why their
 encoding or decoding failed, as Ian Bicking already pointed out.  If
 bytes.encode() and the equivalent of text.decode() is going to disappear,

+1 on better documentation all around with regards to encodings and 
Unicode.  So far the best explanation I've found (so far) is in PEP 100. 
  The Python docs and built in help hardly explain more than the minimal 
argument list for the encoding and decoding methods, and the str and 
unicode type constructor arguments aren't explained any better.

 Bengt Richter had a good idea with bytes.recode() for strictly bytes
 transformations (and the equivalent for text), though it is ambiguous as
 to the direction; are we encoding or decoding with bytes.recode()?  In
 my opinion, this is why .encode() and .decode() makes sense to keep on
 both bytes and text, the direction is unambiguous, and if one has even a
 remote idea of what the heck the codec is, they know their result.
 
  - Josiah

I like the bytes.recode() idea a lot. +1

It seems to me it's a far more useful idea than encoding and decoding by 
overloading and could do both and more.  It has a lot of potential to be 
an intermediate step for encoding as well as being used for many other 
translations to byte data.

I think I would prefer that encode and decode be just functions with 
well defined names and arguments instead of being methods or arguments 
to string and Unicode types.

I'm not sure on exactly how this would work. Maybe it would need two 
sets of encodings, ie.. decoders, and encoders.  An exception would be
given if it wasn't found for the direction one was going in.

Roughly... something or other like:

 import encodings

 encodings.tostr(obj, encoding):
if encoding not in encoders:
raise LookupError 'encoding not found in encoders'
# check if obj works with encoding to string
# ...
b = bytes(obj).recode(encoding)
return str(b)

 encodings.tounicode(obj, decodeing):
if decoding not in decoders:
raise LookupError 'decoding not found in decoders'
# check if obj works with decoding to unicode
# ...
b = bytes(obj).recode(decoding)
return unicode(b)

Anyway... food for thought.

Cheers,
Ronald Adam







___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Martin v. Löwis
Aahz wrote:
 The problem is that they don't understand that Martin v. L?wis is not
 Unicode -- once all strings are Unicode, this is guaranteed to work.

This specific call, yes. I don't think the problem will go away as long
as both encode and decode are available for both strings and byte
arrays.

 While it's not absolutely true, my experience of watching Unicode
 confusion is that the simplest approach for newbies is: encode FROM
 Unicode, decode TO Unicode.

I think this is what should be in-grained into the library, also. It
shouldn't try to give additional meaning to these terms.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Josiah Carlson

Ron Adam [EMAIL PROTECTED] wrote:
 Josiah Carlson wrote:
  Bengt Richter had a good idea with bytes.recode() for strictly bytes
  transformations (and the equivalent for text), though it is ambiguous as
  to the direction; are we encoding or decoding with bytes.recode()?  In
  my opinion, this is why .encode() and .decode() makes sense to keep on
  both bytes and text, the direction is unambiguous, and if one has even a
  remote idea of what the heck the codec is, they know their result.
  
   - Josiah
 
 I like the bytes.recode() idea a lot. +1
 
 It seems to me it's a far more useful idea than encoding and decoding by 
 overloading and could do both and more.  It has a lot of potential to be 
 an intermediate step for encoding as well as being used for many other 
 translations to byte data.

Indeed it does.

 I think I would prefer that encode and decode be just functions with 
 well defined names and arguments instead of being methods or arguments 
 to string and Unicode types.

Attaching it to string and unicode objects is a useful convenience. 
Just like x.replace(y, z) is a convenience for string.replace(x, y, z) . 
Tossing the encode/decode somewhere else, like encodings, or even string,
I see as a backwards step.

 I'm not sure on exactly how this would work. Maybe it would need two 
 sets of encodings, ie.. decoders, and encoders.  An exception would be
 given if it wasn't found for the direction one was going in.
 
 Roughly... something or other like:
 
  import encodings
 
  encodings.tostr(obj, encoding):
 if encoding not in encoders:
 raise LookupError 'encoding not found in encoders'
 # check if obj works with encoding to string
 # ...
 b = bytes(obj).recode(encoding)
 return str(b)
 
  encodings.tounicode(obj, decodeing):
 if decoding not in decoders:
 raise LookupError 'decoding not found in decoders'
 # check if obj works with decoding to unicode
 # ...
 b = bytes(obj).recode(decoding)
 return unicode(b)
 
 Anyway... food for thought.

Again, the problem is ambiguity; what does bytes.recode(something) mean?
Are we encoding _to_ something, or are we decoding _from_ something? 
Are we going to need to embed the direction in the encoding/decoding
name (to_base64, from_base64, etc.)?  That doesn't any better than
binascii.b2a_base64 .  What about .reencode and .redecode?  It seems as
though the 're' added as a prefix to .encode and .decode makes it
clearer that you get the same type back as you put in, and it is also
unambiguous to direction.

The question remains: is str.decode() returning a string or unicode
depending on the argument passed, when the argument quite literally
names the codec involved, difficult to understand?  I don't believe so;
am I the only one?

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin, v. Löwis wrote:
 How are users confused?
 
 Users do
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11:
 ordinal not in range(128)
 
 because they want to convert the string to Unicode, and they have
 found a text telling them that .encode(utf-8) is a reasonable
 method.
 
 What it *should* tell them is
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 AttributeError: 'str' object has no attribute 'encode'

I've already explained why we have .encode() and .decode()
methods on strings and Unicode many times. I've also
explained the misunderstanding that can codecs only do
Unicode-string conversions. And I've explained that
the .encode() and .decode() method *do* check the return
types of the codecs and only allow strings or Unicode
on return (no lists, instances, tuples or anything else).

You seem to ignore this fact.

If we were to follow your idea, we should remove .encode()
and .decode() altogether and refer users to the codecs.encode()
and codecs.decode() function. However, I doubt that users
will like this idea.

 bytes.encode CAN only produce bytes.
 
 I don't understand MAL's design, but I believe in that design,
 bytes.encode could produce anything (say, a list). A codec
 can convert anything to anything else.

True. However, note that the .encode()/.decode() methods on
strings and Unicode narrow down the possible return types.
The corresponding .bytes methods should only allow bytes and
Unicode.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Thomas Wouters
On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote:

 I've already explained why we have .encode() and .decode()
 methods on strings and Unicode many times. I've also
 explained the misunderstanding that can codecs only do
 Unicode-string conversions. And I've explained that
 the .encode() and .decode() method *do* check the return
 types of the codecs and only allow strings or Unicode
 on return (no lists, instances, tuples or anything else).
 
 You seem to ignore this fact.

Actually, I think the problem is that while we all agree the
bytestring/unicode methods are a useful way to convert from bytestring to
unicode and back again, we disagree on their *general* usefulness. Sure, the
codecs mechanism is powerful, and even more so because they can determine
their own returntype. But it still smells and feels like a Perl attitude,
for the reasons already explained numerous times, as well:

 - The return value for the non-unicode encodings depends on the value of
   the encoding argument.

 - The general case, by and large, especially in non-powerusers, is to
   encode unicode to bytestrings and to decode bytestrings to unicode. And
   that is a hard enough task for many of the non-powerusers. Being able to
   use the encode/decode methods for other tasks isn't helping them.

That is why I disagree with the hypergeneralization of the encode/decode
methods, regardless of the fact that it is a natural expansion of the
implementation of codecs. Sure, it looks 'right' and 'natural' when you look
at the implementation. It sure doesn't look natural, to me and to many
others, when you look at the task of encoding and decoding
bytestrings/unicode.

-- 
Thomas Wouters [EMAIL PROTECTED]

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 Just because some codecs don't fit into the string.decode()
 or bytes.encode() scenario doesn't mean that these codecs are
 useless or that the methods should be banned.
 
 No. The reason to ban string.decode and bytes.encode is that
 it confuses users.

Instead of starting to ban everything that can potentially
confuse a few users, we should educate those users and tell
them what these methods mean and how they should be used.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Michael Hudson
This posting is entirely tangential.  Be warned.

Martin v. Löwis [EMAIL PROTECTED] writes:

 It's worse than that. The return *type* depends on the *value* of
 the argument. I think there is little precedence for that:

There's one extremely significant example where the *value* of
something impacts on the type of something else: functions.  The types
of everything involved in str([1]) and len([1]) are the same but the
results are different.  This shows up in PyPy's type annotation; most
of the time we just track types indeed, but when something is called
we need to have a pretty good idea of the potential values, too.

Relavent to the point at hand?  No.  Apologies for wasting your time
:)

Cheers,
mwh

-- 
  The ultimate laziness is not using Perl.  That saves you so much
  work you wouldn't believe it if you had never tried it.
-- Erik Naggum, comp.lang.lisp
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Josiah Carlson wrote:
 Ron Adam [EMAIL PROTECTED] wrote:
 Josiah Carlson wrote:
 Bengt Richter had a good idea with bytes.recode() for strictly bytes
 transformations (and the equivalent for text), though it is ambiguous as
 to the direction; are we encoding or decoding with bytes.recode()?  In
 my opinion, this is why .encode() and .decode() makes sense to keep on
 both bytes and text, the direction is unambiguous, and if one has even a
 remote idea of what the heck the codec is, they know their result.

  - Josiah
 I like the bytes.recode() idea a lot. +1

 It seems to me it's a far more useful idea than encoding and decoding by 
 overloading and could do both and more.  It has a lot of potential to be 
 an intermediate step for encoding as well as being used for many other 
 translations to byte data.
 
 Indeed it does.
 
 I think I would prefer that encode and decode be just functions with 
 well defined names and arguments instead of being methods or arguments 
 to string and Unicode types.
 
 Attaching it to string and unicode objects is a useful convenience. 
 Just like x.replace(y, z) is a convenience for string.replace(x, y, z) . 
 Tossing the encode/decode somewhere else, like encodings, or even string,
 I see as a backwards step.
 
 I'm not sure on exactly how this would work. Maybe it would need two 
 sets of encodings, ie.. decoders, and encoders.  An exception would be
 given if it wasn't found for the direction one was going in.

 Roughly... something or other like:

  import encodings

  encodings.tostr(obj, encoding):
 if encoding not in encoders:
 raise LookupError 'encoding not found in encoders'
 # check if obj works with encoding to string
 # ...
 b = bytes(obj).recode(encoding)
 return str(b)

  encodings.tounicode(obj, decodeing):
 if decoding not in decoders:
 raise LookupError 'decoding not found in decoders'
 # check if obj works with decoding to unicode
 # ...
 b = bytes(obj).recode(decoding)
 return unicode(b)

 Anyway... food for thought.
 
 Again, the problem is ambiguity; what does bytes.recode(something) mean?
 Are we encoding _to_ something, or are we decoding _from_ something? 

This was just an example of one way that might work, but here are my 
thoughts on why I think it might be good.


In this case, the ambiguity is reduced as far as the encoding and 
decodings opperations are concerned.)

  somestring = encodings.tostr( someunicodestr, 'latin-1')

It's pretty clear what is happening to me.

 It will encode to a string an object, named someunicodestr, with 
the 'latin-1' encoder.

And also rusult in clear errors if the specified encoding is 
unavailable, and if it is, if it's not compatible with the given 
*someunicodestr* obj type.

Further hints could be gained by.

 help(encodings.tostr)

Which could result in... something like...
 
 encoding.tostr( string|unicode, encoder ) - string

 Encode a unicode string using a encoder codec to a
 non-unicode string or transform a non-unicode string
 to another non-unicode string using an encoder codec.
 

And if that's not enough, then help(encodings) could give more clues. 
These steps would be what I would do. And then the next thing would be 
to find the python docs entry on encodings.

Placing them in encodings seems like a fairly good place to look for 
these functions if you are working with encodings.  So I find that just 
as convenient as having them be string methods.

There is no intermediate default encoding involved above, (the bytes 
object is used instead), so you wouldn't get some of the messages the 
present system results in when ascii is the default.

(Yes, I know it won't when P3K is here also)

 Are we going to need to embed the direction in the encoding/decoding
 name (to_base64, from_base64, etc.)?  That doesn't any better than
 binascii.b2a_base64 .  

No, that's why I suggested two separate lists (or dictionaries might be 
better).  They can contain the same names, but the lists they are in 
determine the context and point to the needed codec.  And that step is 
abstracted out by putting it inside the encodings.tostr() and 
encodings.tounicode() functions.

So either function would call 'base64' from the correct codec list and 
get the correct encoding or decoding codec it needs.


What about .reencode and .redecode?  It seems as
 though the 're' added as a prefix to .encode and .decode makes it
 clearer that you get the same type back as you put in, and it is also
 unambiguous to direction.

But then wouldn't we end up with multitude of ways to do things?

 s.encode(codec) == s.redecode(codec)
 s.decode(codec) == s.reencode(codec)
 unicode(s, codec) == s.decode(codec)
 str(u, codec) == u.encode(codec)
 str(s, codec) == s.encode(codec)
 unicode(s, codec) == s.reencode(codec)
 str(u, codec) == s.redecode(codec)
 str(s, 

Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Thomas Wouters wrote:
 On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote:
 
 I've already explained why we have .encode() and .decode()
 methods on strings and Unicode many times. I've also
 explained the misunderstanding that can codecs only do
 Unicode-string conversions. And I've explained that
 the .encode() and .decode() method *do* check the return
 types of the codecs and only allow strings or Unicode
 on return (no lists, instances, tuples or anything else).

 You seem to ignore this fact.
 
 Actually, I think the problem is that while we all agree the
 bytestring/unicode methods are a useful way to convert from bytestring to
 unicode and back again, we disagree on their *general* usefulness. Sure, the
 codecs mechanism is powerful, and even more so because they can determine
 their own returntype. But it still smells and feels like a Perl attitude,
 for the reasons already explained numerous times, as well:

It's by no means a Perl attitude.

The main reason is symmetry and the fact that strings and Unicode
should be as similar as possible in order to simplify the task of
moving from one to the other.

  - The return value for the non-unicode encodings depends on the value of
the encoding argument.

Not really: you'll always get a basestring instance.

  - The general case, by and large, especially in non-powerusers, is to
encode unicode to bytestrings and to decode bytestrings to unicode. And
that is a hard enough task for many of the non-powerusers. Being able to
use the encode/decode methods for other tasks isn't helping them.

Agreed.

Still, I believe that this is an educational problem. There are
a couple of gotchas users will have to be aware of (and this is
unrelated to the methods in question):

* encoding always refers to transforming original data into
  a derived form

* decoding always refers to transforming a derived form of
  data back into its original form

* for Unicode codecs the original form is Unicode, the derived
  form is, in most cases, a string

As a result, if you want to use a Unicode codec such as utf-8,
you encode Unicode into a utf-8 string and decode a utf-8 string
into Unicode.

Encoding a string is only possible if the string itself is
original data, e.g. some data that is supposed to be transformed
into a base64 encoded form.

Decoding Unicode is only possible if the Unicode string itself
represents a derived form, e.g. a sequence of hex literals.

 That is why I disagree with the hypergeneralization of the encode/decode
 methods, regardless of the fact that it is a natural expansion of the
 implementation of codecs. Sure, it looks 'right' and 'natural' when you look
 at the implementation. It sure doesn't look natural, to me and to many
 others, when you look at the task of encoding and decoding
 bytestrings/unicode.

That's because you only look at one specific task.

Codecs also unify the various interfaces to common encodings
such as base64, uu or zip which are not Unicode related.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Adam Olsen
On 2/18/06, Josiah Carlson [EMAIL PROTECTED] wrote:
 Look at what we've currently got going for data transformations in the
 standard library to see what these removals will do: base64 module,
 binascii module, binhex module, uu module, ...  Do we want or need to
 add another top-level module for every future encoding/codec that comes
 out (or does everyone think that we're done seeing codecs)?  Do we want
 to keep monkey-patching binascii with names like 'a2b_hqx'?  While there
 is currently one text-text transform (rot13), do we add another module
 for text-text transforms? Would it start having names like t2e_rot13()
 and e2t_rot13()?

If top-level modules are the problem then why not make codecs into a package?

from codecs import utf8, base64

utf8.encode(u) - b
utf8.decode(b) - u
base64.encode(b) - b
base64.decode(b) - b

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Aahz
On Sat, Feb 18, 2006, Ron Adam wrote:

 I like the bytes.recode() idea a lot. +1
 
 It seems to me it's a far more useful idea than encoding and decoding by 
 overloading and could do both and more.  It has a lot of potential to be 
 an intermediate step for encoding as well as being used for many other 
 translations to byte data.
 
 I think I would prefer that encode and decode be just functions with 
 well defined names and arguments instead of being methods or arguments 
 to string and Unicode types.
 
 I'm not sure on exactly how this would work. Maybe it would need two 
 sets of encodings, ie.. decoders, and encoders.  An exception would be
 given if it wasn't found for the direction one was going in.

Here's an idea I don't think I've seen before:

bytes.recode(b, src_encoding, dest_encoding)

This requires the user to state up-front what the source encoding is.
One of the big problems that I see with the whole encoding mess is that
so much of it contains implicit assumptions about the source encoding;
this gets away from that.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

19. A language that doesn't affect the way you think about programming,
is not worth knowing.  --Alan Perlis
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread M.-A. Lemburg
Aahz wrote:
 On Sat, Feb 18, 2006, Ron Adam wrote:
 I like the bytes.recode() idea a lot. +1

 It seems to me it's a far more useful idea than encoding and decoding by 
 overloading and could do both and more.  It has a lot of potential to be 
 an intermediate step for encoding as well as being used for many other 
 translations to byte data.

 I think I would prefer that encode and decode be just functions with 
 well defined names and arguments instead of being methods or arguments 
 to string and Unicode types.

 I'm not sure on exactly how this would work. Maybe it would need two 
 sets of encodings, ie.. decoders, and encoders.  An exception would be
 given if it wasn't found for the direction one was going in.
 
 Here's an idea I don't think I've seen before:
 
 bytes.recode(b, src_encoding, dest_encoding)
 
 This requires the user to state up-front what the source encoding is.
 One of the big problems that I see with the whole encoding mess is that
 so much of it contains implicit assumptions about the source encoding;
 this gets away from that.

You might want to look at the codecs.py module: it has all these
things and a lot more.

http://docs.python.org/lib/module-codecs.html
http://svn.python.org/view/python/trunk/Lib/codecs.py?view=markup

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Martin v. Löwis
M.-A. Lemburg wrote:
 I've already explained why we have .encode() and .decode()
 methods on strings and Unicode many times. I've also
 explained the misunderstanding that can codecs only do
 Unicode-string conversions. And I've explained that
 the .encode() and .decode() method *do* check the return
 types of the codecs and only allow strings or Unicode
 on return (no lists, instances, tuples or anything else).
 
 You seem to ignore this fact.

I'm not ignoring the fact that you have explained this
many times. I just fail to understand your explanations.

For example, you said at some point that codecs are not
restricted to Unicode. However, I don't recall any
explanation what the restriction *is*, if any restriction
exists. No such restriction seems to be documented.

 True. However, note that the .encode()/.decode() methods on
 strings and Unicode narrow down the possible return types.
 The corresponding .bytes methods should only allow bytes and
 Unicode.

I forgot that: what is the rationale for that restriction?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Martin v. Löwis
Michael Hudson wrote:
 There's one extremely significant example where the *value* of
 something impacts on the type of something else: functions.  The types
 of everything involved in str([1]) and len([1]) are the same but the
 results are different.  This shows up in PyPy's type annotation; most
 of the time we just track types indeed, but when something is called
 we need to have a pretty good idea of the potential values, too.
 
 Relavent to the point at hand?  No.  Apologies for wasting your time
 :)

Actually, I think it is relevant. I never thought about it this way,
but now that you mention it, you are right.

This demonstrates that the string argument to .encode is actually
a function name, atleast the way it is implemented now. So
.encode(uu) and .encode(rot13) are *two* different methods,
instead of being a single method.

This brings me back to my original point: rot13 should be a function,
not a parameter to some function. In essence, .encode reimplements
apply(), with the added feature of not having to pass the function
itself, but just its name.

Maybe this design results from a really deep understanding of

Namespaces are one honking great idea -- let's do more of those!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 I've already explained why we have .encode() and .decode()
 methods on strings and Unicode many times. I've also
 explained the misunderstanding that can codecs only do
 Unicode-string conversions. And I've explained that
 the .encode() and .decode() method *do* check the return
 types of the codecs and only allow strings or Unicode
 on return (no lists, instances, tuples or anything else).

 You seem to ignore this fact.
 
 I'm not ignoring the fact that you have explained this
 many times. I just fail to understand your explanations.

Feel free to ask questions.

 For example, you said at some point that codecs are not
 restricted to Unicode. However, I don't recall any
 explanation what the restriction *is*, if any restriction
 exists. No such restriction seems to be documented.

The codecs are not restricted w/r to the data types
they work on. It's up to the codecs to define which
data types are valid and which they take on input and
return.

 True. However, note that the .encode()/.decode() methods on
 strings and Unicode narrow down the possible return types.
 The corresponding .bytes methods should only allow bytes and
 Unicode.
 
 I forgot that: what is the rationale for that restriction?

To assure that only those types can be returned from those
methods, ie. instances of basestring, which in return permits
type inference for those methods.

The codecs functions encode() and decode() don't have these
restrictions, and thus provide a generic interface to the
codec's encode and decode functions. It's up to the caller
to restrict the allowed encodings and as result the
possible input/output types.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Martin v. Löwis
M.-A. Lemburg wrote:
True. However, note that the .encode()/.decode() methods on
strings and Unicode narrow down the possible return types.
The corresponding .bytes methods should only allow bytes and
Unicode.

I forgot that: what is the rationale for that restriction?
 
 
 To assure that only those types can be returned from those
 methods, ie. instances of basestring, which in return permits
 type inference for those methods.

Hmm. So it for type inference
Where is that documented?

This looks pretty inconsistent. Either codecs can give arbitrary
return types, then .encode/.decode should also be allowed to
give arbitrary return types, or codecs should be restricted.
What's the point of first allowing a wide interface, and then
narrowing it?

Also, if type inference is the goal, what is the point in allowing
two result types?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 True. However, note that the .encode()/.decode() methods on
 strings and Unicode narrow down the possible return types.
 The corresponding .bytes methods should only allow bytes and
 Unicode.
 I forgot that: what is the rationale for that restriction?

 To assure that only those types can be returned from those
 methods, ie. instances of basestring, which in return permits
 type inference for those methods.
 
 Hmm. So it for type inference
 Where is that documented?

Somewhere in the python-dev mailing list archives ;-)

Seriously, we should probably add this to the documentation.

 This looks pretty inconsistent. Either codecs can give arbitrary
 return types, then .encode/.decode should also be allowed to
 give arbitrary return types, or codecs should be restricted.

No.

As I've said before: the .encode() and .decode() methods
are convenience methods to interface to codecs which take
string/Unicode on input and create string/Unicode output.

 What's the point of first allowing a wide interface, and then
 narrowing it?

The codec interface is an abstract interface. It is a flexible
enough to allow codecs to define possible input and output
types while being strict about the method names and signatures.

Much like the file interface in Python, the copy protocol
or the pickle interface.

 Also, if type inference is the goal, what is the point in allowing
 two result types?

I'm not sure I understand the question: type inference is about
being able to infer the types of (among other things) function
return objects. This is what the restriction guarantees - much
like int() guarantees that you get either an integer or a long.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Josiah Carlson

Ron Adam [EMAIL PROTECTED] wrote:
 Josiah Carlson wrote:
[snip]
  Again, the problem is ambiguity; what does bytes.recode(something) mean?
  Are we encoding _to_ something, or are we decoding _from_ something? 
 
 This was just an example of one way that might work, but here are my 
 thoughts on why I think it might be good.
 
 In this case, the ambiguity is reduced as far as the encoding and 
 decodings opperations are concerned.)
 
   somestring = encodings.tostr( someunicodestr, 'latin-1')
 
 It's pretty clear what is happening to me.
 
  It will encode to a string an object, named someunicodestr, with 
 the 'latin-1' encoder.

But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
unicode(..., 'latin-1')?

What about string transformations:
somestring = encodings.tostr(somestr, 'base64')

How do we get that back?  encodings.tostr() again is completely
ambiguous, str(somestring, 'base64') seems a bit awkward (switching
namespaces)?


 And also rusult in clear errors if the specified encoding is 
 unavailable, and if it is, if it's not compatible with the given 
 *someunicodestr* obj type.
 
 Further hints could be gained by.
 
  help(encodings.tostr)
 
 Which could result in... something like...
  
  encoding.tostr( string|unicode, encoder ) - string
 
  Encode a unicode string using a encoder codec to a
  non-unicode string or transform a non-unicode string
  to another non-unicode string using an encoder codec.
  
 
 And if that's not enough, then help(encodings) could give more clues. 
 These steps would be what I would do. And then the next thing would be 
 to find the python docs entry on encodings.
 
 Placing them in encodings seems like a fairly good place to look for 
 these functions if you are working with encodings.  So I find that just 
 as convenient as having them be string methods.
 
 There is no intermediate default encoding involved above, (the bytes 
 object is used instead), so you wouldn't get some of the messages the 
 present system results in when ascii is the default.
 
 (Yes, I know it won't when P3K is here also)
 
  Are we going to need to embed the direction in the encoding/decoding
  name (to_base64, from_base64, etc.)?  That doesn't any better than
  binascii.b2a_base64 .  
 
 No, that's why I suggested two separate lists (or dictionaries might be 
 better).  They can contain the same names, but the lists they are in 
 determine the context and point to the needed codec.  And that step is 
 abstracted out by putting it inside the encodings.tostr() and 
 encodings.tounicode() functions.
 
 So either function would call 'base64' from the correct codec list and 
 get the correct encoding or decoding codec it needs.

Either the API you have described is incomplete, you haven't noticed the
directional ambiguity you are describing, or I have completely lost it.


  What about .reencode and .redecode?  It seems as
  though the 're' added as a prefix to .encode and .decode makes it
  clearer that you get the same type back as you put in, and it is also
  unambiguous to direction.
 
 But then wouldn't we end up with multitude of ways to do things?
 
  s.encode(codec) == s.redecode(codec)
  s.decode(codec) == s.reencode(codec)
  unicode(s, codec) == s.decode(codec)
  str(u, codec) == u.encode(codec)
  str(s, codec) == s.encode(codec)
  unicode(s, codec) == s.reencode(codec)
  str(u, codec) == s.redecode(codec)
  str(s, codec) == s.redecode(codec)
 
 Umm .. did I miss any?  Which ones would you remove?
 
 Which ones of those will succeed with which codecs?

I must not be expressing myself very well.

Right now:
s.encode() - s
s.decode() - s, u
u.encode() - s, u
u.decode() - u

Martin et al's desired change to encode/decode:
s.decode() - u
u.encode() - s

No others.

What my thoughts on .reencode() and .redecode() would get you given
Martin et al's desired change:
s.reencode() - s (you get encoded strings as strings)
s.redecode() - s (you get decoded strings as strings)
u.reencode() - u (you get encoded unicode as unicode)
u.redecode() - u (you get decoded unicode as unicode)

If one wants to go from unicode to string, one uses .encode(). If one
wants to go from string to unicode, one uses .decode().  If one wants to
keep their type unchanged, but encode or decode the data/text, one would
use .reencode() and .redecode(), depending on whether their source is an
encoded block of data, or the original data they want to encode.

The other bonus is that if given .reencode() and .redecode(), one can
quite easily verify that the source is possible as a source, and that
you would get back the proper type.  How this would occur behind the
scenes is beyond the scope of this discussion, but it seems to me to be
easy, given what I've read about the current mechanism.

Whether the constructors for the str and unicode do their own codec
transformations is beside the 

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Aahz wrote:
 On Sat, Feb 18, 2006, Ron Adam wrote:
 I like the bytes.recode() idea a lot. +1

 It seems to me it's a far more useful idea than encoding and decoding by 
 overloading and could do both and more.  It has a lot of potential to be 
 an intermediate step for encoding as well as being used for many other 
 translations to byte data.

 I think I would prefer that encode and decode be just functions with 
 well defined names and arguments instead of being methods or arguments 
 to string and Unicode types.

 I'm not sure on exactly how this would work. Maybe it would need two 
 sets of encodings, ie.. decoders, and encoders.  An exception would be
 given if it wasn't found for the direction one was going in.
 
 Here's an idea I don't think I've seen before:
 
 bytes.recode(b, src_encoding, dest_encoding)
 
 This requires the user to state up-front what the source encoding is.
 One of the big problems that I see with the whole encoding mess is that
 so much of it contains implicit assumptions about the source encoding;
 this gets away from that.

Yes, but it's not just the encodings that are implicit, it is also the 
types.

s.encode(codec)  # explicit source type, ? dest type
s.decode(codec)  # explicit source type, ? dest type

encodings.tostr(obj, codec) # implicit *known* source type
# explicit dest type

encodings.tounicode(obj, codec) # implicit *known* source type
# explicit dest type

In this case the source is implicit, but there can be a well defined 
check to validate the source type against the codec being used.  It's my 
feeling the user *knows* what he already has, and so it's more important 
that the resulting object type is explicit.

In your suggestion...

bytes.recode(b, src_encoding, dest_incoding)

Here the encodings are both explicit, but the both the source and the 
destinations of the bytes are not.  Since it working on bytes, they 
could have come from anywhere, and after the translation they would then 
will be cast to the type the user *thinks* it should result in.  A 
source of errors that would likely pass silently.

The way I see it is the bytes type should be a lower level object that 
doesn't care what byte transformation it does. Ie.. they are all one way 
byte to byte transformations determined by context.  And it should have 
the capability to read from and write to types without translating in 
the same step.  Keep it simple.

Then it could be used as a lower level byte translator to implement 
encodings and other translations in encoding methods or functions 
instead of trying to make it replace the higher level functionality.

Cheers,
Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Thomas Wouters
On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:

 It's by no means a Perl attitude.

In your eyes, perhaps. It certainly feels that way to me (or I wouldn't have
said it :). Perl happens to be full of general constructs that were added
because they were easy to add, or they were useful in edgecases. The
encode/decode methods remind me of that, even though I fully understand the
reasoning behind it, and the elegance of the implementation.

 The main reason is symmetry and the fact that strings and Unicode
 should be as similar as possible in order to simplify the task of
 moving from one to the other.

Yes, and this is a design choice I don't agree with. They're different
types. They do everything similarly, except when they are mixed together
(unicode takes precedence, in general, encoding the bytestring from the
default encoding.) Going from one to the other isn't symmetric, though. I
understand that you disagree; the disagreement is on the fundamental choice
of allowing 'encode' and 'decode' to do *more* than going from and to
unicode. I regret that decision, not the decision to make encode and decode
symmetric (which makes sense, after the decision to overgeneralize
encode/decode is made.)

   - The return value for the non-unicode encodings depends on the value of
 the encoding argument.

 Not really: you'll always get a basestring instance.

Which is not a particularly useful distinction, since in any real world
application, you have to be careful not to mix unicode with (non-ascii)
bytestrings. The only way to reliably deal with unicode is to have it
well-contained (when migrating an application from using bytestrings to
using unicode) or to use unicode everywhere, decoding/encoding at
entrypoints. Containment is hard to achieve.

 Still, I believe that this is an educational problem. There are
 a couple of gotchas users will have to be aware of (and this is
 unrelated to the methods in question):
 
 * encoding always refers to transforming original data into
   a derived form
 
 * decoding always refers to transforming a derived form of
   data back into its original form
 
 * for Unicode codecs the original form is Unicode, the derived
   form is, in most cases, a string
 
 As a result, if you want to use a Unicode codec such as utf-8,
 you encode Unicode into a utf-8 string and decode a utf-8 string
 into Unicode.
 
 Encoding a string is only possible if the string itself is
 original data, e.g. some data that is supposed to be transformed
 into a base64 encoded form.
 
 Decoding Unicode is only possible if the Unicode string itself
 represents a derived form, e.g. a sequence of hex literals.

Most of these gotchas would not have been gotchas had encode/decode only
been usable for unicode encodings.

  That is why I disagree with the hypergeneralization of the encode/decode
  methods
[..]
 That's because you only look at one specific task.

 Codecs also unify the various interfaces to common encodings
 such as base64, uu or zip which are not Unicode related.

No, I think you misunderstand. I object to the hypergeneralization of the
*encode/decode methods*, not the codec system. I would have been fine with
another set of methods for non-unicode transformations. Although I would
have been even more fine if they got their encoding not as a string, but as,
say, a module object, or something imported from a module.

Not that I think any of this matters; we have what we have and I'll have to
live with it ;)

-- 
Thomas Wouters [EMAIL PROTECTED]

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Terry Reedy

Josiah Carlson [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]

 Again, the problem is ambiguity; what does bytes.recode(something) mean?
 Are we encoding _to_ something, or are we decoding _from_ something?
 Are we going to need to embed the direction in the encoding/decoding
 name (to_base64, from_base64, etc.)?

To me, that seems simple and clear.  b.recode('from_base64') obviously 
requires that b meet the restrictions of base64.  Similarly for 'from_hex'.

 That doesn't any better than binascii.b2a_base64

I think 'from_base64' is *much* better.  I think there are now 4 
string-to-string transform modules that do similar things.  Not optimal to 
me.

 What about .reencode and .redecode?  It seems as
 though the 're' added as a prefix to .encode and .decode makes it
 clearer that you get the same type back as you put in, and it is also
 unambiguous to direction.

To me, the 're' prefix is awkward, confusing, and misleading.

Terry J. Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam

Josiah Carlson wrote:
 Ron Adam [EMAIL PROTECTED] wrote:
 Josiah Carlson wrote:
 [snip]
 Again, the problem is ambiguity; what does bytes.recode(something) mean?
 Are we encoding _to_ something, or are we decoding _from_ something? 
 This was just an example of one way that might work, but here are my 
 thoughts on why I think it might be good.

 In this case, the ambiguity is reduced as far as the encoding and 
 decodings opperations are concerned.)

   somestring = encodings.tostr( someunicodestr, 'latin-1')

 It's pretty clear what is happening to me.

  It will encode to a string an object, named someunicodestr, with 
 the 'latin-1' encoder.
 
 But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
 unicode(..., 'latin-1')?

Yes, Just do.

  someunicodestr = encoding.tounicode( somestring, 'latin-1')



 What about string transformations:
 somestring = encodings.tostr(somestr, 'base64')
 
 How do we get that back?  encodings.tostr() again is completely
 ambiguous, str(somestring, 'base64') seems a bit awkward (switching
 namespaces)?

In the case where a string is converted to another string. It would 
probably be best to have a requirement that they all get converted to 
unicode as an intermediate step.  By doing that it becomes an explicit 
two step opperation.

 # string to string encoding
 u_string = encodings.tounicode(s_string, 'base64')
 s2_string = encodings.tostr(u_string, 'base64')

Or you could have a convenience function to do it in the encodings 
module also.

def strtostr(s, sourcecodec, destcodec):
u = tounicode(s, sourcecodec)
return tostr(u, destcodec)

Then...

s2 = encodings.strtostr(s, 'base64, 'base64)

Which would be kind of pointless in this example, but it would be a good 
way to test a codec.

assert s == s2


 Are we going to need to embed the direction in the encoding/decoding
 name (to_base64, from_base64, etc.)?  That doesn't any better than
 binascii.b2a_base64 .  
 No, that's why I suggested two separate lists (or dictionaries might be 
 better).  They can contain the same names, but the lists they are in 
 determine the context and point to the needed codec.  And that step is 
 abstracted out by putting it inside the encodings.tostr() and 
 encodings.tounicode() functions.

 So either function would call 'base64' from the correct codec list and 
 get the correct encoding or decoding codec it needs.
 
 Either the API you have described is incomplete, you haven't noticed the
 directional ambiguity you are describing, or I have completely lost it.

Most likely I gave an incomplete description of the API in this case 
because there are probably several ways to implement it.



 What about .reencode and .redecode?  It seems as
 though the 're' added as a prefix to .encode and .decode makes it
 clearer that you get the same type back as you put in, and it is also
 unambiguous to direction.

...

  I must not be expressing myself very well.
 
 Right now:
 s.encode() - s
 s.decode() - s, u
 u.encode() - s, u
 u.decode() - u
 
 Martin et al's desired change to encode/decode:
 s.decode() - u
 u.encode() - s
 
  No others.

Which would be similar to the functions I suggested.  The main 
difference is only weather it would be better to have them as methods or 
separate factory functions and the spelling of the names.  Both have 
their advantages I think.


 The method bytes.recode(), always does a byte transformation which can 
 be almost anything.  It's the context bytes.recode() is used in that 
 determines what's happening.  In the above cases, it's using an encoding 
 transformation, so what it's doing is precisely what you would expect by 
 it's context.
 
 Indeed, there is a translation going on, but it is not clear as to
 whether you are encoding _to_ something or _from_ something.  What does
 s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
 That's where the ambiguity lies.

Bengt didn't propose adding .recode() to the string types, but only the 
bytes type.  The byte type would recode the bytes using a specific 
transformation.  I believe his view is it's a lower level API than 
strings that can be used to implement the higher level encoding API 
with, not replace the encoding API.  Or that is they way I interpreted 
the suggestion.


 There isn't a bytes.decode(), since that's just another transformation. 
 So only the one method is needed.  Which makes it easer to learn.
 
 But ambiguous.

What's ambiguous about it?  It's no more ambiguous than any math 
operation where you can do it one way with one operations and get your 
original value back with the same operation by using an inverse value.

n2=n+1; n3=n+(-1); n==n3
n2=n*2; n3=n*(.5); n==n3


 Learning how the current system works comes awfully close to reverse 
 engineering.  Maybe I'm overstating it a bit, but I suspect many end up 
 doing exactly that in order to learn how Python does 

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Josiah Carlson

Ron Adam [EMAIL PROTECTED] wrote:
 Josiah Carlson wrote:
  Ron Adam [EMAIL PROTECTED] wrote:
  Josiah Carlson wrote:
  [snip]
  Again, the problem is ambiguity; what does bytes.recode(something) mean?
  Are we encoding _to_ something, or are we decoding _from_ something? 
  This was just an example of one way that might work, but here are my 
  thoughts on why I think it might be good.
 
  In this case, the ambiguity is reduced as far as the encoding and 
  decodings opperations are concerned.)
 
somestring = encodings.tostr( someunicodestr, 'latin-1')
 
  It's pretty clear what is happening to me.
 
   It will encode to a string an object, named someunicodestr, with 
  the 'latin-1' encoder.
  
  But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
  unicode(..., 'latin-1')?
 
 Yes, Just do.
 
   someunicodestr = encoding.tounicode( somestring, 'latin-1')
 
  What about string transformations:
  somestring = encodings.tostr(somestr, 'base64')
  
  How do we get that back?  encodings.tostr() again is completely
  ambiguous, str(somestring, 'base64') seems a bit awkward (switching
  namespaces)?
 
 In the case where a string is converted to another string. It would 
 probably be best to have a requirement that they all get converted to 
 unicode as an intermediate step.  By doing that it becomes an explicit 
 two step opperation.
 
  # string to string encoding
  u_string = encodings.tounicode(s_string, 'base64')
  s2_string = encodings.tostr(u_string, 'base64')

Except that ambiguates it even further.

Is encodings.tounicode() encoding, or decoding?  According to everything
you have said so far, it would be decoding.  But if I am decoding binary
data, why should it be spending any time as a unicode string?  What do I
mean?

x = f.read() #x contains base-64 encoded binary data
y = encodings.to_unicode(x, 'base64')

y now contains BINARY DATA, except that it is a unicode string

z = encodings.to_str(y, 'latin-1')

Later you define a str_to_str function, which I (or someone else) would
use like:

z = str_to_str(x, 'base64', 'latin-1')

But the trick is that I don't want some unicode string encoded in
latin-1, I want my binary data unencoded.  They may happen to be the
same in this particular example, but that doesn't mean that it makes any
sense to the user.

[...]

  What about .reencode and .redecode?  It seems as
  though the 're' added as a prefix to .encode and .decode makes it
  clearer that you get the same type back as you put in, and it is also
  unambiguous to direction.
 
 ...
 
   I must not be expressing myself very well.
  
  Right now:
  s.encode() - s
  s.decode() - s, u
  u.encode() - s, u
  u.decode() - u
  
  Martin et al's desired change to encode/decode:
  s.decode() - u
  u.encode() - s
  
   No others.
 
 Which would be similar to the functions I suggested.  The main 
 difference is only weather it would be better to have them as methods or 
 separate factory functions and the spelling of the names.  Both have 
 their advantages I think.

While others would disagree, I personally am not a fan of to* or from*
style namings, for either function names (especially in the encodings
module) or methods.  Just a personal preference.

Of course, I don't find the current situation regarding
str/unicode.encode/decode to be confusing either, but maybe it's because
my unicode experience is strictly within the realm of GUI widgets, where
compartmentalization can be easier.


  The method bytes.recode(), always does a byte transformation which can 
  be almost anything.  It's the context bytes.recode() is used in that 
  determines what's happening.  In the above cases, it's using an encoding 
  transformation, so what it's doing is precisely what you would expect by 
  it's context.

[THIS IS THE AMBIGUITY]
  Indeed, there is a translation going on, but it is not clear as to
  whether you are encoding _to_ something or _from_ something.  What does
  s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
  That's where the ambiguity lies.
 
 Bengt didn't propose adding .recode() to the string types, but only the 
 bytes type.  The byte type would recode the bytes using a specific 
 transformation.  I believe his view is it's a lower level API than 
 strings that can be used to implement the higher level encoding API 
 with, not replace the encoding API.  Or that is they way I interpreted 
 the suggestion.

But again, what would the transformation be?  To something?  From
something?  'to_base64', 'from_base64', 'to_rot13' (which happens to be
identical to) 'from_rot13', ...  Saying it would recode ... using a
specific transformation is a cop-out, what would the translation be? 
How would it work?  How would it be spelled?

That smells quite a bit like .encode() and .decode(), just spelled
differently, and without quite a clear path.  That is why I was offering...

   s.reencode() - s 

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Josiah Carlson wrote:
 Ron Adam [EMAIL PROTECTED] wrote:


 Except that ambiguates it even further.

 Is encodings.tounicode() encoding, or decoding?  According to everything
 you have said so far, it would be decoding.  But if I am decoding binary
 data, why should it be spending any time as a unicode string?  What do I
 mean?

Encoding and decoding are relative concepts.  It's all encoding from one
thing to another.  Weather it's decoding or encoding depends on the
relationship of the current encoding to a standard encoding.

The confusion introduced by decode is when the 'default_encoding'
changes, will change, or is unknown.


 x = f.read() #x contains base-64 encoded binary data
 y = encodings.to_unicode(x, 'base64')
 
 y now contains BINARY DATA, except that it is a unicode string

No, that wasn't what I was describing.  You get a Unicode string object
as the result, not a bytes object with binary data.  See the toy example
at the bottom.


 z = encodings.to_str(y, 'latin-1')
 
 Later you define a str_to_str function, which I (or someone else) would
 use like:
 
 z = str_to_str(x, 'base64', 'latin-1')
 
 But the trick is that I don't want some unicode string encoded in
 latin-1, I want my binary data unencoded.  They may happen to be the
 same in this particular example, but that doesn't mean that it makes any
 sense to the user.

If you want bytes then you would use the bytes() type to get bytes
directly.  Not encode or decode.

 binary_unicode = bytes(unicode_string)

The exact byte order and representation would need to be decided by the
python developers in this case.  The internal representation
'unicode-internal', is UCS-2 I believed.



 It's no more ambiguous than any math 
 operation where you can do it one way with one operations and get your 
 original value back with the same operation by using an inverse value.

 n2=n+1; n3=n+(-1); n==n3
 n2=n*2; n3=n*(.5); n==n3
 
 Ahh, so you are saying 'to_base64' and 'from_base64'.  There is one
 major reason why I don't like that kind of a system: I can't just say
 encoding='base64' and use str.encode(encoding) and str.decode(encoding),
 I necessarily have to use, str.recode('to_'+encoding) and
 str.recode('from_'+encoding) .  Seems a bit awkward.

Yes, but the encodings API could abstract out the 'to_base64' and
'from_base64' so you can just say 'base64' and have it work either way.

Maybe a toy incomplete example might help.



# in module bytes.py or someplace else.
class bytes(list):
   
   bytes methods defined here
   



# in module encodings.py

# using a dict of lists, but other solutions would
# work just as well.
unicode_codecs = {
   'base64': ('from_base64', 'to_base64'),
   }

def tounicode(obj, from_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[from_codec][0])
return unicode(b)

def tostr(obj, to_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[to_codec][1])
return str(b)



# in your application

import encodings

... a bunch of code ...

u = encodings.tounicode(s, 'base64')

# or if going the other way

s = encodings.tostr(u, 'base64')



Does this help?  Is the relationship between the bytes object and the
encodings API clearer here?  If not maybe we should discuss it further
off line.


Cheers,
Ronald Adam
















___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Martin v. Löwis
Josiah Carlson wrote:
 I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex',
 and likely a few others that the two of you may be arguing against
 should stay as encodings, because strictly speaking, they are defined as
 encodings of data.  They may not be encodings of _unicode_ data, but
 that doesn't mean that they aren't useful encodings for other kinds of
 data, some text, some binary, ...

To support them, the bytes type would have to gain a .encode method,
and I'm -1 on supporting bytes.encode, or string.decode.

Why is

s.encode(uu)

any better than

binascii.b2a_uu(s)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Jason Orendorff
On 2/15/06, Guido van Rossum [EMAIL PROTECTED] wrote:
Actually users trying to figure out Unicode would probably be better served if bytes.encode() and text.decode() did not exist.[...]It would be better if the signature of text.encode() always returned a
bytes object. But why deny the bytes object a decode() method if textobjects have an encode() method?
I agree, text.encode() and bytes.decode() are both swell. It's the
other two that bother me.
I'd say there are two symmetric API flavors possible (t and b are
text and bytes objects, respectively, where text is a string type,either str or unicode; enc is an encoding name):- b.decode(enc) - t; t.encode(enc) - b- b = bytes(t, enc); t = text(b, enc)
I'm not sure why one flavor would be preferred over the other,although having both would probably be a mistake.
I prefer constructor flavor; the word bytes feels more concrete than
encode. But I worry about constructors being too overloaded.

 text(b, enc) # decode
 text(mydict) # repr
 text(b) # uh... decode with default encoding?

-j

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Bob Ippolito

On Feb 16, 2006, at 9:20 PM, Josiah Carlson wrote:


 Greg Ewing [EMAIL PROTECTED] wrote:

 Josiah Carlson wrote:

 They may not be encodings of _unicode_ data,

 But if they're not encodings of unicode data, what
 business do they have being available through
 someunicodestring.encode(...)?

 I had always presumed that bytes objects are going to be able to be a
 source for encode AND decode, like current non-unicode strings are  
 able
 to be today.  In that sense, if I have a bytes object which is an
 encoding of rot13, hex, uu, etc., or I have a bytes object which I  
 would
 like to be in one of those encodings, I should be able to do  
 b.encode(...)
 or b.decode(...), given that 'b' is a bytes object.

 Are 'encodings' going to become a mechanism to encode and decode
 _unicode_ strings, rather than a mechanism to encode and decode _text
 and data_ strings?  That would seem like a backwards step to me, as  
 the
 email package would need to package their own base-64 encode/decode  
 API
 and implementation, and similarly for any other package which uses any
 one of the encodings already available.

It would be VERY useful to separate the two concepts.  bytes-bytes  
transforms should be one function pair, and bytes-text transforms  
should be another.  The current situation is totally insane:

str.decode(codec) - str or unicode or UnicodeDecodeError or  
ZlibError or TypeError.. who knows what else
str.encode(codec) - str or unicode or UnicodeDecodeError or  
TypeError... probably other exceptions

Granted, unicode.encode(codec) and unicode.decode(codec) are actually  
somewhat sane in that the return type is always a str and the  
exceptions are either UnicodeEncodeError or UnicodeDecodeError.

I think that rot13 is the only conceptually text-text transform  
(though the current implementation is really bytes-bytes),  
everything else is either bytes-text or bytes-bytes.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-17 Thread Stephen J. Turnbull
 Guido == Guido van Rossum [EMAIL PROTECTED] writes:

Guido I'd say there are two symmetric API flavors possible (t
Guido and b are text and bytes objects, respectively, where text
Guido is a string type, either str or unicode; enc is an encoding
Guido name):

Guido - b.decode(enc) - t; t.encode(enc) - b

-0  When taking a binary file and attaching it to the text of a mail
message using BASE64, the tendency to say you're encoding the file in
BASE64 is very strong.  I just don't see how such usages can be
avoided in discussion, which makes the types of decode and encode hard
to remember, and easy to mistake in some contexts.

Guido - b = bytes(t, enc); t = text(b, enc)

+1  The coding conversion operation has always felt like a constructor
to me, and in this particular usage that's exactly what it is.  I
prefer the nomenclature to reflect that.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Josiah Carlson wrote:
 I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex',
 and likely a few others that the two of you may be arguing against
 should stay as encodings, because strictly speaking, they are defined as
 encodings of data.  They may not be encodings of _unicode_ data, but
 that doesn't mean that they aren't useful encodings for other kinds of
 data, some text, some binary, ...
 
 To support them, the bytes type would have to gain a .encode method,
 and I'm -1 on supporting bytes.encode, or string.decode.
 
 Why is
 
 s.encode(uu)
 
 any better than
 
 binascii.b2a_uu(s)

The .encode() and .decode() methods are merely convenience
interfaces to the registered codecs (with some extra logic to
make sure that only a pre-defined set of return types are allowed).
It's up to the user to use them for e.g. UU-encoding or not.

The reason we have codecs for UU, zip and the others is that
you can use their StreamWriters/Readers in stackable streams.

Just because some codecs don't fit into the string.decode()
or bytes.encode() scenario doesn't mean that these codecs are
useless or that the methods should be banned.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 17 2006)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Bengt Richter
On Fri, 17 Feb 2006 00:33:49 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 
[EMAIL PROTECTED] wrote:

Josiah Carlson wrote:
 I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex',
 and likely a few others that the two of you may be arguing against
 should stay as encodings, because strictly speaking, they are defined as
 encodings of data.  They may not be encodings of _unicode_ data, but
 that doesn't mean that they aren't useful encodings for other kinds of
 data, some text, some binary, ...

To support them, the bytes type would have to gain a .encode method,
and I'm -1 on supporting bytes.encode, or string.decode.

Why is

s.encode(uu)

any better than

binascii.b2a_uu(s)

One aspect is that dotted notation method calling is serially composable,
whereas function calls nest, and you have to find and read from the innermost,
which gets hard quickly unless you use multiline formatting. But even then
you can't read top to bottom as processing order.

If we had a general serial composition syntax for function calls
something like unix piping (which is a big part of the power of unix shells IMO)
we could make the choice of appropriate composition semantics better.

Decorators already compose functions in a limited way, but processing
order would read like forth horizontally. Maybe '-' ? How about

foo(x, y) - bar() - baz(z)

as as sugar for

baz.__get__(bar.__get__(foo(x, y))())(z)

? (Hope I got that right ;-)

I.e., you'd have self-like args to receive results from upstream. E.g.,

  def foo(x, y): return 'foo(%s, %s)'%(x,y)
 ...
  def bar(stream): return 'bar(%s)'%stream
 ...
  def baz(stream, z): return 'baz(%s, %s)'%(stream,z)
 ...
  x = 'ex'; y='wye'; z='zed'
  baz.__get__(bar.__get__(foo(x, y))())(z)
 'baz(bar(foo(ex, wye)), zed)'

Regards,
Bengt Richter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Martin v. Löwis
M.-A. Lemburg wrote:
 Just because some codecs don't fit into the string.decode()
 or bytes.encode() scenario doesn't mean that these codecs are
 useless or that the methods should be banned.

No. The reason to ban string.decode and bytes.encode is that
it confuses users.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Josiah Carlson

Martin v. Löwis [EMAIL PROTECTED] wrote:
 M.-A. Lemburg wrote:
  Just because some codecs don't fit into the string.decode()
  or bytes.encode() scenario doesn't mean that these codecs are
  useless or that the methods should be banned.
 
 No. The reason to ban string.decode and bytes.encode is that
 it confuses users.

How are users confused?  bytes.encode CAN only produce bytes.  Though
string.decode (or bytes.decode) MAY produce strings (or bytes) or
unicode, depending on the codec, I think it is quite reasonable to
expect that users will understand that string.decode('utf-8') is
different than string.decode('base-64'), and that they may produce
different output.  In a similar fashion, dict.get(1) may produce
different results than dict.get(2) for some dictionaries.  If some users
can't understand this (passing different arguments to a function may
produce different output), then I think that some users are broken
beyond repair.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Bengt Richter
On Fri, 17 Feb 2006 21:35:25 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 
[EMAIL PROTECTED] wrote:

M.-A. Lemburg wrote:
 Just because some codecs don't fit into the string.decode()
 or bytes.encode() scenario doesn't mean that these codecs are
 useless or that the methods should be banned.

No. The reason to ban string.decode and bytes.encode is that
it confuses users.
Well, that's because of semantic overloading. Assuming you mean
string as characters and bytes as binary bytes.

The trouble is encoding and decoding have to have bytes to represent
the coded info, whichever direction. Characters per se aren't coded
info, so string.decode doesn't make sense without faking it with
string.encode().decode() and bytes.encode() likewise first has to
have a hidden .decode to become a string that makes sense to encode.
And the hidden stuff restricts to ascii, for further grief :-(

So yes, please ban string.decode and bytes.encode.

And maybe introduce bytes.recode for bytes-bytes transforms?
(strings don't have any codes to recode).

Regards,
Bengt Richter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Martin v. Löwis
Josiah Carlson wrote:
 How are users confused?

Users do

py Martin v. Löwis.encode(utf-8)
Traceback (most recent call last):
  File stdin, line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11:
ordinal not in range(128)

because they want to convert the string to Unicode, and they have
found a text telling them that .encode(utf-8) is a reasonable
method.

What it *should* tell them is

py Martin v. Löwis.encode(utf-8)
Traceback (most recent call last):
  File stdin, line 1, in ?
AttributeError: 'str' object has no attribute 'encode'

 bytes.encode CAN only produce bytes.

I don't understand MAL's design, but I believe in that design,
bytes.encode could produce anything (say, a list). A codec
can convert anything to anything else.

 If some users
 can't understand this (passing different arguments to a function may
 produce different output),

It's worse than that. The return *type* depends on the *value* of
the argument. I think there is little precedence for that: normally,
the return values depend on the argument values, and, in a polymorphic
function, the return type might depend on the argument types (e.g.
the arithmetic operations). Also, the return type may depend on the
number of arguments (e.g. by requesting a return type in a keyword
argument).

 then I think that some users are broken beyond repair.

Hmm. I'm speechless.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Ian Bicking
Martin v. Löwis wrote:
 Users do
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11:
 ordinal not in range(128)
 
 because they want to convert the string to Unicode, and they have
 found a text telling them that .encode(utf-8) is a reasonable
 method.
 
 What it *should* tell them is
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 AttributeError: 'str' object has no attribute 'encode'

I think it would be even better if they got ValueError: utf8 can only 
encode unicode objects.  AttributeError is not much more clear than the 
UnicodeDecodeError.

That str.encode(unicode_encoding) implicitly decodes strings seems like 
a flaw in the unicode encodings, quite seperate from the existance of 
str.encode.  I for one really like s.encode('zlib').encode('base64') -- 
and if the zlib encoding raised an error when it was passed a unicode 
object (instead of implicitly encoding the string with the ascii 
encoding) that would be fine.

The pipe-like nature of .encode and .decode works very nicely for 
certain transformations, applicable to both unicode and byte objects. 
Let's not throw the baby out with the bath water.


-- 
Ian Bicking  /  [EMAIL PROTECTED]  /  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Josiah Carlson

Martin v. Löwis [EMAIL PROTECTED] wrote:
 
 Josiah Carlson wrote:
  How are users confused?
 
 Users do
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11:
 ordinal not in range(128)
 
 because they want to convert the string to Unicode, and they have
 found a text telling them that .encode(utf-8) is a reasonable
 method.

Removing functionality because some users read bad instructions
somewhere, is a bit like kicking your kitten because your puppy peed on
the floor.  You are punishing the wrong group, for something that
shouldn't result in punishment: it should result in education.

Users are always going to get bad instructions, and removing utility
because some users fail to think before they act, or complain when their
lack of thinking doesn't work, will give us a language where we are
removing features because *new* users have no idea what they are doing.


 What it *should* tell them is
 
 py Martin v. Löwis.encode(utf-8)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 AttributeError: 'str' object has no attribute 'encode'

I disagree.  I think the original error was correct, and we should be
educating users to prefix their literals with a 'u' if they want unicode,
or they should get their data from a unicode source (wxPython with
unicode, StreamReader, etc.)


  bytes.encode CAN only produce bytes.
 
 I don't understand MAL's design, but I believe in that design,
 bytes.encode could produce anything (say, a list). A codec
 can convert anything to anything else.

That seems to me to be a little overkill...

In any case, I personally find that data.encode('base-64') and
edata.decode('base-64') to be more convenient than binascii.b2a_base64
(data) and binascii.a2b_base64(edata).  Ditto for hexlify/unhexlify, etc.


  If some users
  can't understand this (passing different arguments to a function may
  produce different output),
 
 It's worse than that. The return *type* depends on the *value* of
 the argument. I think there is little precedence for that: normally,
 the return values depend on the argument values, and, in a polymorphic
 function, the return type might depend on the argument types (e.g.
 the arithmetic operations). Also, the return type may depend on the
 number of arguments (e.g. by requesting a return type in a keyword
 argument).

You only need to look to dictionaries where different values passed into
a function call may very well return results of different types, yet
there have been no restrictions on mapping to and from single types per
dictionary.

Many dict-like interfaces for configuration files do this, things like
config.get('remote_host') and config.get('autoconnect') not being
uncommon.


 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Martin v. Löwis
Ian Bicking wrote:
 That str.encode(unicode_encoding) implicitly decodes strings seems like
 a flaw in the unicode encodings, quite seperate from the existance of
 str.encode.  I for one really like s.encode('zlib').encode('base64') --
 and if the zlib encoding raised an error when it was passed a unicode
 object (instead of implicitly encoding the string with the ascii
 encoding) that would be fine.
 
 The pipe-like nature of .encode and .decode works very nicely for
 certain transformations, applicable to both unicode and byte objects.
 Let's not throw the baby out with the bath water.

The way you use it, it's a matter of notation only: why
is

zlib(base64(s))

any worse? I think it's better: it doesn't use string literals to
denote function names.

If there is a point to this overgeneralized codec idea, it is
the streaming aspect: that you don't need to process all data
at once, but can feed data sequentially. Of course, you are
not using this in your example.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-17 Thread Ian Bicking
Josiah Carlson wrote:
If some users
can't understand this (passing different arguments to a function may
produce different output),

It's worse than that. The return *type* depends on the *value* of
the argument. I think there is little precedence for that: normally,
the return values depend on the argument values, and, in a polymorphic
function, the return type might depend on the argument types (e.g.
the arithmetic operations). Also, the return type may depend on the
number of arguments (e.g. by requesting a return type in a keyword
argument).
 
 
 You only need to look to dictionaries where different values passed into
 a function call may very well return results of different types, yet
 there have been no restrictions on mapping to and from single types per
 dictionary.
 
 Many dict-like interfaces for configuration files do this, things like
 config.get('remote_host') and config.get('autoconnect') not being
 uncommon.

I think there is *some* justification, if you don't understand up front 
that the codec you refer to (using a string) is just a way of avoiding 
an import (thankfully -- dynamically importing unicode codecs is 
obviously infeasible).  Now, if you understand the argument refers to 
some algorithm, it's not so bad.

The other aspect is that there should be something consistent about the 
return types -- the Python type is not what we generally rely on, 
though.  In this case they are all data.  Unicode and bytes are both 
data, and you could probably argue lists of ints is data too (but an 
arbitrary list definitely isn't data).  On the outer end of data might 
be an ElementTree structure (but that's getting fishy).  An open file 
object is not data.  A tuple probably isn't data.

-- 
Ian Bicking  /  [EMAIL PROTECTED]  /  http://blog.ianbicking.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >