Re: [Python-Dev] bytes.from_hex()
Greg Ewing wrote: Ron Adam wrote: This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple. u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64') The point of the exercise was to avoid using the terms 'encode' and 'decode' entirely, since some people claim to be confused by them. Yes, that was what I was trying for with the tounicode, tostring (tobyte) suggestion, but the direction could become ambiguous as you pointed out. The constructors above have 4 data items implied: 1: The source object which includes the source type and data 2: The codec to use 3: The direction of the operation 4: The destination type (determined by the constructor used) There isn't any ambiguity other than when to use encode or decode, but in this case that really is a documentation problem because there is no ambiguities in this form. Everything is explicit. Another version of the above was pointed out to me off line that might be preferable. u = unicode(b, encode='base64') b = bytes(u, decode='base64') Which would also work with the tostring and tounicode methods. u = b.tounicode(decode='base64') b = u.tobytes(incode='base64') If we're going to continue to use 'encode' and 'decode', why not just make them functions: b = encode(u, 'utf-8') u = decode(b, 'utf-8') import codecs codecs.decode('abc', 'ascii') u'abc' There's that time machine again. ;-) In the case of Unicode encodings, if you get them backwards you'll get a type error. The advantage of using functions over methods or constructor arguments is that they can be applied uniformly to any input and output types. If codecs are to be more general, then there may be time when the returned type needs to be specified. This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode. Some inputs may equally work with more than one output type. Of course, the answer in these cases may be to just 'know' what you will get, and then convert it to what you want. Cheers, Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Doesn't that make base64 non-text by analogy to other look but don't touch strings like a .gz or vmlinuz? No, because I can take a piece of base64 encoded data and use a text editor to manually paste it in with some other text (e.g. a plain-text (not MIME) mail message). Then I can mail it to someone, or send it by text-mode ftp, or translate it from Unix to MSDOS line endings and give it to a Windows user, or translate it into EBCDIC and give it to someone who has an IBM mainframe, etc, etc. And the person at the other end can use their text editor to manually extract it and decode it and recover the original data. I can't do any of those directly with a .gz file or vmlinuz. I'm not just making those uses up, BTW. It's not very long ago people used to do things like that all the time with uuencode, binhex, etc -- because mail and news at the time were strictly text channels. They still are, really -- otherwise we wouldn't be using anything as hairy as MIME, we'd just mail our binary files as-is. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam wrote: This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode. I'd need to see some concrete examples of such codecs before being convinced that they exist, or that they couldn't just as well return a fixed type that you then transform to what you want. I suspect that said transformation would involve some further encoding or decoding, in which case you really have more than one codec. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg Ewing wrote: Ron Adam wrote: This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode. I'd need to see some concrete examples of such codecs before being convinced that they exist, or that they couldn't just as well return a fixed type that you then transform to what you want. I think text some codecs that currently return 'ascii' encoded text would be candidates. If you use u'abc'.encode('rot13') you get an ascii string back and not a unicode string. And if you use decode to get back, you don't get the original unicode back, but an ascii representation of the original you then need to decode to unicode. I suspect that said transformation would involve some further encoding or decoding, in which case you really have more than one codec. Yes, I can see that. So the following are probable better reasons to specify the type. Codecs are very close to types and they quite often result in a type change, having the change visible in the code adds to overall readability. This is probably my main desire for this. There is another reason for being explicit about types with codecs. If you store the codecs with a tuple of attributes as the keys, (name, in_type, out_type), then it makes it possible to look up the codec with the correct behavior and then just do it. The alternative is to test the input, try it, then test the output. The look up doesn't add much overhead, but does adds safety. Codecs don't seem to be the type of thing where you will want to be able to pass a wide variety of objects into. So a narrow slot is probably preferable to a wide one here. In cases where a codec might be useful in more than one combination of types, it could have an entry for each valid combination in the lookup table. The codec lookup also validates the desired operation for nearly free. Of course, the data will need to be valid as well. ;-) Cheers, Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: Greg Ewing [EMAIL PROTECTED] wrote: u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() . - Josiah This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple. u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64') b = bytes(u, 'encode', 'utf-8') u = unicode(b, 'decode', 'utf-8') u2 = unicode(u1, 'encode', 'piglatin') u1 = unicode(u2, 'decode', 'piglatin') It looks somewhat cleaner if you combine them in a path style string. b = bytes(u, 'encode/utf-8') u = unicode(b, 'decode/utf-8') Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam wrote: Josiah Carlson wrote: Greg Ewing [EMAIL PROTECTED] wrote: u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() . - Josiah This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple. u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64') b = bytes(u, 'encode', 'utf-8') u = unicode(b, 'decode', 'utf-8') u2 = unicode(u1, 'encode', 'piglatin') u1 = unicode(u2, 'decode', 'piglatin') It looks somewhat cleaner if you combine them in a path style string. b = bytes(u, 'encode/utf-8') u = unicode(b, 'decode/utf-8') It gets from bad to worse :( I always liked the assymmetry between u = unicode(s, utf8) and s = u.encode(utf8) which I think was the original design of the unicode API. Cudos for whoever came up with that. When I saw b = bytes(u, utf8) mentioned for the first time, I thought: why on earth must the bytes constructor be coupled to the unicode API?!?! It makes no sense to me whatsoever. Bytes have so much more use besides encoded text. I believe (please correct me if I'm wrong) that the encoding argument of bytes() was invented to make it easier to write byte literals. Perhaps a true bytes literal notation is in order after all? My preference for bytes - unicode - bytes API would be this: u = unicode(b, utf8) # just like we have now b = u.tobytes(utf8) # like u.encode(), but being explicit # about the resulting type As to base64, while it works as a codec (Why a base64 codec? Because we can!), I don't find it a natural API at all, for such conversions. (I do however agree with Greg Ewing that base64 encoded data is text, not ascii-encoded bytes ;-) Just-my-2-cts ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam wrote: This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple. u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64') The point of the exercise was to avoid using the terms 'encode' and 'decode' entirely, since some people claim to be confused by them. While I succeeded in that, I concede that the result isn't particularly intuitive and is arguably even more confusing. If we're going to continue to use 'encode' and 'decode', why not just make them functions: b = encode(u, 'utf-8') u = decode(b, 'utf-8') In the case of Unicode encodings, if you get them backwards you'll get a type error. The advantage of using functions over methods or constructor arguments is that they can be applied uniformly to any input and output types. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: What you presumably meant was what would you consider the proper type for (P)CDATA? No, I mean the whole thing, including all the ... tags etc. Like you see when you load an XML file into a text editor. (BTW, doesn't the fact that you *can* load an XML file into what we call a text editor say something?) nobody but authors of wire drivers[2] and introspective code will need to _explicitly_ call .encode('base64'). Even a wire driver writer will only need it if he's trying to turn a text wire into a binary wire, as far as I can see. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg == Greg Ewing [EMAIL PROTECTED] writes: Greg (BTW, doesn't the fact that you *can* load an XML file into Greg what we call a text editor say something?) Why not answer that question for yourself, and then turn that answer into a description of text semantics? For me, it says that, just like a gzipped file or the Linux kernel, I can load an XML file into a text editor. But unlike the .gz or vmlinuz, I can easily find many useful things to do to the XML string in the text editor. Doesn't that make base64 non-text by analogy to other look but don't touch strings like a .gz or vmlinuz? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Tue, 2006-02-28 at 15:23 -0800, Bill Janssen wrote: Greg Ewing wrote: Bill Janssen wrote: bytes - base64 - text text - de-base64 - bytes It's nice to hear I'm not out of step with the entire world on this. :-) Well, I can certainly understand the bytes-base64-bytes side of thing too. The text produced is specified as using a 65-character subset of US-ASCII, so that's really bytes. Huh... just joining here but surely you don't mean a text string that doesn't use every character available in a particular encoding is really bytes... it's still a text string... If you base64 encode some bytes, you get a string. If you then want to access that base64 string as if it was a bunch of bytes, cast it to bytes. Be careful not to confuse (type)cast with (type)convert... A convert transforms the data from one type/class to another, modifying it to be a valid equivalent instance of the other type/class; ie int - float. A cast does not modify the data in any way, it just changes its type/class to be the other type, and assumes that the data is a valid instance of the other type; eg int32 - bytes[4]. Minor data munging under the hood to cleanly switch the type/class is acceptable (ie adding array length info etc) provided you keep to the spirit of the cast. Keep these two concepts separate and you should be right :-) -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bill Janssen wrote: Greg Ewing wrote: Bill Janssen wrote: bytes - base64 - text text - de-base64 - bytes It's nice to hear I'm not out of step with the entire world on this. :-) Well, I can certainly understand the bytes-base64-bytes side of thing too. The text produced is specified as using a 65-character subset of US-ASCII, so that's really bytes. If the base64 codec was a text-bytes codec, and bytes did not have an encode method, then if you want to convert your original bytes to ascii bytes, you would do: ascii_bytes = orig_bytes.decode(base64).encode(ascii) Use base64 to convert my byte sequence to characters, then give me the corresponding ascii byte sequence To reverse the process: orig_bytes = ascii_bytes.decode(ascii).encode(base64) Use ascii to convert my byte sequence to characters, then use base64 to convert those characters back to the original byte sequence The only slightly odd aspect is that this inverts the conventional meaning of base64 encoding and decoding, where you expect to encode from bytes to characters and decode from characters to bytes. As strings currently have both methods, the existing codec is able to use the conventional sense for base64: encode goes from str-as-bytes to str-as-text (giving a longer string with characters that fit in the base64 subset) and decode goes from str-as-text to str-as-bytes (giving back the original string) All the unicode codecs, on the other hand, use encode to get from characters to bytes and decode to get from bytes to characters. So if bytes objects *did* have an encode method, it should still result in a unicode object, just the same as a decode method does (because you are encoding bytes as characters), and unicode objects would acquire a corresponding decode method (that decodes from a character format such as base64 to the original byte sequence). In the name of TOOWTDI, I'd suggest that we just eat the slight terminology glitch in the rare cases like base64, hex and oct (where the character format is technically the encoded format), and leave it so that there is a single method pair (bytes.decode to go from bytes to characters, and text.encode to go from characters to bytes). Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://www.boredomandlaziness.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Nick Coghlan wrote: All the unicode codecs, on the other hand, use encode to get from characters to bytes and decode to get from bytes to characters. So if bytes objects *did* have an encode method, it should still result in a unicode object, just the same as a decode method does (because you are encoding bytes as characters), and unicode objects would acquire a corresponding decode method (that decodes from a character format such as base64 to the original byte sequence). In the name of TOOWTDI, I'd suggest that we just eat the slight terminology glitch in the rare cases like base64, hex and oct (where the character format is technically the encoded format), and leave it so that there is a single method pair (bytes.decode to go from bytes to characters, and text.encode to go from characters to bytes). I think you have it pretty straight here. While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring. b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64') The bytes could then *not* ignore the string decode codec but use it for string to string decoding. b64string = b.tostring('base64') b = bytes(b64string, 'base64') b = bytes(hexstring, 'hex') hexstring = b.tostring('hex') hexstring = b.tounicode('hex') An exception could be raised if the codec does not support input or output type depending on the situation. This would allow for differnt types of codecs to live together without as much confusion I think. I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode. Expecting it not to fly, but just maybe it could? Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam writes: While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring. [...] I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode. +1 I always find myself slightly confused by encode() and decode() despite the fact that I understand (I think) the reason for the choice of those names and by rights ought to have no trouble. I'm not arguing that it's worth the gratuitous code breakage (I don't have enough code using encode() and decode() for my opinion to count in that matter.) But I will say that if there were no legacy I'd prefer the tounicode() and tostring() (but shouldn't it be 'tobytes()' instead?) names for Python 3.0. -- Michael Chermside * This email may contain confidential or privileged information. If you believe you have received the message in error, please notify the sender and delete the message without copying or disclosing it. * ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Huh... just joining here but surely you don't mean a text string that doesn't use every character available in a particular encoding is really bytes... it's still a text string... No, once it's in a particular encoding it's bytes, no longer text. As you say, Keep these two concepts separate and you should be right :-) Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Chermside, Michael wrote: ... I will say that if there were no legacy I'd prefer the tounicode() and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0. Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode? -- -- Scott David Daniels [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
I wrote: ... I will say that if there were no legacy I'd prefer the tounicode() and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0. Scott Daniels replied: Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode? Um... yes. Sorry, I'm not completely used to 3.0 yet. I'll need to borrow the time machine for a little longer before my fingers really pick up on the 3.0 names and idioms. -- Michael Chermside ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Nick Coghlan wrote: ascii_bytes = orig_bytes.decode(base64).encode(ascii) orig_bytes = ascii_bytes.decode(ascii).encode(base64) The only slightly odd aspect is that this inverts the conventional meaning of base64 encoding and decoding, -1. Whatever we do, we shouldn't design things so that it's necessary to write anything as unintuitive as that. We need to make up our minds whether the .encode() and .decode() methods are only meant for Unicode encodings, or whether they are for general transformations between bytes and characters. If they're only meant for Unicode, then bytes should only have .decode(), unicode strings should only have .encode(), and only Unicode codecs should be available that way. Things like base64 would need to have a different interface. If they're for general transformations, then both types should have both methods, with the return type depending on the codec you're using, and it's the programmer's responsibility to use codecs that make sense for what he's doing. But if they're for general transformations, why limit them to just bytes and characters? Following that through leads to giving *every* object .encode() and .decode() methods. I don't think we should go that far, but it's hard to see where to draw the line. Are bytes and strings special enough to justify them having their own peculiar methods for codec access? -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bill Janssen wrote: No, once it's in a particular encoding it's bytes, no longer text. The point at issue is whether the characters produced by base64 are in a particular encoding. According to my reading of the RFC, they're not. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam wrote: While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring. b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64') I don't like that, because it creates a dependency (conceptually, at least) between the bytes type and the unicode type. And why unicode in particular? Why should it have a tounicode() method, but not a toint() or tofloat() or tolist() etc.? I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode. Another thing is that it only works if the codec transforms between two different types. If you have a bytes-to-bytes transformation, for example, then b2 = b1.tobytes('some-weird-encoding') is ambiguous. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
[My apologies Greg; I meant to send this to the whole list. I really need a list-reply button in GMail. ] On 3/1/06, Greg Ewing [EMAIL PROTECTED] wrote: I don't like that, because it creates a dependency (conceptually, at least) between the bytes type and the unicode type. I only find half of this bothersome. The unicode type has a pretty clear dependency on the bytestring type: all I/O needs to be done in bytes. Various APIs may mask this by accepting unicode values and transparently doing the right thing, but from the theoretical standpoint we pretend there is no simple serialization of unicode values. But the reverse is not true: the bytestring type has no dependency on unicode. As a practicality vs purity, however, I think it's a good choice to let the bytestring type have a tie to unicode, much like the str type implicitly does now. But you're absolutely right that adding a .tounicode begs the question why not a .tointeger? To try to step back and summarize the viewpoints I've seen so far, there are three main requirements. 1) We want things that are conceptually text to be stored in memory as unicode values. 2) We want there to be some unambiguous conversion via codecs between bytestrings and unicode values. This should help teaching, learning, and remembering unicode. 3) We want a way to apply and reverse compressions, encodings, encryptions, etc., which are not only between bytestrings and unicode values; they may be between any two arbitrary types. This allows writing practical programs. There seems to be little disagreement over 1, provided sufficiently efficient implementation, or sufficient string powers in the bytestring type. To satisfy both 2 and 3, there seem to be a couple options. What other requirements do we have? For (2): a) Restrict the existing helpers to be only bytestring.decode and unicode.encode, possibly enforcing output types of the opposite kind, and removing bytestring.encode b) Add new methods with these semantics, e.g. bytestring.udecode and unicode.uencode For (3): c) Create new helpers codecs.encode(obj, encoding, errors) and codecs.decode(obj, encoding, errors) d) [Keep existing bytestring and unicode helper methods as is, and] require use of codecs.getencoder() and codecs.getdecoder() for arbitrary starting object types Obviously 2a and 3d do not work together, but 2b and 3c work with either complementary option. What other options do we have? Michael -- Michael Urman http://www.tortall.net/mu/blog ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg Ewing wrote: Ron Adam wrote: While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring. b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64') I don't like that, because it creates a dependency (conceptually, at least) between the bytes type and the unicode type. And why unicode in particular? Why should it have a tounicode() method, but not a toint() or tofloat() or tolist() etc.? I don't think it creates a dependency between the types, but it does create a stronger relationship between them when a method that returns a fixed type is used. No reason not to other than avoiding having methods that really aren't needed. But if it makes sense to have them, sure. If a codec isn't needed probably using a regular constructor should be used instead. I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode. Another thing is that it only works if the codec transforms between two different types. If you have a bytes-to-bytes transformation, for example, then b2 = b1.tobytes('some-weird-encoding') is ambiguous. Are you asking if it's decoding or encoding? bytes to unicode - decoding unicode to bytes - encoding bytes to bytes - ? Good point, I think this defines part the difficulty. 1. We can specify the operation and not be sure of the resulting type. *or* 2. We can specify the type and not always be sure of the operation. maybe there's a way to specify both so it's unambiguous? Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg Ewing [EMAIL PROTECTED] wrote: u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() . - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bill Janssen wrote: Well, I can certainly understand the bytes-base64-bytes side of thing too. The text produced is specified as using a 65-character subset of US-ASCII, so that's really bytes. But it then goes on to say that these same characters are also a subset of EBCDIC. So it seems to be talking about characters as abstract entities here, not as bit patterns. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bill Janssen wrote: I use it quite a bit for image processing (converting to and from the data: URL form), and various checksum applications (converting SHA into a string). Aha! We have a customer! For those cases, would you find it more convenient for the result to be text or bytes in Py3k? Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron == Ron Adam [EMAIL PROTECTED] writes: Ron So, lets consider a codec and a coding as being two Ron different things where a codec is a character sub set of Ron unicode characters expressed in a native format. And a Ron coding is *not* a subset of the unicode character set, but an Ron _opperation_ performed on text. Ron codec - text is always in *one_codec* at any time. No, a codec is an operation, not a state. And text qua text has no need of state; the whole point of defining text (as in the unicode type) is to abstract from such representational issues. Ron Pure codecs such as latin-1 can be envoked over and over and Ron you can always get back what you put in in a single step. Maybe you'd like to define them that way, but it doesn't work in general. Given that str and unicode currently don't carry state with them, it's not possible for to ASCII and to EBCDIC to be idempotent at the same time. And for the languages spoken by 75% of the world's population, to latin-1 cannot be successfully invoked even once, let alone be idempotent. You really need to think about how your examples apply to codecs like KOI8-R for Russian and Shift JIS for Japanese. In practice, I just don't think you can distinguish codecs from coding using the kind of mathematical properties you have described here. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: The reason that Python source code is text is that the primary producers/consumers of Python source code are human beings, not compilers I disagree with primary -- I think human and computer use of source code have equal importance. Because of the fact that Python source code must be acceptable to the Python compiler, a great many transformations that would be harmless to English text (upper casing, paragraph wrapping, etc.) would cause disaster if applied to a Python program. I don't see how base64 is any different. Yes, which implies that you assume he has control of the data all the way to the channel that actually requires base64. Yes. If he doesn't, he can't safely use base64 at all. That's true regardless of how the base64-encoded data is represented. It's true of any data of any kind. Use case: the Gnus MUA supports the RFC that allows non-ASCII names in MIME headers that take file names... I'm not familiar with all the details you're alluding to here, but if there's a bug here, I'd say it's due to somebody not thinking something through properly. It shouldn't matter if something gets encoded four times as long as it gets decoded four times at the other end. If it's not possible to do that, someone made an assumption about the channel that wasn't true. It's what is the Python compiler/interpreter going to think? AFAICS, it's going to think that base64 is a unicode codec. Only if it's designed that way, and I specifically think it shouldn't -- i.e. it should be an error to attempt the likes of a_unicode_string.encode(base64) or unicode(something, base64). The interface for doing base64 encoding should be something else. I don't believe that takes a character string as input has any intrinsic meaning. I'm using that phrase in the context of Python, where it means a function that takes a Python character string as input. In the particular case of base64, it has the added restriction that it must preserve the particular 65 characters used. In practice, I think it's a loaded gun aimed at my foot. And yours. Whereas it seems quite the opposite to me, i.e. *failing* to clearly distinguish between text and binary data here is what will lead to confusion and foot-shooting. I think we need some concrete use cases to talk about if we're to get any further with this. Do you have any such use cases in mind? Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron == Ron Adam [EMAIL PROTECTED] writes: Ron We could call it transform or translate if needed. You're still losing the directionality, which is my primary objection to recode. The absence of directionality is precisely why recode is used in that sense for i18n work. There really isn't a good reason that I can see to use anything other than the pair encode and decode. In monolingual environments, once _all_ human-readable text (specifically including Python programs and console I/O) is automatically mapped to a Python (unicode) string, most programmers will never need to think about it as long as Python (the project) very very strongly encourages that all Python programs be written in UTF-8 if there's any chance the program will be reused in a locale other than the one where it was written. (Alternatively you can depend on PEP 263 coding cookies.) Then the user (or the Python interpreter) just changes console and file I/O codecs to the encoding in use in that locale, and everything just works. So the remaining uses of encode and decode are for advanced users and specialists: people using stuff like base64 or gzip, and those who need to use unicode codecs explicitly. I could be wrong about the possibility to get rid of explicit unicode codec use in monolingual environments, but I hope that we can at least try to achieve that. Unlikely. Errors like A string.encode(base64).encode(base64) are all too easy to commit in practice. Ron Yes,... and wouldn't the above just result in a copy so it Ron wouldn't be an out right error. No, you either get the following: A string. - QSBzdHJpbmcu - UVNCemRISnBibWN1 or you might get an error if base64 is defined as bytes-unicode. Ron * Given that the string type gains a __codec__ attribute Ron to handle automatic decoding when needed. (is there a reason Ron not to?) Ronstr(object[,codec][,error]) - string coded with codec Ronunicode(object[,error]) - unicode Ronbytes(object) - bytes str == unicode in Py3k, so this is a non-starter. What do you want to say? Ron * a recode() method is used for transformations that Ron *do_not* change the current codec. I'm not sure what you mean by the current codec. If it's attached to an encoded object, it should be the codec needed to decode the object. And it should be allowed to be a codec stack. So suppose you start with a unicode object obj. Then bytes = bytes (obj, 'utf-8')# implicit .encode() print bytes.codec ['utf-8'] wire = bytes.encode ('base64') # with apologies to Greg E. print wire.codec ['base64', 'utf-8'] obj2 = wire.decode ('gzip') CodecMatchException obj2 = wire.decode (wire.codec) print obj == obj2 True print obj2.codec [] or maybe None for the last. I think this would be very nice as a basis for improving the email module (for one), but I don't really think it belongs in Python core. Ron That may be why it wasn't done this way to start. (?) I suspect the real reason is that Marc-Andre had the generalized codec in mind from Day 0, and your proposal only works with duck-typing if codecs always have a well-defined signature with two different types for the argument and return of the constructor. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg == Greg Ewing [EMAIL PROTECTED] writes: Greg Stephen J. Turnbull wrote: No, base64 isn't a wire protocol. It's a family[...]. Greg Yes, and it's up to the programmer to choose those code Greg units (i.e. pick an encoding for the characters) that will, Greg in fact, pass through the channel he is using without Greg corruption. I don't see how any of this is inconsistent with Greg what I've said. It's not. It just shows that there are other correct ways to think about the issue. Only if you do no transformations that will harm the base64-encoding. ... It doesn't allow any of the usual transformations on characters that might be applied globally to a mail composition buffer, for example. Greg I don't understand that. Obviously if you rot13 your mail Greg message or turn it into pig latin or something, it's going Greg to mess up any base64 it might contain. But that would be a Greg silly thing to do to a message containing base64. What message containing base64? Any base64 in there? Nope, nobody here but us Unicode characters! I certainly hope that in Py3k bytes objects will have neither ROT13 nor case-changing methods, but str objects certainly will. Why give up the safety of that distinction? Greg Given any piece of text, there are things it makes sense to Greg do with it and things it doesn't, depending entirely on the Greg use to which the text will eventually be put. I don't see Greg how base64 is any different in this regard. If you're going to be binary about it, it's not different. However the kind of text for which Unicode was designed is normally produced and consumed by people, who wll pt up w/ ll knds f nnsns. Base64 decoders will not put up with the same kinds of nonsense that people will. You're basically assuming that the person who implements the code that processes a Unicode string is the same person who implemented the code that converts a binary object into base64 and inserts it into a string. I think that's a dangerous (and certainly invalid) assumption. I know I've lost time and data to applications that make assumptions like that. In fact, that's why MULE is a four-letter word in Emacs channels.wink So then you bring it right back in with base64. Now they need to know about bytes-unicode codecs. Greg No, they need to know about the characteristics of the Greg channel over which they're sending the data. I meant it in a trivial sense: How do you use a bytes-unicode codec properly without knowing that it's a bytes-unicode codec? In most environments, it should be possible to hide bytes-unicode codecs almost all the time, and I think that's a very good thing. I don't think it's a good idea to gratuitously introduce wire protocols as unicode codecs, even if a class of bit patterns which represent the integer 65 are denoted A in various sources. Practicality beats purity (especially when you're talking about the purity of a pregnant virgin). Greg It might be appropriate to to use base64 followed by some Greg encoding, but the programmer needs to be aware of that and Greg choose the encoding wisely. It's not possible to shield him Greg from having to know about encodings in that situation, even Greg if the encoding is just ascii. What do you think the email module does? Assuming conforming MIME messages and receivers capable of handling UTF-8, the user of the email module does not need to know anything about any encodings at all. With a little more smarts, the email module could even make a good choice of output encoding based on the _language_ of the text, removing the restriction to UTF-8 on the output side, too. With the aid of file(1), it can make excellent guesses about attachments. Sure, the email module programmer needs to know, but the email module programmer needs to know an awful lot about codecs anyway, since mail at that level is a binary channel, while users will be throwing a mixed bag of binary and textual objects at it. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
* The following reply is a rather longer than I intended explanation of why codings (and how they differ) like 'rot' aren't the same thing as pure unicode codecs and probably should be treated differently. If you already understand that, then I suggest skipping this. But if you like detailed logical analysis, it might be of some interest even if it's reviewing the obvious to those who already know. (And hopefully I didn't make any really obvious errors myself.) Stephen J. Turnbull wrote: Ron == Ron Adam [EMAIL PROTECTED] writes: Ron We could call it transform or translate if needed. You're still losing the directionality, which is my primary objection to recode. The absence of directionality is precisely why recode is used in that sense for i18n work. I think your not understanding what I suggested. It might help if we could agree on some points and then go from there. So, lets consider a codec and a coding as being two different things where a codec is a character sub set of unicode characters expressed in a native format. And a coding is *not* a subset of the unicode character set, but an _opperation_ performed on text. So you would have the following properties. codec - text is always in *one_codec* at any time. coding - operation performed on text. Lets add a special default coding called 'none' to represent a do nothing coding. (figuratively for explanation purposes) 'none' - return the input as is, or the uncoded text Given the above relationships we have the following possible transformations. 1. codec to like codec: 'ascii' to 'ascii' 2. codec to unlike codec: 'ascii' to 'latin1' And we have coding relationships of: a. coding to like coding # Unchanged, do nothing b. coding to unlike coding Then we can express all the possible combinations as... [1.a, 1.b, 2.a, 2.b] 1.a - coding in codec to like coding in like codec: 'none' in 'ascii' to 'none' in 'ascii' 1.b - coding in codec to diff coding in like codec: 'none' in 'ascii' to 'base64' in 'ascii' 2.a - coding in codec to same coding in diff codec: 'none' in 'ascii' to 'none' in 'latin1' 2.b - coding in codec to diff coding in diff codec: 'none' in 'latin1' to 'base64' in 'ascii' This last one is a problem as some codecs combine coding with character set encoding and return text in a differnt encoding than they recieved. The line is also blurred between types and encodings. Is unicode and encoding? Will bytes also be a encoding? Using the above combinations: (1.a) is just creating a new copy of a object. s = str(s) (1.b) is recoding an object, it returns a copy of the object in the same encoding. s = s.encode('hex-codec') # ascii str - ascii str coded in hex s = s.decode('hex-codec') # ascii str coded in hex - ascii str * these are really two differnt operations. And encoding repeatedly results in nested codings. Codecs (as a pure subset of unicode) don't have that property. * the hex-codec also fit the 2.b pattern below if the source string is of a differnt type than ascii. (or the the default string?) (2.a) creates a copy encoded in a new codec. s = s.encode('latin1') * I beleive string constructors should have a encoding argument for use with unicode strings. s = str(u, 'latin1') # This would match the bytes constructor. (2.b) are combinations of the above. s = u.encode('base64') # unicode to ascii string as base64 coded characters u = unicode(s.decode('base64')) # ascii string coded in base64 to unicode characters or u = unicode(s, 'base64') Traceback (most recent call last): File stdin, line 1, in ? TypeError: decoder did not return an unicode object (type=str) Ooops... ;) So is coding the same as a codec? I think they have different properties and should be treated differently except when the practicality over purity rule is needed. And in those cases maybe the names could clearly state the result. u.decode('base64ascii') # name indicates coding to codec A string. - QSBzdHJpbmcu - UVNCemRISnBibWN1 Looks like the underlying sequence is: native string - unicode - unicode coded base64 - coded ascii str And decode operation would be... coded ascii str - unicode coded base64 - unicode - ascii str Except it may combine some of these steps to speed it up. Since it's a hybred codec including a coding operation. We have to treat it as a codec. Ron * Given that the string type gains a __codec__ attribute Ron to handle automatic decoding when needed. (is there a reason Ron not to?) Ronstr(object[,codec][,error]) - string coded with codec Ronunicode(object[,error]) - unicode Ronbytes(object) - bytes str == unicode in Py3k, so this is a non-starter. What do you want to say? Ron * a recode() method is used for
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: the kind of text for which Unicode was designed is normally produced and consumed by people, who wll pt up w/ ll knds f nnsns. Base64 decoders will not put up with the same kinds of nonsense that people will. The Python compiler won't put up with that sort of nonsense either. Would you consider that makes Python source code binary data rather than text, and that it's inappropriate to represent it using a unicode string? You're basically assuming that the person who implements the code that processes a Unicode string is the same person who implemented the code that converts a binary object into base64 and inserts it into a string. No, I'm assuming the user of base64 knows the characteristics of the channel he's using. You can only use base64 if you know the channel promises not to munge the particular characters that base64 uses. If you don't know that, you shouldn't be trying to send base64 through that channel. In most environments, it should be possible to hide bytes-unicode codecs almost all the time, But it *is* hidden in the situation I'm talking about, because all the Unicode encoding/decoding takes place inside the implementation of the text channel, which I'm taking as a given. I don't think it's a good idea to gratuitously introduce wire protocols as unicode codecs, I am *not* saying that base64 is a unicode codec! If that's what you thought I was saying, it's no wonder we're confusing each other. It's just a transformation from bytes to text. I'm only calling it unicode because all text will be unicode in Py3k. In py2.x it could just as well be a str -- but a str interpreted as text, not binary. What do you think the email module does? Assuming conforming MIME messages But I'm not assuming mime in the first place. If I have a mail interface that will accept chunks of binary data and encode them as a mime message for me, then I don't need to use base64 in the first place. The only time I need to use something like base64 is when I have something that will only accept text. In Py3k, accepts text is going to mean takes a character string as input, where character string is a distinct type from binary data. So having base64 produce anything other than a character string would be awkward and inconvenient. I phrased that paragraph carefully to avoid using the word unicode anywhere. Does that make it clearer what I'm getting at? -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Please define character, and explain how its semantics map to Python's unicode objects. One of the 65 abstract entities referred to in the RFC and represented in that RFC by certain visual glyphs. There is a subset of the Unicode code points that are conventionally associated with very similar glyphs, so that there is an obvious one-to-one mapping between these entities and those Unicode code points. These entities therefore have a natural and obvious representation using Python unicode strings. No, base64 isn't a wire protocol. Rather, it's a schema for a family of wire protocols, whose alphabets are heuristically chosen on the assumption that code units which happen to correspond to alpha-numeric code points in a commonly-used coded character set are more likely to pass through a communication channel without corruption. Yes, and it's up to the programmer to choose those code units (i.e. pick an encoding for the characters) that will, in fact, pass through the channel he is using without corruption. I don't see how any of this is inconsistent with what I've said. Only if you do no transformations that will harm the base64-encoding. ... It doesn't allow any of the usual transformations on characters that might be applied globally to a mail composition buffer, for example. I don't understand that. Obviously if you rot13 your mail message or turn it into pig latin or something, it's going to mess up any base64 it might contain. But that would be a silly thing to do to a message containing base64. Given any piece of text, there are things it makes sense to do with it and things it doesn't, depending entirely on the use to which the text will eventually be put. I don't see how base64 is any different in this regard. So then you bring it right back in with base64. Now they need to know about bytes-unicode codecs. No, they need to know about the characteristics of the channel over which they're sending the data. Base64 is designed for situations in which you have a *text* channel that you know is capable of transmitting at least a certain subset of characters, where character means whatever is used as input to that channel. In Py3k, text will be represented by unicode strings. So a Py3k text channel should take unicode as its input, not bytes. I think we've got a bit sidetracked by talking about mime. I wasn't actually thinking about mime, but just a plain text message into which some base64 data was being inserted. That's the way we used to do things in the old days with uuencode etc, before mime was invented. Here, the channel is NOT the socket or whatever that the ultimate transmission takes place over -- it's the interface to your mail sending software that takes a piece of plain text and sends it off as a mail message somehow. In Py3k, if a channel doesn't take unicode as input, then it's not a text channel, and it's not appropriate to be using base64 with it directly. It might be appropriate to to use base64 followed by some encoding, but the programmer needs to be aware of that and choose the encoding wisely. It's not possible to shield him from having to know about encodings in that situation, even if the encoding is just ascii. Trying to do so will just lead to more confusion, in my opinion. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg == Greg Ewing [EMAIL PROTECTED] writes: Greg Stephen J. Turnbull wrote: What I advocate for Python is to require that the standard base64 codec be defined only on bytes, and always produce bytes. Greg I don't understand that. It seems quite clear to me that Greg base64 encoding (in the general sense of encoding, not the Greg unicode sense) takes binary data (bytes) and produces Greg characters. Base64 is a (family of) wire protocol(s). It's not clear to me that it makes sense to say that the alphabets used by baseNN encodings are composed of characters, but suppose we stipulate that. Greg So in Py3k the correct usage would be [bytes-unicode]. IMHO, as a wire protocol, base64 simply doesn't care what Python's internal representation of characters is. I don't see any case for correctness here, only for convenience, both for programmers on the job and students in the classroom. We can choose the character set that works best for us. I think that's 8-bit US ASCII. My belief is that bytes-bytes is going to be the dominant use case, although I don't use binary representation in XML. However, AFAIK for on the wire use UTF-8 is strongly recommended for XML, and in that case it's also efficient to use bytes-bytes for XML, since conversion of base64 bytes to UTF-8 characters is simply a matter of Simon says, be UTF-8! And in the classroom, you're just going to confuse students by telling them that UTF-8 --[Unicode codec]-- Python string is decoding but UTF-8 --[base64 codec]-- Python string is encoding, when MAL is telling them that -- Python string is always decoding. Sure, it all makes sense if you already know what's going on. But I have trouble remembering, especially in cases like UTF-8 vs UTF-16 where Perl and Python have opposite internal representations, and glibc has a third which isn't either. If base64 (and gzip, etc) are all considered bytes-bytes, there just isn't an issue any more. The simple rule wins: to Python string is always decoding. Why fight it when we can run away with efficiency gains?wink (In the above, Python string means the unicode type, not str.) -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Base64 is a (family of) wire protocol(s). It's not clear to me that it makes sense to say that the alphabets used by baseNN encodings are composed of characters, Take a look at http://en.wikipedia.org/wiki/Base64 where it says ...base64 is a binary to text encoding scheme whereby an arbitrary sequence of bytes is converted to a sequence of printable ASCII characters. Also see RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt) which defines base64 in terms of an encoding from octets to characters, and also says A 65-character subset of US-ASCII is used ... This subset has the important property that it is represented identically in all versions of ISO 646 ... and all characters in the subset are also represented identically in all versions of EBCDIC. Which seems to make it perfectly clear that the result of the encoding is to be considered as characters, which are not necessarily going to be encoded using ascii. So base64 on its own is *not* a wire protocol. Only after encoding the characters do you have a wire protocol. I don't see any case for correctness here, only for convenience, I'm thinking of convenience, too. Keep in mind that in Py3k, 'unicode' will be called 'str' (or something equally neutral like 'text') and you will rarely have to deal explicitly with unicode codings, this being done mostly for you by the I/O objects. So most of the time, using base64 will be just as convenient as it is today: base64_encode(my_bytes) and write the result out somewhere. The reason I say it's *corrrect* is that if you go straight from bytes to bytes, you're *assuming* the eventual encoding is going to be an ascii superset. The programmer is going to have to know about this assumption and understand all its consequences and decide whether it's right, and if not, do something to change it. Whereas if the result is text, the right thing happens automatically whatever the ultimate encoding turns out to be. You can take the text from your base64 encoding, combine it with other text from any other source to form a complete mail message or xml document or whatever, and write it out through a file object that's using any unicode encoding at all, and the result will be correct. it's also efficient to use bytes-bytes for XML, since conversion of base64 bytes to UTF-8 characters is simply a matter of Simon says, be UTF-8! Efficiency is an implementation concern. In Py3k, strings which contain only ascii or latin-1 might be stored as 1 byte per character, in which case this would not be an issue. And in the classroom, you're just going to confuse students by telling them that UTF-8 --[Unicode codec]-- Python string is decoding but UTF-8 --[base64 codec]-- Python string is encoding, when MAL is telling them that -- Python string is always decoding. Which is why I think that only *unicode* codings should be available through the .encode and .decode interface. Or alternatively there should be something more explicit like .unicode_encode and .unicode_decode that is thus restricted. Also, if most unicode coding is done in the I/O objects, there will be far less need for programmers to do explicit unicode coding in the first place, so likely it will become more of an advanced topic, rather than something you need to come to grips with on day one of using unicode, like it is now. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Feb 22, 2006, at 6:35 AM, Greg Ewing wrote: I'm thinking of convenience, too. Keep in mind that in Py3k, 'unicode' will be called 'str' (or something equally neutral like 'text') and you will rarely have to deal explicitly with unicode codings, this being done mostly for you by the I/O objects. So most of the time, using base64 will be just as convenient as it is today: base64_encode(my_bytes) and write the result out somewhere. The reason I say it's *corrrect* is that if you go straight from bytes to bytes, you're *assuming* the eventual encoding is going to be an ascii superset. The programmer is going to have to know about this assumption and understand all its consequences and decide whether it's right, and if not, do something to change it. Whereas if the result is text, the right thing happens automatically whatever the ultimate encoding turns out to be. You can take the text from your base64 encoding, combine it with other text from any other source to form a complete mail message or xml document or whatever, and write it out through a file object that's using any unicode encoding at all, and the result will be correct. This makes little sense for mail. You combine *bytes*, in various and possibly different encodings to form a mail message. Some MIME sections might have a base64 Content-Transfer-Encoding, others might be 8bit encoded, others might be 7bit encoded, others might be quoted- printable encoded. Before the C-T-E encoding, you will have had to do the Content-Type encoding, coverting your text into bytes with the desired character encoding: utf-8, iso-8859-1, etc. Having the final mail message be made up of characters, right before transmission to the socket would be crazy. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg Ewing [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Efficiency is an implementation concern. It is also a user concern, especially if inefficiency overruns memory limits. In Py3k, strings which contain only ascii or latin-1 might be stored as 1 byte per character, in which case this would not be an issue. If 'might' becomes 'will', I and I suspect others will be happier with the change. And I would be happy if the choice of physical storage was pretty much handled behind the scenes, as with the direction int/long unification is going. Which is why I think that only *unicode* codings should be available through the .encode and .decode interface. Or alternatively there should be something more explicit like .unicode_encode and .unicode_decode that is thus restricted. I prefer the shorter names and using recode, for instance, for bytes to bytes. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Terry Reedy wrote: Greg Ewing [EMAIL PROTECTED] wrote in message Which is why I think that only *unicode* codings should be available through the .encode and .decode interface. Or alternatively there should be something more explicit like .unicode_encode and .unicode_decode that is thus restricted. I prefer the shorter names and using recode, for instance, for bytes to bytes. While I prefer constructors with an explicit encode argument, and use a recode() method for 'like to like' coding. Then the whole encode/decode confusion goes away. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Terry Reedy wrote: Greg Ewing [EMAIL PROTECTED] wrote in message Efficiency is an implementation concern. It is also a user concern, especially if inefficiency overruns memory limits. Sure, but what I mean is that it's better to find what's conceptually right and then look for an efficient way of implementing it, rather than letting the implementation drive the design. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam wrote: While I prefer constructors with an explicit encode argument, and use a recode() method for 'like to like' coding. Then the whole encode/decode confusion goes away. I'd be happy with that, too. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
James Y Knight wrote: Some MIME sections might have a base64 Content-Transfer-Encoding, others might be 8bit encoded, others might be 7bit encoded, others might be quoted- printable encoded. I stand corrected -- in that situation you would have to encode the characters before combining them with other material. However, this doesn't change my view that the result of base64 encoding by itself is characters, not bytes. To go straight to bytes would require assuming an encoding, and that would make it *harder* to use in cases where you wanted a different encoding, because you'd first have to undo the default encoding and then re-encode it using the one you wanted. It may be reasonable to provide an easy way to go straight from raw bytes to ascii-encoded-base64 bytes, but that should be a different codec. The plain base64 codec should produce text. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Greg == Greg Ewing [EMAIL PROTECTED] writes: Greg Stephen J. Turnbull wrote: Base64 is a (family of) wire protocol(s). It's not clear to me that it makes sense to say that the alphabets used by baseNN encodings are composed of characters, Greg Take a look at [this that the other] Those references use character in an ambiguous and ill-defined way. Trying to impose Python unicode object semantics on vague characters is a bad idea IMO. Greg Which seems to make it perfectly clear that the result of Greg the encoding is to be considered as characters, which are Greg not necessarily going to be encoded using ascii. Please define character, and explain how its semantics map to Python's unicode objects. Greg So base64 on its own is *not* a wire protocol. Only after Greg encoding the characters do you have a wire protocol. No, base64 isn't a wire protocol. Rather, it's a schema for a family of wire protocols, whose alphabets are heuristically chosen on the assumption that code units which happen to correspond to alpha-numeric code points in a commonly-used coded character set are more likely to pass through a communication channel without corruption. Note that I have _precisely_ defined what I mean. You still have the problem that you haven't defined character, and that is a real problem, see below. I don't see any case for correctness here, only for convenience, Greg I'm thinking of convenience, too. Keep in mind that in Py3k, Greg 'unicode' will be called 'str' (or something equally neutral Greg like 'text') and you will rarely have to deal explicitly Greg with unicode codings, this being done mostly for you by the Greg I/O objects. So most of the time, using base64 will be just Greg as convenient as it is today: base64_encode(my_bytes) and Greg write the result out somewhere. Convenient, yes, but incorrect. Once you mix those bytes with the Python string type, they become subject to all the usual operations on characters, and there's no way for Python to tell you that you didn't want to do that. Ie, Greg Whereas if the result is text, the right thing happens Greg automatically whatever the ultimate encoding turns out to Greg be. You can take the text from your base64 encoding, combine Greg it with other text from any other source to form a complete Greg mail message or xml document or whatever, and write it out Greg through a file object that's using any unicode encoding at Greg all, and the result will be correct. Only if you do no transformations that will harm the base64-encoding. This is why I say base64 is _not_ based on characters, at least not in the way they are used in Python strings. It doesn't allow any of the usual transformations on characters that might be applied globally to a mail composition buffer, for example. In other words, you don't escape from the programmer having to know what he's doing. EIBTI, and the setup I advocate forces the programmer to explicitly decide where to convert base64 objects to a textual representation. This reminds him that he'd better not touch that text. Greg The reason I say it's *corrrect* is that if you go straight Greg from bytes to bytes, you're *assuming* the eventual encoding Greg is going to be an ascii superset. The programmer is going Greg to have to know about this assumption and understand all its Greg consequences and decide whether it's right, and if not, do Greg something to change it. I'm not assuming any such thing, except in the context of analysis of implementation efficiency. And the programmer needs to know about the semantics of text that is actually a base64-encoded object, and that they are different from string semantics. This is something that programmers are used to dealing with in the case of Python 2.x str and C char[]; the whole point of the unicode type is to allow the programmer to abstract from that when dealing human-readable text. Why confuse the issue. And in the classroom, you're just going to confuse students by telling them that UTF-8 --[Unicode codec]-- Python string is decoding but UTF-8 --[base64 codec]-- Python string is encoding, when MAL is telling them that -- Python string is always decoding. Greg Which is why I think that only *unicode* codings should be Greg available through the .encode and .decode interface. Or Greg alternatively there should be something more explicit like Greg .unicode_encode and .unicode_decode that is thus restricted. Greg Also, if most unicode coding is done in the I/O objects, Greg there will be far less need for programmers to do explicit Greg unicode coding in the first place, so likely it will become Greg more of an advanced topic, rather than something you need to Greg come to grips with on day one of using unicode, like it is Greg now. So then you bring it
Re: [Python-Dev] bytes.from_hex()
Ron == Ron Adam [EMAIL PROTECTED] writes: Ron Terry Reedy wrote: I prefer the shorter names and using recode, for instance, for bytes to bytes. Ron While I prefer constructors with an explicit encode argument, Ron and use a recode() method for 'like to like' coding. 'Recode' is a great name for the conceptual process, but the methods are directional. Also, in internationalization work, recode strongly connotes encodingA - original - encodingB, as in iconv. I do prefer constructors, as it's generally not a good idea to do encoding/decoding in-place for human-readable text, since the codecs are often lossy. Ron Then the whole encode/decode confusion goes away. Unlikely. Errors like A string.encode(base64).encode(base64) are all too easy to commit in practice. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: What I advocate for Python is to require that the standard base64 codec be defined only on bytes, and always produce bytes. I don't understand that. It seems quite clear to me that base64 encoding (in the general sense of encoding, not the unicode sense) takes binary data (bytes) and produces characters. That's the whole point of base64 -- so you can send arbitrary data over a channel that is only capable of dealing with characters. So in Py3k the correct usage would be base64unicode encodeencode(x) original bytes unicode - bytes for transmission - base64unicode decodedecode(x) where x is whatever unicode encoding the transmission channel uses for characters (probably ascii or an ascii superset, but not necessarily). So, however it's spelled, the typing is such that base64_encode(bytes) -- unicode and base64_decode(unicode) -- bytes -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Sun, 2006-02-19 at 23:30 +0900, Stephen J. Turnbull wrote: M == M.-A. Lemburg [EMAIL PROTECTED] writes: M * for Unicode codecs the original form is Unicode, the derived M form is, in most cases, a string First of all, that's Martin's point! Second, almost all Americans, a large majority of Japanese, and I would bet most Western Europeans would say you have that backwards. That's the problem, and it's the Unicode advocates' problem (ie, ours), not the users'. Even if we're right: education will require lots of effort. Rather, we should just make it as easy as possible to do it right, and hard to do it wrong. I think you've hit the nail squarely on the head. Even though I /know/ what the intended semantics are, the originality of the string form is deeply embedded in my nearly 30 years of programming experience, almost all of it completely American English-centric. I always have to stop and think about which direction .encode() and .decode() go in because it simply doesn't feel natural. Or more simply put, my brain knows what's right, but my heart doesn't and that's why converting from one to the other is always a hiccup in the smooth flow of coding. And while I'm sympathetic to MAL's design decisions, the overlaying of the generalizations doesn't help. -Barry signature.asc Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: It doesn't seem strange to you to need to encode data twice to be able to have a usable sequence of characters which can be embedded in an effectively 7-bit email; I'm talking about a 3.0 world where all strings are unicode and the unicode - external coding is for the most part done automatically by the I/O objects. So you'd be building up your whole email as a string (aka unicode) which happens to only contain code points in the range 0..127, and then writing it to your socket or whatever. You wouldn't need to do the second encoding step explicitly very often. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin Stephen J. Turnbull wrote: Bengt The characters in b could be encoded in plain ascii, or Bengt utf16le, you have to know. Which base64 are you thinking about? Both RFC 3548 and RFC 2045 (MIME) specify subsets of US-ASCII explicitly. Martin Unfortunately, it is ambiguous as to whether they refer to Martin US-ASCII, the character set, or US-ASCII, the encoding. True for RFC 3548, but the authors of RFC 2045 clearly had the encoding in mind, since they depend on RFC 822. Martin It appears that RFC 3548 talks about the character set Martin only: OK, although RFC 3548 cites RFC 20 (!) as its source for US-ASCII, which clearly has bytes (though not necessarily octets) in mind, it doesn't actually restrict base encoding to be a subset of US-ASCII. On the other hand, RFC 3548 doesn't define base64 (or any other base encoding), it simply provides a set of requirements that a conforming implementation must satisfy. Python can therefore choose to define its base64 as a bytes-bytes codec, with the alphabet drawn from US-ASCII interpreted as encoding. I would definitely prefer that, as png_image = unicode.encode('base64') violates MAL's intuitive schema for the method. Martin For an example where base64 is *not* necessarily Martin ASCII-encoded, see the binary data type in XML Martin Schema. There, base64 is embedded into an XML document, Martin and uses the encoding of the entire XML document. As a Martin result, you may get base64 data in utf16le. I'll have to take a look. It depends on whether base64 is specified as an octet-stream to Unicode stream transformation or as an embedding of an intermediate representation into Unicode. Granted, defining the base64 alphabet as a subset of Unicode seems like the logical way to do it in the context of XML. P.S. My apologies for munging your name in the To: header. I'm having problems with my MUA. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters [EMAIL PROTECTED] wrote: On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote: [...] - The return value for the non-unicode encodings depends on the value of the encoding argument. Not really: you'll always get a basestring instance. But actually basestring is weird graft of semantic apples and empty bags IMO. unicode is essentially an abstract character vector type, and str is an abstract binary octet vector type having nothing to do with characters except by inferential association with an encoding. Which is not a particularly useful distinction, since in any real world application, you have to be careful not to mix unicode with (non-ascii) bytestrings. The only way to reliably deal with unicode is to have it well-contained (when migrating an application from using bytestrings to using unicode) or to use unicode everywhere, decoding/encoding at entrypoints. Containment is hard to achieve. Still, I believe that this is an educational problem. There are a couple of gotchas users will have to be aware of (and this is unrelated to the methods in question): * encoding always refers to transforming original data into a derived form ISTM encoding separates type information from the source and sets it aside as the identity of the encoding, and renders the data in a composite of more primitive types, octets being the most primitive short of bits. * decoding always refers to transforming a derived form of data back into its original form Decoding of a composite of primitives requires additional separate information (namely identification of the encoding) to create a higher composite type. * for Unicode codecs the original form is Unicode, the derived form is, in most cases, a string You mean a str instance, right? Where the original type as character vector is gone. That's not a string in the sense of character string. As a result, if you want to use a Unicode codec such as utf-8, you encode Unicode into a utf-8 string and decode a utf-8 string into Unicode. s/string/str instance/ Encoding a string is only possible if the string itself is original data, e.g. some data that is supposed to be transformed into a base64 encoded form. note what base64 really is for. It's essence is to create a _character_ sequence which can succeed in being encoded as ascii. The concept of base64 going str-str is really a mental shortcut for s_str.decode('base64').encode('ascii'), where 3 octets are decoded as code for 4 characters modulo padding logic. Decoding Unicode is only possible if the Unicode string itself represents a derived form, e.g. a sequence of hex literals. Again, it's an abbreviation, e.g. print u'4cf6776973'.encode('hex_chars_to_octets').decode('latin-1') Should print Löwis Most of these gotchas would not have been gotchas had encode/decode only been usable for unicode encodings. That is why I disagree with the hypergeneralization of the encode/decode methods [..] That's because you only look at one specific task. Codecs also unify the various interfaces to common encodings such as base64, uu or zip which are not Unicode related. I think the trouble is that these view the transformations as octets-octets whereas IMO decoding should always result in a container type that knows what it is semantically without association with external use-this-codec information. IOW, octets.decode('zip') - archive archive.encode('bzip') - octets You could even subclass octet to make archive that knows it's an octet vector representing a decoded zip, so it can have an encode method that could (specifying 'zip' again) encode itself back to the original zip, or an alternate method to encode itself as something else, which you couldn't do from plain octets without specifying both transformations at once. (hence the .recode idea, but I don't think that is as pure. The constructor for the container type could also be used, like Archive(octets, 'zip') analogous to unicode('abc', 'ascii') IOW octets + decoding info - container type instance container type instance + encoding info - octets No, I think you misunderstand. I object to the hypergeneralization of the *encode/decode methods*, not the codec system. I would have been fine with another set of methods for non-unicode transformations. Although I would have been even more fine if they got their encoding not as a string, but as, say, a module object, or something imported from a module. Not that I think any of this matters; we have what we have and I'll have to live with it ;) Probably. BTW, you may notice I'm saying octet instead of bytes. I have another post on that, arguing that the basic binary information type should be octet, since binary files are made of octets that have no instrinsic numerical or character significance. See other post if interested ;-) Regards, Bengt Richter ___ Python-Dev
Re: [Python-Dev] bytes.from_hex()
Josiah == Josiah Carlson [EMAIL PROTECTED] writes: Josiah I try to internalize it by not thinking of strings as Josiah encoded data, but as binary data, and unicode as text. I Josiah then remind myself that unicode isn't native on-disk or Josiah cross-network (which stores and transports bytes, not Josiah characters), so one needs to encode it as binary data. Josiah It's a subtle difference, but it has worked so far for me. Seems like a lot of work for something that for monolingual usage should Just Work almost all of the time. Josiah I notice that you seem to be in Japan, so teaching unicode Josiah is a must. Yes. Japan is more complicated than that, but in Python unicode is a must. Josiah If you are using the unicode is text and strings are Josiah data, and they aren't getting it; then I don't know. Well, I can tell you that they don't get it. One problem is PEP 263. It makes it very easy to write programs that do line-oriented I/O with input() and print, and the students come to think it should always be that easy. Since Japan has at least 6 common encodings that students encounter on a daily basis while browsing the web, plus a couple more that live inside of MSFT Word and Java, they're used to huge amounts of magic. The normal response of novice programmers is to mandate that users of their programs use the encoding of choice and put it in ordinary strings so that it just works. Ie, the average student just eats the F on the codecs assignment, and writes the rest of her programs without them. simple, and the exceptions for using a nonexistent method mean I don't have to reinforce---the students will be able to teach each other. The exceptions also directly help reinforce the notion that text == Unicode. Josiah Are you sure that they would help? If .encode() and Josiah .decode() drop from strings and unicode (respectively), Josiah they get an AttributeError. That's almost useless. Well, I'm not _sure_, but this is the kind of thing that you can learn by rote. And it will happen on a sufficiently regular basis that a large fraction of students will experience it. They'll ask each other, and usually they'll find a classmate who knows what happened. I haven't tried this with codecs, but that's been my experience with statistical packages where some routines understand non-linear equations but others insist on linear equations.[1] The error messages (Equation is non-linear! Aaugh!) are not much more specific than AttributeError. Josiah Raising a better exception (with more information) would Josiah be better in that case, but losing the functionality that Josiah either would offer seems unnecessary; Well, the point is that for the usual suspects (ie, Unicode codecs) there is no functionality that would be lost. As MAL pointed out, for these codecs the original text is always Unicode; that's the role Unicode is designed for, and by and large it fits the bill very well. With few exceptions (such as rot13) the derived text will be bytes that peripherals such as keyboards and terminals can generate and display. Josiah You are trying to encode/decode to/from incompatible Josiah types. expected: a-b got: x-y is better. Some of those Josiah can be done *very soon*, given the capabilities of the Josiah encodings module, That's probably the way to go. If we can have a derived Unicode codec class that does this, that would pretty much entirely serve the need I perceive. Beginning students could learn to write iconv.py, more advanced students could learn to create codec stacks to generate MIME bodies, which could include base64 or quoted-printable bytes - bytes codecs. Footnotes: [1] If you're not familiar with regression analysis, the problem is that the equation z = a*log(x) + b*log(y) where a and b are to be estimated is _linear_ in the sense that x, y, and z are data series, and X = log(x) and Y = log(y) can be precomputed so that the equation actually computed is z = a*X + b*Y. On the other hand z = a*(x + b*y) is _nonlinear_ because of the coefficient on y being a*b. Students find this hard to grasp in the classroom, but they learn quickly in the lab. I believe the parameter/variable inversion that my students have trouble with in statistics is similar to the original/derived inversion that happens with text you can see (derived, string) and abstract text inside the program (original, Unicode). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Martin For an example where base64 is *not* necessarily Martin ASCII-encoded, see the binary data type in XML Martin Schema. There, base64 is embedded into an XML document, Martin and uses the encoding of the entire XML document. As a Martin result, you may get base64 data in utf16le. I'll have to take a look. It depends on whether base64 is specified as an octet-stream to Unicode stream transformation or as an embedding of an intermediate representation into Unicode. Granted, defining the base64 alphabet as a subset of Unicode seems like the logical way to do it in the context of XML. Please do take a look. It is the only way: If you were to embed base64 *bytes* into character data content of an XML element, the resulting XML file might not be well-formed anymore (if the encoding of the XML file is not an ASCII superencoding). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin Please do take a look. It is the only way: If you were to Martin embed base64 *bytes* into character data content of an XML Martin element, the resulting XML file might not be well-formed Martin anymore (if the encoding of the XML file is not an ASCII Martin superencoding). Excuse me, I've been doing category theory recently. By embedding I mean a map from an intermediate object which is a stream of bytes to the corresponding stream of characters. In the case of UTF-16-coded characters, this would necessarily imply a representation change, as you say. What I advocate for Python is to require that the standard base64 codec be defined only on bytes, and always produce bytes. Any representation change should be done explicitly. This is surely conformant with RFC 2045's definition and with RFC 3548. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Feb 20, 2006, at 7:25 PM, Stephen J. Turnbull wrote: Martin == Martin v Löwis [EMAIL PROTECTED] writes: Martin Please do take a look. It is the only way: If you were to Martin embed base64 *bytes* into character data content of an XML Martin element, the resulting XML file might not be well-formed Martin anymore (if the encoding of the XML file is not an ASCII Martin superencoding). Excuse me, I've been doing category theory recently. By embedding I mean a map from an intermediate object which is a stream of bytes to the corresponding stream of characters. In the case of UTF-16-coded characters, this would necessarily imply a representation change, as you say. What I advocate for Python is to require that the standard base64 codec be defined only on bytes, and always produce bytes. Any representation change should be done explicitly. This is surely conformant with RFC 2045's definition and with RFC 3548. +1 -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
M.-A. Lemburg [EMAIL PROTECTED] writes: Martin v. Löwis wrote: M.-A. Lemburg wrote: True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. I forgot that: what is the rationale for that restriction? To assure that only those types can be returned from those methods, ie. instances of basestring, which in return permits type inference for those methods. Hmm. So it for type inference Where is that documented? Somewhere in the python-dev mailing list archives ;-) Seriously, we should probably add this to the documentation. Err.. I don't think this is a good argument, for quite a few reasons. There certainly aren't many other features in Python designed to aid type inference and the knowledge that something returns unicode or str isn't especially useful... Cheers, mwh -- ROOSTA: Ever since you arrived on this planet last night you've been going round telling people that you're Zaphod Beeblebrox, but that they're not to tell anyone else. -- The Hitch-Hikers Guide to the Galaxy, Episode 7 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ian == Ian Bicking [EMAIL PROTECTED] writes: Ian Encodings cover up eclectic interfaces, where those Ian interfaces fit a basic pattern -- data in, data out. Isn't filter the word you're looking for? I think you've just made a very strong case that this is a slippery slope that we should avoid. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
M == M.-A. Lemburg [EMAIL PROTECTED] writes: M Martin v. Löwis wrote: No. The reason to ban string.decode and bytes.encode is that it confuses users. M Instead of starting to ban everything that can potentially M confuse a few users, we should educate those users and tell M them what these methods mean and how they should be used. ISTM it's neither potential nor a few. As Aahz pointed out, for the common use of text I/O it requires only a single clue (Unicode is The One True Plain Text, everything else must be decoded to Unicode before use.) and you don't need any education about how to use codecs under Martin's restrictions; you just need to know which ones to use. This is not a benefit to be given up lightly. Would it be reasonable to put those restrictions in the codecs? Ie, so that bytes().encode('gzip') is allowed for the generic codec 'gzip', but bytes().encode('utf-8') is an error for the charset codec 'utf-8'? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
M == M.-A. Lemburg [EMAIL PROTECTED] writes: M The main reason is symmetry and the fact that strings and M Unicode should be as similar as possible in order to simplify M the task of moving from one to the other. Those are perfectly compatible with Martin's suggestion. M Still, I believe that this is an educational problem. There are M a couple of gotchas users will have to be aware of (and this is M unrelated to the methods in question): But IMO that's wrong, both in attitude and in fact. As for attitude, users should not have to be aware of these gotchas. Codec writers, on the other hand, should be required to avoid presenting users with those gotchas. Martin's draconian restriction is in the right direction, but you can argue it goes way too far. In fact, of course it's related to the methods in question. Original vs derived data can only be defined in terms of some notion of the usual semantics of the streams, and that is going to be strongly reflected in the semantics of the methods. M * encoding always refers to transforming original data into a M derived form M * decoding always refers to transforming a derived form of M data back into its original form Users *already* know that; it's a very strong connotation of the English words. The problem is that users typically have their own concept of what's original and what's derived. For example: M * for Unicode codecs the original form is Unicode, the derived M form is, in most cases, a string First of all, that's Martin's point! Second, almost all Americans, a large majority of Japanese, and I would bet most Western Europeans would say you have that backwards. That's the problem, and it's the Unicode advocates' problem (ie, ours), not the users'. Even if we're right: education will require lots of effort. Rather, we should just make it as easy as possible to do it right, and hard to do it wrong. BTW, what use cases do you have in mind for Unicode - Unicode decoding? Maximally decomposed forms and/or eliminating compatibility characters etc? Very specialized. M Codecs also unify the various interfaces to common encodings M such as base64, uu or zip which are not Unicode related. Now this is useful and has use cases I've run into, for example in email, where you would like to use the same interface for base64 as for shift_jis and you'd like to be able to write def encode-mime-body (string, codec-list): if codec-list[0] not in charset-codec-list: raise NotCharsetCodecException if len (codec-list) 1 and codec-list[-1] not in transfer-codec-list: raise NotTransferCodecException for codec in codec-list: string = string.encode (codec) return string mime-body = encode-mime-body (This is a pen., [ 'shift_jis', 'zip', 'base64' ]) I guess I have to admit I'm backtracking from my earlier hardline support for Martin's position, but I'm still sympathetic: (a) that's the direct way to make it easy to do it right, and (b) I still think the use cases for non-Unicode codecs are YAGNI very often. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah == Josiah Carlson [EMAIL PROTECTED] writes: Josiah The question remains: is str.decode() returning a string Josiah or unicode depending on the argument passed, when the Josiah argument quite literally names the codec involved, Josiah difficult to understand? I don't believe so; am I the Josiah only one? Do you do any of the user education *about codec use* that you recommend? The people I try to teach about coding invariably find it difficult to understand. The problem is that the near-universal intuition is that for human-usable text is pretty much anything *but Unicode* will do. This is a really hard block to get them past. There is very good reason why Unicode is plain text (original in MAL's terms) and everything else is encoded (derived), but students new to the concept often take a while to get it. Maybe it's just me, but whether it's the teacher or the students, I am *not* excited about the education route. Martin's simple rule *is* simple, and the exceptions for using a nonexistent method mean I don't have to reinforce---the students will be able to teach each other. The exceptions also directly help reinforce the notion that text == Unicode. I grant the point that .decode('base64') is useful, but I also believe that education is a lot more easily said than done in this case. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bob == Bob Ippolito [EMAIL PROTECTED] writes: Bob On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote: But you aren't always getting *unicode* text from the decoding of bytes, and you may be encoding bytes *to* bytes: Please note that I presumed that you can indeed assume that decoding of bytes always results in unicode, and encoding of unicode always results in bytes. I believe Guido made the proposal relying on that assumption too. The constructor notation makes no sense for making an object of the same type as the original unless it's a copy constructor. You could argue that the base64 language is indeed a different language from the bytes language, and I'd agree. But since there's no way in Python to determine whether a string that conforms to base64 is supposed to be base64 or bytes, it would be a very bad idea to interpret the distinction as one of type. b2 = bytes(b, base64) b3 = bytes(b2, base64) Which direction are we going again? Bob This is *exactly* why the current set of codecs are INSANE. Bob unicode.encode and str.decode should be used *only* for Bob unicode codecs. Byte transforms are entirely different Bob semantically and should be some other method pair. General filters are semantically different, I agree. But encode and decode in English are certainly far more general than character coding conversion. The use of those methods for any stream conversion that is invertible (eg, compression or encryption) is not insane. It's just pedagogically inconvenient given the existing confusion (outside of python-dev, of coursewink) about character coding issues. I'd like to rephrase your statement as *only* unicode.encode and str.decode should be used for unicode codecs. Ie, str.encode(codec) and unicode.decode(codec) should raise errors if codec is a unicode codec. The question in my mind is whether we should allow other kinds of codecs or not. I could live with notwink, but if we're going to have other kinds of codecs, I think they should have concrete signatures. Ie, basestring - basestring shouldn't be allowed. Content transfer encodings like BASE64 and quoted-printable, compression, encryption, etc IMO should be bytes - bytes. Overloading to unicode - unicode is sorta plausible for BASE64 or QP, but YAGNI. OTOH, the Unicode standard does define a number of unicode - unicode transformations, and it might make sense to generalize to case conversions etc. (Note that these conversions are pseudo-invertible, so you can think of them as generalized .encode/.decode pairs. The inverse is usually the identity, which seems weird, but from the pedagogical standpoint you could handle that weirdness by raising an error if the .encode method were invoked.) To be concrete, I could imagine writing s2 = s1.decode('upcase') if s2 == s1: print Why are you shouting at me? else: print I like calm, well-spoken snakes. s3 = s2.encode('upcase') if s3 == s2: print Never fails! else: print See a vet; your Python is *very* sick. I chose the decode method to do the non-trivial transformation because .decode()'s value is supposed to be original text in MAL's terms. And that's true of uppercase-only text; you're still supposed to be able to read it, so I guess it's not encoded. That's pretty pedantic; I think it's better to raise on .encode('upcase'). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Bengt == Bengt Richter [EMAIL PROTECTED] writes: Bengt The characters in b could be encoded in plain ascii, or Bengt utf16le, you have to know. Which base64 are you thinking about? Both RFC 3548 and RFC 2045 (MIME) specify subsets of US-ASCII explicitly. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: BTW, what use cases do you have in mind for Unicode - Unicode decoding? I think rot13 falls into that category: it is a transformation on text, not on bytes. For other odd cases: base64 goes Unicode-bytes in the *decode* direction, not in the encode direction. Some may argue that base64 is bytes, not text, but in many applications, you can combine base64 (or uuencode) with abitrary other text in a single stream. Of course, it could be required that you go u.encode(ascii).decode(base64). def encode-mime-body (string, codec-list): if codec-list[0] not in charset-codec-list: raise NotCharsetCodecException if len (codec-list) 1 and codec-list[-1] not in transfer-codec-list: raise NotTransferCodecException for codec in codec-list: string = string.encode (codec) return string mime-body = encode-mime-body (This is a pen., [ 'shift_jis', 'zip', 'base64' ]) I think this is an example where you *should* use the codec API, as designed. As that apparently requires streams for stacking (ie. no support for codec stacking), you would have to write def encode_mime_body(string, codec_list): stack = output = cStringIO.StringIO() for codec in reversed(codec_list): stack = codecs.getwriter(codec)(stack) stack.write(string) stack.reset() return output.getValue() Notice that you have to start the stacking with the last codec, and you have to keep a reference to the StringIO object where the actual bytes end up. Regards, Martin P.S. there shows some LISP through in your Python code :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Do you do any of the user education *about codec use* that you recommend? The people I try to teach about coding invariably find it difficult to understand. The problem is that the near-universal intuition is that for human-usable text is pretty much anything *but Unicode* will do. It really is a matter of education. For the first time in my career, I have been teaching the first-semester programming course, and I was happy to see that the text book already has a section on text and Unicode (actually, I selected the text book also based on whether there was good discussion of that aspect). So I spent quite some time with data representation (integrals, floats, characters), and I hope that the students now got it. If they didn't learn it that way in the first semester (or already got mis-educated in highschool), it will be very hard for them to relearn. So I expect that it will take a decade or two until this all is common knowledge. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull wrote: Bengt The characters in b could be encoded in plain ascii, or Bengt utf16le, you have to know. Which base64 are you thinking about? Both RFC 3548 and RFC 2045 (MIME) specify subsets of US-ASCII explicitly. Unfortunately, it is ambiguous as to whether they refer to US-ASCII, the character set, or US-ASCII, the encoding. It appears that RFC 3548 talks about the character set only: - section 2.4 talks about choosing an alphabet, and how it should be possible for humans to handle such data. - section 2.3 talks about non-alphabet characters So it appears that RFC 3548 defines a conversion bytes-text. To transmit this, you then also need encoding. MIME appears to also use the US-ASCII *encoding* (charset, in IETF speak), for the base64 Content-Transfer-Encoding. For an example where base64 is *not* necessarily ASCII-encoded, see the binary data type in XML Schema. There, base64 is embedded into an XML document, and uses the encoding of the entire XML document. As a result, you may get base64 data in utf16le. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Feb 19, 2006, at 10:55 AM, Martin v. Löwis wrote: Stephen J. Turnbull wrote: BTW, what use cases do you have in mind for Unicode - Unicode decoding? I think rot13 falls into that category: it is a transformation on text, not on bytes. The current implementation is a transformation on bytes, not text. Conceptually though, it's a text-text transform. For other odd cases: base64 goes Unicode-bytes in the *decode* direction, not in the encode direction. Some may argue that base64 is bytes, not text, but in many applications, you can combine base64 (or uuencode) with abitrary other text in a single stream. Of course, it could be required that you go u.encode(ascii).decode(base64). I would say that base64 is bytes-bytes. Just because those bytes happen to be in a subset of ASCII, it's still a serialization meant for wire transmission. Sometimes it ends up in unicode (e.g. in XML), but that's the exception not the rule. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Stephen J. Turnbull [EMAIL PROTECTED] wrote: Josiah == Josiah Carlson [EMAIL PROTECTED] writes: Josiah The question remains: is str.decode() returning a string Josiah or unicode depending on the argument passed, when the Josiah argument quite literally names the codec involved, Josiah difficult to understand? I don't believe so; am I the Josiah only one? Do you do any of the user education *about codec use* that you recommend? The people I try to teach about coding invariably find it difficult to understand. The problem is that the near-universal intuition is that for human-usable text is pretty much anything *but Unicode* will do. This is a really hard block to get them past. There is very good reason why Unicode is plain text (original in MAL's terms) and everything else is encoded (derived), but students new to the concept often take a while to get it. I've not been teaching Python; when I was still a TA, it was strictly algorithms and data structures. Of those people who I have had the opportunity to entice into Python, I've not followed up on their progress to know if they had any issues. I try to internalize it by not thinking of strings as encoded data, but as binary data, and unicode as text. I then remind myself that unicode isn't native on-disk or cross-network (which stores and transports bytes, not characters), so one needs to encode it as binary data. It's a subtle difference, but it has worked so far for me. In my experience, at least for only-English speaking users, most people don't even get to unicode. I didn't even touch it until I had been well versed with the encoding and decoding of all different kinds of binary data, when a half-dozen international users (China, Japan, Russia, ...) requested its support in my source editor; so I added it. Supporting it properly hasn't been very difficult, and the only real nit I have experienced is supporting the encoding line just after the #! line for arbitrary codecs (sometimes saving a file in a particular encoding dies). I notice that you seem to be in Japan, so teaching unicode is a must. If you are using the unicode is text and strings are data, and they aren't getting it; then I don't know. Maybe it's just me, but whether it's the teacher or the students, I am *not* excited about the education route. Martin's simple rule *is* simple, and the exceptions for using a nonexistent method mean I don't have to reinforce---the students will be able to teach each other. The exceptions also directly help reinforce the notion that text == Unicode. Are you sure that they would help? If .encode() and .decode() drop from strings and unicode (respectively), they get an AttributeError. That's almost useless. Raising a better exception (with more information) would be better in that case, but losing the functionality that either would offer seems unnecessary; which is why I had suggested some of the other method names. Perhaps a This method was removed because it confused users. Use help(str.encode) (or unicode.decode) to find out how you can do the equivalent, or do what you *really* wanted to do. I grant the point that .decode('base64') is useful, but I also believe that education is a lot more easily said than done in this case. What I meant by education is 'better documentation' and 'better exception messages'. I didn't learn Python by sitting in a class; I learned it by going through the tutorial over a weekend as a 2nd year undergrad and writing software which could do what I wanted/needed. Compared to the compiler messages I'd been seeing from Codewarrior and MSVC 6, Python exceptions were like an oracle. I can understand how first-time programmers can have issues with *some* Python exception messages, which is why I think that we could use better ones. There is also the other issue that sometimes people fail to actually read the messages. Again, I don't believe that an AttributeError is any better than an ordinal not in range(128), but You are trying to encode/decode to/from incompatible types. expected: a-b got: x-y is better. Some of those can be done *very soon*, given the capabilities of the encodings module, and they could likely be easily migrated, regardless of the decisions with .encode()/.decode() . - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: Bob Ippolito [EMAIL PROTECTED] wrote: On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote: Greg Ewing [EMAIL PROTECTED] wrote: Stephen J. Turnbull wrote: Guido == Guido van Rossum [EMAIL PROTECTED] writes: Guido - b = bytes(t, enc); t = text(b, enc) +1 The coding conversion operation has always felt like a constructor to me, and in this particular usage that's exactly what it is. I prefer the nomenclature to reflect that. This also has the advantage that it competely avoids using the verbs encode and decode and the attendant confusion about which direction they go in. e.g. s = text(b, base64) makes it obvious that you're going from the binary side to the text side of the base64 conversion. But you aren't always getting *unicode* text from the decoding of bytes, and you may be encoding bytes *to* bytes: b2 = bytes(b, base64) b3 = bytes(b2, base64) Which direction are we going again? This is *exactly* why the current set of codecs are INSANE. unicode.encode and str.decode should be used *only* for unicode codecs. Byte transforms are entirely different semantically and should be some other method pair. The problem is that we are overloading data types. Strings (and bytes) can contain both encoded text as well as data, or even encoded data. Right Educate the users. Raise better exceptions telling people why their encoding or decoding failed, as Ian Bicking already pointed out. If bytes.encode() and the equivalent of text.decode() is going to disappear, +1 on better documentation all around with regards to encodings and Unicode. So far the best explanation I've found (so far) is in PEP 100. The Python docs and built in help hardly explain more than the minimal argument list for the encoding and decoding methods, and the str and unicode type constructor arguments aren't explained any better. Bengt Richter had a good idea with bytes.recode() for strictly bytes transformations (and the equivalent for text), though it is ambiguous as to the direction; are we encoding or decoding with bytes.recode()? In my opinion, this is why .encode() and .decode() makes sense to keep on both bytes and text, the direction is unambiguous, and if one has even a remote idea of what the heck the codec is, they know their result. - Josiah I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Roughly... something or other like: import encodings encodings.tostr(obj, encoding): if encoding not in encoders: raise LookupError 'encoding not found in encoders' # check if obj works with encoding to string # ... b = bytes(obj).recode(encoding) return str(b) encodings.tounicode(obj, decodeing): if decoding not in decoders: raise LookupError 'decoding not found in decoders' # check if obj works with decoding to unicode # ... b = bytes(obj).recode(decoding) return unicode(b) Anyway... food for thought. Cheers, Ronald Adam ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Aahz wrote: The problem is that they don't understand that Martin v. L?wis is not Unicode -- once all strings are Unicode, this is guaranteed to work. This specific call, yes. I don't think the problem will go away as long as both encode and decode are available for both strings and byte arrays. While it's not absolutely true, my experience of watching Unicode confusion is that the simplest approach for newbies is: encode FROM Unicode, decode TO Unicode. I think this is what should be in-grained into the library, also. It shouldn't try to give additional meaning to these terms. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: Bengt Richter had a good idea with bytes.recode() for strictly bytes transformations (and the equivalent for text), though it is ambiguous as to the direction; are we encoding or decoding with bytes.recode()? In my opinion, this is why .encode() and .decode() makes sense to keep on both bytes and text, the direction is unambiguous, and if one has even a remote idea of what the heck the codec is, they know their result. - Josiah I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. Indeed it does. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. Attaching it to string and unicode objects is a useful convenience. Just like x.replace(y, z) is a convenience for string.replace(x, y, z) . Tossing the encode/decode somewhere else, like encodings, or even string, I see as a backwards step. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Roughly... something or other like: import encodings encodings.tostr(obj, encoding): if encoding not in encoders: raise LookupError 'encoding not found in encoders' # check if obj works with encoding to string # ... b = bytes(obj).recode(encoding) return str(b) encodings.tounicode(obj, decodeing): if decoding not in decoders: raise LookupError 'decoding not found in decoders' # check if obj works with decoding to unicode # ... b = bytes(obj).recode(decoding) return unicode(b) Anyway... food for thought. Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? Are we going to need to embed the direction in the encoding/decoding name (to_base64, from_base64, etc.)? That doesn't any better than binascii.b2a_base64 . What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. The question remains: is str.decode() returning a string or unicode depending on the argument passed, when the argument quite literally names the codec involved, difficult to understand? I don't believe so; am I the only one? - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin, v. Löwis wrote: How are users confused? Users do py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11: ordinal not in range(128) because they want to convert the string to Unicode, and they have found a text telling them that .encode(utf-8) is a reasonable method. What it *should* tell them is py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? AttributeError: 'str' object has no attribute 'encode' I've already explained why we have .encode() and .decode() methods on strings and Unicode many times. I've also explained the misunderstanding that can codecs only do Unicode-string conversions. And I've explained that the .encode() and .decode() method *do* check the return types of the codecs and only allow strings or Unicode on return (no lists, instances, tuples or anything else). You seem to ignore this fact. If we were to follow your idea, we should remove .encode() and .decode() altogether and refer users to the codecs.encode() and codecs.decode() function. However, I doubt that users will like this idea. bytes.encode CAN only produce bytes. I don't understand MAL's design, but I believe in that design, bytes.encode could produce anything (say, a list). A codec can convert anything to anything else. True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote: I've already explained why we have .encode() and .decode() methods on strings and Unicode many times. I've also explained the misunderstanding that can codecs only do Unicode-string conversions. And I've explained that the .encode() and .decode() method *do* check the return types of the codecs and only allow strings or Unicode on return (no lists, instances, tuples or anything else). You seem to ignore this fact. Actually, I think the problem is that while we all agree the bytestring/unicode methods are a useful way to convert from bytestring to unicode and back again, we disagree on their *general* usefulness. Sure, the codecs mechanism is powerful, and even more so because they can determine their own returntype. But it still smells and feels like a Perl attitude, for the reasons already explained numerous times, as well: - The return value for the non-unicode encodings depends on the value of the encoding argument. - The general case, by and large, especially in non-powerusers, is to encode unicode to bytestrings and to decode bytestrings to unicode. And that is a hard enough task for many of the non-powerusers. Being able to use the encode/decode methods for other tasks isn't helping them. That is why I disagree with the hypergeneralization of the encode/decode methods, regardless of the fact that it is a natural expansion of the implementation of codecs. Sure, it looks 'right' and 'natural' when you look at the implementation. It sure doesn't look natural, to me and to many others, when you look at the task of encoding and decoding bytestrings/unicode. -- Thomas Wouters [EMAIL PROTECTED] Hi! I'm a .signature virus! copy me into your .signature file to help me spread! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis wrote: M.-A. Lemburg wrote: Just because some codecs don't fit into the string.decode() or bytes.encode() scenario doesn't mean that these codecs are useless or that the methods should be banned. No. The reason to ban string.decode and bytes.encode is that it confuses users. Instead of starting to ban everything that can potentially confuse a few users, we should educate those users and tell them what these methods mean and how they should be used. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
This posting is entirely tangential. Be warned. Martin v. Löwis [EMAIL PROTECTED] writes: It's worse than that. The return *type* depends on the *value* of the argument. I think there is little precedence for that: There's one extremely significant example where the *value* of something impacts on the type of something else: functions. The types of everything involved in str([1]) and len([1]) are the same but the results are different. This shows up in PyPy's type annotation; most of the time we just track types indeed, but when something is called we need to have a pretty good idea of the potential values, too. Relavent to the point at hand? No. Apologies for wasting your time :) Cheers, mwh -- The ultimate laziness is not using Perl. That saves you so much work you wouldn't believe it if you had never tried it. -- Erik Naggum, comp.lang.lisp ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: Bengt Richter had a good idea with bytes.recode() for strictly bytes transformations (and the equivalent for text), though it is ambiguous as to the direction; are we encoding or decoding with bytes.recode()? In my opinion, this is why .encode() and .decode() makes sense to keep on both bytes and text, the direction is unambiguous, and if one has even a remote idea of what the heck the codec is, they know their result. - Josiah I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. Indeed it does. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. Attaching it to string and unicode objects is a useful convenience. Just like x.replace(y, z) is a convenience for string.replace(x, y, z) . Tossing the encode/decode somewhere else, like encodings, or even string, I see as a backwards step. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Roughly... something or other like: import encodings encodings.tostr(obj, encoding): if encoding not in encoders: raise LookupError 'encoding not found in encoders' # check if obj works with encoding to string # ... b = bytes(obj).recode(encoding) return str(b) encodings.tounicode(obj, decodeing): if decoding not in decoders: raise LookupError 'decoding not found in decoders' # check if obj works with decoding to unicode # ... b = bytes(obj).recode(decoding) return unicode(b) Anyway... food for thought. Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? This was just an example of one way that might work, but here are my thoughts on why I think it might be good. In this case, the ambiguity is reduced as far as the encoding and decodings opperations are concerned.) somestring = encodings.tostr( someunicodestr, 'latin-1') It's pretty clear what is happening to me. It will encode to a string an object, named someunicodestr, with the 'latin-1' encoder. And also rusult in clear errors if the specified encoding is unavailable, and if it is, if it's not compatible with the given *someunicodestr* obj type. Further hints could be gained by. help(encodings.tostr) Which could result in... something like... encoding.tostr( string|unicode, encoder ) - string Encode a unicode string using a encoder codec to a non-unicode string or transform a non-unicode string to another non-unicode string using an encoder codec. And if that's not enough, then help(encodings) could give more clues. These steps would be what I would do. And then the next thing would be to find the python docs entry on encodings. Placing them in encodings seems like a fairly good place to look for these functions if you are working with encodings. So I find that just as convenient as having them be string methods. There is no intermediate default encoding involved above, (the bytes object is used instead), so you wouldn't get some of the messages the present system results in when ascii is the default. (Yes, I know it won't when P3K is here also) Are we going to need to embed the direction in the encoding/decoding name (to_base64, from_base64, etc.)? That doesn't any better than binascii.b2a_base64 . No, that's why I suggested two separate lists (or dictionaries might be better). They can contain the same names, but the lists they are in determine the context and point to the needed codec. And that step is abstracted out by putting it inside the encodings.tostr() and encodings.tounicode() functions. So either function would call 'base64' from the correct codec list and get the correct encoding or decoding codec it needs. What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. But then wouldn't we end up with multitude of ways to do things? s.encode(codec) == s.redecode(codec) s.decode(codec) == s.reencode(codec) unicode(s, codec) == s.decode(codec) str(u, codec) == u.encode(codec) str(s, codec) == s.encode(codec) unicode(s, codec) == s.reencode(codec) str(u, codec) == s.redecode(codec) str(s,
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Thomas Wouters wrote: On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote: I've already explained why we have .encode() and .decode() methods on strings and Unicode many times. I've also explained the misunderstanding that can codecs only do Unicode-string conversions. And I've explained that the .encode() and .decode() method *do* check the return types of the codecs and only allow strings or Unicode on return (no lists, instances, tuples or anything else). You seem to ignore this fact. Actually, I think the problem is that while we all agree the bytestring/unicode methods are a useful way to convert from bytestring to unicode and back again, we disagree on their *general* usefulness. Sure, the codecs mechanism is powerful, and even more so because they can determine their own returntype. But it still smells and feels like a Perl attitude, for the reasons already explained numerous times, as well: It's by no means a Perl attitude. The main reason is symmetry and the fact that strings and Unicode should be as similar as possible in order to simplify the task of moving from one to the other. - The return value for the non-unicode encodings depends on the value of the encoding argument. Not really: you'll always get a basestring instance. - The general case, by and large, especially in non-powerusers, is to encode unicode to bytestrings and to decode bytestrings to unicode. And that is a hard enough task for many of the non-powerusers. Being able to use the encode/decode methods for other tasks isn't helping them. Agreed. Still, I believe that this is an educational problem. There are a couple of gotchas users will have to be aware of (and this is unrelated to the methods in question): * encoding always refers to transforming original data into a derived form * decoding always refers to transforming a derived form of data back into its original form * for Unicode codecs the original form is Unicode, the derived form is, in most cases, a string As a result, if you want to use a Unicode codec such as utf-8, you encode Unicode into a utf-8 string and decode a utf-8 string into Unicode. Encoding a string is only possible if the string itself is original data, e.g. some data that is supposed to be transformed into a base64 encoded form. Decoding Unicode is only possible if the Unicode string itself represents a derived form, e.g. a sequence of hex literals. That is why I disagree with the hypergeneralization of the encode/decode methods, regardless of the fact that it is a natural expansion of the implementation of codecs. Sure, it looks 'right' and 'natural' when you look at the implementation. It sure doesn't look natural, to me and to many others, when you look at the task of encoding and decoding bytestrings/unicode. That's because you only look at one specific task. Codecs also unify the various interfaces to common encodings such as base64, uu or zip which are not Unicode related. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On 2/18/06, Josiah Carlson [EMAIL PROTECTED] wrote: Look at what we've currently got going for data transformations in the standard library to see what these removals will do: base64 module, binascii module, binhex module, uu module, ... Do we want or need to add another top-level module for every future encoding/codec that comes out (or does everyone think that we're done seeing codecs)? Do we want to keep monkey-patching binascii with names like 'a2b_hqx'? While there is currently one text-text transform (rot13), do we add another module for text-text transforms? Would it start having names like t2e_rot13() and e2t_rot13()? If top-level modules are the problem then why not make codecs into a package? from codecs import utf8, base64 utf8.encode(u) - b utf8.decode(b) - u base64.encode(b) - b base64.decode(b) - b -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
On Sat, Feb 18, 2006, Ron Adam wrote: I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Here's an idea I don't think I've seen before: bytes.recode(b, src_encoding, dest_encoding) This requires the user to state up-front what the source encoding is. One of the big problems that I see with the whole encoding mess is that so much of it contains implicit assumptions about the source encoding; this gets away from that. -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ 19. A language that doesn't affect the way you think about programming, is not worth knowing. --Alan Perlis ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Aahz wrote: On Sat, Feb 18, 2006, Ron Adam wrote: I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Here's an idea I don't think I've seen before: bytes.recode(b, src_encoding, dest_encoding) This requires the user to state up-front what the source encoding is. One of the big problems that I see with the whole encoding mess is that so much of it contains implicit assumptions about the source encoding; this gets away from that. You might want to look at the codecs.py module: it has all these things and a lot more. http://docs.python.org/lib/module-codecs.html http://svn.python.org/view/python/trunk/Lib/codecs.py?view=markup -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
M.-A. Lemburg wrote: I've already explained why we have .encode() and .decode() methods on strings and Unicode many times. I've also explained the misunderstanding that can codecs only do Unicode-string conversions. And I've explained that the .encode() and .decode() method *do* check the return types of the codecs and only allow strings or Unicode on return (no lists, instances, tuples or anything else). You seem to ignore this fact. I'm not ignoring the fact that you have explained this many times. I just fail to understand your explanations. For example, you said at some point that codecs are not restricted to Unicode. However, I don't recall any explanation what the restriction *is*, if any restriction exists. No such restriction seems to be documented. True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. I forgot that: what is the rationale for that restriction? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Michael Hudson wrote: There's one extremely significant example where the *value* of something impacts on the type of something else: functions. The types of everything involved in str([1]) and len([1]) are the same but the results are different. This shows up in PyPy's type annotation; most of the time we just track types indeed, but when something is called we need to have a pretty good idea of the potential values, too. Relavent to the point at hand? No. Apologies for wasting your time :) Actually, I think it is relevant. I never thought about it this way, but now that you mention it, you are right. This demonstrates that the string argument to .encode is actually a function name, atleast the way it is implemented now. So .encode(uu) and .encode(rot13) are *two* different methods, instead of being a single method. This brings me back to my original point: rot13 should be a function, not a parameter to some function. In essence, .encode reimplements apply(), with the added feature of not having to pass the function itself, but just its name. Maybe this design results from a really deep understanding of Namespaces are one honking great idea -- let's do more of those! Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis wrote: M.-A. Lemburg wrote: I've already explained why we have .encode() and .decode() methods on strings and Unicode many times. I've also explained the misunderstanding that can codecs only do Unicode-string conversions. And I've explained that the .encode() and .decode() method *do* check the return types of the codecs and only allow strings or Unicode on return (no lists, instances, tuples or anything else). You seem to ignore this fact. I'm not ignoring the fact that you have explained this many times. I just fail to understand your explanations. Feel free to ask questions. For example, you said at some point that codecs are not restricted to Unicode. However, I don't recall any explanation what the restriction *is*, if any restriction exists. No such restriction seems to be documented. The codecs are not restricted w/r to the data types they work on. It's up to the codecs to define which data types are valid and which they take on input and return. True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. I forgot that: what is the rationale for that restriction? To assure that only those types can be returned from those methods, ie. instances of basestring, which in return permits type inference for those methods. The codecs functions encode() and decode() don't have these restrictions, and thus provide a generic interface to the codec's encode and decode functions. It's up to the caller to restrict the allowed encodings and as result the possible input/output types. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
M.-A. Lemburg wrote: True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. I forgot that: what is the rationale for that restriction? To assure that only those types can be returned from those methods, ie. instances of basestring, which in return permits type inference for those methods. Hmm. So it for type inference Where is that documented? This looks pretty inconsistent. Either codecs can give arbitrary return types, then .encode/.decode should also be allowed to give arbitrary return types, or codecs should be restricted. What's the point of first allowing a wide interface, and then narrowing it? Also, if type inference is the goal, what is the point in allowing two result types? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis wrote: M.-A. Lemburg wrote: True. However, note that the .encode()/.decode() methods on strings and Unicode narrow down the possible return types. The corresponding .bytes methods should only allow bytes and Unicode. I forgot that: what is the rationale for that restriction? To assure that only those types can be returned from those methods, ie. instances of basestring, which in return permits type inference for those methods. Hmm. So it for type inference Where is that documented? Somewhere in the python-dev mailing list archives ;-) Seriously, we should probably add this to the documentation. This looks pretty inconsistent. Either codecs can give arbitrary return types, then .encode/.decode should also be allowed to give arbitrary return types, or codecs should be restricted. No. As I've said before: the .encode() and .decode() methods are convenience methods to interface to codecs which take string/Unicode on input and create string/Unicode output. What's the point of first allowing a wide interface, and then narrowing it? The codec interface is an abstract interface. It is a flexible enough to allow codecs to define possible input and output types while being strict about the method names and signatures. Much like the file interface in Python, the copy protocol or the pickle interface. Also, if type inference is the goal, what is the point in allowing two result types? I'm not sure I understand the question: type inference is about being able to infer the types of (among other things) function return objects. This is what the restriction guarantees - much like int() guarantees that you get either an integer or a long. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 18 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: [snip] Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? This was just an example of one way that might work, but here are my thoughts on why I think it might be good. In this case, the ambiguity is reduced as far as the encoding and decodings opperations are concerned.) somestring = encodings.tostr( someunicodestr, 'latin-1') It's pretty clear what is happening to me. It will encode to a string an object, named someunicodestr, with the 'latin-1' encoder. But now how do you get it back? encodings.tounicode(..., 'latin-1')?, unicode(..., 'latin-1')? What about string transformations: somestring = encodings.tostr(somestr, 'base64') How do we get that back? encodings.tostr() again is completely ambiguous, str(somestring, 'base64') seems a bit awkward (switching namespaces)? And also rusult in clear errors if the specified encoding is unavailable, and if it is, if it's not compatible with the given *someunicodestr* obj type. Further hints could be gained by. help(encodings.tostr) Which could result in... something like... encoding.tostr( string|unicode, encoder ) - string Encode a unicode string using a encoder codec to a non-unicode string or transform a non-unicode string to another non-unicode string using an encoder codec. And if that's not enough, then help(encodings) could give more clues. These steps would be what I would do. And then the next thing would be to find the python docs entry on encodings. Placing them in encodings seems like a fairly good place to look for these functions if you are working with encodings. So I find that just as convenient as having them be string methods. There is no intermediate default encoding involved above, (the bytes object is used instead), so you wouldn't get some of the messages the present system results in when ascii is the default. (Yes, I know it won't when P3K is here also) Are we going to need to embed the direction in the encoding/decoding name (to_base64, from_base64, etc.)? That doesn't any better than binascii.b2a_base64 . No, that's why I suggested two separate lists (or dictionaries might be better). They can contain the same names, but the lists they are in determine the context and point to the needed codec. And that step is abstracted out by putting it inside the encodings.tostr() and encodings.tounicode() functions. So either function would call 'base64' from the correct codec list and get the correct encoding or decoding codec it needs. Either the API you have described is incomplete, you haven't noticed the directional ambiguity you are describing, or I have completely lost it. What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. But then wouldn't we end up with multitude of ways to do things? s.encode(codec) == s.redecode(codec) s.decode(codec) == s.reencode(codec) unicode(s, codec) == s.decode(codec) str(u, codec) == u.encode(codec) str(s, codec) == s.encode(codec) unicode(s, codec) == s.reencode(codec) str(u, codec) == s.redecode(codec) str(s, codec) == s.redecode(codec) Umm .. did I miss any? Which ones would you remove? Which ones of those will succeed with which codecs? I must not be expressing myself very well. Right now: s.encode() - s s.decode() - s, u u.encode() - s, u u.decode() - u Martin et al's desired change to encode/decode: s.decode() - u u.encode() - s No others. What my thoughts on .reencode() and .redecode() would get you given Martin et al's desired change: s.reencode() - s (you get encoded strings as strings) s.redecode() - s (you get decoded strings as strings) u.reencode() - u (you get encoded unicode as unicode) u.redecode() - u (you get decoded unicode as unicode) If one wants to go from unicode to string, one uses .encode(). If one wants to go from string to unicode, one uses .decode(). If one wants to keep their type unchanged, but encode or decode the data/text, one would use .reencode() and .redecode(), depending on whether their source is an encoded block of data, or the original data they want to encode. The other bonus is that if given .reencode() and .redecode(), one can quite easily verify that the source is possible as a source, and that you would get back the proper type. How this would occur behind the scenes is beyond the scope of this discussion, but it seems to me to be easy, given what I've read about the current mechanism. Whether the constructors for the str and unicode do their own codec transformations is beside the
Re: [Python-Dev] bytes.from_hex()
Aahz wrote: On Sat, Feb 18, 2006, Ron Adam wrote: I like the bytes.recode() idea a lot. +1 It seems to me it's a far more useful idea than encoding and decoding by overloading and could do both and more. It has a lot of potential to be an intermediate step for encoding as well as being used for many other translations to byte data. I think I would prefer that encode and decode be just functions with well defined names and arguments instead of being methods or arguments to string and Unicode types. I'm not sure on exactly how this would work. Maybe it would need two sets of encodings, ie.. decoders, and encoders. An exception would be given if it wasn't found for the direction one was going in. Here's an idea I don't think I've seen before: bytes.recode(b, src_encoding, dest_encoding) This requires the user to state up-front what the source encoding is. One of the big problems that I see with the whole encoding mess is that so much of it contains implicit assumptions about the source encoding; this gets away from that. Yes, but it's not just the encodings that are implicit, it is also the types. s.encode(codec) # explicit source type, ? dest type s.decode(codec) # explicit source type, ? dest type encodings.tostr(obj, codec) # implicit *known* source type # explicit dest type encodings.tounicode(obj, codec) # implicit *known* source type # explicit dest type In this case the source is implicit, but there can be a well defined check to validate the source type against the codec being used. It's my feeling the user *knows* what he already has, and so it's more important that the resulting object type is explicit. In your suggestion... bytes.recode(b, src_encoding, dest_incoding) Here the encodings are both explicit, but the both the source and the destinations of the bytes are not. Since it working on bytes, they could have come from anywhere, and after the translation they would then will be cast to the type the user *thinks* it should result in. A source of errors that would likely pass silently. The way I see it is the bytes type should be a lower level object that doesn't care what byte transformation it does. Ie.. they are all one way byte to byte transformations determined by context. And it should have the capability to read from and write to types without translating in the same step. Keep it simple. Then it could be used as a lower level byte translator to implement encodings and other translations in encoding methods or functions instead of trying to make it replace the higher level functionality. Cheers, Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote: It's by no means a Perl attitude. In your eyes, perhaps. It certainly feels that way to me (or I wouldn't have said it :). Perl happens to be full of general constructs that were added because they were easy to add, or they were useful in edgecases. The encode/decode methods remind me of that, even though I fully understand the reasoning behind it, and the elegance of the implementation. The main reason is symmetry and the fact that strings and Unicode should be as similar as possible in order to simplify the task of moving from one to the other. Yes, and this is a design choice I don't agree with. They're different types. They do everything similarly, except when they are mixed together (unicode takes precedence, in general, encoding the bytestring from the default encoding.) Going from one to the other isn't symmetric, though. I understand that you disagree; the disagreement is on the fundamental choice of allowing 'encode' and 'decode' to do *more* than going from and to unicode. I regret that decision, not the decision to make encode and decode symmetric (which makes sense, after the decision to overgeneralize encode/decode is made.) - The return value for the non-unicode encodings depends on the value of the encoding argument. Not really: you'll always get a basestring instance. Which is not a particularly useful distinction, since in any real world application, you have to be careful not to mix unicode with (non-ascii) bytestrings. The only way to reliably deal with unicode is to have it well-contained (when migrating an application from using bytestrings to using unicode) or to use unicode everywhere, decoding/encoding at entrypoints. Containment is hard to achieve. Still, I believe that this is an educational problem. There are a couple of gotchas users will have to be aware of (and this is unrelated to the methods in question): * encoding always refers to transforming original data into a derived form * decoding always refers to transforming a derived form of data back into its original form * for Unicode codecs the original form is Unicode, the derived form is, in most cases, a string As a result, if you want to use a Unicode codec such as utf-8, you encode Unicode into a utf-8 string and decode a utf-8 string into Unicode. Encoding a string is only possible if the string itself is original data, e.g. some data that is supposed to be transformed into a base64 encoded form. Decoding Unicode is only possible if the Unicode string itself represents a derived form, e.g. a sequence of hex literals. Most of these gotchas would not have been gotchas had encode/decode only been usable for unicode encodings. That is why I disagree with the hypergeneralization of the encode/decode methods [..] That's because you only look at one specific task. Codecs also unify the various interfaces to common encodings such as base64, uu or zip which are not Unicode related. No, I think you misunderstand. I object to the hypergeneralization of the *encode/decode methods*, not the codec system. I would have been fine with another set of methods for non-unicode transformations. Although I would have been even more fine if they got their encoding not as a string, but as, say, a module object, or something imported from a module. Not that I think any of this matters; we have what we have and I'll have to live with it ;) -- Thomas Wouters [EMAIL PROTECTED] Hi! I'm a .signature virus! copy me into your .signature file to help me spread! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? Are we going to need to embed the direction in the encoding/decoding name (to_base64, from_base64, etc.)? To me, that seems simple and clear. b.recode('from_base64') obviously requires that b meet the restrictions of base64. Similarly for 'from_hex'. That doesn't any better than binascii.b2a_base64 I think 'from_base64' is *much* better. I think there are now 4 string-to-string transform modules that do similar things. Not optimal to me. What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. To me, the 're' prefix is awkward, confusing, and misleading. Terry J. Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: [snip] Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? This was just an example of one way that might work, but here are my thoughts on why I think it might be good. In this case, the ambiguity is reduced as far as the encoding and decodings opperations are concerned.) somestring = encodings.tostr( someunicodestr, 'latin-1') It's pretty clear what is happening to me. It will encode to a string an object, named someunicodestr, with the 'latin-1' encoder. But now how do you get it back? encodings.tounicode(..., 'latin-1')?, unicode(..., 'latin-1')? Yes, Just do. someunicodestr = encoding.tounicode( somestring, 'latin-1') What about string transformations: somestring = encodings.tostr(somestr, 'base64') How do we get that back? encodings.tostr() again is completely ambiguous, str(somestring, 'base64') seems a bit awkward (switching namespaces)? In the case where a string is converted to another string. It would probably be best to have a requirement that they all get converted to unicode as an intermediate step. By doing that it becomes an explicit two step opperation. # string to string encoding u_string = encodings.tounicode(s_string, 'base64') s2_string = encodings.tostr(u_string, 'base64') Or you could have a convenience function to do it in the encodings module also. def strtostr(s, sourcecodec, destcodec): u = tounicode(s, sourcecodec) return tostr(u, destcodec) Then... s2 = encodings.strtostr(s, 'base64, 'base64) Which would be kind of pointless in this example, but it would be a good way to test a codec. assert s == s2 Are we going to need to embed the direction in the encoding/decoding name (to_base64, from_base64, etc.)? That doesn't any better than binascii.b2a_base64 . No, that's why I suggested two separate lists (or dictionaries might be better). They can contain the same names, but the lists they are in determine the context and point to the needed codec. And that step is abstracted out by putting it inside the encodings.tostr() and encodings.tounicode() functions. So either function would call 'base64' from the correct codec list and get the correct encoding or decoding codec it needs. Either the API you have described is incomplete, you haven't noticed the directional ambiguity you are describing, or I have completely lost it. Most likely I gave an incomplete description of the API in this case because there are probably several ways to implement it. What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. ... I must not be expressing myself very well. Right now: s.encode() - s s.decode() - s, u u.encode() - s, u u.decode() - u Martin et al's desired change to encode/decode: s.decode() - u u.encode() - s No others. Which would be similar to the functions I suggested. The main difference is only weather it would be better to have them as methods or separate factory functions and the spelling of the names. Both have their advantages I think. The method bytes.recode(), always does a byte transformation which can be almost anything. It's the context bytes.recode() is used in that determines what's happening. In the above cases, it's using an encoding transformation, so what it's doing is precisely what you would expect by it's context. Indeed, there is a translation going on, but it is not clear as to whether you are encoding _to_ something or _from_ something. What does s.recode('base64') mean? Are you encoding _to_ base64 or _from_ base64? That's where the ambiguity lies. Bengt didn't propose adding .recode() to the string types, but only the bytes type. The byte type would recode the bytes using a specific transformation. I believe his view is it's a lower level API than strings that can be used to implement the higher level encoding API with, not replace the encoding API. Or that is they way I interpreted the suggestion. There isn't a bytes.decode(), since that's just another transformation. So only the one method is needed. Which makes it easer to learn. But ambiguous. What's ambiguous about it? It's no more ambiguous than any math operation where you can do it one way with one operations and get your original value back with the same operation by using an inverse value. n2=n+1; n3=n+(-1); n==n3 n2=n*2; n3=n*(.5); n==n3 Learning how the current system works comes awfully close to reverse engineering. Maybe I'm overstating it a bit, but I suspect many end up doing exactly that in order to learn how Python does
Re: [Python-Dev] bytes.from_hex()
Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: Ron Adam [EMAIL PROTECTED] wrote: Josiah Carlson wrote: [snip] Again, the problem is ambiguity; what does bytes.recode(something) mean? Are we encoding _to_ something, or are we decoding _from_ something? This was just an example of one way that might work, but here are my thoughts on why I think it might be good. In this case, the ambiguity is reduced as far as the encoding and decodings opperations are concerned.) somestring = encodings.tostr( someunicodestr, 'latin-1') It's pretty clear what is happening to me. It will encode to a string an object, named someunicodestr, with the 'latin-1' encoder. But now how do you get it back? encodings.tounicode(..., 'latin-1')?, unicode(..., 'latin-1')? Yes, Just do. someunicodestr = encoding.tounicode( somestring, 'latin-1') What about string transformations: somestring = encodings.tostr(somestr, 'base64') How do we get that back? encodings.tostr() again is completely ambiguous, str(somestring, 'base64') seems a bit awkward (switching namespaces)? In the case where a string is converted to another string. It would probably be best to have a requirement that they all get converted to unicode as an intermediate step. By doing that it becomes an explicit two step opperation. # string to string encoding u_string = encodings.tounicode(s_string, 'base64') s2_string = encodings.tostr(u_string, 'base64') Except that ambiguates it even further. Is encodings.tounicode() encoding, or decoding? According to everything you have said so far, it would be decoding. But if I am decoding binary data, why should it be spending any time as a unicode string? What do I mean? x = f.read() #x contains base-64 encoded binary data y = encodings.to_unicode(x, 'base64') y now contains BINARY DATA, except that it is a unicode string z = encodings.to_str(y, 'latin-1') Later you define a str_to_str function, which I (or someone else) would use like: z = str_to_str(x, 'base64', 'latin-1') But the trick is that I don't want some unicode string encoded in latin-1, I want my binary data unencoded. They may happen to be the same in this particular example, but that doesn't mean that it makes any sense to the user. [...] What about .reencode and .redecode? It seems as though the 're' added as a prefix to .encode and .decode makes it clearer that you get the same type back as you put in, and it is also unambiguous to direction. ... I must not be expressing myself very well. Right now: s.encode() - s s.decode() - s, u u.encode() - s, u u.decode() - u Martin et al's desired change to encode/decode: s.decode() - u u.encode() - s No others. Which would be similar to the functions I suggested. The main difference is only weather it would be better to have them as methods or separate factory functions and the spelling of the names. Both have their advantages I think. While others would disagree, I personally am not a fan of to* or from* style namings, for either function names (especially in the encodings module) or methods. Just a personal preference. Of course, I don't find the current situation regarding str/unicode.encode/decode to be confusing either, but maybe it's because my unicode experience is strictly within the realm of GUI widgets, where compartmentalization can be easier. The method bytes.recode(), always does a byte transformation which can be almost anything. It's the context bytes.recode() is used in that determines what's happening. In the above cases, it's using an encoding transformation, so what it's doing is precisely what you would expect by it's context. [THIS IS THE AMBIGUITY] Indeed, there is a translation going on, but it is not clear as to whether you are encoding _to_ something or _from_ something. What does s.recode('base64') mean? Are you encoding _to_ base64 or _from_ base64? That's where the ambiguity lies. Bengt didn't propose adding .recode() to the string types, but only the bytes type. The byte type would recode the bytes using a specific transformation. I believe his view is it's a lower level API than strings that can be used to implement the higher level encoding API with, not replace the encoding API. Or that is they way I interpreted the suggestion. But again, what would the transformation be? To something? From something? 'to_base64', 'from_base64', 'to_rot13' (which happens to be identical to) 'from_rot13', ... Saying it would recode ... using a specific transformation is a cop-out, what would the translation be? How would it work? How would it be spelled? That smells quite a bit like .encode() and .decode(), just spelled differently, and without quite a clear path. That is why I was offering... s.reencode() - s
Re: [Python-Dev] bytes.from_hex()
Josiah Carlson wrote: Ron Adam [EMAIL PROTECTED] wrote: Except that ambiguates it even further. Is encodings.tounicode() encoding, or decoding? According to everything you have said so far, it would be decoding. But if I am decoding binary data, why should it be spending any time as a unicode string? What do I mean? Encoding and decoding are relative concepts. It's all encoding from one thing to another. Weather it's decoding or encoding depends on the relationship of the current encoding to a standard encoding. The confusion introduced by decode is when the 'default_encoding' changes, will change, or is unknown. x = f.read() #x contains base-64 encoded binary data y = encodings.to_unicode(x, 'base64') y now contains BINARY DATA, except that it is a unicode string No, that wasn't what I was describing. You get a Unicode string object as the result, not a bytes object with binary data. See the toy example at the bottom. z = encodings.to_str(y, 'latin-1') Later you define a str_to_str function, which I (or someone else) would use like: z = str_to_str(x, 'base64', 'latin-1') But the trick is that I don't want some unicode string encoded in latin-1, I want my binary data unencoded. They may happen to be the same in this particular example, but that doesn't mean that it makes any sense to the user. If you want bytes then you would use the bytes() type to get bytes directly. Not encode or decode. binary_unicode = bytes(unicode_string) The exact byte order and representation would need to be decided by the python developers in this case. The internal representation 'unicode-internal', is UCS-2 I believed. It's no more ambiguous than any math operation where you can do it one way with one operations and get your original value back with the same operation by using an inverse value. n2=n+1; n3=n+(-1); n==n3 n2=n*2; n3=n*(.5); n==n3 Ahh, so you are saying 'to_base64' and 'from_base64'. There is one major reason why I don't like that kind of a system: I can't just say encoding='base64' and use str.encode(encoding) and str.decode(encoding), I necessarily have to use, str.recode('to_'+encoding) and str.recode('from_'+encoding) . Seems a bit awkward. Yes, but the encodings API could abstract out the 'to_base64' and 'from_base64' so you can just say 'base64' and have it work either way. Maybe a toy incomplete example might help. # in module bytes.py or someplace else. class bytes(list): bytes methods defined here # in module encodings.py # using a dict of lists, but other solutions would # work just as well. unicode_codecs = { 'base64': ('from_base64', 'to_base64'), } def tounicode(obj, from_codec): b = bytes(obj) b = b.recode(unicode_codecs[from_codec][0]) return unicode(b) def tostr(obj, to_codec): b = bytes(obj) b = b.recode(unicode_codecs[to_codec][1]) return str(b) # in your application import encodings ... a bunch of code ... u = encodings.tounicode(s, 'base64') # or if going the other way s = encodings.tostr(u, 'base64') Does this help? Is the relationship between the bytes object and the encodings API clearer here? If not maybe we should discuss it further off line. Cheers, Ronald Adam ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Josiah Carlson wrote: I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex', and likely a few others that the two of you may be arguing against should stay as encodings, because strictly speaking, they are defined as encodings of data. They may not be encodings of _unicode_ data, but that doesn't mean that they aren't useful encodings for other kinds of data, some text, some binary, ... To support them, the bytes type would have to gain a .encode method, and I'm -1 on supporting bytes.encode, or string.decode. Why is s.encode(uu) any better than binascii.b2a_uu(s) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On 2/15/06, Guido van Rossum [EMAIL PROTECTED] wrote: Actually users trying to figure out Unicode would probably be better served if bytes.encode() and text.decode() did not exist.[...]It would be better if the signature of text.encode() always returned a bytes object. But why deny the bytes object a decode() method if textobjects have an encode() method? I agree, text.encode() and bytes.decode() are both swell. It's the other two that bother me. I'd say there are two symmetric API flavors possible (t and b are text and bytes objects, respectively, where text is a string type,either str or unicode; enc is an encoding name):- b.decode(enc) - t; t.encode(enc) - b- b = bytes(t, enc); t = text(b, enc) I'm not sure why one flavor would be preferred over the other,although having both would probably be a mistake. I prefer constructor flavor; the word bytes feels more concrete than encode. But I worry about constructors being too overloaded. text(b, enc) # decode text(mydict) # repr text(b) # uh... decode with default encoding? -j ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Feb 16, 2006, at 9:20 PM, Josiah Carlson wrote: Greg Ewing [EMAIL PROTECTED] wrote: Josiah Carlson wrote: They may not be encodings of _unicode_ data, But if they're not encodings of unicode data, what business do they have being available through someunicodestring.encode(...)? I had always presumed that bytes objects are going to be able to be a source for encode AND decode, like current non-unicode strings are able to be today. In that sense, if I have a bytes object which is an encoding of rot13, hex, uu, etc., or I have a bytes object which I would like to be in one of those encodings, I should be able to do b.encode(...) or b.decode(...), given that 'b' is a bytes object. Are 'encodings' going to become a mechanism to encode and decode _unicode_ strings, rather than a mechanism to encode and decode _text and data_ strings? That would seem like a backwards step to me, as the email package would need to package their own base-64 encode/decode API and implementation, and similarly for any other package which uses any one of the encodings already available. It would be VERY useful to separate the two concepts. bytes-bytes transforms should be one function pair, and bytes-text transforms should be another. The current situation is totally insane: str.decode(codec) - str or unicode or UnicodeDecodeError or ZlibError or TypeError.. who knows what else str.encode(codec) - str or unicode or UnicodeDecodeError or TypeError... probably other exceptions Granted, unicode.encode(codec) and unicode.decode(codec) are actually somewhat sane in that the return type is always a str and the exceptions are either UnicodeEncodeError or UnicodeDecodeError. I think that rot13 is the only conceptually text-text transform (though the current implementation is really bytes-bytes), everything else is either bytes-text or bytes-bytes. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex()
Guido == Guido van Rossum [EMAIL PROTECTED] writes: Guido I'd say there are two symmetric API flavors possible (t Guido and b are text and bytes objects, respectively, where text Guido is a string type, either str or unicode; enc is an encoding Guido name): Guido - b.decode(enc) - t; t.encode(enc) - b -0 When taking a binary file and attaching it to the text of a mail message using BASE64, the tendency to say you're encoding the file in BASE64 is very strong. I just don't see how such usages can be avoided in discussion, which makes the types of decode and encode hard to remember, and easy to mistake in some contexts. Guido - b = bytes(t, enc); t = text(b, enc) +1 The coding conversion operation has always felt like a constructor to me, and in this particular usage that's exactly what it is. I prefer the nomenclature to reflect that. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis wrote: Josiah Carlson wrote: I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex', and likely a few others that the two of you may be arguing against should stay as encodings, because strictly speaking, they are defined as encodings of data. They may not be encodings of _unicode_ data, but that doesn't mean that they aren't useful encodings for other kinds of data, some text, some binary, ... To support them, the bytes type would have to gain a .encode method, and I'm -1 on supporting bytes.encode, or string.decode. Why is s.encode(uu) any better than binascii.b2a_uu(s) The .encode() and .decode() methods are merely convenience interfaces to the registered codecs (with some extra logic to make sure that only a pre-defined set of return types are allowed). It's up to the user to use them for e.g. UU-encoding or not. The reason we have codecs for UU, zip and the others is that you can use their StreamWriters/Readers in stackable streams. Just because some codecs don't fit into the string.decode() or bytes.encode() scenario doesn't mean that these codecs are useless or that the methods should be banned. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 17 2006) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Fri, 17 Feb 2006 00:33:49 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= [EMAIL PROTECTED] wrote: Josiah Carlson wrote: I would agree that zip is questionable, but 'uu', 'rot13', perhaps 'hex', and likely a few others that the two of you may be arguing against should stay as encodings, because strictly speaking, they are defined as encodings of data. They may not be encodings of _unicode_ data, but that doesn't mean that they aren't useful encodings for other kinds of data, some text, some binary, ... To support them, the bytes type would have to gain a .encode method, and I'm -1 on supporting bytes.encode, or string.decode. Why is s.encode(uu) any better than binascii.b2a_uu(s) One aspect is that dotted notation method calling is serially composable, whereas function calls nest, and you have to find and read from the innermost, which gets hard quickly unless you use multiline formatting. But even then you can't read top to bottom as processing order. If we had a general serial composition syntax for function calls something like unix piping (which is a big part of the power of unix shells IMO) we could make the choice of appropriate composition semantics better. Decorators already compose functions in a limited way, but processing order would read like forth horizontally. Maybe '-' ? How about foo(x, y) - bar() - baz(z) as as sugar for baz.__get__(bar.__get__(foo(x, y))())(z) ? (Hope I got that right ;-) I.e., you'd have self-like args to receive results from upstream. E.g., def foo(x, y): return 'foo(%s, %s)'%(x,y) ... def bar(stream): return 'bar(%s)'%stream ... def baz(stream, z): return 'baz(%s, %s)'%(stream,z) ... x = 'ex'; y='wye'; z='zed' baz.__get__(bar.__get__(foo(x, y))())(z) 'baz(bar(foo(ex, wye)), zed)' Regards, Bengt Richter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
M.-A. Lemburg wrote: Just because some codecs don't fit into the string.decode() or bytes.encode() scenario doesn't mean that these codecs are useless or that the methods should be banned. No. The reason to ban string.decode and bytes.encode is that it confuses users. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis [EMAIL PROTECTED] wrote: M.-A. Lemburg wrote: Just because some codecs don't fit into the string.decode() or bytes.encode() scenario doesn't mean that these codecs are useless or that the methods should be banned. No. The reason to ban string.decode and bytes.encode is that it confuses users. How are users confused? bytes.encode CAN only produce bytes. Though string.decode (or bytes.decode) MAY produce strings (or bytes) or unicode, depending on the codec, I think it is quite reasonable to expect that users will understand that string.decode('utf-8') is different than string.decode('base-64'), and that they may produce different output. In a similar fashion, dict.get(1) may produce different results than dict.get(2) for some dictionaries. If some users can't understand this (passing different arguments to a function may produce different output), then I think that some users are broken beyond repair. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
On Fri, 17 Feb 2006 21:35:25 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= [EMAIL PROTECTED] wrote: M.-A. Lemburg wrote: Just because some codecs don't fit into the string.decode() or bytes.encode() scenario doesn't mean that these codecs are useless or that the methods should be banned. No. The reason to ban string.decode and bytes.encode is that it confuses users. Well, that's because of semantic overloading. Assuming you mean string as characters and bytes as binary bytes. The trouble is encoding and decoding have to have bytes to represent the coded info, whichever direction. Characters per se aren't coded info, so string.decode doesn't make sense without faking it with string.encode().decode() and bytes.encode() likewise first has to have a hidden .decode to become a string that makes sense to encode. And the hidden stuff restricts to ascii, for further grief :-( So yes, please ban string.decode and bytes.encode. And maybe introduce bytes.recode for bytes-bytes transforms? (strings don't have any codes to recode). Regards, Bengt Richter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Josiah Carlson wrote: How are users confused? Users do py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11: ordinal not in range(128) because they want to convert the string to Unicode, and they have found a text telling them that .encode(utf-8) is a reasonable method. What it *should* tell them is py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? AttributeError: 'str' object has no attribute 'encode' bytes.encode CAN only produce bytes. I don't understand MAL's design, but I believe in that design, bytes.encode could produce anything (say, a list). A codec can convert anything to anything else. If some users can't understand this (passing different arguments to a function may produce different output), It's worse than that. The return *type* depends on the *value* of the argument. I think there is little precedence for that: normally, the return values depend on the argument values, and, in a polymorphic function, the return type might depend on the argument types (e.g. the arithmetic operations). Also, the return type may depend on the number of arguments (e.g. by requesting a return type in a keyword argument). then I think that some users are broken beyond repair. Hmm. I'm speechless. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis wrote: Users do py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11: ordinal not in range(128) because they want to convert the string to Unicode, and they have found a text telling them that .encode(utf-8) is a reasonable method. What it *should* tell them is py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? AttributeError: 'str' object has no attribute 'encode' I think it would be even better if they got ValueError: utf8 can only encode unicode objects. AttributeError is not much more clear than the UnicodeDecodeError. That str.encode(unicode_encoding) implicitly decodes strings seems like a flaw in the unicode encodings, quite seperate from the existance of str.encode. I for one really like s.encode('zlib').encode('base64') -- and if the zlib encoding raised an error when it was passed a unicode object (instead of implicitly encoding the string with the ascii encoding) that would be fine. The pipe-like nature of .encode and .decode works very nicely for certain transformations, applicable to both unicode and byte objects. Let's not throw the baby out with the bath water. -- Ian Bicking / [EMAIL PROTECTED] / http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Martin v. Löwis [EMAIL PROTECTED] wrote: Josiah Carlson wrote: How are users confused? Users do py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 11: ordinal not in range(128) because they want to convert the string to Unicode, and they have found a text telling them that .encode(utf-8) is a reasonable method. Removing functionality because some users read bad instructions somewhere, is a bit like kicking your kitten because your puppy peed on the floor. You are punishing the wrong group, for something that shouldn't result in punishment: it should result in education. Users are always going to get bad instructions, and removing utility because some users fail to think before they act, or complain when their lack of thinking doesn't work, will give us a language where we are removing features because *new* users have no idea what they are doing. What it *should* tell them is py Martin v. Löwis.encode(utf-8) Traceback (most recent call last): File stdin, line 1, in ? AttributeError: 'str' object has no attribute 'encode' I disagree. I think the original error was correct, and we should be educating users to prefix their literals with a 'u' if they want unicode, or they should get their data from a unicode source (wxPython with unicode, StreamReader, etc.) bytes.encode CAN only produce bytes. I don't understand MAL's design, but I believe in that design, bytes.encode could produce anything (say, a list). A codec can convert anything to anything else. That seems to me to be a little overkill... In any case, I personally find that data.encode('base-64') and edata.decode('base-64') to be more convenient than binascii.b2a_base64 (data) and binascii.a2b_base64(edata). Ditto for hexlify/unhexlify, etc. If some users can't understand this (passing different arguments to a function may produce different output), It's worse than that. The return *type* depends on the *value* of the argument. I think there is little precedence for that: normally, the return values depend on the argument values, and, in a polymorphic function, the return type might depend on the argument types (e.g. the arithmetic operations). Also, the return type may depend on the number of arguments (e.g. by requesting a return type in a keyword argument). You only need to look to dictionaries where different values passed into a function call may very well return results of different types, yet there have been no restrictions on mapping to and from single types per dictionary. Many dict-like interfaces for configuration files do this, things like config.get('remote_host') and config.get('autoconnect') not being uncommon. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Ian Bicking wrote: That str.encode(unicode_encoding) implicitly decodes strings seems like a flaw in the unicode encodings, quite seperate from the existance of str.encode. I for one really like s.encode('zlib').encode('base64') -- and if the zlib encoding raised an error when it was passed a unicode object (instead of implicitly encoding the string with the ascii encoding) that would be fine. The pipe-like nature of .encode and .decode works very nicely for certain transformations, applicable to both unicode and byte objects. Let's not throw the baby out with the bath water. The way you use it, it's a matter of notation only: why is zlib(base64(s)) any worse? I think it's better: it doesn't use string literals to denote function names. If there is a point to this overgeneralized codec idea, it is the streaming aspect: that you don't need to process all data at once, but can feed data sequentially. Of course, you are not using this in your example. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]
Josiah Carlson wrote: If some users can't understand this (passing different arguments to a function may produce different output), It's worse than that. The return *type* depends on the *value* of the argument. I think there is little precedence for that: normally, the return values depend on the argument values, and, in a polymorphic function, the return type might depend on the argument types (e.g. the arithmetic operations). Also, the return type may depend on the number of arguments (e.g. by requesting a return type in a keyword argument). You only need to look to dictionaries where different values passed into a function call may very well return results of different types, yet there have been no restrictions on mapping to and from single types per dictionary. Many dict-like interfaces for configuration files do this, things like config.get('remote_host') and config.get('autoconnect') not being uncommon. I think there is *some* justification, if you don't understand up front that the codec you refer to (using a string) is just a way of avoiding an import (thankfully -- dynamically importing unicode codecs is obviously infeasible). Now, if you understand the argument refers to some algorithm, it's not so bad. The other aspect is that there should be something consistent about the return types -- the Python type is not what we generally rely on, though. In this case they are all data. Unicode and bytes are both data, and you could probably argue lists of ints is data too (but an arbitrary list definitely isn't data). On the outer end of data might be an ElementTree structure (but that's getting fishy). An open file object is not data. A tuple probably isn't data. -- Ian Bicking / [EMAIL PROTECTED] / http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com