Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Steven D'Aprano writes: [long example] > Am I right so far? > > So the email package uses the surrogate-escape error handler and ends up > with this Unicode string: > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' > > which can be encoded back to the bytes we

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Akira Li
Steven D'Aprano writes: > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: >> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray >> wrote: > >> > Basically, we are pretending that the each smuggled >> > byte is single character for string parsing purposes...but they don't >> > matc

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > wrote: > > Basically, we are pretending that the each smuggled > > byte is single character for string parsing purposes...but they don't > > match any of our parsing constants. T

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Glenn Linderman writes: > Some bytes may decode into characters without needing to be > smuggled... maybe not in text-protocols like email, but in the > general case. So then some of the bytes that should be interpreted > as binary data are not in a disjoint set from characters. True, but irr

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Glenn Linderman
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote: It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray wrote: > Yes. I thought you were saying that one could not treat the string with > smuggled bytes as if it were a string. (It's a string that can't be > encoded unless you use the surrogateescape error handler, but it is > still a string from Pyth

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull" wrote: > As long as the Java string manipulation functions don't check for > surrogates, you should be fine with this representation. Of course I > suppose your matching functions (etc) don't check for them either, so > you will be somewh

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
R. David Murray writes: > > Do what, exactly? As I understand you, you treat the unknown bytes as > > completely opaque, not representing any characters at all. Which is > > what I'm saying: those are not characters. > > Yes. I thought you were saying that one could not treat the string wit

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Jim Baker writes: > Given that Jython uses UTF-16 as its representation, it is possible to > frequently smuggle isolated surrogates in it. A surrogate pair must be a > low surrogate in range (D800, DC00), then a high surrogate in range(DC00, > E000). > > Of course, if you do actually have a

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray > wrote: > >> You can't treat them as characters, so while you have them in your > >> string, you can't treat it as a pure Unicode string - it''s a Unicode > >> string with smuggled bytes

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker wrote: > Of course, if you do actually have a smuggled isolated low surrogate > FOLLOWED by a smuggled isolated high surrogate - guess what, the only > interpretation is a codepoint. Or perhaps more likely garbage. Of course it > doesn't happen so often,

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Jim Baker
Great points here - I especially like the concluding statement "you can't treat it as a pure Unicode string - it's a Unicode string with smuggled bytes" Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a l

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray wrote: >> You can't treat them as characters, so while you have them in your >> string, you can't treat it as a pure Unicode string - it''s a Unicode >> string with smuggled bytes. > > Well, except that I do. The email header parsing algorithms all

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray > wrote: > > That isn't the case in the email package. The smuggled bytes are not > > errors[*], they are literally smuggled bytes. > > But they're not characters, which is what Stephen

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray wrote: > That isn't the case in the email package. The smuggled bytes are not > errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them a

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico wrote: > On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull > wrote: > > Jim J. Jewett writes: > > > > > In terms of best-effort, it is reasonable to treat the smuggled bytes > > > as representing a character outside of your unicode repertoire