Steven D'Aprano writes:
[long example]
> Am I right so far?
>
> So the email package uses the surrogate-escape error handler and ends up
> with this Unicode string:
>
> 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
>
> which can be encoded back to the bytes we
Steven D'Aprano writes:
> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
>> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray
>> wrote:
>
>> > Basically, we are pretending that the each smuggled
>> > byte is single character for string parsing purposes...but they don't
>> > matc
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray
> wrote:
> > Basically, we are pretending that the each smuggled
> > byte is single character for string parsing purposes...but they don't
> > match any of our parsing constants. T
Glenn Linderman writes:
> Some bytes may decode into characters without needing to be
> smuggled... maybe not in text-protocols like email, but in the
> general case. So then some of the bytes that should be interpreted
> as binary data are not in a disjoint set from characters.
True, but irr
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote:
It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.
That is not true
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray wrote:
> Yes. I thought you were saying that one could not treat the string with
> smuggled bytes as if it were a string. (It's a string that can't be
> encoded unless you use the surrogateescape error handler, but it is
> still a string from Pyth
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull"
wrote:
> As long as the Java string manipulation functions don't check for
> surrogates, you should be fine with this representation. Of course I
> suppose your matching functions (etc) don't check for them either, so
> you will be somewh
R. David Murray writes:
> > Do what, exactly? As I understand you, you treat the unknown bytes as
> > completely opaque, not representing any characters at all. Which is
> > what I'm saying: those are not characters.
>
> Yes. I thought you were saying that one could not treat the string wit
Jim Baker writes:
> Given that Jython uses UTF-16 as its representation, it is possible to
> frequently smuggle isolated surrogates in it. A surrogate pair must be a
> low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
> E000).
>
> Of course, if you do actually have a
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray
> wrote:
> >> You can't treat them as characters, so while you have them in your
> >> string, you can't treat it as a pure Unicode string - it''s a Unicode
> >> string with smuggled bytes
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker wrote:
> Of course, if you do actually have a smuggled isolated low surrogate
> FOLLOWED by a smuggled isolated high surrogate - guess what, the only
> interpretation is a codepoint. Or perhaps more likely garbage. Of course it
> doesn't happen so often,
Great points here - I especially like the concluding statement "you can't
treat it as a pure Unicode string - it's a Unicode string with smuggled
bytes"
Given that Jython uses UTF-16 as its representation, it is possible to
frequently smuggle isolated surrogates in it. A surrogate pair must be a
l
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray wrote:
>> You can't treat them as characters, so while you have them in your
>> string, you can't treat it as a pure Unicode string - it''s a Unicode
>> string with smuggled bytes.
>
> Well, except that I do. The email header parsing algorithms all
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray
> wrote:
> > That isn't the case in the email package. The smuggled bytes are not
> > errors[*], they are literally smuggled bytes.
>
> But they're not characters, which is what Stephen
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray wrote:
> That isn't the case in the email package. The smuggled bytes are not
> errors[*], they are literally smuggled bytes.
But they're not characters, which is what Stephen and I were saying -
and contrary to what Jim said about treating them a
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico wrote:
> On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull
> wrote:
> > Jim J. Jewett writes:
> >
> > > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > > as representing a character outside of your unicode repertoire
16 matches
Mail list logo