Re: mailbox misbehavior with non-ASCII

2022-07-30 Thread Peter J. Holzer
On 2022-07-29 23:24:57 +, Peter Pearson wrote:
> The following code produces a nonsense result with the input 
> described below:
> 
> import mailbox
> box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
> x = box.values()[0]
> h = x.get("X-DSPAM-Factors")
> print(type(h))
> # 
> 
> The output is the desired "str" when the message file contains this:
> 
> To: recipi...@example.com
> Message-ID: <123>
> Date: Sun, 24 Jul 2022 15:31:19 +
> Subject: Blah blah
> From: f...@from.com
> X-DSPAM-Factors: a'b
> 
> xxx
> 
> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type 
> "email.header.Header", and seems to contain inscrutable garbage.

It's not inscrutable to me, but then I remember when RFC 1522 was the
relevant RFC.

Calling h.encode() returns

=?unknown-8bit?b?YeKAmWI=?=

which is about the best result you can get. The character set is unknown
and the content (when decoded) is the bytes

61 e2 80 99 62

which is what your file contained (assuming you used UTF-8).

What would be nice if you could get at that content directly. There
doesn't seem to be documented method to do that. You can use h._chunks,
but as the _ in the name implies, that's implementation detail which
might change in future versions (and it's not quite straightforward
either, although consistent with other parts of python, I think).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mailbox misbehavior with non-ASCII

2022-07-30 Thread Barry


> On 30 Jul 2022, at 00:30, Peter Pearson  wrote:
> 
> The following code produces a nonsense result with the input 
> described below:
> 
> import mailbox
> box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
> x = box.values()[0]
> h = x.get("X-DSPAM-Factors")
> print(type(h))
> # 
> 
> The output is the desired "str" when the message file contains this:
> 
> To: recipi...@example.com
> Message-ID: <123>
> Date: Sun, 24 Jul 2022 15:31:19 +
> Subject: Blah blah
> From: f...@from.com
> X-DSPAM-Factors: a'b
> 
> xxx
> 
> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type 
> "email.header.Header", and seems to contain inscrutable garbage.

Include in any bug report the exact bytes that are in the header.
In may not be utf-8 encoded it maybe windows cp1252, etc.
Repr of the bytes header will show this.

Barry

> 
> I realize that one should not put non-ASCII characters in
> message headers, but of course I didn't put it there, it
> just showed up, pretty much beyond my control.  And I realize
> that when software is given input that breaks the rules, one
> cannot expect optimal results, but I'd think an exception
> would be the right answer.
> 
> Is this worth a bug report?
> 
> -- 
> To email me, substitute nowhere->runbox, invalid->com.
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mailbox misbehavior with non-ASCII

2022-07-29 Thread 2QdxY4RzWzUUiLuE
On 2022-07-29 at 23:24:57 +,
Peter Pearson  wrote:

> The following code produces a nonsense result with the input 
> described below:
> 
> import mailbox
> box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
> x = box.values()[0]
> h = x.get("X-DSPAM-Factors")
> print(type(h))
> # 
> 
> The output is the desired "str" when the message file contains this:
> 
> To: recipi...@example.com
> Message-ID: <123>
> Date: Sun, 24 Jul 2022 15:31:19 +
> Subject: Blah blah
> From: f...@from.com
> X-DSPAM-Factors: a'b
> 
> xxx
> 
> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type 
> "email.header.Header", and seems to contain inscrutable garbage.
> 
> I realize that one should not put non-ASCII characters in
> message headers, but of course I didn't put it there, it
> just showed up, pretty much beyond my control.  And I realize
> that when software is given input that breaks the rules, one
> cannot expect optimal results, but I'd think an exception
> would be the right answer.

Be strict in what you send, but generous is what you receive.

I agree that email headers are supposed to be ASCII (RFC 822, 2822, and
now 5322) all say that, but always throwing an exception seems a little
harsh, and arguably (I'm not arguing for or against) breaks backwards
compatibility.  At least let the exception contain, in its own
attribute, the inscrutable garbage after the space after the colon and
before next CR/LF pair.

> Is this worth a bug report?

If nothing else, the documentation could specify or disclaim the
existing behavior.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: mailbox misbehavior with non-ASCII

2022-07-29 Thread Ethan Furman

On 7/29/22 16:24, Peter Pearson wrote:


> ... but if the apostrophe in "a'b" is replaced with a
> RIGHT SINGLE QUOTATION MARK, the returned h is of type
> "email.header.Header", and seems to contain inscrutable garbage.
>
> I'd think an exception would be the right answer.
>
> Is this worth a bug report?

I would say yes.

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


mailbox misbehavior with non-ASCII

2022-07-29 Thread Peter Pearson
The following code produces a nonsense result with the input 
described below:

import mailbox
box = mailbox.Maildir("/home/peter/Temp/temp",create=False)
x = box.values()[0]
h = x.get("X-DSPAM-Factors")
print(type(h))
# 

The output is the desired "str" when the message file contains this:

To: recipi...@example.com
Message-ID: <123>
Date: Sun, 24 Jul 2022 15:31:19 +
Subject: Blah blah
From: f...@from.com
X-DSPAM-Factors: a'b

xxx

... but if the apostrophe in "a'b" is replaced with a
RIGHT SINGLE QUOTATION MARK, the returned h is of type 
"email.header.Header", and seems to contain inscrutable garbage.

I realize that one should not put non-ASCII characters in
message headers, but of course I didn't put it there, it
just showed up, pretty much beyond my control.  And I realize
that when software is given input that breaks the rules, one
cannot expect optimal results, but I'd think an exception
would be the right answer.

Is this worth a bug report?

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list