Re: mailbox misbehavior with non-ASCII
On 2022-07-29 23:24:57 +, Peter Pearson wrote: > The following code produces a nonsense result with the input > described below: > > import mailbox > box = mailbox.Maildir("/home/peter/Temp/temp",create=False) > x = box.values()[0] > h = x.get("X-DSPAM-Factors") > print(type(h)) > # > > The output is the desired "str" when the message file contains this: > > To: recipi...@example.com > Message-ID: <123> > Date: Sun, 24 Jul 2022 15:31:19 + > Subject: Blah blah > From: f...@from.com > X-DSPAM-Factors: a'b > > xxx > > ... but if the apostrophe in "a'b" is replaced with a > RIGHT SINGLE QUOTATION MARK, the returned h is of type > "email.header.Header", and seems to contain inscrutable garbage. It's not inscrutable to me, but then I remember when RFC 1522 was the relevant RFC. Calling h.encode() returns =?unknown-8bit?b?YeKAmWI=?= which is about the best result you can get. The character set is unknown and the content (when decoded) is the bytes 61 e2 80 99 62 which is what your file contained (assuming you used UTF-8). What would be nice if you could get at that content directly. There doesn't seem to be documented method to do that. You can use h._chunks, but as the _ in the name implies, that's implementation detail which might change in future versions (and it's not quite straightforward either, although consistent with other parts of python, I think). hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: mailbox misbehavior with non-ASCII
> On 30 Jul 2022, at 00:30, Peter Pearson wrote: > > The following code produces a nonsense result with the input > described below: > > import mailbox > box = mailbox.Maildir("/home/peter/Temp/temp",create=False) > x = box.values()[0] > h = x.get("X-DSPAM-Factors") > print(type(h)) > # > > The output is the desired "str" when the message file contains this: > > To: recipi...@example.com > Message-ID: <123> > Date: Sun, 24 Jul 2022 15:31:19 + > Subject: Blah blah > From: f...@from.com > X-DSPAM-Factors: a'b > > xxx > > ... but if the apostrophe in "a'b" is replaced with a > RIGHT SINGLE QUOTATION MARK, the returned h is of type > "email.header.Header", and seems to contain inscrutable garbage. Include in any bug report the exact bytes that are in the header. In may not be utf-8 encoded it maybe windows cp1252, etc. Repr of the bytes header will show this. Barry > > I realize that one should not put non-ASCII characters in > message headers, but of course I didn't put it there, it > just showed up, pretty much beyond my control. And I realize > that when software is given input that breaks the rules, one > cannot expect optimal results, but I'd think an exception > would be the right answer. > > Is this worth a bug report? > > -- > To email me, substitute nowhere->runbox, invalid->com. > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: mailbox misbehavior with non-ASCII
On 2022-07-29 at 23:24:57 +, Peter Pearson wrote: > The following code produces a nonsense result with the input > described below: > > import mailbox > box = mailbox.Maildir("/home/peter/Temp/temp",create=False) > x = box.values()[0] > h = x.get("X-DSPAM-Factors") > print(type(h)) > # > > The output is the desired "str" when the message file contains this: > > To: recipi...@example.com > Message-ID: <123> > Date: Sun, 24 Jul 2022 15:31:19 + > Subject: Blah blah > From: f...@from.com > X-DSPAM-Factors: a'b > > xxx > > ... but if the apostrophe in "a'b" is replaced with a > RIGHT SINGLE QUOTATION MARK, the returned h is of type > "email.header.Header", and seems to contain inscrutable garbage. > > I realize that one should not put non-ASCII characters in > message headers, but of course I didn't put it there, it > just showed up, pretty much beyond my control. And I realize > that when software is given input that breaks the rules, one > cannot expect optimal results, but I'd think an exception > would be the right answer. Be strict in what you send, but generous is what you receive. I agree that email headers are supposed to be ASCII (RFC 822, 2822, and now 5322) all say that, but always throwing an exception seems a little harsh, and arguably (I'm not arguing for or against) breaks backwards compatibility. At least let the exception contain, in its own attribute, the inscrutable garbage after the space after the colon and before next CR/LF pair. > Is this worth a bug report? If nothing else, the documentation could specify or disclaim the existing behavior. -- https://mail.python.org/mailman/listinfo/python-list
Re: mailbox misbehavior with non-ASCII
On 7/29/22 16:24, Peter Pearson wrote: > ... but if the apostrophe in "a'b" is replaced with a > RIGHT SINGLE QUOTATION MARK, the returned h is of type > "email.header.Header", and seems to contain inscrutable garbage. > > I'd think an exception would be the right answer. > > Is this worth a bug report? I would say yes. -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
mailbox misbehavior with non-ASCII
The following code produces a nonsense result with the input described below: import mailbox box = mailbox.Maildir("/home/peter/Temp/temp",create=False) x = box.values()[0] h = x.get("X-DSPAM-Factors") print(type(h)) # The output is the desired "str" when the message file contains this: To: recipi...@example.com Message-ID: <123> Date: Sun, 24 Jul 2022 15:31:19 + Subject: Blah blah From: f...@from.com X-DSPAM-Factors: a'b xxx ... but if the apostrophe in "a'b" is replaced with a RIGHT SINGLE QUOTATION MARK, the returned h is of type "email.header.Header", and seems to contain inscrutable garbage. I realize that one should not put non-ASCII characters in message headers, but of course I didn't put it there, it just showed up, pretty much beyond my control. And I realize that when software is given input that breaks the rules, one cannot expect optimal results, but I'd think an exception would be the right answer. Is this worth a bug report? -- To email me, substitute nowhere->runbox, invalid->com. -- https://mail.python.org/mailman/listinfo/python-list