Re: Bug reported regarding Unicode handling in email address

2021-06-15 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in <20210615212220.5bc88%stef...@sdaoden.eu>: ... |a minority by very far, and in the 60s an african bishop said "by |the year 2700 the white man will have destroyed live on earth", |and i believed him already when i was young. Actually he said By the year 2700 the

Re: Bug reported regarding Unicode handling in email address

2021-06-15 Thread Steffen Nurpmeso
Good evening. Ken Hornstein wrote in <20210615025348.83822125...@pb-smtp20.pobox.com>: |>kre was coming from a "per draft source character set" i think. |>But of course, application dependent. It is more general than "i |>really need this now to get nmh (or mailx) going". When i went

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>kre was coming from a "per draft source character set" i think. >But of course, application dependent. It is more general than "i >really need this now to get nmh (or mailx) going". When i went >online around 2010 there was a Python member (Murray, who did the >rewrite of the Python mail

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>Out of interest, how about > >pick -from klētnieks -and -search £42 -or -search ₿ I guess I was thinking "convert messages to native character set" while searching. I realize that doesn't cover complete Unicode equivalence; we'd really need ICU or something like that. I can live without

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Ralph Corderoy wrote in <20210614205202.034db21...@orac.inputplus.co.uk>: |Hi Steffen, | |> It is still hard to do with POSIX let alone ISO. You need an UTF-8 |> locale you can actively select, POSIX/ISO functions do not support |> graphemes, and __STDC_ISO_10646__ is an option, so that you

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ralph Corderoy
Hi Steffen, > It is still hard to do with POSIX let alone ISO. You need an UTF-8 > locale you can actively select, POSIX/ISO functions do not support > graphemes, and __STDC_ISO_10646__ is an option, so that you cannot > simply code some tables on your own to fill the gaps, because looking > at

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ralph Corderoy
Hi Ken, > I realized back when I was originally looking at i18n issues in nmh we > don't need to perform THAT much work on characters internally. Out of interest, how about pick -from klētnieks -and -search £42 -or -search ₿ -- Cheers, Ralph.

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Ken Hornstein wrote in <20210614165452.a056f120...@pb-smtp20.pobox.com>: |>Sure, convert to Unicode, work in Unicode, convert back, that is |>the way to go. | |I know that this is application dependent, but what "work" do you |need to perform on the characters? | |I realized back when I

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>Sure, convert to Unicode, work in Unicode, convert back, that is >the way to go. I know that this is application dependent, but what "work" do you need to perform on the characters? I realized back when I was originally looking at i18n issues in nmh we don't need to perform THAT much work on

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in <20210614162626.vfjxt%stef...@sdaoden.eu>: ... | <20210614121214.84c1621...@orac.inputplus.co.uk>: ... ||Why not iconv(3) the input from the user's locale, the MIME part's ||charset, etc., to UTF-8, work internally, and then iconv() again on the ... |functions do

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>> Just read in and DON'T convert on input; just convert ONCE on output. > >So then the internal strings are varying encodings, including ones with >NUL bytes? Yes. Although it seems like in practice nobody uses encodings that contain NUL bytyes. Like I said, fixing that would be tough. --Ken

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>Last i looked they use a gigantic chunk of memory in mbstate_t or >so (128 byte?). 128 bytes is considered 'gigantic'? :-) While I am not a huge fan of the POSIX locale functions, thankfully we can mostly get by without them. Basically we use iconv() to convert from the source character set to

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>> What sorry excuse for an MUA are you using over there? :-) > >That would be exmh. Hey, don't drag us fellow exmh users into YOUR mix-up! :-) I'm puzzled as to the process you use to compose the reply. Because if it was being run through mhbuild, there is NO way it should have ever encoded a

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Sat, 12 Jun 2021 10:04:36 +0100, Ralph Corderoy said: > What sorry excuse for an MUA are you using over there? :-) That would be exmh. > And why doesn't it complain at you when it spots the attempt to send > these transgressions onto the wire? That's a very good question - I *thought* I

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Steffen Nurpmeso
Ralph Corderoy wrote in <20210612103715.a572c21...@orac.inputplus.co.uk>: |>> I am aware that some people, for reasons I cannot comprehend, want |>> to run in the "C" locale |> |> I do that, not so much because I want to, but because that's what |> happens when no LC_* env variables (nor

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi kre, > If the draft contained Content-Type, right from the beginning (either > auto set as part of repl or comp processing, or manually inserted), > then we wouldn't need to be guessing what charset it was using, would > we? Yes, we would need to guess because the Content-Type only describes

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken, > Probably the best way to do that is using mhbuild directives. > That is, you can today do stuff like: > > # [... utf-8 text here ...] > # [... iso-8859-1 text here ...] > # [... HTML text here ...] The input to mhbuild can be that, it's true, though a text editor might only handle it

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Valdis, Your email was interesting. Ken wrote ¯\_(ツ)_/¯ which in UTF-8 is $ hd <<<'¯\_(ツ)_/¯' c2 af 5c 5f 28 e3 83 84 29 5f 2f c2 af 0a |..\_(...)_/...| 000e $ and in Unicode is $ iconv -f utf-8 -t ucs-4le <<<'¯\_(ツ)_/¯' | > hexdump -ve '8/4

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken, > > Complain precisely > > Well ... I am not sure this feeling is universal: > > https://lists.nongnu.org/archive/html/nmh-workers/2014-04/msg00213.html > https://lists.nongnu.org/archive/html/nmh-workers/2015-03/msg00045.html They're about emails which were faulty before they reached

Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Fri, 11 Jun 2021 14:04:36 -0400, Ken Hornstein said: > character. This obviously works best if your local character set is > UTF-8. I am aware that some people, for reasons I cannot comprehend, > want to run in the "C" locale but PRETEND that their character set > is UTF-8 and this approach

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
>> And then, to get back to my original point ... if we see an 8-bit >> character that is not valid in the current character set, what, >> exactly, should we do about it? > >Complain precisely, e.g. pathname, line number, column, encoding >expected, byte(s) seen. I'd expect an nmh user to want to

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
> | to automatically run "mhbuild" on all drafts because nmh users had > >I recall that happening I think - I suspect it never was an issue for >me, as I have (still have, and have had for a LONG time) the -mime >switch for send (and push) in mh profile. Ah, I had to look that up. I don't think

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Robert Elz
Date:Fri, 11 Jun 2021 14:04:36 -0400 From:Ken Hornstein Message-ID: <20210611180437.3b854c2...@pb-smtp1.pobox.com> | As I understand your question ... no, that is not true (with a few caveats). I believe you understood the question correctly! Thanks. | We

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
>That actually brings up one point I have wondered about, and which might >help here - my recollection (it has been a long time since I tested this, >so things might have changed) is that nmh doesn't like receiving drafts >with MIME fields in the header (including particularly for right now) a

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Robert Elz
Date:Thu, 10 Jun 2021 18:16:42 -0400 From:Ken Hornstein Message-ID: <20210610221648.a1cd0c9...@pb-smtp2.pobox.com> | I feel compelled to point out that when we find 8-bit characters we use | the user's locale to find the character set to construct the appropriate

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ralph Corderoy
Hi kre, I've reordered the quotes... > - /var/spool/$LOGNAME is in UTF-8. > > Says who? I think for me it is in whatever mixture of char encodings > that were used by the various senders of the messages that are there. To be clear, we're talking about the use of UTF-8 in fields after SMTPUTF8

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ralph Corderoy
Hi Ken, > > But my point stands. nmh should know from the context where the > > email address appears what encoding the bytes use when trying to > > parse it. > > > > - mail/inbox/42 was written by us; it's our choice. > > - mail/draft is the process's locale. > > - /var/spool/$LOGNAME is in

Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Ken Hornstein
>Probably, but which process? How do we know what created it? There's >no requirement that it be sent any time soon after it was composed - with >just the draft file there's not a lot of leeway, but we support drafts in >a folder, and there there can be lots waiting to be sent. My drafts/1

Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Robert Elz
Date:Thu, 10 Jun 2021 11:31:10 +0100 From:Ralph Corderoy Message-ID: <20210610103110.c017721...@orac.inputplus.co.uk> | - mail/inbox/42 was written by us; it's our choice. For me, it would be written by procmail (mostly) and it will be unaltered from what was in

Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Ken Hornstein
>But my point stands. nmh should know from the context where the email >address appears what encoding the bytes use when trying to parse it. > >- mail/inbox/42 was written by us; it's our choice. >- mail/draft is the process's locale. >- /var/spool/$LOGNAME is in UTF-8. Right, but ... reality

Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ken Hornstein
>> The address parser code is used for a lot of things. The specific bug >> report was about a draft message that contained Cyrillic characters. >> We know what that character set was in THAT case, because it's a draft >> message and we can derive the locale from the environment or the nmh >>

Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Tom Lane
Ralph Corderoy writes: > U+0081 as 0x81 is ‘is a character representable as an unsigned char’ for > it's a character, U+0081, and unsigned char holds [0, 0x100) so it > suffers no loss of representation as an unsigned char. Sure, but then what you are feeding the function is *not* UTF8. UTF8

Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ralph Corderoy
Hi Ken, > I am wondering if the simplest solution is to put in isascii() in > front of those tests in that function. We only really care about > those tests returning "true" for ASCII characters. Thoughts? Just some tests, really, on Linux of different locales. One multibyte, the other two

Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ralph Corderoy
Hi Tom, > Anyway, interpreting the input as a Unicode code point, for values > above U+7F (or, if you stretch it unreasonably, U+FF) is very clearly > outside the spec. I'm not sure it is. An unwise design choice by 4.4BSD, yes. U+0081 as 0x81 is ‘is a character representable as an unsigned

Re: Bug reported regarding Unicode handling in email address

2021-06-03 Thread Bob Carragher
On Wed, 02 Jun 2021 17:47:42 -0400 Ken Hornstein sez: > >It's early morning for me, and I'm still at least a liter of Diet Mountain > >Dew > >away from being sufficiently caffeinated to be positive, but that looks like > >"not totally correct, but a lot closer than what we have now". > > > >In

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew >away from being sufficiently caffeinated to be positive, but that looks like >"not totally correct, but a lot closer than what we have now". > >In particular, that will accept overlong and illegal utf-8 codepoints,

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Valdis Klētnieks
On Wed, 02 Jun 2021 00:13:51 -0400, Ken Hornstein said: > So this bug was reported yesterday: > > https://savannah.nongnu.org/bugs/?60713 > I am wondering if the simplest solution is to put in isascii() in front > of those tests in that function. We only really care about those tests >

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>You need to read a bit further down, where POSIX says > >The c argument is an int, the value of which the application shall >ensure is representable as an unsigned char or equal to the value of >the macro EOF. If the argument has any other value, the behavior is >undefined. Oof,

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Tom Lane
Ken Hornstein writes: >> The macros are just fundamentally broken in any locale that >> has multibyte characters: you cannot squeeze a multibyte character >> into an input that is supposed to be either an "unsigned char" or EOF. >> Vendors can choose either to violate the spec (say, by

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread David Levine
Ken wrote: > But it sounds like to me that everyone is on board with sprinkling in > some isascii() calls there where it makes sense. +1 David

Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>The macros are just fundamentally broken in any locale that >has multibyte characters: you cannot squeeze a multibyte character >into an input that is supposed to be either an "unsigned char" or EOF. >Vendors can choose either to violate the spec (say, by interpreting >the "int" input as a

Re: Bug reported regarding Unicode handling in email address

2021-06-01 Thread Tom Lane
Ken Hornstein writes: > So, it seems like the behavior of iscntrl() and isspace() if the value > is > 127 is undefined. If you're in the UTF-8 locale MacOS X treats that > as a Unicode codepoint. But we are NOT treating it like that in this case; > we're processing it on a

Bug reported regarding Unicode handling in email address

2021-06-01 Thread Ken Hornstein
So this bug was reported yesterday: https://savannah.nongnu.org/bugs/?60713 And I kind of thought we got this mostly right! So I dug into it a bit. It turns out the problem is WAY down in the address parser. Specifically it is here, in sbr/mf.c:my_lex() if (iscntrl ((unsigned