Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>Last i looked they use a gigantic chunk of memory in mbstate_t or
>so (128 byte?).

128 bytes is considered 'gigantic'? :-)

While I am not a huge fan of the POSIX locale functions, thankfully we can
mostly get by without them.  Basically we use iconv() to convert from the
source character set to the native character set, and we have a small
amount of mbtowc() and wcwidth() to handle multibyte character sets and
figure out column width (and really, we only do UTF-8 well).

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>> What sorry excuse for an MUA are you using over there? :-)
>
>That would be exmh.

Hey, don't drag us fellow exmh users into YOUR mix-up! :-)

I'm puzzled as to the process you use to compose the reply.  Because
if it was being run through mhbuild, there is NO way it should have
ever encoded a '\' as =5C and missed encoding U+30C4.

The ツ in your reply shows up correctly here (because it's being
interpreted as UTF-8) but the leading and trailing macrons get the
"invalid character" glyph.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Sat, 12 Jun 2021 10:04:36 +0100, Ralph Corderoy said:

> What sorry excuse for an MUA are you using over there?  :-)

That would be exmh.

> And why doesn't it complain at you when it spots the attempt to send
> these transgressions onto the wire?

That's a very good question - I *thought* I fixed that, but obviously
there's still some unicode/utf-8 confusion. It displayed correctly in Ken's
mail, while composing the reply, and in your mail, which is why I didn't
notice it was still broken.

But wait.. there's more..  The =AF=5C screw-up is in the outbound file.

17:38:01 0 [~] grep "can only say" Mail/outbox/41541 | hx
 3E206361 6E206F6E 6C792073 6179203D  41463D35 435F28E3 8384295F 
2F3D4146  *> can only say =AF=5C_(...)_/=AF*
0020 2E0A

But linemode 'show' displays it correctly as well. Why did *that* work here
but you report 

> it doesn't display correctly here when decoded, e.g. the un-QP'd =AF
> isn't valid UTF=8.


Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Steffen Nurpmeso
Ralph Corderoy wrote in
 <20210612103715.a572c21...@orac.inputplus.co.uk>:
 |>> I am aware that some people, for reasons I cannot comprehend, want
 |>> to run in the "C" locale
 |>
 |> I do that, not so much because I want to, but because that's what
 |> happens when no LC_* env variables (nor LANG) exist at all.   That's
 |> me.   I believe you understand that locales aren't exactly first class
 |> objects in NetBSD...  (Or not yet anyway).
 |
 |https://wiki.netbsd.org/tutorials/unicode/ suggests Unicode through
 |UTF-8 is well supported as long as the user sets the appropriate
 |environment variables.  Isn't just that you choose not to set them?

Last i looked they use a gigantic chunk of memory in mbstate_t or
so (128 byte?).  Other than that the Citrus project was ..the
first to support locales in (free) Unix?  I think so.  What was
totally missing was support for collation.  Understandable here
especially strxfrm(3) which uses a terrible algorithm that drives
me up the wall in order to turn some A in a B that can be matched
via strcmp(3).  /ME shivers.  Other than that the w*() interface
is a terrible mess, it does not know about graphemes,
normalization, de-/composing, etc.  Just my one cent.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi kre,

> If the draft contained Content-Type, right from the beginning (either
> auto set as part of repl or comp processing, or manually inserted),
> then we wouldn't need to be guessing what charset it was using, would
> we?

Yes, we would need to guess because the Content-Type only describes the
content part following the fields but we need to know the encoding to
read the fields themselves, e.g. non-ASCII runes in email addresses.

But we shouldn't guess, we should use the user's locale to build a
draft, e.g. for repl(1), so the user's locale-aware editor is happy, and
to parse the draft, whether nmh built it or something else.

> > So my thinking is the spool-file's writer will either be something
> > like Postfix which declares support for SMTPUTF8, is handed UTF-8,
> > and AFAICS stores it verbatim,
>
> On my system the spool file is written by /usr/libexec/mail.local
> (which would be invoked by postfix if I used that) but can also be run
> from anything.
>
> While the most common practice is to get to it via the system's MTA,
> it also gets invoked by other things, and will write whatever messages
> they hand it (doing no processing other than inserting the mail spool
> "From ..." separator line and '>' quoting a leading "From " on any
> other line).

Then I think my point stands and it's just the responsibility passes
upstream of the near-transparent file-locking mail.local, whether that's
Postfix or not.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken,

> Probably the best way to do that is using mhbuild directives.
> That is, you can today do stuff like:
>
> # [... utf-8 text here ...]
> # [... iso-8859-1 text here ...]
> # [... HTML text here ...]

The input to mhbuild can be that, it's true, though a text editor might
only handle it in the C locale.  And then nmh treats a NUL byte as end
of string, e.g. charset=ucs-2le doesn't work.  Worse than just
truncating the UCS-2LE input, it causes corruption in earlier parts in
this experiment.

$ cat build
#! /bin/bash

(
printf '%s\n' \
'subject: Test.' \
'' \
'Disappears.' \
'#draft
sed -n l draft
echo

cp draft mimed
mhbuild -list -realsize -headers -verbose mimed
echo

sed -n l mimed
$
$ ./build
subject: Test.$
$
Disappears.$
#$
Content-Transfer-Encoding: 8bit$
$
--- =_aa0$
Content-Type: text/plain; charset="UTF-8"$
Content-ID: <21398.162349278...@orac.inputplus.co.uk>$
Content-Transfer-Encoding: 8bit$
$
 ²  ain; charset=iso-8859-1$
Fiat: $ \243$
$
--- =_aa0$
Content-Type: text/plain; charset="ucs-2le"$
Content-ID: <21398.162349278...@orac.inputplus.co.uk>$
$
 ³ $
$
--- =_aa0--$
$ 

1. sed happily displays the NUL bytes in the draft.

2. The ‘Disappears’ part in the draft has vanished.  The Fiat part
starts with part of the preceding directive.  Altering the length of the
UCS-2LE part changes how far back this part erroneously starts;
I suspect some pointer subtraction.

3. All that makes it into the UCS-2LE part is the three spaces which
represent the first three-quarters of the U+2020 dagger and its
following U+0020 space.

This isn't a complaint, just passing on the observation having made the
effort.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Valdis,

Your email was interesting.  Ken wrote

¯\_(ツ)_/¯

which in UTF-8 is

$ hd <<<'¯\_(ツ)_/¯'
  c2 af 5c 5f 28 e3 83 84  29 5f 2f c2 af 0a  |..\_(...)_/...|
000e
$ 

and in Unicode is

$ iconv -f utf-8 -t ucs-4le <<<'¯\_(ツ)_/¯' |
> hexdump -ve '8/4 "% 8x" "\n"'
  af  5c  5f  2830c4  29  5f  2f
  af   a
$

Your MIME email which quoted it arrived here containing

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

=AF=5C_(ツ)_/=AF

I think that's faulty.  The initial U+00AF has been QP'd as =AF when it
should be the UTF-8 =C2=AF.  The U+30C4 has been put in as the UTF-8 ツ
without being QP'd at all.

It doesn't display correctly here when decoded, e.g. the un-QP'd =AF
isn't valid UTF=8.

What sorry excuse for an MUA are you using over there?  :-)
And why doesn't it complain at you when it spots the attempt to send
these transgressions onto the wire?

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken,

> > Complain precisely
>
> Well ... I am not sure this feeling is universal:
>
> https://lists.nongnu.org/archive/html/nmh-workers/2014-04/msg00213.html
> https://lists.nongnu.org/archive/html/nmh-workers/2015-03/msg00045.html

They're about emails which were faulty before they reached the local
system.  I'm talking about the local system's poor configuration causing
corruption and it not going unnoticed because it would flow downstream.

> And ... well, reality, again, rears it's ugly head.

As I said in a bit which was snipped:

   ‘I quite agree the current code greatly hinders doing anything about
this.’

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Fri, 11 Jun 2021 14:04:36 -0400, Ken Hornstein said:

> character.  This obviously works best if your local character set is
> UTF-8.  I am aware that some people, for reasons I cannot comprehend,
> want to run in the "C" locale but PRETEND that their character set
> is UTF-8 and this approach does not work for them.  To these people I
> can only say �\_(ツ)_/�.

I discovered that using LANG=en_US.utf8 but LC_COLLATE=C was the proper
solution, as /bin/ls then outputs files in the order that God intended, not the
creeping bletcherous horror that UTF-8 collation creates. :)