Re: Bug reported regarding Unicode handling in email address

2021-06-15 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20210615212220.5bc88%stef...@sdaoden.eu>:
 ...
 |a minority by very far, and in the 60s an african bishop said "by
 |the year 2700 the white man will have destroyed live on earth",
 |and i believed him already when i was young.

Actually he said

  By the year 2700 the white man will have destroyed live on
  earth, and then the time of the Africans begins.

To clarify that.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-15 Thread Steffen Nurpmeso
Good evening.

Ken Hornstein wrote in
 <20210615025348.83822125...@pb-smtp20.pobox.com>:
 |>kre was coming from a "per draft source character set" i think.
 |>But of course, application dependent.  It is more general than "i
 |>really need this now to get nmh (or mailx) going".  When i went
 |>online around 2010 there was a Python member (Murray, who did the
 |>rewrite of the Python mail engine) who was (or is) an nmh user, as
 |>he said.  I looked at nmh but i think it could not even do MIME by then?
 |
 |It depends on what you mean.  nmh, back when it was MH, could do MIME,
 |but the support was ... not wonderful, if you wanted to deal with modern
 |messages.  It is better now.
 |
 |>But even with only columns there are problems, like bidi.
 |
 |In one sense, we kinda don't have to deal with this because we just feed
 |our output into a pager.  Probably in theory we could do better, but we
 |have a ways to go until our MIME support is good enough to deal.
 |
 |>For serialization you are surely right.  This imposes a conversion
 |>back and forth to wchar_t with POSIX interface, then.  And you
 |>have already lost the performance battle.
 |
 |Are you running on super slow machines?  I can't really imagine that
 |really impacting performance in any measurable way.

We turn in circles, and obviously off-topic for the nmh list.

 |>So UTF-8 pretties up stuff for an all-american or all-english view.
 |
 |Well, guilty on both counts, but I think if you're concerned about size
 |then it probably is more compact for most European languages as well.

Well it is cool, it would be nice if it would have been adopted
more widely shortly after it was invented.  C90 Amendment 1 would
have been great, as Plan9 shows.  Despite that the number is
a minority by very far, and in the 60s an african bishop said "by
the year 2700 the white man will have destroyed live on earth",
and i believed him already when i was young.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>kre was coming from a "per draft source character set" i think.
>But of course, application dependent.  It is more general than "i
>really need this now to get nmh (or mailx) going".  When i went
>online around 2010 there was a Python member (Murray, who did the
>rewrite of the Python mail engine) who was (or is) an nmh user, as
>he said.  I looked at nmh but i think it could not even do MIME by then?

It depends on what you mean.  nmh, back when it was MH, could do MIME,
but the support was ... not wonderful, if you wanted to deal with modern
messages.  It is better now.

>But even with only columns there are problems, like bidi.

In one sense, we kinda don't have to deal with this because we just feed
our output into a pager.  Probably in theory we could do better, but we
have a ways to go until our MIME support is good enough to deal.

>For serialization you are surely right.  This imposes a conversion
>back and forth to wchar_t with POSIX interface, then.  And you
>have already lost the performance battle.

Are you running on super slow machines?  I can't really imagine that
really impacting performance in any measurable way.

>So UTF-8 pretties up stuff for an all-american or all-english view.

Well, guilty on both counts, but I think if you're concerned about size
then it probably is more compact for most European languages as well.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>Out of interest, how about
>
>pick -from klētnieks -and -search £42 -or -search ₿

I guess I was thinking "convert messages to native character set" while
searching.  I realize that doesn't cover complete Unicode equivalence;
we'd really need ICU or something like that.  I can live without that
at least for now.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Ralph Corderoy wrote in
 <20210614205202.034db21...@orac.inputplus.co.uk>:
 |Hi Steffen,
 |
 |> It is still hard to do with POSIX let alone ISO.  You need an UTF-8
 |> locale you can actively select, POSIX/ISO functions do not support
 |> graphemes, and __STDC_ISO_10646__ is an option, so that you cannot
 |> simply code some tables on your own to fill the gaps, because looking
 |> at the wchar_t codepoints may not give you a Unicode "codepoint"
 |> (though maybe all do it like that so in practice you could make this a
 |> precondition).
 |
 |Thanks for the detail, I see you point now.

Oh, i was almost a hundred percent sure you saw it before already ;)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ralph Corderoy
Hi Steffen,

> It is still hard to do with POSIX let alone ISO.  You need an UTF-8
> locale you can actively select, POSIX/ISO functions do not support
> graphemes, and __STDC_ISO_10646__ is an option, so that you cannot
> simply code some tables on your own to fill the gaps, because looking
> at the wchar_t codepoints may not give you a Unicode "codepoint"
> (though maybe all do it like that so in practice you could make this a
> precondition).

Thanks for the detail, I see you point now.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ralph Corderoy
Hi Ken,

> I realized back when I was originally looking at i18n issues in nmh we
> don't need to perform THAT much work on characters internally.

Out of interest, how about

pick -from klētnieks -and -search £42 -or -search ₿

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Ken Hornstein wrote in
 <20210614165452.a056f120...@pb-smtp20.pobox.com>:
 |>Sure, convert to Unicode, work in Unicode, convert back, that is
 |>the way to go.
 |
 |I know that this is application dependent, but what "work" do you
 |need to perform on the characters?
 |
 |I realized back when I was originally looking at i18n issues in nmh we
 |don't need to perform THAT much work on characters internally.  We DO
 |do some work when it comes to calculating character width in the format
 |engine, but that's all in the native character set.  So I realized that
 |at least for nmh, there's no advantage to converting to Unicode/UTF-8
 |internally, and a number of disadvantages; like you say, the xlocale
 |functions are non-portable and you can't really get there with the
 |existing POSIX APIs.  Converting internally to Unicode would force you
 |to depend on something like ICU.

kre was coming from a "per draft source character set" i think.
But of course, application dependent.  It is more general than "i
really need this now to get nmh (or mailx) going".  When i went
online around 2010 there was a Python member (Murray, who did the
rewrite of the Python mail engine) who was (or is) an nmh user, as
he said.  I looked at nmh but i think it could not even do MIME by then?
Granted i do not know much of nmh.  You definetely need is-space
for line break detection, and if you visualize yourself you need
is-print or is-control etc.  Etc etc, you know at least as well as
i do.  'Just saying.

But even with only columns there are problems, like bidi.
I added "headline-bidi" "support"

  In general setting this variable will cause Mailx to encapsulate
  text fields that may occur when displaying headline[435] (and some
  other fields, like dynamic expansions in prompt[517]) with special
  Unicode control sequences; it is possible to fine-tune the terminal
  support level by assigning a value: no value (or any value other
  than ‘1’, ‘2’ and ‘3’) will make Mailx assume that the terminal is
  capable to properly deal with Unicode version 6.3, in which case
  text is embedded in a pair of U+2068 (FIRST STRONG ISOLATE) and
  U+2069 (POP DIRECTIONAL ISOLATE) characters.  In addition no space
  on the line is reserved for these characters.

  Weaker support is chosen by using the value ‘1’ (Unicode 6.3, but
  reserve the room of two spaces for writing the control sequences
  onto the line).  The values ‘2’ and ‘3’ select Unicode 1.1 support
  (U+200E, LEFT-TO-RIGHT MARK); the latter again reserves room for
  two spaces in addition.

but it is no good here (st(1)).  Best is without it :)

  From steffen Tue Jun  3 13:42:08 2014
  Date: Tue, 03 Jun 2014 13:42:08 +0200
  From: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?= 
  To: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
  Subject: =?utf-8?B?2KPYrdmF2K8g2KfZhNmF2K3ZhdmI2K/Zig==?=
  MIME-Version: 1.0
  Content-Type: multipart/mixed;
   boundary="=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_"
  Status: R

  This is a multi-part message in MIME format.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: 8bit
  Content-Disposition: inline

  أحمد المحمودي.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_
  Content-Type: text/plain; charset=UTF-8
  Content-Transfer-Encoding: 8bit
  Content-Disposition: attachment;
   filename="أحمد المحمودي.txt"

  أحمد المحمودي.

  --=_01401795729=-WIIWUCvp3AwFMhX+fbN+aN6QsACHfW=_--

Nah, *really* proper internationalization is a very complicated
thing, but it seems you can get away with only slightly touching
this in an email program unless you display or edit actual text.
As i have no right-to-left capabilities, i cannot test anyway.

 |>Really, the older i get the more i think that UTF-16 is not the
 |>worst decision regarding Unicode.  Surrogate pairs have to be
 |>handled, but for UTF-8 you always have to live with multibyte
 |>anyway.
 |
 |I guess I think out of all of the possible worlds, UTF-8 is probably
 |the best compromise.

For serialization you are surely right.  This imposes a conversion
back and forth to wchar_t with POSIX interface, then.  And you
have already lost the performance battle.

You know, that is _really_ weird.  Remembering NetBSD pimping
their vis(3) (i was subscribed to their source-changes for years),
vis(3) goes through this:

/* Allocate space for the wide char strings */
psrc = pdst = extra = NULL;
mdst = NULL;
if ((psrc = calloc(mbslength + 1, sizeof(*psrc))) == NULL)
return -1;
if ((pdst = calloc((16 * mbslength) + 1, sizeof(*pdst))) == NULL)
goto out;
if (*mbdstp == NULL) {
if ((mdst = calloc((16 * mbslength) + 1, sizeof(*mdst))) == NULL)
goto out;
*mbdstp = mdst;
}

And you do not want to look at the rest.  I mean, wow!, that
entirely hammers you off the map!  And not to talk about
longjmp(3) or other signal mess.

The good thing 

Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>Sure, convert to Unicode, work in Unicode, convert back, that is
>the way to go.

I know that this is application dependent, but what "work" do you
need to perform on the characters?

I realized back when I was originally looking at i18n issues in nmh we
don't need to perform THAT much work on characters internally.  We DO
do some work when it comes to calculating character width in the format
engine, but that's all in the native character set.  So I realized that
at least for nmh, there's no advantage to converting to Unicode/UTF-8
internally, and a number of disadvantages; like you say, the xlocale
functions are non-portable and you can't really get there with the
existing POSIX APIs.  Converting internally to Unicode would force you
to depend on something like ICU.


>Really, the older i get the more i think that UTF-16 is not the
>worst decision regarding Unicode.  Surrogate pairs have to be
>handled, but for UTF-8 you always have to live with multibyte
>anyway.

I guess I think out of all of the possible worlds, UTF-8 is probably
the best compromise.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20210614162626.vfjxt%stef...@sdaoden.eu>:
 ...
 | <20210614121214.84c1621...@orac.inputplus.co.uk>:
 ...
 ||Why not iconv(3) the input from the user's locale, the MIME part's
 ||charset, etc., to UTF-8, work internally, and then iconv() again on the
 ...
 |functions do not support graphemes, and __STDC_ISO_10646__ is an
 |option, so that you cannot simply code some tables on your own to
 |fill the gaps, because looking at the wchar_t codepoints may not
 |give you a Unicode "codepoint" (though maybe all do it like that
 |so in practice you could make this a precondition).  I had to

To add that if i recall correctly citrus for example does this,
using the upper bits of wchar_t for state info, but i have
forgotten whether that was done in an UTF-8 locale, or rather in
CJK or SHIFT-JS or whatever (my gut says the latter).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-14 Thread Ken Hornstein
>> Just read in and DON'T convert on input; just convert ONCE on output.
>
>So then the internal strings are varying encodings, including ones with
>NUL bytes?

Yes.  Although it seems like in practice nobody uses encodings that contain
NUL bytyes.  Like I said, fixing that would be tough.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>Last i looked they use a gigantic chunk of memory in mbstate_t or
>so (128 byte?).

128 bytes is considered 'gigantic'? :-)

While I am not a huge fan of the POSIX locale functions, thankfully we can
mostly get by without them.  Basically we use iconv() to convert from the
source character set to the native character set, and we have a small
amount of mbtowc() and wcwidth() to handle multibyte character sets and
figure out column width (and really, we only do UTF-8 well).

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ken Hornstein
>> What sorry excuse for an MUA are you using over there? :-)
>
>That would be exmh.

Hey, don't drag us fellow exmh users into YOUR mix-up! :-)

I'm puzzled as to the process you use to compose the reply.  Because
if it was being run through mhbuild, there is NO way it should have
ever encoded a '\' as =5C and missed encoding U+30C4.

The ツ in your reply shows up correctly here (because it's being
interpreted as UTF-8) but the leading and trailing macrons get the
"invalid character" glyph.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Sat, 12 Jun 2021 10:04:36 +0100, Ralph Corderoy said:

> What sorry excuse for an MUA are you using over there?  :-)

That would be exmh.

> And why doesn't it complain at you when it spots the attempt to send
> these transgressions onto the wire?

That's a very good question - I *thought* I fixed that, but obviously
there's still some unicode/utf-8 confusion. It displayed correctly in Ken's
mail, while composing the reply, and in your mail, which is why I didn't
notice it was still broken.

But wait.. there's more..  The =AF=5C screw-up is in the outbound file.

17:38:01 0 [~] grep "can only say" Mail/outbox/41541 | hx
 3E206361 6E206F6E 6C792073 6179203D  41463D35 435F28E3 8384295F 
2F3D4146  *> can only say =AF=5C_(...)_/=AF*
0020 2E0A

But linemode 'show' displays it correctly as well. Why did *that* work here
but you report 

> it doesn't display correctly here when decoded, e.g. the un-QP'd =AF
> isn't valid UTF=8.


Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Steffen Nurpmeso
Ralph Corderoy wrote in
 <20210612103715.a572c21...@orac.inputplus.co.uk>:
 |>> I am aware that some people, for reasons I cannot comprehend, want
 |>> to run in the "C" locale
 |>
 |> I do that, not so much because I want to, but because that's what
 |> happens when no LC_* env variables (nor LANG) exist at all.   That's
 |> me.   I believe you understand that locales aren't exactly first class
 |> objects in NetBSD...  (Or not yet anyway).
 |
 |https://wiki.netbsd.org/tutorials/unicode/ suggests Unicode through
 |UTF-8 is well supported as long as the user sets the appropriate
 |environment variables.  Isn't just that you choose not to set them?

Last i looked they use a gigantic chunk of memory in mbstate_t or
so (128 byte?).  Other than that the Citrus project was ..the
first to support locales in (free) Unix?  I think so.  What was
totally missing was support for collation.  Understandable here
especially strxfrm(3) which uses a terrible algorithm that drives
me up the wall in order to turn some A in a B that can be matched
via strcmp(3).  /ME shivers.  Other than that the w*() interface
is a terrible mess, it does not know about graphemes,
normalization, de-/composing, etc.  Just my one cent.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi kre,

> If the draft contained Content-Type, right from the beginning (either
> auto set as part of repl or comp processing, or manually inserted),
> then we wouldn't need to be guessing what charset it was using, would
> we?

Yes, we would need to guess because the Content-Type only describes the
content part following the fields but we need to know the encoding to
read the fields themselves, e.g. non-ASCII runes in email addresses.

But we shouldn't guess, we should use the user's locale to build a
draft, e.g. for repl(1), so the user's locale-aware editor is happy, and
to parse the draft, whether nmh built it or something else.

> > So my thinking is the spool-file's writer will either be something
> > like Postfix which declares support for SMTPUTF8, is handed UTF-8,
> > and AFAICS stores it verbatim,
>
> On my system the spool file is written by /usr/libexec/mail.local
> (which would be invoked by postfix if I used that) but can also be run
> from anything.
>
> While the most common practice is to get to it via the system's MTA,
> it also gets invoked by other things, and will write whatever messages
> they hand it (doing no processing other than inserting the mail spool
> "From ..." separator line and '>' quoting a leading "From " on any
> other line).

Then I think my point stands and it's just the responsibility passes
upstream of the near-transparent file-locking mail.local, whether that's
Postfix or not.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken,

> Probably the best way to do that is using mhbuild directives.
> That is, you can today do stuff like:
>
> # [... utf-8 text here ...]
> # [... iso-8859-1 text here ...]
> # [... HTML text here ...]

The input to mhbuild can be that, it's true, though a text editor might
only handle it in the C locale.  And then nmh treats a NUL byte as end
of string, e.g. charset=ucs-2le doesn't work.  Worse than just
truncating the UCS-2LE input, it causes corruption in earlier parts in
this experiment.

$ cat build
#! /bin/bash

(
printf '%s\n' \
'subject: Test.' \
'' \
'Disappears.' \
'#draft
sed -n l draft
echo

cp draft mimed
mhbuild -list -realsize -headers -verbose mimed
echo

sed -n l mimed
$
$ ./build
subject: Test.$
$
Disappears.$
#$
Content-Transfer-Encoding: 8bit$
$
--- =_aa0$
Content-Type: text/plain; charset="UTF-8"$
Content-ID: <21398.162349278...@orac.inputplus.co.uk>$
Content-Transfer-Encoding: 8bit$
$
 ²  ain; charset=iso-8859-1$
Fiat: $ \243$
$
--- =_aa0$
Content-Type: text/plain; charset="ucs-2le"$
Content-ID: <21398.162349278...@orac.inputplus.co.uk>$
$
 ³ $
$
--- =_aa0--$
$ 

1. sed happily displays the NUL bytes in the draft.

2. The ‘Disappears’ part in the draft has vanished.  The Fiat part
starts with part of the preceding directive.  Altering the length of the
UCS-2LE part changes how far back this part erroneously starts;
I suspect some pointer subtraction.

3. All that makes it into the UCS-2LE part is the three spaces which
represent the first three-quarters of the U+2020 dagger and its
following U+0020 space.

This isn't a complaint, just passing on the observation having made the
effort.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Valdis,

Your email was interesting.  Ken wrote

¯\_(ツ)_/¯

which in UTF-8 is

$ hd <<<'¯\_(ツ)_/¯'
  c2 af 5c 5f 28 e3 83 84  29 5f 2f c2 af 0a  |..\_(...)_/...|
000e
$ 

and in Unicode is

$ iconv -f utf-8 -t ucs-4le <<<'¯\_(ツ)_/¯' |
> hexdump -ve '8/4 "% 8x" "\n"'
  af  5c  5f  2830c4  29  5f  2f
  af   a
$

Your MIME email which quoted it arrived here containing

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

=AF=5C_(ツ)_/=AF

I think that's faulty.  The initial U+00AF has been QP'd as =AF when it
should be the UTF-8 =C2=AF.  The U+30C4 has been put in as the UTF-8 ツ
without being QP'd at all.

It doesn't display correctly here when decoded, e.g. the un-QP'd =AF
isn't valid UTF=8.

What sorry excuse for an MUA are you using over there?  :-)
And why doesn't it complain at you when it spots the attempt to send
these transgressions onto the wire?

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Ralph Corderoy
Hi Ken,

> > Complain precisely
>
> Well ... I am not sure this feeling is universal:
>
> https://lists.nongnu.org/archive/html/nmh-workers/2014-04/msg00213.html
> https://lists.nongnu.org/archive/html/nmh-workers/2015-03/msg00045.html

They're about emails which were faulty before they reached the local
system.  I'm talking about the local system's poor configuration causing
corruption and it not going unnoticed because it would flow downstream.

> And ... well, reality, again, rears it's ugly head.

As I said in a bit which was snipped:

   ‘I quite agree the current code greatly hinders doing anything about
this.’

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-12 Thread Valdis Klētnieks
On Fri, 11 Jun 2021 14:04:36 -0400, Ken Hornstein said:

> character.  This obviously works best if your local character set is
> UTF-8.  I am aware that some people, for reasons I cannot comprehend,
> want to run in the "C" locale but PRETEND that their character set
> is UTF-8 and this approach does not work for them.  To these people I
> can only say �\_(ツ)_/�.

I discovered that using LANG=en_US.utf8 but LC_COLLATE=C was the proper
solution, as /bin/ls then outputs files in the order that God intended, not the
creeping bletcherous horror that UTF-8 collation creates. :)




Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
>> And then, to get back to my original point ... if we see an 8-bit
>> character that is not valid in the current character set, what,
>> exactly, should we do about it?
>
>Complain precisely, e.g. pathname, line number, column, encoding
>expected, byte(s) seen.  I'd expect an nmh user to want to understand
>how the parts of their system work and where something has gone wrong
>and a good error message will help diagnose problems rather than just
>passing duff data on so it causes problems further away from the origin.

Well ... I am not sure this feeling is universal:

https://lists.nongnu.org/archive/html/nmh-workers/2014-04/msg00213.html
https://lists.nongnu.org/archive/html/nmh-workers/2015-03/msg00045.html

I'm kind of close to this camp; I don't think we should really emit that
many warnings.

And ... well, reality, again, rears it's ugly head.  For one, we don't
really get a notification down at the address parser WHERE we are at.
So emitting a warning isn't really practical.  Also if are dealing with
it in the format engine, well ... who knows where it is coming from,
exactly?  As far as I can tell, this would result in potentially a lot
of confusing warnings that would drive most people nuts.

I think, mostly, we kind of get this right ... we try to make sure
that the source characters are converted using iconv() to the native
character set at display time (we probably don't get the case right
where raw UTF-8 appears in headers and you're in the C locale, because
we are probably assuming that's ASCII).  In that case the substitution
character should appear.  I'm open to adding code to emit more warnings,
but not turned on by default, and I honestly think it would be more
trouble than it is

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
>  | to automatically run "mhbuild" on all drafts because nmh users had
>
>I recall that happening I think - I suspect it never was an issue for
>me, as I have (still have, and have had for a LONG time) the -mime
>switch for send (and push) in mh profile.

Ah, I had to look that up.

I don't think the -mime switch does what you think it does.  What it
does is IF you are sending a Bcc to someone, it will encapsulate the
Bcc as a multipart/digest.  It doesn't do anything else and definitely
does NOT automatically run mhbuild/mhn.

>I need to ponder all this more (05:xx in the morning isn't the best
>time for clear thinking - and I am *not* an early riser!) but I think
>this might be exactly what I don't want.  I don't want to manually (or
>via calling mhbuild manually, via one path or another) have a fully
>constructed MIME message in the draft (or not usually), what I want is
>to be able to provide explicit info about things I know to be true about
>the draft, and have nmh (mhbuild or whatever) use that info rather than
>guessing, which most probably just means the content-type field.

Right, well ... we don't QUITE have that now the way you want it.  But
you can get close.

Probably the best way to do that is using mhbuild directives.  That is,
you can today do stuff like:

#I do that, not so much because I want to, but because that's what
>happens when no LC_* env variables (nor LANG) exist at all.  That's me.
>I believe you understand that locales aren't exactly first class objects
>in NetBSD...  (Or not yet anyway).

Well, I guess I don't know what you mean by "first class objects"!.
I was under the impression locale support exists on NetBSD.

>The usual problem I encounter is people who insist on using "fancy"
>quote characters, rather than the ascii ones, in an otherwise ascii
>message.  When I include that as a quote in my reply, the UTF-8
>encodings of those things appear in the draft - with my C locale.

Right, I mean ... this is kind of our (nmh's) fault, and kind of your
fault.

We have some kind of bolt-on tools included in contrib (replyfilter)
that help with this.  The general idea is in your reply draft you're
supposed to convert any text you want to include in your reply to
your native locale; it's your fault you don't do this.  It's nmh's fault
that we don't really do it either, or we don't make it easy.

>All something of a mess (but if "repl" noticed it was including the body
>of the message in the reply - and clearly it does - and could then look
>at the Content-type of that message (or what that was converted into)
>and add that to the draft (to be used as above) I suspect there would be
>far less problems.

That approach seems a bit fraught, especially when dealing with multibyte
character sets like UTF-8.  At least the common iterations of 'vi' know
how to deal with UTF-8, but if you get sent a message using iso8859-1
but your editor is expecting UTF-8, your approach would potentially
end up with ISO8859-1 and UTF-8 in the same message which wouldn't work
at all.  I HAVE seen replied-to messages like that, sadly.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Robert Elz
Date:Fri, 11 Jun 2021 14:04:36 -0400
From:Ken Hornstein 
Message-ID:  <20210611180437.3b854c2...@pb-smtp1.pobox.com>

  | As I understand your question ... no, that is not true (with a few caveats).

I believe you understood the question correctly!   Thanks.

  | We finally decided, I think around nmh 1.5 (or 1.6),

It could easily be that far back, or further, that, for reasons I no
longer recall, I used to frequently end up attempting to send messages
with MIME fields in the header (MIME-version in particular, though I
know of no good reason for anyone to ever want to manually insert that
field ... but usually at least one more) and message sending would fail
(so I'd have to delete the things).

Somwehere along the way I have simply stopped getting myself into the
situation where that occurred (whatever I was doing, wherever that draft
was coming from, clearly wasn't working well, so I changed the way I work).

  | to automatically run "mhbuild" on all drafts because nmh users had

I recall that happening I think - I suspect it never was an issue for me,
as I have (still have, and have had for a LONG time) the -mime switch for
send (and push) in mh profile.

  | Now PRIOR to that, if you had the AUTOMHNPROC environment variable set

That I don't recall ever having, in fact, I don't recall ever knowing it
even existed.

  | (I think), send would also run mhbuild (back then, mhn).  _If_ you did
  | that AND your draft contained a MIME header like Content-Type, you'd get
  | an error.

The -mime switch would do the same?   In any case, I saw plenty of the
error (for a while).

  | AND if mhbuild sees a MIME header, it silently exits without error.
  | The assumption there is if the outgoing draft already
  | has MIME headers then either the user knew what he/she was doing

I need to ponder all this more (05:xx in the morning isn't the best time
for clear thinking - and I am *not* an early riser!) but I think this
might be exactly what I don't want.   I don't want to manually (or
via calling mhbuild manually, via one path or another) have a fully
constructed MIME message in the draft (or not usually), what I want
is to be able to provide explicit info about things I know to be true
about the draft, and have nmh (mhbuild or whatever) use that info rather
than guessing, which most probably just means the content-type field.

That is, should I for some bizarre reason have a need to send an HTML
format message, I could use an HTML editor to compose the body, in some
appropriate character set, and save that in a file.   The I do "comp"
and add, or change the Content-Type field to text/html; charset=whatever
whatever isn't necessarily anything related to my locale settings, if I
had any) and read the file after the  line in the draft, to provide
the body of the message.   Then if I need it I can go add some Attach
(pseudo-)fields in the header to add some photos, PDF files, or whatever
I need to also include.   Then nmh takes the draft, knows what form it
is from the Content-type field (no guessing it is HTML based upon 
type constructs in the body, etc) and what charset that body was encoded
using (again no guessing) and builds a multipart-mixed with that initial
text/html plus the image/jpeg (or whatever) that is to be attached.

  | and then your favorite editor will understand the characters in the
  | reply message.

I suppose it might...

  | This obviously works best if your local character set is UTF-8.

Yes.

  | I am aware that some people, for reasons I cannot comprehend,
  | want to run in the "C" locale

I do that, not so much because I want to, but because that's what happens
when no LC_* env variables (nor LANG) exist at all.   That's me.   I believe
you understand that locales aren't exactly first class objects in NetBSD...
(Or not yet anyway).

The usual problem I encounter is people who insist on using "fancy" quote
characters, rather than the ascii ones, in an otherwise ascii message.  When
I include that as a quote in my reply, the UTF-8 encodings of those things
appear in the draft - with my C locale.   That's how I suspect the appearance
of "pretending to be UTF8 with a C locale" comes about.   But it isn't 
something I want to happen - it is all just imposed upon me (if the
draft contained '?' instead of the quote, it would be easier to fix, one
simple global substitute (and then put back any real question marks, but
usually there are none).   I however don't have an input method to type
the UTF8 chars, so I can't do a substitute command for them, and instead
have to move to each one and replace it.   Tedious.   But if I don't I end
up sending a message containing UTF8 without a content-type field that
indicates that.

All something of a mess (but if "repl" noticed it was including the body
of the message in the reply - and clearly it does - and could then look at
the Content-type of that message (or what that was converted into) and 
add that to the draft 

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ken Hornstein
>That actually brings up one point I have wondered about, and which might
>help here - my recollection (it has been a long time since I tested this,
>so things might have changed) is that nmh doesn't like receiving drafts
>with MIME fields in the header (including particularly for right now) a
>Content-Type field - is that still true?   If so, does it need to be?

As I understand your question ... no, that is not true (with a few caveats).

We finally decided, I think around nmh 1.5 (or 1.6), to automatically run
"mhbuild" on all drafts because nmh users had the unpleasant habit of
doing things like sending out unencoded UTF-8 because that was very easy
to do unless you explicitly configured it otherwise.

Now PRIOR to that, if you had the AUTOMHNPROC environment variable set
(I think), send would also run mhbuild (back then, mhn).  _If_ you did
that AND your draft contained a MIME header like Content-Type, you'd get
an error.

What we did was add a new flag to mhbuild, -auto, and send would run
mhbuild on the draft with the -auto flag.  The two changes -auto does are
it disables mhbuild directives AND if mhbuild sees a MIME header, it silently
exits without error.  The assumption there is if the outgoing draft already
has MIME headers then either the user knew what he/she was doing and we
shouldn't mess it it, or you already ran mhbuild once on the draft
explicitly.  So if you provide send(1) with a draft with the proper
MIME headers then everything should work just fine.

>That in a sense raises another question, what do we do when replying to
>a message which is in some (perhaps exotic, like TIS-674) charset, and
>quoting parts of that message, when my locale is "C" (or something
>else different) ?   Clearly converting everything to UTF8 would
>allow it all to work, but whose responsibility is it to do that, and
>when does it happen?

Sigh.  We haven't QUITE covered all of the combinations yet.  There is
some kind of add-on tooling that makes this easier but not perfect.
The short answer is the general trend is to call iconv() to convert
the source characters to the native character set (based on the locale),
and then your favorite editor will understand the characters in the
reply message.  If iconv() fails on a character we insert a substitution
character.  This obviously works best if your local character set is
UTF-8.  I am aware that some people, for reasons I cannot comprehend,
want to run in the "C" locale but PRETEND that their character set
is UTF-8 and this approach does not work for them.  To these people I
can only say ¯\_(ツ)_/¯.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Robert Elz
Date:Thu, 10 Jun 2021 18:16:42 -0400
From:Ken Hornstein 
Message-ID:  <20210610221648.a1cd0c9...@pb-smtp2.pobox.com>

  | I feel compelled to point out that when we find 8-bit characters we use
  | the user's locale to find the character set to construct the appropriate
  | MIME headers.

That's all fine, my previous message wasn't so much about nmh, as about
the suggestion (which I have seen before) that nmh is, or could become,
some kind of stand-alone system (I was going to say closed, but I don't
mean it in the not open source sense) where it can control its environment.
It isn't, and we (or I at least) don't want it to be.

  | So if your 24 year old draft (really?)

Yes.

It is (was really, nothing will ever happen to that one, except perhaps rm)
a reply to an FTP Extensions IETF working group mailing list message, on
what must have been a fairly early draft of what became RFC3659.

The message I was replying to (according to a quote that is in my unsent
reply) contained text like "While nobody but a drug-crazed lunatic would
consider such an approach, ..." which might have been why I hesitated to send
the reply (I will leave it for you to imagine what the reply might have
contained) or it might have been that I simply paused to read later messages
on the list before sending the reply, and discovered that someone else had
already said everything I was planning on saying.   Or who knows ... it is
all far too long ago for me to remember anything at all about it!

  | was edited using ISO8859-1 because it pre-dates UTF-8

This was me... it would have been edited using ASCII - the issue isn't
anything related to my ancient message(s) just to the assumption that we
can ever know anything about any files in the ${HOME}/$(mhparam path) tree,
aside from what we can deduce from the content of the files themselves.

[Aside: Occasionally when I have an unsent draft, particularly intended
for a mailing list like this one was, and I end up deciding not to send it,
I will refile it to the mailing list's nmh folder - so one should also not
assume that messages that aren't drafts have ever been seen by any nmh process,
except refile, which is just "ln" (or "mv") on steriods (and sometimes if
I can be bothered to work out what the message number should be, I would
just use mv.)

That actually brings up one point I have wondered about, and which might
help here - my recollection (it has been a long time since I tested this,
so things might have changed) is that nmh doesn't like receiving drafts
with MIME fields in the header (including particularly for right now) a
Content-Type field - is that still true?   If so, does it need to be?

If the draft contained Content-Type, right from the beginning (either auto set
as part of repl or comp processing, or manually inserted), then we wouldn't
need to be guessing what charset it was using, would we?   It can be updated
when appropriate, either in an editor, to switch charset, or by nmh processing
when handling attachments, etc.

For a while in the intervening period (and so possibly for some of my hundred
or so intervening unsent drafts) I might have been using 8859-15 (I think 15,
but I might be confusing that with 10646-15 ... TIS-674 anyway) chars, as
I used to need to reply to messages sent that way (or whatever wintrash
calls its equivalent).

I certainly wouldn't expect nmh to guess that (how could it?) but it would
be nice if there was a convenient way to tell it, aside from what my current
locale happens to be (my current laptop is newer than my need to deal with
those old work related messages, so I have never bothered to set it up to
handle any of that properly).

That in a sense raises another question, what do we do when replying to
a message which is in some (perhaps exotic, like TIS-674) charset, and
quoting parts of that message, when my locale is "C" (or something
else different) ?   Clearly converting everything to UTF8 would
allow it all to work, but whose responsibility is it to do that, and
when does it happen?


In a slightly later reply to my message, ra...@inputplus.co.uk said:

  | So my thinking is the spool-file's writer will either be something like
  | Postfix which declares support for SMTPUTF8, is handed UTF-8, and AFAICS
  | stores it verbatim,

On my system the spool file is written by /usr/libexec/mail.local (which would
be invoked by postfix if I used that) but can also be run from anything.

While the most common practice is to get to it via the system's MTA, it
also gets invoked by other things, and will write whatever messages they
hand it (doing no processing other than inserting the mail spool "From ..."
separator line and '>' quoting a leading "From " on any other line).

There is no expectation (by it) that messages are necessarily in 822 (or
its  successors) format, though at least some semblance of a relationship
to that is usually maintained.

mail.local's main purpose is locking the spool file so only 

Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ralph Corderoy
Hi kre,

I've reordered the quotes...

> - /var/spool/$LOGNAME is in UTF-8.
>
> Says who?   I think for me it is in whatever mixture of char encodings
> that were used by the various senders of the messages that are there.

To be clear, we're talking about the use of UTF-8 in fields after
SMTPUTF8 has been seen in the SMTP EHLO reply, not any ‘charset’s in
Content-Type fields, or similar.  So my thinking is the spool-file's
writer will either be something like Postfix which declares support for
SMTPUTF8, is handed UTF-8, and AFAICS stores it verbatim, or a program
with no support which will be writing ASCII, a UTF-8 subset.

> > Date:Thu, 10 Jun 2021 11:31:10 +0100
> > From:Ralph Corderoy 
> > Message-ID:  <20210610103110.c017721...@orac.inputplus.co.uk>
> >
> > - mail/inbox/42 was written by us; it's our choice.
>
> For me, it would be written by procmail (mostly) and it will be
> unaltered from what was in the relevant message from /var/spool

If I'm right above then it will be UTF-8 if copied from /var/spool,
and as along as nmh also arranges it to be UTF-8, ignoring the user's
locale, then external and internal writers marry up.

> > - mail/draft is the process's locale.
>
> Probably, but which process?   How do we know what created it?
> There's no requirement that it be sent any time soon after it was
> composed - with just the draft file there's not a lot of leeway, but
> we support drafts in a folder, and there there can be lots waiting to
> be sent.   My drafts/1 file is from 1997

1999 here.

> One day I might send some of those messages...

If your locale today is incompatible with what it was then, and for many
ASCII→UTF-8 users it won't be, ISO 8859-1→UTF-8 is the problem, then
you'll have to iconv(1) or similar to convert the encoding before nmh
would stop griping.  Unless you're unfortunate and it's ISO 8859-1 which
happens to be a valid UTF-8 rune.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-11 Thread Ralph Corderoy
Hi Ken,

> > But my point stands.  nmh should know from the context where the
> > email address appears what encoding the bytes use when trying to
> > parse it.
> > 
> > - mail/inbox/42 was written by us; it's our choice.
> > - mail/draft is the process's locale.
> > - /var/spool/$LOGNAME is in UTF-8.
>
> Right, but ... reality rears it's ugly head.
>
> The address parser is a bunch of layers down.  And it's used for a lot
> of things.  For example, stuff from .mh_profile can end up being
> parsed by it.  We'd have to change the internal API to indicate where
> an address is coming from and I think we'd have to change it almost
> everywhere.

I quite agree the current code greatly hinders doing anything about
this.

> And then, to get back to my original point ... if we see an 8-bit
> character that is not valid in the current character set, what,
> exactly, should we do about it?

Complain precisely, e.g. pathname, line number, column, encoding
expected, byte(s) seen.  I'd expect an nmh user to want to understand
how the parts of their system work and where something has gone wrong
and a good error message will help diagnose problems rather than just
passing duff data on so it causes problems further away from the origin.

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Ken Hornstein
>Probably, but which process?   How do we know what created it?  There's
>no requirement that it be sent any time soon after it was composed - with
>just the draft file there's not a lot of leeway, but we support drafts in
>a folder, and there there can be lots waiting to be sent.   My drafts/1
>file is from 1997 ... drafts/10 is from 2002, and drafts/100 from early April
>(this year).   One day I might send some of those messages...

I feel compelled to point out that when we find 8-bit characters we use
the user's locale to find the character set to construct the appropriate
MIME headers.  So if your 24 year old draft (really?) was edited using
ISO8859-1 because it pre-dates UTF-8 (fine, UTF-8 was invented in 1993,
but RFC 2277 didn't come out until 1998) but you've changed your locale
to use UTF-8, then nmh has no way of knowing that.  In theory I suppose
we could look at the draft and try to take a guess ... but I think that
way lies madness.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Robert Elz
Date:Thu, 10 Jun 2021 11:31:10 +0100
From:Ralph Corderoy 
Message-ID:  <20210610103110.c017721...@orac.inputplus.co.uk>

  | - mail/inbox/42 was written by us; it's our choice.

For me, it would be written by procmail (mostly) and it will be
unaltered from what was in the relevant message from /var/spool

  | - mail/draft is the process's locale.

Probably, but which process?   How do we know what created it?  There's
no requirement that it be sent any time soon after it was composed - with
just the draft file there's not a lot of leeway, but we support drafts in
a folder, and there there can be lots waiting to be sent.   My drafts/1
file is from 1997 ... drafts/10 is from 2002, and drafts/100 from early April
(this year).   One day I might send some of those messages...

  | - /var/spool/$LOGNAME is in UTF-8.

Says who?   I think for me it is in whatever mixture of char encodings
that were used by the various senders of the messages that are there.

kre




Re: Bug reported regarding Unicode handling in email address

2021-06-10 Thread Ken Hornstein
>But my point stands.  nmh should know from the context where the email
>address appears what encoding the bytes use when trying to parse it.
>
>- mail/inbox/42 was written by us; it's our choice.
>- mail/draft is the process's locale.
>- /var/spool/$LOGNAME is in UTF-8.

Right, but ... reality rears it's ugly head.

The address parser is a bunch of layers down.  And it's used for a lot of
things.  For example, stuff from .mh_profile can end up being parsed by
it.  We'd have to change the internal API to indicate where an address is
coming from and I think we'd have to change it almost everywhere.

And then, to get back to my original point ... if we see an 8-bit
character that is not valid in the current character set, what, exactly,
should we do about it?

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ken Hornstein
>> The address parser code is used for a lot of things.  The specific bug
>> report was about a draft message that contained Cyrillic characters.
>> We know what that character set was in THAT case, because it's a draft
>> message and we can derive the locale from the environment or the nmh
>> locale setting.  But if we are processing an email message then we
>> don't easily know the character set.  In theory it should either be
>> us-ascii or utf-8, but reality sometimes intrudes and it could be
>> anything.
>
>If it's an email then won't it be ASCII?

Boy, you're out of the loop!  Check out RFC 6532.

>> I think really instead of using ctype macros, we should be using a
>> specific set of macros tailored for email addresses.
>
>Isn't the problem that one routine is being used to parse emails which
>should comply with the RFCs and also draft emails where it's up to nmh
>to decide the allowable format?  We should be parsing ASCII-encoded
>fields for display in the user's locale with one routine and
>locale-encoded fields for transmission as ASCII with a second routine.

I mean ... yes?  Like many things there's a lot of overloading (see:
using email header parsing routines for config files).  But I think
in practice as long as we don't interpret non-ASCII bytes as "spaces"
we'll be fine.  Like I said, really, for parsing an email header we really
shouldn't be using ctype macros AT ALL but email-specific macros.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Tom Lane
Ralph Corderoy  writes:
> U+0081 as 0x81 is ‘is a character representable as an unsigned char’ for
> it's a character, U+0081, and unsigned char holds [0, 0x100) so it
> suffers no loss of representation as an unsigned char.

Sure, but then what you are feeding the function is *not* UTF8.
UTF8 would require two bytes to represent that code point.  What
you're describing is ISO 8859-1, which is a perfectly fine
single-byte encoding, as long as you don't need any characters
outside the common western-European languages.

Or to put it another way: yes, you can claim that only code points
up to U+FF can be passed to these functions, but that hobbles things
to the point where you really shouldn't claim to be Unicode-aware
at all.

I think it's more sensible to consider that per spec, the 
functions can only deal with single-byte encodings; if you want
something more flexible, you have to go to .

regards, tom lane



Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ralph Corderoy
Hi Ken,

> I am wondering if the simplest solution is to put in isascii() in
> front of those tests in that function.  We only really care about
> those tests returning "true" for ASCII characters.  Thoughts?

Just some tests, really, on Linux of different locales.  One multibyte,
the other two single, and one of those two with bytes which have the
same values as Unicode control ones.

$ iconv -f utf-8 -t utf-8 <<<'From: ÁßÇÐË ' | LC_ALL=en_GB.utf8 
uip/mhbuild -
From: =?UTF-8?B?w4HDn8OHw5DDiw==?= 
MIME-Version: 1.0
Content-Type: text/plain
$ iconv -f utf-8 -t utf-8 <<<ÁßÇÐË | hd
  c3 81 c3 9f c3 87 c3 90  c3 8b 0a |...|
000b
$ base64 -d <<' | 
LC_ALL=en_GB.iso-8859-1 uip/mhbuild -
From: =?ISO-8859-1?B?wd/H0Ms=?= 
MIME-Version: 1.0
Content-Type: text/plain
$ iconv -f utf-8 -t iso-8859-1 <<<ÁßÇÐË | hd
  c1 df c7 d0 cb 0a |..|
0006
$ base64 -d <<' | LC_ALL=en_GB.cp1252 
uip/mhbuild -
From: =?WINDOWS-1252?B?hoeVnA==?= 
MIME-Version: 1.0
Content-Type: text/plain
$ iconv -f utf-8 -t cp1252 <<<†‡•œ | hd
  86 87 95 9c 0a|.|
0005
$ base64 -d <<#define _XOPEN_SOURCE
#include 
#include 
#include 
#include 

int main(void)
{
if (!setlocale(LC_ALL, ""))
return 1;

#define P(f) \
puts(#f);   \
for (int i =-1; i < 0x100; i++) {   \
if (!(i & 0x3f)) putchar('\n'); \
fputc(f(i) ? (i & 0x80 ? '#' : 'x') : '.', stdout); \
if ((i & 0xf) == 0xf) putchar(' '); \
}   \
putchar('\n')

P(iscntrl);
P(isspace);
P(isblank);
P(isdigit);
P(isxdigit);
P(islower);
P(isupper);
P(isalpha);
P(isalnum);
P(ispunct);
P(isgraph);
P(isprint);
P(isascii);

putchar('\n');

#undef P
#define P(f, c) fputc(f(i) ? c : '.', stdout)
for (int i =-1; i < 0x100; i++) {
	printf(" %3d  %4x  %c  ", i, (unsigned short)i, isprint(i) ? i : ' ');
	P(iscntrl,  'c');
	P(isspace,  's');
	P(isblank,  'b');
	P(isdigit,  'd');
	P(isxdigit, 'x');
	P(islower,  'L');
	P(isupper,  'U');
	P(isalpha,  'a');
	P(isalnum,  'l');
	P(ispunct,  'u');
	P(isgraph,  'g');
	P(isprint,  'p');
	P(isascii,  'a');
printf("  %4x\n", (unsigned short)btowc(i));
}
}


Re: Bug reported regarding Unicode handling in email address

2021-06-07 Thread Ralph Corderoy
Hi Tom,

> Anyway, interpreting the input as a Unicode code point, for values
> above U+7F (or, if you stretch it unreasonably, U+FF) is very clearly
> outside the spec.

I'm not sure it is.  An unwise design choice by 4.4BSD, yes.

U+0081 as 0x81 is ‘is a character representable as an unsigned char’ for
it's a character, U+0081, and unsigned char holds [0, 0x100) so it
suffers no loss of representation as an unsigned char.

Though following that argument, every implementation should be doing it.
:-)

-- 
Cheers, Ralph.



Re: Bug reported regarding Unicode handling in email address

2021-06-03 Thread Bob Carragher
On Wed, 02 Jun 2021 17:47:42 -0400 Ken Hornstein  sez:

> >It's early morning for me, and I'm still at least a liter of Diet Mountain 
> >Dew
> >away from being sufficiently caffeinated to be positive, but that looks like
> >"not totally correct, but a lot closer than what we have now".
> >
> >In particular, that will accept overlong and illegal utf-8 codepoints, and
> >probably misbehaves in strange and unusual non-ascii/non-utf-8 things
> >like iso2022-jp.
>
> So, the DETAILS are complicated.

Which is why I have nothing to add to the main thread topic,
other than "sounds good to me!"  B-)

[snip]
>iso2022-jp is
> SO complicated, I don't think we should even try and I get the sense
> everyone is migrating to UTF-8 for email anyway.

I can add one data point here, though:  I, and the folks I
correspond with in Japanese, made that switch by mid-2017.  Prior
to that, we used iso-2022-jp since it was all-ASCII in the years
before MIME-encoding became widely available.

And I know I was late to this party, and the people I correspond
with in Japanese are more likely to use a major "even for
dummies" type mailer (e.g. Gmail) than NMH.  (Sorry.  B-)
Actually _they_ had made that switch before me, and I kept
"complicating" things (a.k.a. "screwing up the email") by
continuing to use iso-2022-jp.  ^_^;;;

Fortunately, we've always used MIME-encoding in the "real name"
portions of our addresses!

Bob



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
>away from being sufficiently caffeinated to be positive, but that looks like
>"not totally correct, but a lot closer than what we have now".
>
>In particular, that will accept overlong and illegal utf-8 codepoints, and
>probably misbehaves in strange and unusual non-ascii/non-utf-8 things
>like iso2022-jp.

So, the DETAILS are complicated.

The address parser code is used for a lot of things.  The specific bug
report was about a draft message that contained Cyrillic characters.
We know what that character set was in THAT case, because it's a draft
message and we can derive the locale from the environment or the nmh
locale setting.  But if we are processing an email message then we don't
easily know the character set.  In theory it should either be us-ascii
or utf-8, but reality sometimes intrudes and it could be anything.

I think really instead of using ctype macros, we should be using a
specific set of macros tailored for email addresses.  Or a flex
lexer designed to process those things.  I kind of think that we
should simply pass the input along as we are given rather than trying
to validate that it is valid UTF-8 (for example).  iso2022-jp is
SO complicated, I don't think we should even try and I get the sense
everyone is migrating to UTF-8 for email anyway.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Valdis Klētnieks
On Wed, 02 Jun 2021 00:13:51 -0400, Ken Hornstein said:
> So this bug was reported yesterday:
>
>   https://savannah.nongnu.org/bugs/?60713

> I am wondering if the simplest solution is to put in isascii() in front
> of those tests in that function.  We only really care about those tests
> returning "true" for ASCII characters.  Thoughts?

It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
away from being sufficiently caffeinated to be positive, but that looks like
"not totally correct, but a lot closer than what we have now".

In particular, that will accept overlong and illegal utf-8 codepoints, and
probably misbehaves in strange and unusual non-ascii/non-utf-8 things
like iso2022-jp.

Personally, I'd just stick the isascii() in there and wait for a bug report
regarding the previous paragraph. :)


pgpylEu6aGNYk.pgp
Description: PGP signature


Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>You need to read a bit further down, where POSIX says
>
>The c argument is an int, the value of which the application shall
>ensure is representable as an unsigned char or equal to the value of
>the macro EOF. If the argument has any other value, the behavior is
>undefined.

Oof, fair enough; I stand corrected!

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Tom Lane
Ken Hornstein  writes:
>> The  macros are just fundamentally broken in any locale that
>> has multibyte characters: you cannot squeeze a multibyte character
>> into an input that is supposed to be either an "unsigned char" or EOF.
>> Vendors can choose either to violate the spec (say, by interpreting
>> the "int" input as a Unicode codepoint) or to produce useless results.

> It's worth pointing out that the official prototype for the ctype macros
> all say they take "int" as an argument, and POSIX says they take as
> an argument a "character".  So interpreting that argument as a Unicode
> codepoint (assuming you're currently in a Unicode locale) is, from my
> reading, within the spec.

You need to read a bit further down, where POSIX says

The c argument is an int, the value of which the application shall
ensure is representable as an unsigned char or equal to the value of
the macro EOF. If the argument has any other value, the behavior is
undefined.

(C99 has identical verbiage.)

The reason to declare the argument as int is so that these can take EOF,
which I suppose is meant to allow them to be applied directly to the
result of getc() ... though why anyone would write code that way is
not clear to me.  Anyway, interpreting the input as a Unicode code point,
for values above U+7F (or, if you stretch it unreasonably, U+FF) is
very clearly outside the spec.

regards, tom lane



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread David Levine
Ken wrote:

> But it sounds like to me that everyone is on board with sprinkling in
> some isascii() calls there where it makes sense.

+1

David



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>The  macros are just fundamentally broken in any locale that
>has multibyte characters: you cannot squeeze a multibyte character
>into an input that is supposed to be either an "unsigned char" or EOF.
>Vendors can choose either to violate the spec (say, by interpreting
>the "int" input as a Unicode codepoint) or to produce useless results.

It's worth pointing out that the official prototype for the ctype macros
all say they take "int" as an argument, and POSIX says they take as
an argument a "character".  So interpreting that argument as a Unicode
codepoint (assuming you're currently in a Unicode locale) is, from my
reading, within the spec.

But it sounds like to me that everyone is on board with sprinkling in
some isascii() calls there where it makes sense.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-01 Thread Tom Lane
Ken Hornstein  writes:
> So, it seems like the behavior of iscntrl() and isspace() if the value
> is > 127 is undefined.  If you're in the UTF-8 locale MacOS X treats that
> as a Unicode codepoint.  But we are NOT treating it like that in this case;
> we're processing it on a character-by-character basis.

The  macros are just fundamentally broken in any locale that
has multibyte characters: you cannot squeeze a multibyte character
into an input that is supposed to be either an "unsigned char" or EOF.
Vendors can choose either to violate the spec (say, by interpreting
the "int" input as a Unicode codepoint) or to produce useless results.

(As I recall, the MacOS UTF8 locales are badly broken in some other
ways, but this problem is not Apple's fault.)

> I am wondering if the simplest solution is to put in isascii() in front
> of those tests in that function.  We only really care about those tests
> returning "true" for ASCII characters.  Thoughts?

Yeah, that seems like a reasonable fix in this context.

regards, tom lane



Bug reported regarding Unicode handling in email address

2021-06-01 Thread Ken Hornstein
So this bug was reported yesterday:

https://savannah.nongnu.org/bugs/?60713

And I kind of thought we got this mostly right!  So I dug into it a bit.

It turns out the problem is WAY down in the address parser.  Specifically
it is here, in sbr/mf.c:my_lex()

if (iscntrl ((unsigned char) c) || isspace ((unsigned char) c))
break;

This LOOKS ok.  But ... if you look at the test message, it contains the
character 'с', which is U+0441 "Cyrillic Small Letter ES".  And the UTF-8
encoding of that is 0xd1 0x81.  So we end up calling iscntrl() on
0xd1 (which is false) AND then we end up calling iscntrl() on 0x81 ...
which returns true (because that's a Unicode "control" character).  Note
this only happens IF you are in a UTF-8 locale call AND you call
setlocale() at the beginning of your program (the latter drove me nuts
because my original test program didn't work because I didn't do that).

So, it seems like the behavior of iscntrl() and isspace() if the value
is > 127 is undefined.  If you're in the UTF-8 locale MacOS X treats that
as a Unicode codepoint.  But we are NOT treating it like that in this case;
we're processing it on a character-by-character basis.

I am wondering if the simplest solution is to put in isascii() in front
of those tests in that function.  We only really care about those tests
returning "true" for ASCII characters.  Thoughts?

--Ken