Steffen Nurpmeso wrote in
<20210615212220.5bc88%stef...@sdaoden.eu>:
...
|a minority by very far, and in the 60s an african bishop said "by
|the year 2700 the white man will have destroyed live on earth",
|and i believed him already when i was young.
Actually he said
By the year 2700 the
Good evening.
Ken Hornstein wrote in
<20210615025348.83822125...@pb-smtp20.pobox.com>:
|>kre was coming from a "per draft source character set" i think.
|>But of course, application dependent. It is more general than "i
|>really need this now to get nmh (or mailx) going". When i went
>kre was coming from a "per draft source character set" i think.
>But of course, application dependent. It is more general than "i
>really need this now to get nmh (or mailx) going". When i went
>online around 2010 there was a Python member (Murray, who did the
>rewrite of the Python mail
>Out of interest, how about
>
>pick -from klētnieks -and -search £42 -or -search ₿
I guess I was thinking "convert messages to native character set" while
searching. I realize that doesn't cover complete Unicode equivalence;
we'd really need ICU or something like that. I can live without
Ralph Corderoy wrote in
<20210614205202.034db21...@orac.inputplus.co.uk>:
|Hi Steffen,
|
|> It is still hard to do with POSIX let alone ISO. You need an UTF-8
|> locale you can actively select, POSIX/ISO functions do not support
|> graphemes, and __STDC_ISO_10646__ is an option, so that you
Hi Steffen,
> It is still hard to do with POSIX let alone ISO. You need an UTF-8
> locale you can actively select, POSIX/ISO functions do not support
> graphemes, and __STDC_ISO_10646__ is an option, so that you cannot
> simply code some tables on your own to fill the gaps, because looking
> at
Hi Ken,
> I realized back when I was originally looking at i18n issues in nmh we
> don't need to perform THAT much work on characters internally.
Out of interest, how about
pick -from klētnieks -and -search £42 -or -search ₿
--
Cheers, Ralph.
Ken Hornstein wrote in
<20210614165452.a056f120...@pb-smtp20.pobox.com>:
|>Sure, convert to Unicode, work in Unicode, convert back, that is
|>the way to go.
|
|I know that this is application dependent, but what "work" do you
|need to perform on the characters?
|
|I realized back when I
>Sure, convert to Unicode, work in Unicode, convert back, that is
>the way to go.
I know that this is application dependent, but what "work" do you
need to perform on the characters?
I realized back when I was originally looking at i18n issues in nmh we
don't need to perform THAT much work on
Steffen Nurpmeso wrote in
<20210614162626.vfjxt%stef...@sdaoden.eu>:
...
| <20210614121214.84c1621...@orac.inputplus.co.uk>:
...
||Why not iconv(3) the input from the user's locale, the MIME part's
||charset, etc., to UTF-8, work internally, and then iconv() again on the
...
|functions do
>> Just read in and DON'T convert on input; just convert ONCE on output.
>
>So then the internal strings are varying encodings, including ones with
>NUL bytes?
Yes. Although it seems like in practice nobody uses encodings that contain
NUL bytyes. Like I said, fixing that would be tough.
--Ken
>Last i looked they use a gigantic chunk of memory in mbstate_t or
>so (128 byte?).
128 bytes is considered 'gigantic'? :-)
While I am not a huge fan of the POSIX locale functions, thankfully we can
mostly get by without them. Basically we use iconv() to convert from the
source character set to
>> What sorry excuse for an MUA are you using over there? :-)
>
>That would be exmh.
Hey, don't drag us fellow exmh users into YOUR mix-up! :-)
I'm puzzled as to the process you use to compose the reply. Because
if it was being run through mhbuild, there is NO way it should have
ever encoded a
On Sat, 12 Jun 2021 10:04:36 +0100, Ralph Corderoy said:
> What sorry excuse for an MUA are you using over there? :-)
That would be exmh.
> And why doesn't it complain at you when it spots the attempt to send
> these transgressions onto the wire?
That's a very good question - I *thought* I
Ralph Corderoy wrote in
<20210612103715.a572c21...@orac.inputplus.co.uk>:
|>> I am aware that some people, for reasons I cannot comprehend, want
|>> to run in the "C" locale
|>
|> I do that, not so much because I want to, but because that's what
|> happens when no LC_* env variables (nor
Hi kre,
> If the draft contained Content-Type, right from the beginning (either
> auto set as part of repl or comp processing, or manually inserted),
> then we wouldn't need to be guessing what charset it was using, would
> we?
Yes, we would need to guess because the Content-Type only describes
Hi Ken,
> Probably the best way to do that is using mhbuild directives.
> That is, you can today do stuff like:
>
> # [... utf-8 text here ...]
> # [... iso-8859-1 text here ...]
> # [... HTML text here ...]
The input to mhbuild can be that, it's true, though a text editor might
only handle it
Hi Valdis,
Your email was interesting. Ken wrote
¯\_(ツ)_/¯
which in UTF-8 is
$ hd <<<'¯\_(ツ)_/¯'
c2 af 5c 5f 28 e3 83 84 29 5f 2f c2 af 0a |..\_(...)_/...|
000e
$
and in Unicode is
$ iconv -f utf-8 -t ucs-4le <<<'¯\_(ツ)_/¯' |
> hexdump -ve '8/4
Hi Ken,
> > Complain precisely
>
> Well ... I am not sure this feeling is universal:
>
> https://lists.nongnu.org/archive/html/nmh-workers/2014-04/msg00213.html
> https://lists.nongnu.org/archive/html/nmh-workers/2015-03/msg00045.html
They're about emails which were faulty before they reached
On Fri, 11 Jun 2021 14:04:36 -0400, Ken Hornstein said:
> character. This obviously works best if your local character set is
> UTF-8. I am aware that some people, for reasons I cannot comprehend,
> want to run in the "C" locale but PRETEND that their character set
> is UTF-8 and this approach
>> And then, to get back to my original point ... if we see an 8-bit
>> character that is not valid in the current character set, what,
>> exactly, should we do about it?
>
>Complain precisely, e.g. pathname, line number, column, encoding
>expected, byte(s) seen. I'd expect an nmh user to want to
> | to automatically run "mhbuild" on all drafts because nmh users had
>
>I recall that happening I think - I suspect it never was an issue for
>me, as I have (still have, and have had for a LONG time) the -mime
>switch for send (and push) in mh profile.
Ah, I had to look that up.
I don't think
Date:Fri, 11 Jun 2021 14:04:36 -0400
From:Ken Hornstein
Message-ID: <20210611180437.3b854c2...@pb-smtp1.pobox.com>
| As I understand your question ... no, that is not true (with a few caveats).
I believe you understood the question correctly! Thanks.
| We
>That actually brings up one point I have wondered about, and which might
>help here - my recollection (it has been a long time since I tested this,
>so things might have changed) is that nmh doesn't like receiving drafts
>with MIME fields in the header (including particularly for right now) a
Date:Thu, 10 Jun 2021 18:16:42 -0400
From:Ken Hornstein
Message-ID: <20210610221648.a1cd0c9...@pb-smtp2.pobox.com>
| I feel compelled to point out that when we find 8-bit characters we use
| the user's locale to find the character set to construct the appropriate
Hi kre,
I've reordered the quotes...
> - /var/spool/$LOGNAME is in UTF-8.
>
> Says who? I think for me it is in whatever mixture of char encodings
> that were used by the various senders of the messages that are there.
To be clear, we're talking about the use of UTF-8 in fields after
SMTPUTF8
Hi Ken,
> > But my point stands. nmh should know from the context where the
> > email address appears what encoding the bytes use when trying to
> > parse it.
> >
> > - mail/inbox/42 was written by us; it's our choice.
> > - mail/draft is the process's locale.
> > - /var/spool/$LOGNAME is in
>Probably, but which process? How do we know what created it? There's
>no requirement that it be sent any time soon after it was composed - with
>just the draft file there's not a lot of leeway, but we support drafts in
>a folder, and there there can be lots waiting to be sent. My drafts/1
Date:Thu, 10 Jun 2021 11:31:10 +0100
From:Ralph Corderoy
Message-ID: <20210610103110.c017721...@orac.inputplus.co.uk>
| - mail/inbox/42 was written by us; it's our choice.
For me, it would be written by procmail (mostly) and it will be
unaltered from what was in
>But my point stands. nmh should know from the context where the email
>address appears what encoding the bytes use when trying to parse it.
>
>- mail/inbox/42 was written by us; it's our choice.
>- mail/draft is the process's locale.
>- /var/spool/$LOGNAME is in UTF-8.
Right, but ... reality
>> The address parser code is used for a lot of things. The specific bug
>> report was about a draft message that contained Cyrillic characters.
>> We know what that character set was in THAT case, because it's a draft
>> message and we can derive the locale from the environment or the nmh
>>
Ralph Corderoy writes:
> U+0081 as 0x81 is ‘is a character representable as an unsigned char’ for
> it's a character, U+0081, and unsigned char holds [0, 0x100) so it
> suffers no loss of representation as an unsigned char.
Sure, but then what you are feeding the function is *not* UTF8.
UTF8
Hi Ken,
> I am wondering if the simplest solution is to put in isascii() in
> front of those tests in that function. We only really care about
> those tests returning "true" for ASCII characters. Thoughts?
Just some tests, really, on Linux of different locales. One multibyte,
the other two
Hi Tom,
> Anyway, interpreting the input as a Unicode code point, for values
> above U+7F (or, if you stretch it unreasonably, U+FF) is very clearly
> outside the spec.
I'm not sure it is. An unwise design choice by 4.4BSD, yes.
U+0081 as 0x81 is ‘is a character representable as an unsigned
On Wed, 02 Jun 2021 17:47:42 -0400 Ken Hornstein sez:
> >It's early morning for me, and I'm still at least a liter of Diet Mountain
> >Dew
> >away from being sufficiently caffeinated to be positive, but that looks like
> >"not totally correct, but a lot closer than what we have now".
> >
> >In
>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
>away from being sufficiently caffeinated to be positive, but that looks like
>"not totally correct, but a lot closer than what we have now".
>
>In particular, that will accept overlong and illegal utf-8 codepoints,
On Wed, 02 Jun 2021 00:13:51 -0400, Ken Hornstein said:
> So this bug was reported yesterday:
>
> https://savannah.nongnu.org/bugs/?60713
> I am wondering if the simplest solution is to put in isascii() in front
> of those tests in that function. We only really care about those tests
>
>You need to read a bit further down, where POSIX says
>
>The c argument is an int, the value of which the application shall
>ensure is representable as an unsigned char or equal to the value of
>the macro EOF. If the argument has any other value, the behavior is
>undefined.
Oof,
Ken Hornstein writes:
>> The macros are just fundamentally broken in any locale that
>> has multibyte characters: you cannot squeeze a multibyte character
>> into an input that is supposed to be either an "unsigned char" or EOF.
>> Vendors can choose either to violate the spec (say, by
Ken wrote:
> But it sounds like to me that everyone is on board with sprinkling in
> some isascii() calls there where it makes sense.
+1
David
>The macros are just fundamentally broken in any locale that
>has multibyte characters: you cannot squeeze a multibyte character
>into an input that is supposed to be either an "unsigned char" or EOF.
>Vendors can choose either to violate the spec (say, by interpreting
>the "int" input as a
Ken Hornstein writes:
> So, it seems like the behavior of iscntrl() and isspace() if the value
> is > 127 is undefined. If you're in the UTF-8 locale MacOS X treats that
> as a Unicode codepoint. But we are NOT treating it like that in this case;
> we're processing it on a
So this bug was reported yesterday:
https://savannah.nongnu.org/bugs/?60713
And I kind of thought we got this mostly right! So I dug into it a bit.
It turns out the problem is WAY down in the address parser. Specifically
it is here, in sbr/mf.c:my_lex()
if (iscntrl ((unsigned
43 matches
Mail list logo