Re: Best practice of using regex on identify none-ASCII email address

Mark Davis ☕ Fri, 01 Nov 2013 06:23:43 -0700

I'm not saying that what is sent to the server has to be those bytes; I'm
saying that if we use the convention that punctuation, whitespace, etc gets
escaped, it would allow us to recognize the boundaries of the local part in
plain text.

I think what you mention is part of a more general problem. Let's suppose
that I have an email address where the bytes that the server recognizes for
the local part are <61 B3>@foo.com. I convert that using Latin-14 to aġ@
foo.com. I send it in an email to you, and you receive it as UTF-8. You see
aġ@foo.com, but underneath the covers it is bytes <61 C4 A1>. But then you
send to the server <61 C4 A1>@foo.com, and it fails. Or worse yet, reaches
someone whose email is aÄ¡@foo.com. (Ok, I could have poked around and
found a more compelling example, but you see the point).

If I really wanted to be absolutely certain that my email wouldn't be
munged by a conversion, I'd never convert from bytes: we'd never see "
[email protected]", we'd always see the equivalent of %6d%61%[email protected].

Mark <https://google.com/+MarkDavis>
*
*
*— Il meglio è l’inimico del bene —*
**

On Fri, Nov 1, 2013 at 1:36 PM, Philippe Verdy <[email protected]> wrote:

>
>
> 2013/11/1 Mark Davis ☕ <[email protected]>
>
>> These are two well-known serious flaws in EAI and URLs; there is no
>> useful syntactic limit on what is in the query part of a URL or on the
>> local part of an email address that would allow their boundaries to be
>> detected in plaintext.
>>
>> No use complaining about them, because people are concerned with
>> backwards compatibility, and wouldn't change the underlying specs.
>>
>> That being true, I wish that industry could come to consensus about
>> requiring everything outside of a well-defined, backwards-compatible set of
>> characters to be expressed as UTF-8 percent-escaped characters in these
>> fields when they are expressed as plaintext. (Something like XID_Continue ±
>> exceptions.) That would allow for unambiguous parsing in plaintext.
>>
>
> Why "UTF-8" only ? There exists already email accounts created with
> various ISO8859-* or windows codepages, or KOI-8R (or U). And none of these
> addresses are aliased with an UTF-8 encoded account name reaching the same
> mailbox (creting these aliases would help these users having such accounts
> to protect their privacy, however there may exist rare cases where these
> aliases woulda conflict with distinct mail accounts
>

Re: Best practice of using regex on identify none-ASCII email address

Reply via email to