RE: Best practice of using regex on identify none-ASCII email address

Shawn Steele Wed, 30 Oct 2013 16:30:07 -0700

For EAI (the question being asked), the entire address, local part and domain, 
are encoded in UTF-8.

-Shawn

From: [email protected] [mailto:[email protected]] On Behalf 
Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda; [email protected]; [email protected]
Subject: Re: Best practice of using regex on identify none-ASCII email address

You should not ttempt to detect scripts or even assume that they are encoded 
based on Unicode, in the username part ; all you can do is to break at the 
first "@" to split it between user name part and the domin name, then use the 
IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot 
before the TLD), and validate the TLD label in a list that you do not restrict 
for local usage only (such as .local or .localnet), or only for your own 
domain, but I suggest that you validte all these domains only by performing a 
MX request on your DNS server (this could take time to reply, unless you just 
check the TLD part, which should be cached most often, or using the DNS request 
only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which 
are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get the address of 
a mail server, but it does not mean it will be immediately and constantly 
reachable : the UIP you get may be temporrily unreachable (due to your ISP or 
local routing problems, or because the remote mail server is temporarily offine 
or overloaded). Performing an MX request however is much faster than trying to 
send a mail to it, because MX resoltuion will use your local DNS server cache 
and caches of offstream DNS servers of your ISP (you normally don't need to 
perform authoritative MX requests which requires recursive search from the 
root, bypassing all caches, and the scalability of the DNS system (so it's not 
a good policy to do it by default).

If you need security, authoritative DNS queries should be replaced by secure 
emails based on direct authentication with the mail server at strt of the SMTP 
session. authoritative DNS queries should be performed only if this 
authentication fails (in order to bypass incorrect data in DNS caches), but not 
automaticlly (this could be caused by problems on your own site), so delay 
these unchecked email addresses in your database (the problem may be solved 
without doing anything when your server will retry several minutes or hours 
later, when it will have successed in sending the validation email for your 
subscribers).

Do not insert in your database any email addresses coming from any source you 
don't trust for having received the approval by the mail address owner, or not 
obeying to the same explicit approval policy seen by that user, or that is not 
in a domain in your own control ; otherwise you risk being flagged as spamming 
and have your site blocked on various mail servers: you need to send the 
validation email without sending any other kind of advertising, except your own 
identity.

Note that instead of a domain, you *may* accept a host name with an IPv4 
address (in decimal dotted format), or an IPv6 address (within [brackets], and 
in hexadecimal with colons), or some other host name formats for specific 
mail/messaging transport protocols you accept, for example 
"username@[irc:ircservernname:port:channelname]", or "username@{uuid}" using 
other punctuation not valid in domain names.

* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each 
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 
0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to the 
RFC's.
- Never "canonicalise" user names by forcing the capitalisation (not even for 
the basic Latin letters : user names could be encoded with Base-64 for example 
where letter case is significant), even if you can do it for the domain name 
part.

2013/10/30 James Lin <[email protected]<mailto:[email protected]>>
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *    Phags-pa scripts

     *   Chinese: Traditional/Simplified
     *   Mongolian
     *   Sanskrit
     *   ...

  *   Kana scripts

     *   Japanese: hirakana/Katakana
     *   ...

  *   Hebrew scripts

     *   Yiddish
     *   Hebrew
     *   Bukhori
     *   …

  *   Latin scripts

     *   English
     *   Italian
     *   ….

  *   Hangul scripts

     *   Korean

  *   Cyrillic Scripts

     *   Russian
     *   Bulgarian
     *   Ukrainian
     *   ...
By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin

From: Paweł Dyda <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Unicode List 
<[email protected]<mailto:[email protected]>>

Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin <[email protected]<mailto:[email protected]>>
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James

RE: Best practice of using regex on identify none-ASCII email address

Reply via email to