Re: Best practice of using regex on identify none-ASCII email address

James Lin Wed, 30 Oct 2013 15:01:02 -0700

Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):


 *    Phags-pa scripts
    *   Chinese: Traditional/Simplified
    *   Mongolian
    *   Sanskrit
    *   ...
 *   Kana scripts
    *   Japanese: hirakana/Katakana
    *   ...
 *   Hebrew scripts
    *   Yiddish
    *   Hebrew
    *   Bukhori
    *   …
 *   Latin scripts
    *   English
    *   Italian
    *   ….
 *   Hangul scripts
    *   Korean
 *   Cyrillic Scripts
    *   Russian
    *   Bulgarian
    *   Ukrainian
    *   ...

By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Unicode List 
<[email protected]<mailto:[email protected]>>
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,

I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...

And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.

Cheers,
Paweł.


2013/10/30 James Lin <[email protected]<mailto:[email protected]>>
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James

Re: Best practice of using regex on identify none-ASCII email address

Reply via email to