Hi
I am not expecting a single regular expression to solve all possible
combination of scripts. What I am looking for probably (which may not be
possible due to combination of scripts and mix scripts) is somewhere along the
line of having individual scripts that validate by the regular expression. I
am still thinking if it is possible to have regular expression for individual
scripts only and not mix-match (for the time being) such as (i am being very
high level here):
* Phags-pa scripts
* Chinese: Traditional/Simplified
* Mongolian
* Sanskrit
* ...
* Kana scripts
* Japanese: hirakana/Katakana
* ...
* Hebrew scripts
* Yiddish
* Hebrew
* Bukhori
* …
* Latin scripts
* English
* Italian
* ….
* Hangul scripts
* Korean
* Cyrillic Scripts
* Russian
* Bulgarian
* Ukrainian
* ...
By focusing on each scripts to derive a regular expression, I was wondering if
such validation can be accomplished here.
Of course, RFC3696 standardize all email formatting rules and we can use such
rule to validate the format before checking the scripts for validity.
Warm Regards,
-James Lin
From: Paweł Dyda <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>, Unicode List
<[email protected]<mailto:[email protected]>>
Subject: Re: Best practice of using regex on identify none-ASCII email address
Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions
are not a valid tool for that job (that is validating Int'l email address
format).
In the internal email I especially gave one specific example, where to my
knowledge it is (nearly) impossible to use Regular Expression to validate email
address.
The reason I gave was mixed-script scenario.
How can we ensure that we allow mixture of Hiragana, Katakana and Latin, while
basically disallowing any other combinations with Latin (especially Latin +
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks
that we might want to prevent. I don't think it is even remotely possible with
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.
2013/10/30 James Lin <[email protected]<mailto:[email protected]>>
Let me include the unicode alias as well for wider audience since this topic
came up few times in the past.
From: James Lin <[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Best practice of using regex on identify none-ASCII email address
Hi
does anyone has the best practice or guideline on how to validate none-ASCII
email address by using regular expression?
I looked through RFC6531, CLDR repository and nothing has a solid example on
how to validate none-ASCII email address.
thanks everyone.
-James