Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice
Le 29/10/2013 17:15, Jörg Knappen a écrit : After running this script, a few more things were there: Non-normalised accents and some really strange encodings I could not really explain but rather guess their meanings, like s/Ãœ/Ü/g s/É/É/g s/AÌ€/À/g s/aÌ€/à/g s/EÌ€/È/g s/eÌ€/è/g s/„/„/g s/“/“/g s/ß/ß/g s/’/’/g s/Ä/Æ/g It was probably not utf8 read as latin 1 and reencoded in utf8, but utf_8 encoding read as Windows 1252 ( http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each of the combination above contains a character absent in latin-1 (œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and not in Latin-15, the other possible mistake. I'v e check that this is consistent with Ü É and ß but not with your Æ. This double encoding would give Ä : Ä=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4 =Ä (and not Æ) Frédéric
Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice
Thanks again! My updated sed pattern generator now looks like: r = range(0xa0, 0x170) file = open(fixu8.sed, w) for i in r: pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g print file, pat1 try: pat2 = s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) + / + unichr(i).encode(utf-8) +/g except: pat2 = pat1 if (pat1 != pat2): print file, pat2 doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors) --Jrg Knappen Gesendet:Mittwoch, 30. Oktober 2013 um 15:34 Uhr Von:Frdric Grosshans frederic.grossh...@gmail.com An:unicode@unicode.org Betreff:Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice Le 29/10/2013 17:15, Jrg Knappen a crit : After running this script, a few more things were there: Non-normalised accents and some really strange encodings I could not really explain but rather guess their meanings, like s///g s///g s/A//g s/a//g s/E//g s/e//g s///g s///g s///g s///g s///g It was probably not utf8 read as latin 1 and reencoded in utf8, but utf_8 encoding read as Windows 1252 ( http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each of the combination above contains a character absent in latin-1 (), and some of them are only present in Windows-1252 () and not in Latin-15, the other possible mistake. Iv e check that this is consistent with and but not with your . This double encoding would give : =Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4 = (and not ) Frdric
Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice
Le 30/10/2013 16:13, Jörg Knappen a écrit : Thanks again! My updated sed pattern generator now looks like: r = range(0xa0, 0x170) file = open(fixu8.sed, w) for i in r: pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g print file, pat1 try: pat2 = s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) + / + unichr(i).encode(utf-8) +/g except: pat2 = pat1 if (pat1 != pat2): print file, pat2 doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors) Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to be a superset of latin1, so it should be enough. Or is there a problem with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ? Frédéric
Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice
The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of latin-1. So I had to do them, too. The data werent consistent at all, not even in their errors. --Jrg Knappen Gesendet:Mittwoch, 30. Oktober 2013 um 16:58 Uhr Von:Frdric Grosshans frederic.grossh...@gmail.com An:Jrg Knappen jknap...@web.de Cc:unicode@unicode.org Betreff:Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice Le 30/10/2013 16:13, Jrg Knappen a crit : Thanks again! My updated sed pattern generator now looks like: r = range(0xa0, 0x170) file = open(fixu8.sed, w) for i in r: pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g print file, pat1 try: pat2 = s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) + / + unichr(i).encode(utf-8) +/g except: pat2 = pat1 if (pat1 != pat2): print file, pat2 doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors) Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to be a superset of latin1, so it should be enough. Or is there a problem with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ? Frdric
Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice
Le 30/10/2013 17:32, Jörg Knappen a écrit : The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of latin-1. So I had to do them, too. The data weren't consistent at all, not even in their errors. --Jörg Knappen Your question helped me dust off and repair a non working python snippet I wrote for a similar problem. I was stuck with the mixing of windows-1252 and latin1 controls (linked with a chinese characters). I write it below for reference. The python snippet below does not need sed, defines a function (unscramble(S)) which works on strings. The extension to files should be easy. Frédéric Grosshans def Step1Filter(S): for c in S : #works character/character because of the cp1252/latin1 ambiguity try : yield c.encode('cp1252') except UnicodeEncodeError : yield c.encode('latin1') #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D) def unscramble(S): return b''.join(c for c in Step1Filter(S)).decode('utf8') PS: If anyone is interested in a licence, I consider this simple enough to be in the public domain an uncopyrightable.
Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice
On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans frederic.grossh...@gmail.com wrote: Le 30/10/2013 17:32, Jörg Knappen a écrit : The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw C1 control characters for all of latin-1. So I had to do them, too. The data weren't consistent at all, not even in their errors. --Jörg Knappen Your question helped me dust off and repair a non working python snippet I wrote for a similar problem. I was stuck with the mixing of windows-1252 and latin1 controls (linked with a chinese characters). I write it below for reference. The python snippet below does not need sed, defines a function (unscramble(S)) which works on strings. The extension to files should be easy. Frédéric Grosshans def Step1Filter(S): for c in S : #works character/character because of the cp1252/latin1 ambiguity try : yield c.encode('cp1252') except UnicodeEncodeError : yield c.encode('latin1') #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D) def unscramble(S): return b''.join(c for c in Step1Filter(S)).decode('utf8') PS: If anyone is interested in a licence, I consider this simple enough to be in the public domain an uncopyrightable. This encoding you've implemented above is known as windows-1252 by the whatwg and all browsers [1][2]. The implementation of cp1252 in python is instead a direct consequence of the unicode.org definition [3]. [1] http://encoding.spec.whatwg.org/index-windows-1252.txt [2] http://bukzor.github.io/encodings/cp1252.html [3] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Re: Best practice of using regex on identify none-ASCII email address
Let me include the unicode alias as well for wider audience since this topic came up few times in the past. From: James Lin james_...@symantec.commailto:james_...@symantec.com Date: Wednesday, October 30, 2013 at 1:11 PM To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org Subject: Best practice of using regex on identify none-ASCII email address Hi does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression? I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address. thanks everyone. -James
RE: Best practice of using regex on identify none-ASCII email address
EAI doesn't really specify anything more than the older SMTP about validating email addresses. Everything in the local part = U+0080 is permissible and up to the server to sort out what characters it wants to allow, how it wants to map things like Turkish I, etc. Some code points are clearly really unhelpful in an email local part, but the EAI RFCs leave it up to the servers how they want to assign mailboxes. Obviously you could check the domain name to make sure it's a valid domain name, and the ASCII range of the local part to make sure it respects the earlier RFCs, and the lengths, but you won't really know if it's a legal name until the mail does/doesn't get accepted by the server. AFAIK there isn't a published regex for doing the limited validation that is possible. -Shawn From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of James Lin Sent: Wednesday, October 30, 2013 1:42 PM To: cldr-us...@unicode.org; unicode@unicode.org Subject: Re: Best practice of using regex on identify none-ASCII email address Let me include the unicode alias as well for wider audience since this topic came up few times in the past. From: James Lin james_...@symantec.commailto:james_...@symantec.com Date: Wednesday, October 30, 2013 at 1:11 PM To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org Subject: Best practice of using regex on identify none-ASCII email address Hi does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression? I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address. thanks everyone. -James
Re: Best practice of using regex on identify none-ASCII email address
Hi I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here): *Phags-pa scripts * Chinese: Traditional/Simplified * Mongolian * Sanskrit * ... * Kana scripts * Japanese: hirakana/Katakana * ... * Hebrew scripts * Yiddish * Hebrew * Bukhori * … * Latin scripts * English * Italian * …. * Hangul scripts * Korean * Cyrillic Scripts * Russian * Bulgarian * Ukrainian * ... By focusing on each scripts to derive a regular expression, I was wondering if such validation can be accomplished here. Of course, RFC3696 standardize all email formatting rules and we can use such rule to validate the format before checking the scripts for validity. Warm Regards, -James Lin From: Paweł Dyda pawel.d...@gmail.commailto:pawel.d...@gmail.com Date: Wednesday, October 30, 2013 at 2:19 PM To: James Lin james_...@symantec.commailto:james_...@symantec.com Cc: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org, Unicode List unicode@unicode.orgmailto:unicode@unicode.org Subject: Re: Best practice of using regex on identify none-ASCII email address Hi James, I am not sure if you have seen my email, but... I believe Regular Expressions are not a valid tool for that job (that is validating Int'l email address format). In the internal email I especially gave one specific example, where to my knowledge it is (nearly) impossible to use Regular Expression to validate email address. The reason I gave was mixed-script scenario. How can we ensure that we allow mixture of Hiragana, Katakana and Latin, while basically disallowing any other combinations with Latin (especially Latin + Cyrillic or Latin + Greek)? I am really curious to know... And of course there are several single-script (homographs and alike) attacks that we might want to prevent. I don't think it is even remotely possible with Regular Expressions. Please correct me if I am wrong. Cheers, Paweł. 2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com Let me include the unicode alias as well for wider audience since this topic came up few times in the past. From: James Lin james_...@symantec.commailto:james_...@symantec.com Date: Wednesday, October 30, 2013 at 1:11 PM To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org Subject: Best practice of using regex on identify none-ASCII email address Hi does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression? I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address. thanks everyone. -James
RE: Best practice of using regex on identify none-ASCII email address
Mixed script stuff considerations are all supposed to be done by the mailbox administrator. It's perfectly valid for a domain to assign Latin addresses and also Cyrillic ones. Indeed for Cyrillic EAI, one probably would almost certainly require ASCII (eg: Latin) aliases during whatever the transition period is. A German mailbox admins may only allow German letters and no other Latin characters in their mailbox names. Other admins may want to allow Latin characters with other scripts (CJK locales come to mind). And a Russian admin may provide all-Cyrillic mailboxes with all-Latin aliases to those names. (Hopefully that admin's being careful about homographs, but the standards still let the admin make the decisions). The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day). -Shawn From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of James Lin Sent: Wednesday, October 30, 2013 2:58 PM To: Paweł Dyda Cc: cldr-us...@unicode.org; unicode@unicode.org Subject: Re: Best practice of using regex on identify none-ASCII email address Hi I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here): *Phags-pa scripts * Chinese: Traditional/Simplified * Mongolian * Sanskrit * ... * Kana scripts * Japanese: hirakana/Katakana * ... * Hebrew scripts * Yiddish * Hebrew * Bukhori * ... * Latin scripts * English * Italian * * Hangul scripts * Korean * Cyrillic Scripts * Russian * Bulgarian * Ukrainian * ... By focusing on each scripts to derive a regular expression, I was wondering if such validation can be accomplished here. Of course, RFC3696 standardize all email formatting rules and we can use such rule to validate the format before checking the scripts for validity. Warm Regards, -James Lin From: Paweł Dyda pawel.d...@gmail.commailto:pawel.d...@gmail.com Date: Wednesday, October 30, 2013 at 2:19 PM To: James Lin james_...@symantec.commailto:james_...@symantec.com Cc: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org, Unicode List unicode@unicode.orgmailto:unicode@unicode.org Subject: Re: Best practice of using regex on identify none-ASCII email address Hi James, I am not sure if you have seen my email, but... I believe Regular Expressions are not a valid tool for that job (that is validating Int'l email address format). In the internal email I especially gave one specific example, where to my knowledge it is (nearly) impossible to use Regular Expression to validate email address. The reason I gave was mixed-script scenario. How can we ensure that we allow mixture of Hiragana, Katakana and Latin, while basically disallowing any other combinations with Latin (especially Latin + Cyrillic or Latin + Greek)? I am really curious to know... And of course there are several single-script (homographs and alike) attacks that we might want to prevent. I don't think it is even remotely possible with Regular Expressions. Please correct me if I am wrong. Cheers, Paweł. 2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com Let me include the unicode alias as well for wider audience since this topic came up few times in the past. From: James Lin james_...@symantec.commailto:james_...@symantec.com Date: Wednesday, October 30, 2013 at 1:11 PM To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org cldr-us...@unicode.orgmailto:cldr-us...@unicode.org Subject: Best practice of using regex on identify none-ASCII email address Hi does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression? I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address. thanks everyone. -James
Re: Best practice of using regex on identify none-ASCII email address
You should not ttempt to detect scripts or even assume that they are encoded based on Unicode, in the username part ; all you can do is to break at the first @ to split it between user name part and the domin name, then use the IDN specs to validate the domain name part. * 1. Domain name part: You may want to restrict only to internet domains (that must contain a dot before the TLD), and validate the TLD label in a list that you do not restrict for local usage only (such as .local or .localnet), or only for your own domain, but I suggest that you validte all these domains only by performing a MX request on your DNS server (this could take time to reply, unless you just check the TLD part, which should be cached most often, or using the DNS request only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which are not in the private-use range of ISO 3166-1). Note that to send a mail, you need a MX resolution on DNS to get the address of a mail server, but it does not mean it will be immediately and constantly reachable : the UIP you get may be temporrily unreachable (due to your ISP or local routing problems, or because the remote mail server is temporarily offine or overloaded). Performing an MX request however is much faster than trying to send a mail to it, because MX resoltuion will use your local DNS server cache and caches of offstream DNS servers of your ISP (you normally don't need to perform authoritative MX requests which requires recursive search from the root, bypassing all caches, and the scalability of the DNS system (so it's not a good policy to do it by default). If you need security, authoritative DNS queries should be replaced by secure emails based on direct authentication with the mail server at strt of the SMTP session. authoritative DNS queries should be performed only if this authentication fails (in order to bypass incorrect data in DNS caches), but not automaticlly (this could be caused by problems on your own site), so delay these unchecked email addresses in your database (the problem may be solved without doing anything when your server will retry several minutes or hours later, when it will have successed in sending the validation email for your subscribers). Do not insert in your database any email addresses coming from any source you don't trust for having received the approval by the mail address owner, or not obeying to the same explicit approval policy seen by that user, or that is not in a domain in your own control ; otherwise you risk being flagged as spamming and have your site blocked on various mail servers: you need to send the validation email without sending any other kind of advertising, except your own identity. Note that instead of a domain, you *may* accept a host name with an IPv4 address (in decimal dotted format), or an IPv6 address (within [brackets], and in hexadecimal with colons), or some other host name formats for specific mail/messaging transport protocols you accept, for example username@[irc:ircservernname:port:channelname], or username@{uuid} using other punctuation not valid in domain names. * 2. User name part: There's no standard encoding there. - Do not assume any encoding (unless you know the encoding used on each specific domain !). This part never obeys the IDNA. - Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 0x80..0xFF are valid in any sequence. - Only few punctuations of the ASCII range need to be checked according to the RFC's. - Never canonicalise user names by forcing the capitalisation (not even for the basic Latin letters : user names could be encoded with *B*ase-64 for example where letter case is significant), even if you can do it for the domain name part. 2013/10/30 James Lin james_...@symantec.com Hi I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here): - Phags-pa scripts - Chinese: Traditional/Simplified - Mongolian - Sanskrit - ... - Kana scripts - Japanese: hirakana/Katakana - ... - Hebrew scripts - Yiddish - Hebrew - Bukhori - … - Latin scripts - English - Italian - …. - Hangul scripts - Korean - Cyrillic Scripts - Russian - Bulgarian - Ukrainian - ... By focusing on each scripts to derive a regular expression, I was wondering if such validation can be accomplished here. Of course, RFC3696 standardize all email formatting rules and we can use such rule to validate the format before
RE: Best practice of using regex on identify none-ASCII email address
For EAI (the question being asked), the entire address, local part and domain, are encoded in UTF-8. -Shawn From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Sent: Wednesday, October 30, 2013 4:08 PM To: James Lin Cc: Paweł Dyda; cldr-us...@unicode.org; unicode@unicode.org Subject: Re: Best practice of using regex on identify none-ASCII email address You should not ttempt to detect scripts or even assume that they are encoded based on Unicode, in the username part ; all you can do is to break at the first @ to split it between user name part and the domin name, then use the IDN specs to validate the domain name part. * 1. Domain name part: You may want to restrict only to internet domains (that must contain a dot before the TLD), and validate the TLD label in a list that you do not restrict for local usage only (such as .local or .localnet), or only for your own domain, but I suggest that you validte all these domains only by performing a MX request on your DNS server (this could take time to reply, unless you just check the TLD part, which should be cached most often, or using the DNS request only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which are not in the private-use range of ISO 3166-1). Note that to send a mail, you need a MX resolution on DNS to get the address of a mail server, but it does not mean it will be immediately and constantly reachable : the UIP you get may be temporrily unreachable (due to your ISP or local routing problems, or because the remote mail server is temporarily offine or overloaded). Performing an MX request however is much faster than trying to send a mail to it, because MX resoltuion will use your local DNS server cache and caches of offstream DNS servers of your ISP (you normally don't need to perform authoritative MX requests which requires recursive search from the root, bypassing all caches, and the scalability of the DNS system (so it's not a good policy to do it by default). If you need security, authoritative DNS queries should be replaced by secure emails based on direct authentication with the mail server at strt of the SMTP session. authoritative DNS queries should be performed only if this authentication fails (in order to bypass incorrect data in DNS caches), but not automaticlly (this could be caused by problems on your own site), so delay these unchecked email addresses in your database (the problem may be solved without doing anything when your server will retry several minutes or hours later, when it will have successed in sending the validation email for your subscribers). Do not insert in your database any email addresses coming from any source you don't trust for having received the approval by the mail address owner, or not obeying to the same explicit approval policy seen by that user, or that is not in a domain in your own control ; otherwise you risk being flagged as spamming and have your site blocked on various mail servers: you need to send the validation email without sending any other kind of advertising, except your own identity. Note that instead of a domain, you *may* accept a host name with an IPv4 address (in decimal dotted format), or an IPv6 address (within [brackets], and in hexadecimal with colons), or some other host name formats for specific mail/messaging transport protocols you accept, for example username@[irc:ircservernname:port:channelname], or username@{uuid} using other punctuation not valid in domain names. * 2. User name part: There's no standard encoding there. - Do not assume any encoding (unless you know the encoding used on each specific domain !). This part never obeys the IDNA. - Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 0x80..0xFF are valid in any sequence. - Only few punctuations of the ASCII range need to be checked according to the RFC's. - Never canonicalise user names by forcing the capitalisation (not even for the basic Latin letters : user names could be encoded with Base-64 for example where letter case is significant), even if you can do it for the domain name part. 2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com Hi I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here): *Phags-pa scripts * Chinese: Traditional/Simplified * Mongolian * Sanskrit * ... * Kana scripts * Japanese: hirakana/Katakana * ... * Hebrew scripts * Yiddish *
Re: Best practice of using regex on identify none-ASCII email address
2013/10/31 Shawn Steele shawn.ste...@microsoft.com For EAI (the question being asked), the entire address, local part and domain, are encoded in UTF-8. No. the question being sked (by James Lin) did NOT include this restriction: does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression? In his 2 replies, he did not added this restriction to EAI only (which is just a possible option on the Internet, not mandatory and frequently not followed in many domains).