Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans

Le 29/10/2013 17:15, Jörg Knappen a écrit :
After running this script, a few more things were there: 
Non-normalised accents and some really strange

encodings I could not really explain but rather guess their meanings, like
s/Ãœ/Ü/g
s/É/É/g
s/AÌ€/À/g
s/aÌ€/à/g
s/EÌ€/È/g
s/eÌ€/è/g
s/„/„/g
s/“/“/g
s/ß/ß/g
s/’/’/g
s/Ä/Æ/g


It was probably not utf8 read as latin 1 and reencoded in utf8, but 
utf_8 encoding read as Windows 1252 ( 
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each 
of the combination above contains a character absent in latin-1 
(œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and 
not in Latin-15, the other possible mistake.


I'v e check that this is consistent with Ü É and ß but not with your Æ. 
This double encoding would give Ä :
Ä=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4 
=Ä (and not Æ)


   Frédéric




Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen

Thanks again!



My updated sed pattern generator now looks like:






r = range(0xa0, 0x170)
file = open(fixu8.sed, w)
for i in r:
 pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g
 print file, pat1
 try:
 pat2 = s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) + / + unichr(i).encode(utf-8) +/g
 except:
 pat2 = pat1
 if (pat1 != pat2):
 print file, pat2



doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low

enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors)



--Jrg Knappen




Gesendet:Mittwoch, 30. Oktober 2013 um 15:34 Uhr
Von:Frdric Grosshans frederic.grossh...@gmail.com
An:unicode@unicode.org
Betreff:Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice

Le 29/10/2013 17:15, Jrg Knappen a crit :
 After running this script, a few more things were there:
 Non-normalised accents and some really strange
 encodings I could not really explain but rather guess their meanings, like
 s///g
 s///g
 s/A//g
 s/a//g
 s/E//g
 s/e//g
 s///g
 s///g
 s///g
 s///g
 s///g

It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 (
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(), and some of them are only present in Windows-1252 () and
not in Latin-15, the other possible mistake.

Iv e check that this is consistent with   and  but not with your .
This double encoding would give  :
=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
= (and not )

Frdric








Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans

Le 30/10/2013 16:13, Jörg Knappen a écrit :

Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open(fixu8.sed, w)
for i in r:
  pat1 = 
s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / 
+ unichr(i).encode(utf-8) +/g

  print file, pat1
  try:
pat2 = 
s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) 
+ / + unichr(i).encode(utf-8) +/g

  except:
pat2 = pat1
  if (pat1 != pat2):
print file, pat2
doing both latin-1 and windows-1252 mangled double utf-8.  This is 
probably enough for now, the rate of errors is low
enough for practical purposes (i.e., lower than the natural error rate 
introduced by typing errors)


Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to 
be a superset of latin1, so it should be enough. Or is there a problem 
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?



Frédéric



Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen

The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw

C1 control characters for all of latin-1. So I had to do them, too.



The data werent consistent at all, not even in their errors.



--Jrg Knappen



Gesendet:Mittwoch, 30. Oktober 2013 um 16:58 Uhr
Von:Frdric Grosshans frederic.grossh...@gmail.com
An:Jrg Knappen jknap...@web.de
Cc:unicode@unicode.org
Betreff:Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

Le 30/10/2013 16:13, Jrg Knappen a crit :
 Thanks again!
 My updated sed pattern generator now looks like:
 r = range(0xa0, 0x170)
 file = open(fixu8.sed, w)
 for i in r:
 pat1 =
 s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + /
 + unichr(i).encode(utf-8) +/g
 print file, pat1
 try:
 pat2 =
 s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8)
 + / + unichr(i).encode(utf-8) +/g
 except:
 pat2 = pat1
 if (pat1 != pat2):
 print file, pat2
 doing both latin-1 and windows-1252 mangled double utf-8. This is
 probably enough for now, the rate of errors is low
 enough for practical purposes (i.e., lower than the natural error rate
 introduced by typing errors)

Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to
be a superset of latin1, so it should be enough. Or is there a problem
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?


Frdric







Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Frédéric Grosshans

Le 30/10/2013 17:32, Jörg Knappen a écrit :
The data did not only contain latin-1 type mangling for the 
non-existent Windows characters, but also sequences with the raw

C1 control characters for all of latin-1. So I had to do them, too.
The data weren't consistent at all, not even in their errors.
--Jörg Knappen
Your question helped me dust off and repair a non working python snippet 
I wrote for a similar problem. I was stuck with the mixing of 
windows-1252 and latin1 controls (linked with a chinese characters). I 
write it below for reference.


The python snippet below does not need sed, defines a function 
(unscramble(S)) which works on strings. The extension to files should be 
easy.


Frédéric Grosshans


def Step1Filter(S):
for c in S :
#works character/character because of the cp1252/latin1 ambiguity
try :
yield c.encode('cp1252')
except UnicodeEncodeError :
yield c.encode('latin1')
#Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)

def unscramble(S):
return b''.join(c for c in Step1Filter(S)).decode('utf8')

PS: If anyone is interested in a licence, I consider this simple enough 
to be in the public domain an uncopyrightable.




Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Buck Golemon
On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans 
frederic.grossh...@gmail.com wrote:

 Le 30/10/2013 17:32, Jörg Knappen a écrit :

  The data did not only contain latin-1 type mangling for the non-existent
 Windows characters, but also sequences with the raw
 C1 control characters for all of latin-1. So I had to do them, too.
 The data weren't consistent at all, not even in their errors.
 --Jörg Knappen

 Your question helped me dust off and repair a non working python snippet I
 wrote for a similar problem. I was stuck with the mixing of windows-1252
 and latin1 controls (linked with a chinese characters). I write it below
 for reference.

 The python snippet below does not need sed, defines a function
 (unscramble(S)) which works on strings. The extension to files should be
 easy.

 Frédéric Grosshans


 def Step1Filter(S):
 for c in S :
 #works character/character because of the cp1252/latin1 ambiguity
 try :
 yield c.encode('cp1252')
 except UnicodeEncodeError :
 yield c.encode('latin1')
 #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)

 def unscramble(S):
 return b''.join(c for c in Step1Filter(S)).decode('utf8')

 PS: If anyone is interested in a licence, I consider this simple enough to
 be in the public domain an uncopyrightable.


This encoding you've implemented above is known as windows-1252 by the
whatwg and all browsers [1][2].
The implementation of cp1252 in python is instead a direct consequence of
the unicode.org definition [3].

 [1] http://encoding.spec.whatwg.org/index-windows-1252.txt
 [2] http://bukzor.github.io/encodings/cp1252.html
 [3]
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT


Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James


RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
EAI doesn't really specify anything more than the older SMTP about validating 
email addresses.  Everything in the local part = U+0080 is permissible and up 
to the server to sort out what characters it wants to allow, how it wants to 
map things like Turkish I, etc.  Some code points are clearly really unhelpful 
in an email local part, but the EAI RFCs leave it up to the servers how they 
want to assign mailboxes.

Obviously you could check the domain name to make sure it's a valid domain 
name, and the ASCII range of the local part to make sure it respects the 
earlier RFCs, and the lengths, but you won't really know if it's a legal name 
until the mail does/doesn't get accepted by the server.  AFAIK there isn't a 
published regex for doing the limited validation that is possible.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 1:42 PM
To: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James


Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread James Lin
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

 *Phags-pa scripts
*   Chinese: Traditional/Simplified
*   Mongolian
*   Sanskrit
*   ...
 *   Kana scripts
*   Japanese: hirakana/Katakana
*   ...
 *   Hebrew scripts
*   Yiddish
*   Hebrew
*   Bukhori
*   …
 *   Latin scripts
*   English
*   Italian
*   ….
 *   Hangul scripts
*   Korean
 *   Cyrillic Scripts
*   Russian
*   Bulgarian
*   Ukrainian
*   ...

By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda pawel.d...@gmail.commailto:pawel.d...@gmail.com
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin james_...@symantec.commailto:james_...@symantec.com
Cc: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org, Unicode List 
unicode@unicode.orgmailto:unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,

I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...

And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.

Cheers,
Paweł.


2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James



RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
Mixed script stuff considerations are all supposed to be done by the mailbox 
administrator.  It's perfectly valid for a domain to assign Latin addresses and 
also Cyrillic ones.  Indeed for Cyrillic EAI, one probably would almost 
certainly require ASCII (eg: Latin) aliases during whatever the transition 
period is.

A German mailbox admins may only allow German letters and no other Latin 
characters in their mailbox names.  Other admins may want to allow Latin 
characters with other scripts (CJK locales come to mind).  And a Russian admin 
may provide all-Cyrillic mailboxes with all-Latin aliases to those names.  
(Hopefully that admin's being careful about homographs, but the standards still 
let the admin make the decisions).

The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day).

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Wednesday, October 30, 2013 2:58 PM
To: Paweł Dyda
Cc: cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Katakana
 *   ...

  *   Hebrew scripts

 *   Yiddish
 *   Hebrew
 *   Bukhori
 *   ...

  *   Latin scripts

 *   English
 *   Italian
 *   

  *   Hangul scripts

 *   Korean

  *   Cyrillic Scripts

 *   Russian
 *   Bulgarian
 *   Ukrainian
 *   ...
By focusing on each scripts to derive a regular expression, I was wondering if 
such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such 
rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin



From: Paweł Dyda pawel.d...@gmail.commailto:pawel.d...@gmail.com
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin james_...@symantec.commailto:james_...@symantec.com
Cc: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org, Unicode List 
unicode@unicode.orgmailto:unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions 
are not a valid tool for that job (that is validating Int'l email address 
format).

In the internal email I especially gave one specific example, where to my 
knowledge it is (nearly) impossible to use Regular Expression to validate email 
address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of  Hiragana, Katakana and Latin, while 
basically disallowing any other combinations with Latin (especially Latin + 
Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks 
that we might want to prevent. I don't think it is even remotely possible with 
Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com
Let me include the unicode alias as well for wider audience since this topic 
came up few times in the past.

From: James Lin james_...@symantec.commailto:james_...@symantec.com
Date: Wednesday, October 30, 2013 at 1:11 PM
To: cldr-us...@unicode.orgmailto:cldr-us...@unicode.org 
cldr-us...@unicode.orgmailto:cldr-us...@unicode.org
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII 
email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on 
how to validate none-ASCII email address.

thanks everyone.
-James



Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
You should not ttempt to detect scripts or even assume that they are
encoded based on Unicode, in the username part ; all you can do is to
break at the first @ to split it between user name part and the domin
name, then use the IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot
before the TLD), and validate the TLD label in a list that you do not
restrict for local usage only (such as .local or .localnet), or only for
your own domain, but I suggest that you validte all these domains only by
performing a MX request on your DNS server (this could take time to reply,
unless you just check the TLD part, which should be cached most often, or
using the DNS request only for domins not in a wellknown list of gTLD, plus
all 2-letter ccTLD which are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get
the address of a mail server, but it does not mean it will be immediately
and constantly reachable : the UIP you get may be temporrily unreachable
(due to your ISP or local routing problems, or because the remote mail
server is temporarily offine or overloaded). Performing an MX request
however is much faster than trying to send a mail to it, because MX
resoltuion will use your local DNS server cache and caches of offstream DNS
servers of your ISP (you normally don't need to perform authoritative MX
requests which requires recursive search from the root, bypassing all
caches, and the scalability of the DNS system (so it's not a good policy to
do it by default).

If you need security, authoritative DNS queries should be replaced by
secure emails based on direct authentication with the mail server at strt
of the SMTP session. authoritative DNS queries should be performed only if
this authentication fails (in order to bypass incorrect data in DNS
caches), but not automaticlly (this could be caused by problems on your own
site), so delay these unchecked email addresses in your database (the
problem may be solved without doing anything when your server will retry
several minutes or hours later, when it will have successed in sending the
validation email for your subscribers).

Do not insert in your database any email addresses coming from any source
you don't trust for having received the approval by the mail address owner,
or not obeying to the same explicit approval policy seen by that user, or
that is not in a domain in your own control ; otherwise you risk being
flagged as spamming and have your site blocked on various mail servers: you
need to send the validation email without sending any other kind of
advertising, except your own identity.

Note that instead of a domain, you *may* accept a host name with an IPv4
address (in decimal dotted format), or an IPv6 address (within [brackets],
and in hexadecimal with colons), or some other host name formats for
specific mail/messaging transport protocols you accept, for example
username@[irc:ircservernname:port:channelname], or username@{uuid}
using other punctuation not valid in domain names.


* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes
in 0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to
the RFC's.
- Never canonicalise user names by forcing the capitalisation (not even
for the basic Latin letters : user names could be encoded with *B*ase-64
for example where letter case is significant), even if you can do it for
the domain name part.





2013/10/30 James Lin james_...@symantec.com

 Hi
 I am not expecting a single regular expression to solve all possible
 combination of scripts.  What I am looking for probably (which may not be
 possible due to combination of scripts and mix scripts) is somewhere along
 the line of having individual scripts that validate by the regular
 expression.  I am still thinking if it is possible to have regular
 expression for individual scripts only and not mix-match (for the time
 being) such as (i am being very high level here):

-  Phags-pa scripts
   - Chinese: Traditional/Simplified
   - Mongolian
   - Sanskrit
   - ...
- Kana scripts
   - Japanese: hirakana/Katakana
   - ...
- Hebrew scripts
   - Yiddish
   - Hebrew
   - Bukhori
   - …
- Latin scripts
   - English
   - Italian
   - ….
- Hangul scripts
   - Korean
- Cyrillic Scripts
   - Russian
   - Bulgarian
   - Ukrainian
   - ...

 By focusing on each scripts to derive a regular expression, I was
 wondering if such validation can be accomplished here.

 Of course, RFC3696 standardize all email formatting rules and we can use
 such rule to validate the format before 

RE: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Shawn Steele
For EAI (the question being asked), the entire address, local part and domain, 
are encoded in UTF-8.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda; cldr-us...@unicode.org; unicode@unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

You should not ttempt to detect scripts or even assume that they are encoded 
based on Unicode, in the username part ; all you can do is to break at the 
first @ to split it between user name part and the domin name, then use the 
IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot 
before the TLD), and validate the TLD label in a list that you do not restrict 
for local usage only (such as .local or .localnet), or only for your own 
domain, but I suggest that you validte all these domains only by performing a 
MX request on your DNS server (this could take time to reply, unless you just 
check the TLD part, which should be cached most often, or using the DNS request 
only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which 
are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get the address of 
a mail server, but it does not mean it will be immediately and constantly 
reachable : the UIP you get may be temporrily unreachable (due to your ISP or 
local routing problems, or because the remote mail server is temporarily offine 
or overloaded). Performing an MX request however is much faster than trying to 
send a mail to it, because MX resoltuion will use your local DNS server cache 
and caches of offstream DNS servers of your ISP (you normally don't need to 
perform authoritative MX requests which requires recursive search from the 
root, bypassing all caches, and the scalability of the DNS system (so it's not 
a good policy to do it by default).

If you need security, authoritative DNS queries should be replaced by secure 
emails based on direct authentication with the mail server at strt of the SMTP 
session. authoritative DNS queries should be performed only if this 
authentication fails (in order to bypass incorrect data in DNS caches), but not 
automaticlly (this could be caused by problems on your own site), so delay 
these unchecked email addresses in your database (the problem may be solved 
without doing anything when your server will retry several minutes or hours 
later, when it will have successed in sending the validation email for your 
subscribers).

Do not insert in your database any email addresses coming from any source you 
don't trust for having received the approval by the mail address owner, or not 
obeying to the same explicit approval policy seen by that user, or that is not 
in a domain in your own control ; otherwise you risk being flagged as spamming 
and have your site blocked on various mail servers: you need to send the 
validation email without sending any other kind of advertising, except your own 
identity.

Note that instead of a domain, you *may* accept a host name with an IPv4 
address (in decimal dotted format), or an IPv6 address (within [brackets], and 
in hexadecimal with colons), or some other host name formats for specific 
mail/messaging transport protocols you accept, for example 
username@[irc:ircservernname:port:channelname], or username@{uuid} using 
other punctuation not valid in domain names.


* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each 
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 
0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to the 
RFC's.
- Never canonicalise user names by forcing the capitalisation (not even for 
the basic Latin letters : user names could be encoded with Base-64 for example 
where letter case is significant), even if you can do it for the domain name 
part.




2013/10/30 James Lin james_...@symantec.commailto:james_...@symantec.com
Hi
I am not expecting a single regular expression to solve all possible 
combination of scripts.  What I am looking for probably (which may not be 
possible due to combination of scripts and mix scripts) is somewhere along the 
line of having individual scripts that validate by the regular expression.  I 
am still thinking if it is possible to have regular expression for individual 
scripts only and not mix-match (for the time being) such as (i am being very 
high level here):

  *Phags-pa scripts

 *   Chinese: Traditional/Simplified
 *   Mongolian
 *   Sanskrit
 *   ...

  *   Kana scripts

 *   Japanese: hirakana/Katakana
 *   ...

  *   Hebrew scripts

 *   Yiddish
 *  

Re: Best practice of using regex on identify none-ASCII email address

2013-10-30 Thread Philippe Verdy
2013/10/31 Shawn Steele shawn.ste...@microsoft.com

  For EAI (the question being asked), the entire address, local part and
 domain, are encoded in UTF-8.


No. the question being sked (by James Lin) did NOT include this
restriction:

 does anyone has the best practice or guideline on how to validate
none-ASCII email address by using regular expression?

In his 2 replies, he did not added this restriction to EAI only (which is
just a possible option on the Internet, not mandatory and frequently not
followed in many domains).