[whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
Section 4.10.4.1.5 defines a valid e-mail address as follows:

A valid e-mail address is a string that matches the production
dot-atom-text @ dot-atom-text where dot-atom-text is defined in RFC
5322 section 3.2.3. [RFC5322]

This is much more restrictive than the full range of e-mail addresses
allowed by RFC 5322 et al.  I've been considering whether to use
input type=email in MediaWiki, and whether to change our server-side
e-mail address validation to match.  Historically, MediaWiki has
mostly just required that an @ symbol be present in the address.
Originally we used a simplistic regex, but when users complained, we
looked into the RFCs and decided it was too complicated to bother with
validation beyond checking for an @ sign.

So before switching us over, I decided to do some research on how many
users' addresses would be invalidated.  I used the database for the
English Wikipedia.  Over all registered users, I found 3,088,880
confirmed addresses, not necessarily all distinct.  (Confirmed here
means that in theory, modulo bugs, the user followed a confirmation
link in the e-mail they received, so the address probably works in
practice.)  Of those, 3,255 (~0.1%) failed HTML 5 validation, as
determined using the following regex-based database query:

r...@rosemary:enwiki SELECT COUNT(*) FROM user WHERE
user_email_authenticated IS NOT NULL AND user_email NOT REGEXP
'^[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*$'
AND user_email != '';
+--+
| COUNT(*) |
+--+
| 3255 |
+--+
1 row in set (16 min 10.80 sec)

(Someone please tell me if my regex doesn't match HTML 5 here.)

Inspection showed that the overwhelming majority of the failures were
due to the presence of excess whitespace, often a single trailing
space, or a space inserted before or after the @ sign.  When I
adjusted the regex to ignore those failures, I got a smaller list, 202
(about 0.007% of the total):

r...@rosemary:enwiki SELECT CONCAT('', user_email, ''),
user_email_authenticated FROM user WHERE user_email_authenticated IS
NOT NULL AND user_email NOT REGEXP '^[
\t\n]*[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~
\t\n]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~
\n\t]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~
\n\t]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~ \n\t]+)*[ \n\t]*$' AND
user_email NOT REGEXP
'^[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*...@[-a-za-z0-9!#$%\'*+/=?^_`{|}~]+(\.[-a-zA-Z0-9!#$%\'*+/=?^_`{|}~]+)*$'
AND user_email != '' LIMIT 500;
+---+--+
| CONCAT('', user_email, '')
 | user_email_authenticated |
+---+--+
...snip...
+---+--+
202 rows in set (13 min 1.71 sec)

Some of these were clearly wrong, and shouldn't have been confirmed to
begin with.  Some even didn't have an @ sign, so probably were
submitted in some window when we did no validation at all (and I have
no idea how they got confirmed).  Of the ones that possibly work, I
identified two major categories:

1) Addresses in the form foo b...@baz.example, or similar.  These
mostly match RFC 5322's name-addr production instead of addr-spec
(some have trailing semicolons, or are missing the initial , etc.).
I assume these were copy-pasted from a mail application.

2) Addresses with dots in incorrect places, in either the local part
or the domain name part.  For instance, multiple consecutive dots, or
leading/trailing dots.  These don't match RFC 5322 at all AFAICT, but
I asked one of the users with an invalid address of the form
f...@example.com, and he said it worked fine for him.  GNU mail gave
a syntax error when I tried to send mail to that address, but Gmail
sent it without complaint, and the user received it successfully.

There were other types of addresses that didn't meet HTML 5's
specification after whitespace was stripped, but none with more than a
single-digit number of addresses occurring in the sample of three
million or so that I looked at.  It's notable that not a single one
used the quoted-string or domain-literal productions, as far as I
could tell from manual inspection.

I should also note that this was only the English Wikipedia, and it
might be that speakers of other languages are more prone to use other
types of addresses that don't meet HTML 5's specification.  When
looking at the Swedish and German databases, for instance, I found one
or two addresses that had apparently been confirmed but contained
non-ASCII characters.  I didn't know the users with those addresses,
and I didn't want to send them unsolicited mail, so I wasn't able to
establish whether those addresses actually worked or the confirmation
was bogus.


Conclusions: At a minimum, I suggest that HTML 5 

Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread David Gerard
2009/8/23 Aryeh Gregor simetrical+...@gmail.com:

   Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.


+1

The quoted portion above strikes to the heart of the matter. I suppose
the spec wants to obviate defective email validation JavaScript, but
any restriction will (a) break stuff the user thinks should work (b)
not stop bad web coders for a second.



- d.


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
On Sun, Aug 23, 2009 at 3:41 PM, Aryeh Gregorsimetrical+...@gmail.com wrote:
 Alternatively, you could just loosen the restrictions even further,
 and only ban input that doesn't contain an @ sign.  (Or that doesn't
 match ^...@]+@[...@]+\.[^@]+$, or whatever.)  Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.

. . . and I should add that I think it might be useful to have an note
recommending that application authors not do any validation beyond
what the spec ends up mandating as required (preferably almost
nothing).  I've had a lot of problems with sites that think + isn't
valid in e-mail addresses, including pretty major sites that should
know better.  You don't really know if it will work anyway until you
try actually sending mail to it -- maybe the local part was mistyped
or invented -- so why not just do that?


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Tab Atkins Jr.
Thanks for doing this work, Aryeh!  It's really awesome!

On Sun, Aug 23, 2009 at 2:41 PM, Aryeh Gregorsimetrical+...@gmail.com wrote:
 Beyond that, although it's safe to say that quoted-string or
 domain-literal or even entirely invalid addresses are extraordinarily
 rare, there are *some* real people who do use them.  Unless something
 is so completely invalid that it's obviously impossible that any mail
 server would even try to send it anywhere, you're probably going to be
 cutting out some small number of users.

Unless you avoid validating *entirely*, there's virtually always going
to be some subset of theoretically valid addresses that you'll flag as
invalid, though.

 So why not have the spec say that in the case of e-mail addresses, the
 browser may warn the user, but should permit them to submit the
 address anyway?  If the user is willing to override the warning, then
 it's likely that they personally know that the e-mail address works,
 e.g., because they use it.

I'd be okay with this.

 Alternatively, you could just loosen the restrictions even further,
 and only ban input that doesn't contain an @ sign.  (Or that doesn't
 match ^...@]+@[...@]+\.[^@]+$, or whatever.)  Or just don't ban anything
 at all, like with type=tel.  type=email differs from most of the other
 types with validity constraints (like month, number, etc.) in that the
 difference between valid and invalid values is a purely pragmatic
 question (what will actually work?) that the user can often answer
 better than the application.  It doesn't seem like a good idea for the
 standard to tell users that the e-mail addresses they've actually been
 using are invalid.

Unlike type=tel, emails have a relatively simply format which *very
nearly everyone* uses.  I agree that if an email works but is one of
those crazy formats it's probably not a good idea to bar them from
using it, but in practice that's exactly what happens right now with
email validation scripts.  If type=email doesn't validate at all
people will still just continue to use their broken homebrew
validators both on client-side and server-side.

It's possible that a token validation step would be sufficient, but I
suspect not.  Probably just a slight loosening of the allowed format,
informed by actual data such as what you gathered, would work fine,
possibly augmented by your suggestion of making type=email flag
'invalid' addresses but not actually prevent them from being
submitted.

Would you mind sharing these 200 or so that don't validate?  Obviously
there are privacy concerns, but I think it would be sufficient to just
replace every alpha character with 'x' and every numeric with '0', or
some similar information-removing transformation.  None of them fail
validation because of the letters or numbers used, so that would still
give us the information we need without revealing stuff we don't.

~TJ


Re: [whatwg] Comments on the definition of a valid e-mail address

2009-08-23 Thread Aryeh Gregor
On Sun, Aug 23, 2009 at 4:00 PM, Tab Atkins Jr.jackalm...@gmail.com wrote:
 Unless you avoid validating *entirely*, there's virtually always going
 to be some subset of theoretically valid addresses that you'll flag as
 invalid, though.

There shouldn't be, IMO, if the browser is forbidden to submit them.

 Unlike type=tel, emails have a relatively simply format which *very
 nearly everyone* uses.  I agree that if an email works but is one of
 those crazy formats it's probably not a good idea to bar them from
 using it, but in practice that's exactly what happens right now with
 email validation scripts.  If type=email doesn't validate at all
 people will still just continue to use their broken homebrew
 validators both on client-side and server-side.

They'll probably do that anyway.  HTML 5 doesn't have to mandate it.

 Would you mind sharing these 200 or so that don't validate?  Obviously
 there are privacy concerns, but I think it would be sufficient to just
 replace every alpha character with 'x' and every numeric with '0', or
 some similar information-removing transformation.  None of them fail
 validation because of the letters or numbers used, so that would still
 give us the information we need without revealing stuff we don't.

I doubt it would be useful.  I summarized all the interesting points,
and remember that these are only 0.007% of the total.  Also, note that
it was 3255 of them that didn't validate.  It was 202 that didn't
validate even after the regex was adjusted to allow whitespace
everywhere (should be equivalent to stripping 0x9, 0xA, 0x20 from
email input).