Re: Where to find DETAIL for spamassassin default RULES

Bill Cole Sun, 12 Jun 2016 12:15:06 -0700

On 11 Jun 2016, at 4:21, Groach wrote:

On 11/06/2016 05:09, Bill Cole wrote:
So, you thought validating email addresses was a problem demanding asolution? And you "solved" it with a regular expression?
Congratulations on now having 2 problems. They should be very happytogether.
The regex I quoted was out of context to the problem and completelyunrelated (sorry if you feel so confused with that).

I was not at all confused, but sometimes when people are Wrong On TheInternet in special ways I cannot resist the urge to respond with aparaphrased geek meme...

Look up Jamie Zawinski's famous "2 problems" quote regarding regularexpressions. It is a perfect fit for the application of regularexpressions to address validation

It is actually for another software project (a mail server)

Please don't take this as derogatory, because I DO NOT mean it to be,but can you explain why the world needs yet another new mail serverimplementation?

As an example of why I ask this, consider that Microsoft rewrote theSMTP implementation in Exchange 2013 and did it wrong, breakingmulti-recipient message handling. I guess they had some reason, but thepoint is that new code means new bugs, even when you have an elaborateQA organization in place to prevent that.

that, being a mail server, must ensure email addresses are valid.

Not really. It needs to make sure that it never generates invalidaddresses and it probably should check addresses in its inputs for typesof invalidity that your later code will assume not to be present, butthose are both far from a need to validate addresses perfectly (or evennear-perfectly) to the RFC specification. Having a logical set ofaddresses that you'd never generate but will still blindly andharmlessly work with, some of which may not fit the RFC specs, is aNON-PROBLEM.

Even if you wanted to draw a RFC-perfect boundary between valid andinvalid addresses, complex regular expressions are a poor tool for thatbecause the logic of REs don't align to that of the ABNF used in RFCs. Asingle regular expression CANNOT precisely match the wholeRFC822/2822/5322 address space. The closest approximation in Perl RE ishuge, indecipherable, and machine-generated. It also cannot deal withnested comments, a valid albeit pathological address structure under theABNF definition. In POSIX RE the problems are MUCH worse.

On the other hand, you COULD use very simple REs to serially andrecursively decompose addresses into the constructs defined by the ABNFspec, using the same logic as the spec to validate addresses. This isnot as interesting a "problem" as writing the One True RFC822 RE, but itis a fairly trivial coding exercise and would run more efficiently thana single RE with the benefit of being more readable and debuggable.

I quoted the regexp in context of showing my point about how'squiggly' they can be and that I am able to read them.....to a point.(I was proud because 'googling' around for a regex email addressvalidator string shows some VERY suspicious andextortionately,seemingly unnecessarily, long offerings. So I had a gomyself).

And just like a hilariously long list of predecessors, came up with a REwhich fails to precisely reproduce the ABNF definition of a validaddress for message headers. This is why you now have 2 problems:

1. The one you invented of needing to precisely validate email addressesto a RFC specification that is not a perfect match for the addressingsupported by any coherent package of production-grade mail software.

2. A regular expression that is absurdly complex which you incorrectlybelieve solves (1) while in fact it does not. It is maybe good enough,but maybe not. It's an untestable approximation of its design goal,which is an intrinsic problem for software.

Re: Where to find DETAIL for spamassassin default RULES

Reply via email to