Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Friday, June 6, 2014 1:55:37 AM CEST, Wietse Venema wrote: Postfix is meant to be used by human operators anywhere on the Internet. Therefore, the postqueue/postmap/etc. tools will have to accept non-ASCII domain names from a human operator in either UTF-8 form and xn--mumble form, and they will have to convert those forms into their stored form. Those tools will also have to render non-ASCII domain names in their stored form, or convert them into UTF-8 or xn--mumble form on request by the human operator. Makes sense; patch coming. Will take a few days. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Wednesday, June 4, 2014 11:16:51 PM CEST, Wietse Venema wrote: * Postfix table queries are case-insensitive. I don't see any attempt to implement that for UTF8 addresses. This leaves an ambiguity. I looked at this now. As I read the code, tables mostly map to lower case and then do a binary comparison. The mysql and pgsql tables may additionally use the database server's ilike operation. Finally, lowercase() maps U to u, but leaves 0xC0 as 0xC0, even if the Postfix server runs in a locale where the lowercase form of that is 0xE0. Is that correct? I can provide a supplementary patch that provides case insensitivity for unicode. It's easy, but there are several ways to do it, and I don't know which you prefer. 1. Toupper/tolower in Postfix, with the usual table. This adds the bulk of a table and is language-independent but imperfect. The well-known problem is i/ı. (The lowercase(I) equvalent is ı in Turkish and a handful of other locales.) 2a. Toupper/tolower that call out to ICU if EAI is enabled and there's any non-ASCII is in the argument. This slows down toupper()/tolower() but Postfix escapes having the table and ICU devotes considerable effort to correctness. It's easy to compose the string, too (composition means to use å instead of a+ring above). 2b. Ditto, but calling a language-sensitive function in ICU, so that i is equal to İ if the Postfix server runs in one of those locales. I'm unhappy about this alternative — a Swiss service provider may well service both Kazakh and Korean users and how should the service providers's Postfix be configured? 3. Switching to titlecase. A bigger change. Titlecase is a form in which in which case differences are erased and in principle it's neither equal to uppercase nor to lowercase. It's only usable for implementing case-insensitive comparison/lookup using fast binary comparison. In my opinion the change to titlecase isn't worth it. There aren't enough problems with lowercase() to justify such a sweeping change. Also keeping lower case allows compiled tables to survive upgrades/downgrades. I'm neutral regarding 1 and 2a. If you'll tell me what you prefer I'll write a patch and test that it matches another implementation. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Thursday, June 5, 2014 10:36:49 AM CEST, Arnt Gulbrandsen wrote: In my opinion the change to titlecase isn't worth it. There aren't enough problems with lowercase() to justify such a sweeping change. Also keeping lower case allows compiled tables to survive upgrades/downgrades. Worse: There are likely user-supplied tables that depend on lowercase input. Both the mysql and pgsql tables make it easy to configure case-sensitive queries, which switching to titlecase would break. I think titlecase is definitely out. (Btw, I wrote tolower() instead of lowercase() once or twice in the previous message. Sorry.) Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
Arnt Gulbrandsen: On Wednesday, June 4, 2014 11:16:51 PM CEST, Wietse Venema wrote: * Postfix table queries are case-insensitive. I don't see any attempt to implement that for UTF8 addresses. This leaves an ambiguity. I looked at this now. As I read the code, tables mostly map to lower case and then do a binary comparison. The mysql and pgsql tables may additionally use the database server's ilike operation. Finally, lowercase() maps U to u, but leaves 0xC0 as 0xC0, even if the Postfix server runs in a locale where the lowercase form of that is 0xE0. Is that correct? That question is not applicable. Postfix locale is C, and lowercase() only translates ASCII characters. I can provide a supplementary patch that provides case insensitivity for unicode. It's easy, but there are several ways to do it, and I don't know which you prefer. 1. Toupper/tolower in Postfix, with the usual table. This adds the bulk of a table and is language-independent but imperfect. The well-known problem is i/?. (The lowercase(I) equvalent is ? in Turkish and a handful of other locales.) 2a. Toupper/tolower that call out to ICU if EAI is enabled and there's any non-ASCII is in the argument. This slows down toupper()/tolower() but Postfix escapes having the table and ICU devotes considerable effort to correctness. It's easy to compose the string, too (composition means to use ? instead of a+ring above). 2b. Ditto, but calling a language-sensitive function in ICU, so that i is equal to ? if the Postfix server runs in one of those locales. I'm unhappy about this alternative ? a Swiss service provider may well service both Kazakh and Korean users and how should the service providers's Postfix be configured? 3. Switching to titlecase. A bigger change. Titlecase is a form in which in which case differences are erased and in principle it's neither equal to uppercase nor to lowercase. It's only usable for implementing case-insensitive comparison/lookup using fast binary comparison. In my opinion the change to titlecase isn't worth it. There aren't enough problems with lowercase() to justify such a sweeping change. Also keeping lower case allows compiled tables to survive upgrades/downgrades. I'm neutral regarding 1 and 2a. If you'll tell me what you prefer I'll write a patch and test that it matches another implementation. This will require further research. If case canonicalization is as complex as you describe then the correct result is likely to differ from what real people expect. That is a security hole. Wietse
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
This will require further research. If case canonicalization is as complex as you describe then the correct result is likely to differ from what real people expect. That is a security hole. That was the case in the nineties, but by now the case folding algorithms in unicode have won. They've been used to much that people have come to expect that they're right. There are problems, but lowercase() escapes all but ı/i. But ı is nasty. I have even found two domains that differ only in ı/i, so Postfix cannot treat them as equal. Composition (the other part of canonicalization) is worse matter. You're right, that might lead to security problems. It can lead to table lookup misses, and I'm sure that table misses can lead to several kinds of security problems. For example forgetting mandatory TLS. The safest alternative is to fully compose table lookup keys. (Or fully decompose, but fully compose is usually faster.) I'll provide a patch to do the 2a alternative. It'll take a few days. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Thu, Jun 05, 2014 at 02:24:38PM +0200, Arnt Gulbrandsen wrote: But ? is nasty. I have even found two domains that differ only in ?/i, so Postfix cannot treat them as equal. Domains passed to lookup tables and match lists need to be in a-label form. The remaining surprises with domains and case-insensitive comparisons vs. unicode will be with header/body checks, likely OK. -- Viktor.
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Thursday, June 5, 2014 4:32:52 PM CEST, Viktor Dukhovni wrote: Domains passed to lookup tables and match lists need to be in a-label form. That would make pcre almost impossible and mysql and pgsql lookups rather inconvenient. The a-label form of blåbærsyltetøy in a-label form is xn--blbrsyltety-y8ao3x. Matching the PCRE /.*syltetøy.*/ in a-label form would be inconvenient, perhaps impossible. Postgres and Mysql have builtin support for UTF8 strings so mysql/pgsql tables can use e.g. the ilike operator, but they do not support strings composed from a-labels. Here's a pgqsl concoction to match usernames, optionally with subaddresses: select id from addresses where localpart='%u' or localpart ilike '%u-%' I cannot imagine any way to implement that if %u is in a-label form. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Thu, Jun 05, 2014 at 05:18:48PM +0200, Arnt Gulbrandsen wrote: On Thursday, June 5, 2014 4:32:52 PM CEST, Viktor Dukhovni wrote: Domains passed to lookup tables and match lists need to be in a-label form. That would make pcre almost impossible and mysql and pgsql lookups rather inconvenient. What's the problem with the canonical representation of the domain exactly as it appears on the wire in DNS, in certificate DNS altnames, ... The a-label form of bl?b?rsyltet?y in a-label form is xn--blbrsyltety-y8ao3x. Matching the PCRE /.*syltet?y.*/ in a-label form would be inconvenient, perhaps impossible. Regular expressions on partial DNS labels are not that useful anyway. Generally one just wants all the sub-domains of a particular domain. Sometimes one wants to filter cable-modem/DSL PTR records, otherwise I'm losing sleep over partial DNS label regexps. Postgres and Mysql have builtin support for UTF8 strings so mysql/pgsql tables can use e.g. the ilike operator, but they do not support strings composed from a-labels. Here's a pgqsl concoction to match usernames, optionally with subaddresses: Nothing lost when the domain name is a-label form. The localpart remains unicode, and one still needs some sort of UTF-8 - utf-8 lower-case operator that operates correctly on ASCII. Frankly applying lowercase() to just the ASCII octets works fine in this situation, provided the domain is in a-label form already. Unicode email address localparts would be case-sensitive in their non-ASCII octets, not the end of the world. -- Viktor.
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
Viktor Dukhovni: Not too many people in Russia read Hebrew (right to left) or can even cut and paste it reliably into a left to right context. Postfix is meant to be used by human operators anywhere on the Internet. Therefore, the postqueue/postmap/etc. tools will have to accept non-ASCII domain names from a human operator in either UTF-8 form and xn--mumble form, and they will have to convert those forms into their stored form. Those tools will also have to render non-ASCII domain names in their stored form, or convert them into UTF-8 or xn--mumble form on request by the human operator. This way, human operators can manage domain names that are in the operator's native script, but they can fall back to ASCII when the domain is in some alien script. So it does not matter what the stored form is (and in the case of the mail queue, the stored form is controlled by the sender anyway). What matters is that Postfix management tools allow humans to use the stored form effectively. In other words, the tools must allow the human operator to choose how to enter a non-ASCII domain name, and how to render it. See also my longer, previous, post in this thread. Wietse
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Wednesday, June 4, 2014 3:23:24 PM CEST, Wietse Venema wrote: Arnt Gulbrandsen: On Wednesday, June 4, 2014 12:55:18 PM CEST, Wietse Venema wrote: ... Yes. We must maintain compatibility with existing practice. Postfix has always passed 8-bit headers and envelopes (localparts) for the past 15 years. It would be an unaceptable compatibility break if, for example, a corporate perimeter MTA were to start bouncing inbound mail just because 1) some up-stream client is changed to flag that email as SMTPUTF8, but 2) some down-stream internal server doesn't announce SMTPUTF8. I think you're right. The two code blocks that return 5.6.7 should perhaps be included later, but definitely not included now. Thus, the SMTP client, cleanup daemon, and other daemon programs MUST NOT engage into any EAI-related stuff unless a message is flagged as EAI-enabled. I will add a guard around that code. The smtputf8 flag in the queue file acts as such a guard. No it doesn't. OK: It's meant to act as such a guard. Example: ORCPT handling in the cleanup Milter client and in the SMTP client is unconditional on the smtputf8 flag. However, given that UTF8 addresses use a special encoding, I suspect that it is better to decode them properly (the alternative would be to not decode them at all and just pass them on, but that requires some extra code to handle existing queue files that contain decoded attributes). You'll see some other code like that in the DSN generation, when it chooses quoting format. I didn't find an alternative I really liked. It's not clear to me that UTF8 addresses always use that special encoding. They probably should, but I found 6533 rather confusing. The niceties of UTF8 addresses in SMTPUTF8 messages vs. UTF8 addresses in other settings aren't as simple as I wish they were. The ORCPT code in Milter/SMTP expects that all 8-bit addresses are SMTPUTF8 addresses that have somehow escaped into ASCIIland, so they should be encoded as RFC6533 says in ORCPT. That's based on my reading of RFC6533. I don't entirely like it, but I don't see any real alternative either. If you see localpart jøran and don't know whether it's just-send-8 or escaped EAI, should you follow EAI's quoting rules or extrapolate from RFC1984? And what should you do if you receive an ORCPT using EAI-style quoting even though the MAIL FROM did not declare SMTPUTF8? Should that ORCPT be reencoded using 1984 encoding or keep its EAI encoding? Icky. Have you given any thought of what happens when a company installs Postfix-EAI on the perimeter, and WANTS TO FORWARD THE MAIL TO THEIR INTERNAL SYSTEMS that may or may not have EAI support? Yes. ... Outgoing mail from that company to unicode addresses may begin to work, depending on whether the internal origin server supports EAI. Incorrect. This does not require any EAI support in the SMTP client. The SMTP client simply hands the mail to the gateway without any transformation of the recipient domain. If the best MX for the unicode recipient obeys RFC6531 section 3.4, then the SMTP client on the gateway has to use the SMTPUTF8 MAIL FROM parameter, ie. support EAI. By extension the origin server has to do the same. Incoming mail to that company from unicode addresses still doesn't work. This has worked for 15 years, at least with UTF8 localparts. Sorry about the sloppy writing. I meant unicode domains. You're right, it has worked with 8-bit localparts in ASCII domains. We must maintain compatibility with existing practice. It would be an unacceptable compatibility break if Postfix were to suddenly start rejecting such mail. OK. Is there a possibity that the same domain name may exist as an UTF8 string in some contexts and as xn-mumble elsewhere? If this is a problem then it will affect many database lookups. As far as I can tell the xn-- mumble is never used outside the DNS lookups, neither in the RFCs nor in practice. The EAI RFCs say to use the xn-- form for MX lookups, to use an ASCII domain name for the EHLO argument, and otherwise don't discuss xn--. In particular they don't say that the email address foo@xn--bar is equivalent to foo@bär. They also don't say it's different. I chose to make them essentially different. If a site admin chooses to add xn--bar to mydestinations, that user has to configure the rest so it works. I chose that mostly because I think xn-- is a phisher's dream. People won't recognize their own domains. But the choice also makes life simpler for table/database lookups. How do UTF8 domain names interact with DNS RHSBL lists? Do they expect the UTF8 form or the xn--mumble form? Unknown as yet. I expect it'll have to be xn-- mumble, but that's really just my guesswork. As far as I could tell none of the RHSBL operators have considered that matter yet. How do UTF8 domain names interact with reject_unknown_sender_domain, reject_unknown_recipient_domain, etc.? It looks like you are passing the
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Wednesday, June 4, 2014 6:58:43 PM CEST, Viktor Dukhovni wrote: My impression is that UTF-8 domain names are are an MUA display format issue. There was tremedously tedious discussion of the approach you suggest, and of many others. There was even a set of experimental RFCs issued. In the end the experimental RFCs were discarded. here's how RFC 6532 sums up the final message format changes: The preceding changes mean that the following constructs now allow UTF-8: 1. Unstructured text, used in header fields like Subject: or Content-description:. 2. Any construct that uses atoms, including but not limited to the local parts of addresses and Message-IDs. This includes addresses in the for clauses of Received: header fields. 3. Quoted strings. 4. Domains. 6531 references 6532, and the MAIL FROM/RCPT TO syntax allows UTF8. 6855 makes corresponding changes to IMAP (no more mUTF7, hurray), 6856 to POP, etc. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
On Wednesday, June 4, 2014 8:38:49 PM CEST, Wietse Venema wrote: I'll read the RFCs carefully and see where it allows UTF8 in SMTP command parameters and replies. You'll do that, but I'll tell you anyway: The client may use it once the server has issued an EHLO response containing SMTPUTF8, and the server may use it once the client has issued a MAIL FROM, VRFY or EXPN command with the SMTPUTF8 parameter. Postfix (both with and without my patch) violates that. If a client tells Postfix: MAIL FROM:æ@æ.æ then Postfix may conceivably answer that æ@æ.æ is not a legal sender address, since æ.æ isn't a valid domain. 6531 says that that response should be ASCII-only, since the client hasn't given permission to use UTF8 in responses. My viewpoint is that no matter what RFC6531 says, the client must accept hearing its own arguments in the SMTP reply. Postfix is right and 6531 is wrongish, so I followed Postfix' reply style rather than comply with 6531. However even without reading those RFCs it is clear that UTF8 cannot be used in 220 server greetings or in EHLO commands or replies, because at that time the server/client have not agreed to use UTF8. Right. Thus, myhostname (or equivalent) must be ASCII, as it always must have been. There is no need to use valid_mail_domain() in reject_non_fqdn_hostname etc. Right. I made some mistakes. I wish I were perfect, but know I am not. Arnt
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
Am 04.06.2014 19:48, schrieb Arnt Gulbrandsen: Compliant SMTP servers only accept mail to/from EAI addresses if the SMTP client uses the SMTPUTF8 form of the MAIL FROM command. The SMTP client, in turn, only uses that form if the origin too used it. The purpose of this feature is to guarantee that EAI messages don't land in the mailboxes of incompatible recipients. The relevant effect of this feature is that in order to send mail to a unicode address, the _sender_ must declare that the message uses EAI. Having 8-bit clean relays on the way is not enough. Thus an EAI domain name may show up as xn--mumble in HELO commands. Yes. I think it's a bad idea to do that. The chance that some SMTP server's gethostbyname() will return the UTF8 form and the SMTP server then complain about EHLO/PTR mismatch is too great. But it can happen. There will be more. I'll just document them and fix them, so I don't have to spend a lot of time reviewing another version. I'm late to the game, haven't checked the relevant RFCs or Arnt's patch, but a few thoughts on this -- perhaps you can answer all dealt with -- but here we go: * It reminds me a bit of the 8BITMIME feature that was in discussion in the late 1990's/early 2000's. I think The World™ never consented on how to deal with all that depending on how radical a certain software implemented its policies. Meaning: do we need this? Is Microsoft going to implement it? IBM's Lotus Domino/Notes suites on the client end? * My bigger concern is that UNICODE opens up ambiguities at various levels, for instance when doing table lookups (especially for policies, such as access control): + IDN punycode (xn--blech-rassel), as mentioned above. + Unicode normalization forms, are these handled consistently? http://www.unicode.org/reports/tr15/ I searched the patch for the word fragment normal, no hits. I find that worrisome. + Characters that are different but use similar-looking gylphs, (homoglyphs), for instance, between Greek/Cyrillic/Latin scripts. Latin A, Cyrillic A, Greek A are three code points for an indistinguishable character. A А Α - in what order are these? Hint: 000: 4120 d090 20ce 910a A .. ... or U+0041 U+0020 U+0410 U+0020 U+0391 Is there a consistent policy for treating them that does not open up loop- and ratholes and pitfalls and barndoors and all other sorts of unfortunate openings for unaware/malicious parties? + How does the patch make Postfix deal with table lookups for tables that don't go through postmap and cannot be normalized? I don't want to create artifical adoption obstacles here, but I think there is some room for nasty surprises, and that space needs exploration and solutions. That's not just security discussion, but also reliability. (Perhaps Unicode requires - or I missed - homoglyph tables, and case mapping tables...) I think Wietse's expectation on how not to change established behaviour of release versions is clear, and I've always known I can rely on Postfix's compatibility. (Not to say that Postfix's compatibility is exemplary, as in good example, but I digress.)
Re: Patch: Unicode email support (RFC 6531, 6532, 6533)
Arnt Gulbrandsen: De/composition are pushed to the DNS. The SMTP part just says: Convert to a IDNA a-labels in order to do the MX lookup, and otherwise don't mess with the bytes you received. (My patch uses ICU to convert to a-labels.) That is a mis-conception. DNS is not the only interface that requires xn--mumble names. Like a cancer, EAI has the potential to infect many aspects of address handling and policy lookup. This is why I estimated that SMTPUTF8 would be a major project. * The form xn--mumble will also be required in server greetings and EHLO commands, when an MTA host- or domain name contains non-ASCII characters. This means that Postfix must convert myhostname into xn--mumble form in those contexts that require ASCII text. * With multiple forms for the same domain name, xn--mumble in HELO/EHLO (and perhaps other SMTP commands) and UTF8 in MAIL/RCPT/ETRN/VRFY, Postfix lookup tables must either contain multiple lookup keys for the same domain name, or Postfix must convert all domain/email-address lookup keys into one canonical form. That is, either convert all UTF8 domain names into xn--mumble, or convert all xn--mumble domain names into UTF8. Having only one lookup key per domain in Postfix lookup tables will more secure but it will be a royal pain to implement (and here is no way to do that with header/body_checks). * I am not sure that we can rely on the postmap table query or create map commands to normalize domain names in lookup keys. Also, LDAP/*SQL*/etc. databases aren't created with postmap commands. All this could be another argument to use only xn--mumble or to use only UTF8 forms in databases. Again, more secure but a royal pain to implement, because postmap doesn't really know if a lookup key is a user, a domain, or something else. * If xn--mumble were to become the canonical form for table lookup, then Postfix parent-domain matching will not be broken: where buecher.com becomes xn--bcher-kva.com, foo.buecher.com becomes foo.xn--bcher-kva.com. Other things: * Postfix table queries are case-insensitive. I don't see any attempt to implement that for UTF8 addresses. This leaves an ambiguity. Wietse