Re: Add UTF8 support in PostgreSQL lookup table interface
John Fawcett: > On 25/08/18 23:59, Wietse Venema wrote: > > Wietse: > >> /* > >> * Don't frustrate future attempts to make Postfix UTF-8 transparent. > >> */ > >> if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 > >> && !valid_utf8_string(name, strlen(name))) { > >> if (msg_verbose) > >> msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", > >> myname, dict_ldap->parser->name, name); > >> return (0); > >> } > >> > >> This code has been [in dict_ldap.c and dict_sqlite.c] for four > >> years already. Never heard a peep. > > John Fawcett: > >> is the above check done in the single dict modules redundant? > > No. > > > > - The above filter executes only if UTF8 mode is OFF. > > > > - The dict_utf8 filter that you refer to executes only if UTF8 mode > > is ON. > > > > Wietse > > ok, got it, I need to put smtputf8_enable = no to go through the code > path above. > > The following trivially equivalent patch for mysql client seems to be ok > in my testing (ie gives same behaviour as now except for non valid utf8 > characters in the lookup): > > --- src/global/dict_mysql.c.orig??? 2018-08-26 10:14:29.085703480 +0200 > +++ src/global/dict_mysql.c.new??? 2018-08-26 14:58:30.695898300 +0200 > @@ -326,6 +326,18 @@ > dict->error = 0; > ? > /* > +??? *? * Don't frustrate future attempts to make Postfix UTF-8 > transparent. > +??? */ > +??? if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 > +??? && !valid_utf8_string(name, strlen(name))) { > +??? if (msg_verbose) > +??? msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", > + myname, dict_mysql->parser->name, name); > +??? return (0); > +??? } > + > + > +??? /* > ? * Optionally fold the key. > ? */ > if (dict->flags & DICT_FLAG_FOLD_FIX) { > > > Maybe it would be better to put this code into dict_lookup() so it gets > used for all lookup tables, though that is more invasive and requires > testing across more table types. In non-UTF8 mode, that would change Postfix behavior with memcache, Berkeley DB, LMDB, and other map types that currently don't care about encodings. Also, dict_lookup() would be the wrong place. It would miss all the lookups by calling dict->get() directly. > On the original issue about Postgres, as you have stated, it would make > sense to take out the hard coded LATIN1 encoding. The configuration > could then be specified in configuration files > (https://www.postgresql.org/docs/9.3/static/libpq-pgservice.html) > similar to the way the client character set encoding can be configured > for dict_mysql. Alternatively the character set encoding to be read from > a new variable. In UTF8 mode, Postfix can only ask well-formed UTF8 queries. That is the longer-term future; the vast majority of the web is already 90+% UTF8 (including ASCII). In non-UTF8 mode, there are no valid non-ASCII queries, so all we can do is to limit the damage while not breaking existing sites unnecessarily. Wietse
Re: Add UTF8 support in PostgreSQL lookup table interface
On 25/08/18 23:59, Wietse Venema wrote: > Wietse: >> /* >> * Don't frustrate future attempts to make Postfix UTF-8 transparent. >> */ >> if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 >> && !valid_utf8_string(name, strlen(name))) { >> if (msg_verbose) >> msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", >> myname, dict_ldap->parser->name, name); >> return (0); >> } >> >> This code has been [in dict_ldap.c and dict_sqlite.c] for four >> years already. Never heard a peep. > John Fawcett: >> is the above check done in the single dict modules redundant? > No. > > - The above filter executes only if UTF8 mode is OFF. > > - The dict_utf8 filter that you refer to executes only if UTF8 mode > is ON. > > Wietse ok, got it, I need to put smtputf8_enable = no to go through the code path above. The following trivially equivalent patch for mysql client seems to be ok in my testing (ie gives same behaviour as now except for non valid utf8 characters in the lookup): --- src/global/dict_mysql.c.orig 2018-08-26 10:14:29.085703480 +0200 +++ src/global/dict_mysql.c.new 2018-08-26 14:58:30.695898300 +0200 @@ -326,6 +326,18 @@ dict->error = 0; /* + * * Don't frustrate future attempts to make Postfix UTF-8 transparent. + */ + if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 + && !valid_utf8_string(name, strlen(name))) { + if (msg_verbose) + msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", + myname, dict_mysql->parser->name, name); + return (0); + } + + + /* * Optionally fold the key. */ if (dict->flags & DICT_FLAG_FOLD_FIX) { Maybe it would be better to put this code into dict_lookup() so it gets used for all lookup tables, though that is more invasive and requires testing across more table types. On the original issue about Postgres, as you have stated, it would make sense to take out the hard coded LATIN1 encoding. The configuration could then be specified in configuration files (https://www.postgresql.org/docs/9.3/static/libpq-pgservice.html) similar to the way the client character set encoding can be configured for dict_mysql. Alternatively the character set encoding to be read from a new variable. John
Re: Add UTF8 support in PostgreSQL lookup table interface
Wietse: > /* > * Don't frustrate future attempts to make Postfix UTF-8 transparent. > */ > if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 > && !valid_utf8_string(name, strlen(name))) { > if (msg_verbose) > msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", > myname, dict_ldap->parser->name, name); > return (0); > } > > This code has been [in dict_ldap.c and dict_sqlite.c] for four > years already. Never heard a peep. John Fawcett: > is the above check done in the single dict modules redundant? No. - The above filter executes only if UTF8 mode is OFF. - The dict_utf8 filter that you refer to executes only if UTF8 mode is ON. Wietse
Re: Add UTF8 support in PostgreSQL lookup table interface
On 24/08/18 22:54, Wietse Venema wrote: > Viktor Dukhovni: >> Yes, but that'd have to be done by the dictionary lookup layer, >> possibly in proxymap, based on a suitable signal from the lookup >> client, but the low-level API (dict_get()) does not presently >> support any per-lookup flags. So we'd need dict_get_ex() that >> takes a new utf8 flag and supporting changes throughout the >> code. This is a major change. >> >> Perhaps there's a clever way to avoid this, but I am not seeing it yet. > I don't think that it is a good idea to signal the query string > encoding through the Postfix dict(3) API. Primarily because there > is no legitimate use for non-UTF8 in envelopes and headers, so why > bother adding complexity now for a dying cause? > > Relevant for the $Subject: in UTF8 mode, the Postfix dict API will > never look up a non-UTF-8 string, because it cannot exist, and it > will never return a non-UTF-8 lookup result, for the same reasons. > If non-UTF8 content were allowed in UTF-8 mode, then every string > would be a potential bomb ready to explode. > > Even more relevant for the $Subject: several Postfix lookup tables > that already block non-UTF8 queries when smtputf8 mode is turned > off (dict_ldap.c, dict_sqlite.c): > > /* > * Don't frustrate future attempts to make Postfix UTF-8 transparent. > */ > if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 > && !valid_utf8_string(name, strlen(name))) { > if (msg_verbose) > msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", > myname, dict_ldap->parser->name, name); > return (0); > } > > This code has been there for four years already. Never heard a peep. > > I think that we should by default extend this to all lookup tables > and do away with that LATIN1 hack. That would also address the > problem that underlies the original request. No need for signaling > the request encoding! > > Wietse Wietse is the above check done in the single dict modules redundant? If DICT_FLAG_UTF8_ACTIVE is set it means that the lookup method has already been overwritten by dict_utf8_lookup (see util/dict_utf8.c) /* * Interpose on the lookup/update/delete methods. It is a conscious * decision not to tinker with the iterator or destructor. */ backup->lookup = dict->lookup; backup->update = dict->update; backup->delete = dict->delete; dict->lookup = dict_utf8_lookup; dict->update = dict_utf8_update; dict->delete = dict_utf8_delete; and where dict_utf8_dict does this check /* * Validate and optionally fold the key, and if invalid skip the request. */ if ((fold_res = dict_utf8_check_fold(dict, key, _err)) == 0) { msg_warn("%s:%s: non-UTF-8 key \"%s\": %s", dict->type, dict->name, key, utf8_err); dict->error = DICT_ERR_NONE; return (0); } John
Re: Add UTF8 support in PostgreSQL lookup table interface
Viktor Dukhovni: > On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote: > > Even more relevant for the $Subject: several Postfix lookup tables > > already block non-UTF8 queries when smtputf8 mode is turned > > off (dict_ldap.c, dict_sqlite.c): > > > > /* > > * Don't frustrate future attempts to make Postfix UTF-8 transparent. > > */ > > if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 > > && !valid_utf8_string(name, strlen(name))) { > > if (msg_verbose) > > msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", > > myname, dict_ldap->parser->name, name); > > return (0); > > } > > ... > > I think that we should by default extend this to all lookup tables > > and do away with that LATIN1 hack. That would also address the > > problem that underlies the original request. No need for signaling > > the request encoding! > > If the Postgres dictionary client-encoding changes from LATIN1 to > UTF8, at what layer do you want to convert "unspecified" non-ASCII > lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)? There is NO such conversion. For the past four 3-4 years, Postfix has enforced that a query can be only well-formed UTF-8 for LDAP and SQLite tables EVEN IF STMPUTF8 mode is turned off. And no-one has complained. I am arguing to extend this filter to the other tables, possibly with backwards compatibility. There never was legitimate use of non-UTF8 in envelopes and headers. It is a dying cause, therefore let's not extend its miserable life by adding complexity to Postfix. Wietse
Re: Add UTF8 support in PostgreSQL lookup table interface
On Fri, Aug 24, 2018 at 05:14:46PM -0400, Viktor Dukhovni wrote: > On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote: > > > I think that we should by default extend this to all lookup tables > > and do away with that LATIN1 hack. That would also address the > > problem that underlies the original request. No need for signaling > > the request encoding! > > If the Postgres dictionary client-encoding changes from LATIN1 to > UTF8, at what layer do you want to convert "unspecified" non-ASCII > lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)? > > If we don't do that, we'll get dictionary lookup errors for invalid > input strings, rather than no-match... It seems you just want to skip the lookup instead... I guess that'll work too. -- Viktor.
Re: Add UTF8 support in PostgreSQL lookup table interface
On Fri, Aug 24, 2018 at 04:54:32PM -0400, Wietse Venema wrote: > I think that we should by default extend this to all lookup tables > and do away with that LATIN1 hack. That would also address the > problem that underlies the original request. No need for signaling > the request encoding! If the Postgres dictionary client-encoding changes from LATIN1 to UTF8, at what layer do you want to convert "unspecified" non-ASCII lookup keys UTF8 (by encoding each non-ASCII byte to 2 UTF-8 bytes)? If we don't do that, we'll get dictionary lookup errors for invalid input strings, rather than no-match... -- Viktor.
Re: Add UTF8 support in PostgreSQL lookup table interface
On Fri, Aug 24, 2018 at 02:55:19PM -0400, Viktor Dukhovni wrote: > SMTP protocol. The something else unspecified will be some valid ( > intended or otherwise) LATIN1 string. Its use in database queries > with a LATIN1 client encoding will not throw perplexing errors. Ok, I get it. I will note that this is true only for some values of "perplexing". For instance, to a user, the strings a...@anvilwalrusden.com and aj[ZWNJ]s...@anvilwalrusden.com look identical, so the fact that the second one doesn't match will be perplexing. This is no criticism: perplexity in i18n is pretty normal. But I get it that you won't get some sort of back end encoding error. > support any per-lookup flags. So we'd need dict_get_ex() that > takes a new utf8 flag and supporting changes throughout the > code. This is a major change. Sounds like, yes. A -- Andrew Sullivan a...@anvilwalrusden.com
Re: Add UTF8 support in PostgreSQL lookup table interface
Viktor Dukhovni: > Yes, but that'd have to be done by the dictionary lookup layer, > possibly in proxymap, based on a suitable signal from the lookup > client, but the low-level API (dict_get()) does not presently > support any per-lookup flags. So we'd need dict_get_ex() that > takes a new utf8 flag and supporting changes throughout the > code. This is a major change. > > Perhaps there's a clever way to avoid this, but I am not seeing it yet. I don't think that it is a good idea to signal the query string encoding through the Postfix dict(3) API. Primarily because there is no legitimate use for non-UTF8 in envelopes and headers, so why bother adding complexity now for a dying cause? Relevant for the $Subject: in UTF8 mode, the Postfix dict API will never look up a non-UTF-8 string, because it cannot exist, and it will never return a non-UTF-8 lookup result, for the same reasons. If non-UTF8 content were allowed in UTF-8 mode, then every string would be a potential bomb ready to explode. Even more relevant for the $Subject: several Postfix lookup tables that already block non-UTF8 queries when smtputf8 mode is turned off (dict_ldap.c, dict_sqlite.c): /* * Don't frustrate future attempts to make Postfix UTF-8 transparent. */ if ((dict->flags & DICT_FLAG_UTF8_ACTIVE) == 0 && !valid_utf8_string(name, strlen(name))) { if (msg_verbose) msg_info("%s: %s: Skipping lookup of non-UTF-8 key '%s'", myname, dict_ldap->parser->name, name); return (0); } This code has been there for four years already. Never heard a peep. I think that we should by default extend this to all lookup tables and do away with that LATIN1 hack. That would also address the problem that underlies the original request. No need for signaling the request encoding! Wietse
Re: Add UTF8 support in PostgreSQL lookup table interface
> On Aug 24, 2018, at 2:33 PM, Andrew Sullivan wrote: > > On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote: > >> Absent any indication of character set from the client, there was >> no way to know what encoding any particular non-ASCII octet >> string may be using, so the code was optimized to avoid spurious >> database string conversion errors, by using an encoding that >> would accept any octet-string, garbage-in -> garbage-out. > > Unless I misunderstand you (and I might well be doing so) I don't > think LATIN1 works that way in Postgres. The point isn't really about Postgres per-se, but rather that all octet strings are valid encodings of some LATIN1 string, since every octet is a valid LATIN1 code-point. The same cannot be said of UTF8, since random octet strings are generally invalid. > If you're going > to get multibyte strings, I don't see how LATIN1 is any less likely to > throw errors than UTF8 We don't get "multi-byte" strings, absent SMTPUTF8 we get octet strings that are either ASCII, or something else unspecified that violates the SMTP protocol. The something else unspecified will be some valid ( intended or otherwise) LATIN1 string. Its use in database queries with a LATIN1 client encoding will not throw perplexing errors. > It's > true that the characters in that case will probably map, but they'll > fail anyway since the match could as easily be wrong as right). The sending system violated the SMTP protocol, garbage-in -> garbage out. The folks who tend to be lazy about encodings have generally been the ones with some flavour of a single-byte ISO-8859-X encoding, so we'll continue to treat unspecified 8-bit input as LATIN1. >> This means that we'd a way to dynamically update the client >> encoding of the database connection to UTF8 when appropriate >> and revert it LATIN1 when the client encoding is unspecified. > > You can do this in libpq and also in commands passed on a regular connection: > >SET CLIENT_ENCODING TO 'value'; Yes, but that'd have to be done by the dictionary lookup layer, possibly in proxymap, based on a suitable signal from the lookup client, but the low-level API (dict_get()) does not presently support any per-lookup flags. So we'd need dict_get_ex() that takes a new utf8 flag and supporting changes throughout the code. This is a major change. Perhaps there's a clever way to avoid this, but I am not seeing it yet. -- Viktor.
Re: Add UTF8 support in PostgreSQL lookup table interface
Hi, On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote: > Absent any indication of character set from the client, there was > no way to know what encoding any particular non-ASCII octet > string may be using, so the code was optimized to avoid spurious > database string conversion errors, by using an encoding that > would accept any octet-string, garbage-in -> garbage-out. Unless I misunderstand you (and I might well be doing so) I don't think LATIN1 works that way in Postgres. The backend's encoding and the front end's need to be translation-compatible, but that would be true of a SQL_ASCII back end with literally any front end encoding (no conversion is specified to a back end in SQL_ASCII). If you're going to get multibyte strings, I don't see how LATIN1 is any less likely to throw errors than UTF8 (and it is slightly more likely, if you have a UTF-8 multi-octet code point that maps into ISO 8859-1 space in the wrong way, or if the target environment fails to use ISO 8859-1. It's true that the characters in that case will probably map, but they'll fail anyway since the match could as easily be wrong as right). > This means that we'd a way to dynamically update the client > encoding of the database connection to UTF8 when appropriate > and revert it LATIN1 when the client encoding is unspecified. You can do this in libpq and also in commands passed on a regular connection: SET CLIENT_ENCODING TO 'value'; (see https://www.postgresql.org/docs/10/static/multibyte.html). I forget whether this can be done inside a transaction, but it seems a fabulously bad idea to change encodings mid-transaction anyway. > And this needs to work across the proxymap protocol. The problem could well be here; I'm insufficiently familiar with its internals to comment. Best regards, A -- Andrew Sullivan a...@anvilwalrusden.com
Re: Add UTF8 support in PostgreSQL lookup table interface
> On Aug 23, 2018, at 6:39 PM, Geir Pedersen wrote: > > The dictionary interface to Postgresql found in src/global/dict_pgsql.c does > not support UTF8. It explicitly telles the database that Postfix will send > LATIN1. Absent any indication of character set from the client, there was no way to know what encoding any particular non-ASCII octet string may be using, so the code was optimized to avoid spurious database string conversion errors, by using an encoding that would accept any octet-string, garbage-in -> garbage-out. > With SMTPUTF8 support now in place, Postfix may try to look up addresses with > UTF8 in the local part in PostgreSQL virtual mailbox maps. While Postfix now supports UTF8, it is not always enabled, and even when enabled not always used by the client. So using UTF-8 on the database connection may not always be appropriate. Ideally we'd only use UTF-8 when the client indicates that is using SMTPUTF8: MAIL FROM: BODY=8BITMIME SMTPUTF8 > Such lookups now fail as the UTF8 sent by Postfix is taken as LATIN1 > by PostgreSQL. Error message from Postfix: > "Recipient address rejected: User unknown in virtual mailbox table" This means that we'd a way to dynamically update the client encoding of the database connection to UTF8 when appropriate and revert it LATIN1 when the client encoding is unspecified. And this needs to work across the proxymap protocol. So while the change you're proposing is well motivated, I am not sure that the solution is as simple as you propose. We'd need to add a query-time UTF8 flag to the low-level dictionary lookup methods, implement the higher-level lookups on top of a default octet-string (LATIN1 if you like) encoding, and add new functions that perform similar lookups on UTF8 data. The PostgresSQL driver would then export a function to switch the client connection to UTF8 (assuming the encoding can be changed on the fly between queries). -- -- Viktor.