Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
I wrote: Peter Eisentraut pete...@gmx.net writes: AFAIR, the only reason that we haven't disallowed this sort of stuff years and years ago is that people use it; the Japanese in particular. I don't see what is different now. What's different now is that 8.4 has already established the principle that you have to clone template0 if you want to change the locale of a database. I think this is a good time to establish the same principle for encodings. (Or in other words, if we don't fix it now, when will be a better time?) Attached is a proposed patch (without documentation changes as yet) for this. Since the code is already enforcing exact locale match when cloning a non-template0 database, I just made it act the same for encoding, without any strange exceptions for SQL_ASCII. I found that mbregress.sh was already broken by the existing restrictions, if you try to use it in a database whose default locale isn't C. The patch adds switches to fix that. The patch also incidentally fixes a few ereport's that were missing errcode values. Last chance for objections ... regards, tom lane Index: src/backend/commands/dbcommands.c === RCS file: /cvsroot/pgsql/src/backend/commands/dbcommands.c,v retrieving revision 1.223 diff -c -r1.223 dbcommands.c *** src/backend/commands/dbcommands.c 5 May 2009 23:39:55 - 1.223 --- src/backend/commands/dbcommands.c 6 May 2009 00:30:59 - *** *** 361,367 #endif (encoding == PG_SQL_ASCII superuser( ereport(ERROR, ! (errmsg(encoding %s does not match locale %s, pg_encoding_to_char(encoding), dbctype), errdetail(The chosen LC_CTYPE setting requires encoding %s., --- 361,368 #endif (encoding == PG_SQL_ASCII superuser( ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), !errmsg(encoding %s does not match locale %s, pg_encoding_to_char(encoding), dbctype), errdetail(The chosen LC_CTYPE setting requires encoding %s., *** *** 374,402 #endif (encoding == PG_SQL_ASCII superuser( ereport(ERROR, ! (errmsg(encoding %s does not match locale %s, pg_encoding_to_char(encoding), dbcollate), errdetail(The chosen LC_COLLATE setting requires encoding %s., pg_encoding_to_char(collate_encoding; /* !* Check that the new locale is compatible with the source database. * !* We know that template0 doesn't contain any indexes that depend on !* collation or ctype, so template0 can be used as template for !* any locale. */ if (strcmp(dbtemplate, template0) != 0) { if (strcmp(dbcollate, src_collate) != 0) ereport(ERROR, ! (errmsg(new collation is incompatible with the collation of the template database (%s), src_collate), errhint(Use the same collation as in the template database, or use template0 as template.))); if (strcmp(dbctype, src_ctype) != 0) ereport(ERROR, ! (errmsg(new LC_CTYPE is incompatible with LC_CTYPE of the template database (%s), src_ctype), errhint(Use the same LC_CTYPE as in the template database, or use template0 as template.))); } --- 375,419 #endif (encoding == PG_SQL_ASCII superuser( ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), !errmsg(encoding %s does not match locale %s, pg_encoding_to_char(encoding), dbcollate), errdetail(The chosen LC_COLLATE setting requires encoding %s., pg_encoding_to_char(collate_encoding; /* !* Check that the new encoding and locale settings match the source !* database. We insist on this because we simply copy the source data --- !* any non-ASCII data would be wrongly encoded, and any indexes sorted !* according to the source locale would be wrong. * !* However, we assume
[HACKERS] Why do we let CREATE DATABASE reassign encoding?
If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. (In non-C locales, this will typically fail because the two different encodings can't both match the locale. But I don't believe it's our policy to enforce encoding validity only for non-C locales.) We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Tom Lane wrote: If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. Wow, I'm surprised we allow that. Never occurred to me to try. We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. Agreed, that's exactly what we did with per-database collation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
In response to Tom Lane t...@sss.pgh.pa.us: If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. (In non-C locales, this will typically fail because the two different encodings can't both match the locale. But I don't believe it's our policy to enforce encoding validity only for non-C locales.) We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. Granted, there's the potential for special cases with databases used only for templates, but as I see it, this should be allowed, it should just fail if any data in the template can't be converted to the desired encoding. I mean, I can always alter template1 by inserting non-utf8 data, and then try to use it to create a utf8 encoded database ... -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Tom Lane wrote: If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. (In non-C locales, this will typically fail because the two different encodings can't both match the locale. But I don't believe it's our policy to enforce encoding validity only for non-C locales.) We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. Really? You want to forbid selecting an encoding when the source is template1, which is the default, and template1 is not SQL_ASCII? So the following sequence woiuld be illegal: initdb -E latin1 createdb -E utf8 I think we have a bit more thinking to do on this - I don't have a reasonable solution immediately in my head. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Bill Moran wmo...@potentialtech.com writes: In response to Tom Lane t...@sss.pgh.pa.us: We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. You're supposed to clone from template0, not template1, when creating DBs that are different in either encoding or locale from the installation default. We already enforce this except for having missed the special case of C locale. (There might be some corner cases involving UTF8 on Windows, too; not sure about that.) The reason is that template0 is expected to contain only ASCII data, but template1 might not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Bill Moran wrote: In response to Tom Lane t...@sss.pgh.pa.us: We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. You could still use template0 as template for a database with any encoding, like: CREATE DATABASE .. TEMPLATE = template0; We can special case template0 because we know its contents are pure 7-bit ascii which is compatible with any server encoding. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Andrew Dunstan and...@dunslane.net writes: So the following sequence woiuld be illegal: initdb -E latin1 createdb -E utf8 Yes, that's rather the point. Note that it already *is* illegal unless you happen to have selected C locale; AFAICS that is an oversight and not intentional. For instance, going in the other direction in en_US locale, I get $ createdb -E latin1 l1 createdb: database creation failed: ERROR: encoding LATIN1 does not match locale en_US.utf8 DETAIL: The chosen LC_CTYPE setting requires encoding UTF8. You can get around this by cloning template0 instead of template1 (we assume template0 contains nothing that's encoding-specific). Possibly the docs will need to be improved to emphasize that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: So the following sequence woiuld be illegal: initdb -E latin1 createdb -E utf8 Yes, that's rather the point. Note that it already *is* illegal unless you happen to have selected C locale; AFAICS that is an oversight and not intentional. Ok, then I agree that we need a big warning on that in the docs. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
In response to Tom Lane t...@sss.pgh.pa.us: Bill Moran wmo...@potentialtech.com writes: In response to Tom Lane t...@sss.pgh.pa.us: We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. You're supposed to clone from template0, not template1, when creating DBs that are different in either encoding or locale from the installation default. We already enforce this except for having missed the special case of C locale. Ah ... was not aware of that. It hasn't come up in my usage. (There might be some corner cases involving UTF8 on Windows, too; not sure about that.) The reason is that template0 is expected to contain only ASCII data, but template1 might not. Makes sense, with that explanation. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
So it would still be possible to byass this check by cloning a database into SQL_ASCII and then cloning it into the desired encoding? Doesn't sound like it really accomplishes much. I do seem to recall some discussion about this way back. I don't recall the conclusion but I remember some talk about detecting an empty template database and the ned to reindex. -- Greg On 23 Apr 2009, at 19:46, Bill Moran wmo...@potentialtech.com wrote: In response to Tom Lane t...@sss.pgh.pa.us: If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. (In non-C locales, this will typically fail because the two different encodings can't both match the locale. But I don't believe it's our policy to enforce encoding validity only for non-C locales.) We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. Granted, there's the potential for special cases with databases used only for templates, but as I see it, this should be allowed, it should just fail if any data in the template can't be converted to the desired encoding. I mean, I can always alter template1 by inserting non-utf8 data, and then try to use it to create a utf8 encoded database ... -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Greg Stark greg.st...@enterprisedb.com writes: So it would still be possible to byass this check by cloning a database into SQL_ASCII and then cloning it into the desired encoding? Doesn't sound like it really accomplishes much. Well, it accomplishes preventing stupid encoding violations. The point came to mind when I saw a post a little bit ago on -general in which the poster seemed to imagine that CREATE DATABASE would convert encodings for him. Since that is not happening in the foreseeable future, I think we need to prevent the system from acting like it would work. If we wanted to be entirely anal about this, we could allow SQL_ASCII destination with a different source encoding, but not the reverse. However, we currently consider that you're on your own to ensure sanity when using SQL_ASCII as far as locale goes, so I'm not sure why the policy would be different for encoding. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Tom Lane wrote: If we wanted to be entirely anal about this, we could allow SQL_ASCII destination with a different source encoding, but not the reverse. However, we currently consider that you're on your own to ensure sanity when using SQL_ASCII as far as locale goes, so I'm not sure why the policy would be different for encoding. The trouble is that people won't know the provenance of their database. I think we should try to guarantee as far as possible that if a database has encoding X then all the data in it is valid in that encoding. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
On Apr 23, 2009, at 12:00 PM, Tom Lane wrote: You can get around this by cloning template0 instead of template1 (we assume template0 contains nothing that's encoding-specific). Possibly the docs will need to be improved to emphasize that. I was just about to suggest that. With this change, template0 is suddenly going to be a lot more important for people to know about and make use of. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
On Thursday 23 April 2009 22:00:25 Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: So the following sequence woiuld be illegal: initdb -E latin1 createdb -E utf8 Yes, that's rather the point. Note that it already is illegal unless you happen to have selected C locale; AFAIR, the only reason that we haven't disallowed this sort of stuff years and years ago is that people use it; the Japanese in particular. I don't see what is different now. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we let CREATE DATABASE reassign encoding?
Peter Eisentraut pete...@gmx.net writes: AFAIR, the only reason that we haven't disallowed this sort of stuff years and years ago is that people use it; the Japanese in particular. I don't see what is different now. What's different now is that 8.4 has already established the principle that you have to clone template0 if you want to change the locale of a database. I think this is a good time to establish the same principle for encodings. (Or in other words, if we don't fix it now, when will be a better time?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers