Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-23 Thread Tom Lane
I wrote:
>> One question that I've got is why the ICU portion refuses to load
>> any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
>> Surely this is completely wrong?  I should think that what we load into
>> pg_collation ought to be independent of template1's encoding, the same
>> as it is for libc collations, and the right place to be making a test
>> like that is where somebody attempts to use an ICU collation.

Pursuant to the second part of that: I checked on what happens if you
try to use an ICU collation in a database with a not-supported-by-ICU
encoding.  We have to cope with that scenario even with the current
(broken IMO) initdb behavior, because even if template1 has a supported
encoding, it's possible to create another database that doesn't.
It does fail more or less cleanly; you get an "encoding "foo" not
supported by ICU" message at runtime (out of get_encoding_name_for_icu).
But that's quite a bit unlike the behavior for libc collations: with
those, you get an error in collation name lookup, along the lines of

collation "en_DK.utf8" for encoding "SQL_ASCII" does not exist

The attached proposed patch makes the behavior for ICU collations the
same, by dint of injecting the is_encoding_supported_by_icu() check
into collation name lookup.

regards, tom lane

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 64f6fee..029a132 100644
*** a/src/backend/catalog/namespace.c
--- b/src/backend/catalog/namespace.c
*** OpfamilyIsVisible(Oid opfid)
*** 1915,1923 
--- 1915,1974 
  }
  
  /*
+  * lookup_collation
+  *		If there's a collation of the given name/namespace, and it works
+  *		with the given encoding, return its OID.  Else return InvalidOid.
+  */
+ static Oid
+ lookup_collation(const char *collname, Oid collnamespace, int32 encoding)
+ {
+ 	Oid			collid;
+ 	HeapTuple	colltup;
+ 	Form_pg_collation collform;
+ 
+ 	/* Check for encoding-specific entry (exact match) */
+ 	collid = GetSysCacheOid3(COLLNAMEENCNSP,
+ 			 PointerGetDatum(collname),
+ 			 Int32GetDatum(encoding),
+ 			 ObjectIdGetDatum(collnamespace));
+ 	if (OidIsValid(collid))
+ 		return collid;
+ 
+ 	/*
+ 	 * Check for any-encoding entry.  This takes a bit more work: while libc
+ 	 * collations with collencoding = -1 do work with all encodings, ICU
+ 	 * collations only work with certain encodings, so we have to check that
+ 	 * aspect before deciding it's a match.
+ 	 */
+ 	colltup = SearchSysCache3(COLLNAMEENCNSP,
+ 			  PointerGetDatum(collname),
+ 			  Int32GetDatum(-1),
+ 			  ObjectIdGetDatum(collnamespace));
+ 	if (!HeapTupleIsValid(colltup))
+ 		return InvalidOid;
+ 	collform = (Form_pg_collation) GETSTRUCT(colltup);
+ 	if (collform->collprovider == COLLPROVIDER_ICU)
+ 	{
+ 		if (is_encoding_supported_by_icu(encoding))
+ 			collid = HeapTupleGetOid(colltup);
+ 		else
+ 			collid = InvalidOid;
+ 	}
+ 	else
+ 	{
+ 		collid = HeapTupleGetOid(colltup);
+ 	}
+ 	ReleaseSysCache(colltup);
+ 	return collid;
+ }
+ 
+ /*
   * CollationGetCollid
   *		Try to resolve an unqualified collation name.
   *		Returns OID if collation found in search path, else InvalidOid.
+  *
+  * Note that this will only find collations that work with the current
+  * database's encoding.
   */
  Oid
  CollationGetCollid(const char *collname)
*** CollationGetCollid(const char *collname)
*** 1935,1953 
  		if (namespaceId == myTempNamespace)
  			continue;			/* do not look in temp namespace */
  
! 		/* Check for database-encoding-specific entry */
! 		collid = GetSysCacheOid3(COLLNAMEENCNSP,
!  PointerGetDatum(collname),
!  Int32GetDatum(dbencoding),
!  ObjectIdGetDatum(namespaceId));
! 		if (OidIsValid(collid))
! 			return collid;
! 
! 		/* Check for any-encoding entry */
! 		collid = GetSysCacheOid3(COLLNAMEENCNSP,
!  PointerGetDatum(collname),
!  Int32GetDatum(-1),
!  ObjectIdGetDatum(namespaceId));
  		if (OidIsValid(collid))
  			return collid;
  	}
--- 1986,1992 
  		if (namespaceId == myTempNamespace)
  			continue;			/* do not look in temp namespace */
  
! 		collid = lookup_collation(collname, namespaceId, dbencoding);
  		if (OidIsValid(collid))
  			return collid;
  	}
*** CollationGetCollid(const char *collname)
*** 1961,1966 
--- 2000,2008 
   *		Determine whether a collation (identified by OID) is visible in the
   *		current search path.  Visible means "would be found by searching
   *		for the unqualified collation name".
+  *
+  * Note that only collations that work with the current database's encoding
+  * will be considered visible.
   */
  bool
  CollationIsVisible(Oid collid)
*** CollationIsVisible(Oid collid)
*** 1990,1998 
  	{
  		/*
  		 * If it is in the path, it might still not be visible; it could be
! 		 * hidden by another conversion of the same name earlier in the path.
! 		 * So we must do a slow 

Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-23 Thread Tom Lane
I wrote:
> One question that I've got is why the ICU portion refuses to load
> any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
> Surely this is completely wrong?  I should think that what we load into
> pg_collation ought to be independent of template1's encoding, the same
> as it is for libc collations, and the right place to be making a test
> like that is where somebody attempts to use an ICU collation.  But
> I've not tried to test it.

So I did test that, and found out the presumable reason why that's there:
icu_from_uchar() falls over if the database encoding is unsupported, and
we use that to convert ICU "display names" for use as comments for the
ICU collations.  But that's not very much less wrongheaded, because it will
allow non-ASCII characters into the initial database contents, which is
absolutely not acceptable.  We assume we can bit-copy the contents of
template0 and it will be valid in any encoding.

Therefore, I think the right thing to do is remove that test and change
get_icu_locale_comment() so that it rejects non-ASCII text, making the
encoding conversion trivial, as in the attached patch.

On my Fedora 25 laptop, the only collations that go without a comment
in this approach are the "nb" ones (Norwegian Bokmål).  As I recall,
that locale is a second-class citizen for other reasons already,
precisely because of its loony insistence on a non-ASCII name even
when we're asking for an Anglicized version.

I'm inclined to add a test to reject non-ASCII in the ICU locale names as
well as the comments.  We've had to do that for libc locale names, and
this experience shows that the ICU locale maintainers don't have their
heads screwed on any straighter.  But this patch doesn't do that.

regards, tom lane

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 1c43f0b..6febe63 100644
*** a/src/backend/commands/collationcmds.c
--- b/src/backend/commands/collationcmds.c
*** get_icu_language_tag(const char *localen
*** 431,437 
  
  /*
   * Get a comment (specifically, the display name) for an ICU locale.
!  * The result is a palloc'd string.
   */
  static char *
  get_icu_locale_comment(const char *localename)
--- 431,439 
  
  /*
   * Get a comment (specifically, the display name) for an ICU locale.
!  * The result is a palloc'd string, or NULL if we can't get a comment
!  * or find that it's not all ASCII.  (We can *not* accept non-ASCII
!  * comments, because the contents of template0 must be encoding-agnostic.)
   */
  static char *
  get_icu_locale_comment(const char *localename)
*** get_icu_locale_comment(const char *local
*** 439,444 
--- 441,447 
  	UErrorCode	status;
  	UChar		displayname[128];
  	int32		len_uchar;
+ 	int32		i;
  	char	   *result;
  
  	status = U_ZERO_ERROR;
*** get_icu_locale_comment(const char *local
*** 446,456 
  	displayname, lengthof(displayname),
  	);
  	if (U_FAILURE(status))
! 		ereport(ERROR,
! (errmsg("could not get display name for locale \"%s\": %s",
! 		localename, u_errorName(status;
  
! 	icu_from_uchar(, displayname, len_uchar);
  
  	return result;
  }
--- 449,468 
  	displayname, lengthof(displayname),
  	);
  	if (U_FAILURE(status))
! 		return NULL;			/* no good reason to raise an error */
  
! 	/* Check for non-ASCII comment */
! 	for (i = 0; i < len_uchar; i++)
! 	{
! 		if (displayname[i] > 127)
! 			return NULL;
! 	}
! 
! 	/* OK, transcribe */
! 	result = palloc(len_uchar + 1);
! 	for (i = 0; i < len_uchar; i++)
! 		result[i] = displayname[i];
! 	result[len_uchar] = '\0';
  
  	return result;
  }
*** pg_import_system_collations(PG_FUNCTION_
*** 642,655 
  
  	/* Load collations known to ICU */
  #ifdef USE_ICU
- 	if (!is_encoding_supported_by_icu(GetDatabaseEncoding()))
- 	{
- 		ereport(NOTICE,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-  errmsg("encoding \"%s\" not supported by ICU",
- 		pg_encoding_to_char(GetDatabaseEncoding();
- 	}
- 	else
  	{
  		int			i;
  
--- 654,659 
*** pg_import_system_collations(PG_FUNCTION_
*** 661,666 
--- 665,671 
  		{
  			const char *name;
  			char	   *langtag;
+ 			char	   *icucomment;
  			const char *collcollate;
  			UEnumeration *en;
  			UErrorCode	status;
*** pg_import_system_collations(PG_FUNCTION_
*** 686,693 
  
  CommandCounterIncrement();
  
! CreateComments(collid, CollationRelationId, 0,
! 			   get_icu_locale_comment(name));
  			}
  
  			/*
--- 691,700 
  
  CommandCounterIncrement();
  
! icucomment = get_icu_locale_comment(name);
! if (icucomment)
! 	CreateComments(collid, CollationRelationId, 0,
!    icucomment);
  			}
  
  			/*
*** pg_import_system_collations(PG_FUNCTION_
*** 720,727 
  
  	CommandCounterIncrement();
  
! 	CreateComments(collid, 

Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-22 Thread Tom Lane
I wrote:
> Now the hard part of that is that because pg_import_system_collations
> isn't using a temporary staging table, but is just inserting directly
> into pg_collation, there isn't any way for it to eliminate duplicates
> unless it uses if_not_exists behavior all the time.  So there seem to
> be two ways to proceed:
> 1. Drop pg_import_system_collations' if_not_exists argument and just
> define it as adding any collations not already known in pg_collation.
> 2. Significantly rewrite it so that it de-dups the collation set by
> hand before trying to insert into pg_collation.

After further thought, I realized that there's another argument for
making pg_import_system_collations() always do if-not-exists, which
is that as it stands it's inconsistent anyway because it silently
uses if-not-exists behavior for aliases.

So attached is a draft patch that drops if_not_exists and tweaks the
alias insertion code to guarantee deterministic behavior.  I also
corrected failure to use CommandCounterIncrement() in the ICU part
of the code, which would cause problems if ICU can ever report
duplicate names; something I don't especially care to assume is
impossible.  Also, I fixed things so that pg_import_system_collations()
doesn't emit any chatter about duplicate existing names, because it
didn't take long to realize that that behavior was useless and
irritating.  (Perhaps we should change the function to return the
number of entries successfully added?  But I didn't do that here.)

I've tested this with a faked version of "locale -a" that generates
duplicated entries, and it does what I think it should.

One question that I've got is why the ICU portion refuses to load
any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()).
Surely this is completely wrong?  I should think that what we load into
pg_collation ought to be independent of template1's encoding, the same
as it is for libc collations, and the right place to be making a test
like that is where somebody attempts to use an ICU collation.  But
I've not tried to test it.

Also, I'm a bit tempted to remove setup_collation()'s manual insertion of
'ucs_basic' in favor of adding a DATA() line for that to pg_collation.h.
As things stand, if for some reason "locale -a" were to report a locale
named that, initdb would fail.  In the old code, it would not have failed
--- it's uncertain whether you would have gotten the forced UTF8/C
definition or libc's definition, but you would have gotten only one.
However, that approach would result in 'ucs_basic' being treated as
pinned, which perhaps we don't want.  If we don't want it to be pinned,
another idea is just to make setup_collation() insert it first not last,
thereby ensuring that the SQL definition wins over any libc entries.

Comments?

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e073f7b..bd2eaaa 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** postgres=# SELECT * FROM pg_walfile_name
*** 19711,19717 

 
  pg_import_system_collations
! pg_import_system_collations(if_not_exists boolean, schema regnamespace)
 
 void
 Import operating system collations
--- 19711,19717 

 
  pg_import_system_collations
! pg_import_system_collations(schema regnamespace)
 
 void
 Import operating system collations
*** postgres=# SELECT * FROM pg_walfile_name
*** 19730,19747 
 
  
 
! pg_import_system_collations populates the system
! catalog pg_collation with collations based on all the
! locales it finds on the operating system.  This is
  what initdb uses;
  see  for more details.  If additional
  locales are installed into the operating system later on, this function
! can be run again to add collations for the new locales.  In that case, the
! parameter if_not_exists should be set to true to
! skip over existing collations.  The schema
! parameter would typically be pg_catalog, but that is
! not a requirement.  (Collation objects based on locales that are no longer
! present on the operating system are never removed by this function.)
 
  

--- 19730,19748 
 
  
 
! pg_import_system_collations adds collations to the system
! catalog pg_collation based on all the
! locales it finds in the operating system.  This is
  what initdb uses;
  see  for more details.  If additional
  locales are installed into the operating system later on, this function
! can be run again to add collations for the new locales.  Locales that
! match existing entries in pg_collation will be skipped.
! (But collation objects based on locales that are no longer
! present in the operating system are not removed by this function.)
! The schema parameter would typically
! be pg_catalog, but that is 

Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-21 Thread Marco Atzeri

On 20/06/2017 17:37, Tom Lane wrote:

I wrote:

Marco Atzeri  writes:

Building on Cygwin latest 10  beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL:  collation "ja_JP" for encoding "EUC_JP" already exists



Hmph.  Could we see the results of "locale -a | grep ja_JP" ?


Despite the lack of followup from the OP, I'm pretty troubled by this
report.  It shows that the reimplementation of OS collation data import
as pg_import_system_collations() is a whole lot more fragile than the
original coding.  We have never before trusted "locale -a" to not produce
duplicate outputs, not since the very beginning in 414c5a2e.  AFAICS,
the current coding has also lost the protections we added very shortly
after that in 853c1750f; and it has also lost the admittedly rather
arbitrary, but at least deterministic, preference order for conflicting
short aliases that was in the original initdb code.


Hi Tom,
I raised the duplication issue on the cygwin mailing list,
and one of the core developer reports that
they saw the same issues on Linux in the past.

https://cygwin.com/ml/cygwin/2017-06/msg00253.html



regards, tom lane


Regards
Marco




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-20 Thread Marco Atzeri

On 18/06/2017 16:48, Tom Lane wrote:

Marco Atzeri  writes:

Building on Cygwin latest 10  beta1 or head sourece,
make check fails as:
...
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214
CEST [16860] FATAL:  collation "ja_JP" for encoding "EUC_JP" already exists


Hmph.  Could we see the results of "locale -a | grep ja_JP" ?

regards, tom lane



$ locale -a |grep -i ja
ja_JP
ja_JP
ja_JP.utf8
ja_JP.ujis
ja_JP@cjknarrow
ja_JP.utf8@cjknarrow
japanese
japanese.euc
japanese.sjis

$ locale -a |grep -i "euc"
japanese.euc
korean.euc

Regards
Marco


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-20 Thread Tom Lane
I wrote:
> Marco Atzeri  writes:
>> Building on Cygwin latest 10  beta1 or head sourece,
>> make check fails as:
>> ...
>> performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 
>> CEST [16860] FATAL:  collation "ja_JP" for encoding "EUC_JP" already exists

> Hmph.  Could we see the results of "locale -a | grep ja_JP" ?

Despite the lack of followup from the OP, I'm pretty troubled by this
report.  It shows that the reimplementation of OS collation data import
as pg_import_system_collations() is a whole lot more fragile than the
original coding.  We have never before trusted "locale -a" to not produce
duplicate outputs, not since the very beginning in 414c5a2e.  AFAICS,
the current coding has also lost the protections we added very shortly
after that in 853c1750f; and it has also lost the admittedly rather
arbitrary, but at least deterministic, preference order for conflicting
short aliases that was in the original initdb code.

I suppose the idea was to see whether we actually needed those defenses,
but since we have here a failure report after less than a month of beta,
it seems clear to me that we do.  I think we need to upgrade
pg_import_system_collations to have all the same logic that was there
before.

Now the hard part of that is that because pg_import_system_collations
isn't using a temporary staging table, but is just inserting directly
into pg_collation, there isn't any way for it to eliminate duplicates
unless it uses if_not_exists behavior all the time.  So there seem to
be two ways to proceed:

1. Drop pg_import_system_collations' if_not_exists argument and just
define it as adding any collations not already known in pg_collation.

2. Significantly rewrite it so that it de-dups the collation set by
hand before trying to insert into pg_collation.

#2 seems like a lot more work, but on the other hand, we might need
most of that logic anyway to get back deterministic alias handling.
However, since I cannot see any real-world use case at all for
if_not_exists = false, I figure we might as well do #1 and take
whatever simplification we can get that way.

I'm willing to do the legwork on this, but before I start, does
anyone have any ideas or objections?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-18 Thread Tom Lane
Marco Atzeri  writes:
> Building on Cygwin latest 10  beta1 or head sourece,
> make check fails as:
> ...
> performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 
> CEST [16860] FATAL:  collation "ja_JP" for encoding "EUC_JP" already exists

Hmph.  Could we see the results of "locale -a | grep ja_JP" ?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] initdb initalization failure for collation "ja_JP"

2017-06-17 Thread Marco Atzeri

Building on Cygwin latest 10  beta1 or head sourece,
make check fails as:

-initdb.log -
The database cluster will be initialized with locales
  COLLATE:  en_US.UTF-8
  CTYPE:en_US.UTF-8
  MESSAGES: C
  MONETARY: en_US.UTF-8
  NUMERIC:  en_US.UTF-8
  TIME: en_US.UTF-8
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory 
/cygdrive/e/cyg_pub/devel/postgresql/prova/postgresql-10.0-0.1.x86_64/build/src/test/regress/./tmp_check/data 
... ok

creating subdirectories ... ok
selecting default max_connections ... 30
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... sysv
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 
CEST [16860] FATAL:  collation "ja_JP" for encoding "EUC_JP" already exists

-

This does not happen on 9.6.x.
Any idea what to look ?

Regards
Marco


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers