Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

2016-03-29 Thread Oleg Bartunov
On Mon, Mar 28, 2016 at 5:57 PM, Stephen Frost  wrote:

> All,
>
> Changed the thread name (we're no longer talking about release
> notes...).
>
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Oleg Bartunov  writes:
> > > Should we start thinking about ICU ?
> >
> > Isn't it still true that ICU fails to meet our minimum requirements?
> > That would include (a) working with the full Unicode character range
> > (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> > we could deal with (b) by inserting a conversion, but that would take
> > a lot of shine off the performance numbers you mention.
> >
> > I'm also not exactly convinced by your implicit assumption that ICU is
> > bug-free.
>
> We have a wiki page about ICU.  I'm not sure that it's current, but if
> it isn't and people are interested then perhaps we should update it:
>
> https://wiki.postgresql.org/wiki/Todo:ICU
>
>
Good point, I forget about this page.



> If we're going to talk about minimum requirements, I'd like to argue
> that we require whatever system we're using to have versioning (which
> glibc currently lacks, as I understand it...) to avoid the risk that
> indexes will become corrupt when whatever we're using for collation
> changes.  I'm pretty sure that's already bitten us on at least some
> RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> with strcoll vs. strxfrm.
>

agree.


>
> Regarding key abbreviation and performance, if we are confident that
> strcoll and strxfrm are at least independently internally consistent
> then we could consider offering an option to choose between them.
> We'd need to identify what each index was built with to do so, however,
> as they would need to be rebuilt if the choice changes, at least
> until/unless they're made to reliably agree.  Even using only one or the
> other doesn't address the versioning problem though, which is a problem
> for all currently released versions of PG and is just going to continue
> to be an issue.
>

Ideally, we should benchmarking all locales on all platforms for all kind
indexes. But that's  big project.


>
> Thanks!
>
> Stephen
>


Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

2016-03-28 Thread Peter Geoghegan
On Mon, Mar 28, 2016 at 12:36 PM, Stephen Frost  wrote:
> Having to figure out how each and every stdlib does versioning doesn't
> sound fun, I certainly agree with you there, but it hardly seems
> impossible.  What we need, even if we look to move to ICU, is a place to
> remember that version information and a way to do something when we
> discover that we're now using a different version.

I think that the versioning situation is all over the place. It isn't
in the C standard. And there are many different versions of many
different stdlibs to support. Most importantly, where support
nominally exists, a strong incentive to get it exactly right may not.
We've seen that already.

> I'm not quite sure what the best way to do that is, but I imagine it
> involves changes to existing catalogs or perhaps even a new one.  I
> don't have any particularly great ideas for existing releases (maybe
> stash information in the index somewhere when it's rebuilt and then
> check it and throw an ERROR if they don't match?)

I think we'd need to introduce an abstraction like a "collation
provider", of which ICU would theoretically be just one. The OS would
be a baked-in collation provider. Everything that works today would
continue to work. We'd then largely just be grandfathering out systems
that rely on OS locales across major version upgrades, since the vast
majority of users are happy with Unicode, and have no cultural or
technical reason to prefer the OS locales that I can think of.

I am unconvinced with the idea that it especially matters that sort(1)
might not be in agreement with Postgres. Neither is any Java app, or
any .Net app, or the user's web browser in the case of Safari or
Google Chrome (maybe others). I want Postgres to be consistent with
Postgres, across different nodes on the network, in environments where
I may have little knowledge of the underlying OS. Think "sort pushdown
in postgres_fdw".

Users from certain East Asian user communities might prefer to stick
with regional encodings, perhaps due to specific concerns about the
Han Unification controversy. But I'm pretty sure that these users have
very low expectations about collations in Postgres today. I was
recently told that collating Japanese is starting to get a bit better,
due to various new initiatives, but that most experienced Japanese
Postgres DBAs tend to use the "C" collation.

I don't want to impose a Unicode monoculture on anyone. But I do think
there are clear benefits for the large majority of users that always
use Unicode. Nothing needs to break that works today to make this
happen. Abbreviated keys provide an immediate incentive for users to
adopt ICU; users that might otherwise be on the fence about it.

>> The question is only how we deal with this when it happens. One thing
>> that's attractive about ICU is that it makes this explicit, both for
>> the logical behavior of a collation, as well as the stability of
>> binary sort keys (Glibc's versioning seemingly just does the former).
>> So the equivalent of strxfrm() output has license to change for
>> technical reasons that are orthogonal to the practical concerns of
>> end-users about how text sorts in their locale. ICU is clear on what
>> it takes to make binary sort keys in indexes work. And various major
>> database systems rely on this being right.
>
> There seems to be some disagreement about if ICU provides the
> information we'd need to make a decision or not.  It seems like it
> would, given its usage in other database systems, but if so, we need to
> very clearly understand exactly how it works and how we can depend on
> it.

It seems likely that it exposes the information required to make what
we need to do practical.

Certainly, adopting ICU is a big project that we should proceed
cautiously with, but there is a reason why every other major database
system uses either ICU, or a library based on UCA [1] that allows the
system to centrally control versioned collations (SQLite just makes
this optional).

I think that ICU *could* still tie us to the available collations on
an OS (those collations that are available with their ICU packages).
What I haven't figured out yet is if it's practical to install
versions that are available from some central location, like the CLDR
[2]. I don't think we'd want to have Postgres ship "supported
collations" in each major version, in roughly the style of the IANA
timezone stuff, but it's far too early to rule that out. It would have
upsides.

[1] https://en.wikipedia.org/wiki/Unicode_collation_algorithm
[2] http://cldr.unicode.org/
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

2016-03-28 Thread Stephen Frost
* Peter Geoghegan (p...@heroku.com) wrote:
> On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost  wrote:
> > If we're going to talk about minimum requirements, I'd like to argue
> > that we require whatever system we're using to have versioning (which
> > glibc currently lacks, as I understand it...) to avoid the risk that
> > indexes will become corrupt when whatever we're using for collation
> > changes.  I'm pretty sure that's already bitten us on at least some
> > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> > with strcoll vs. strxfrm.
> 
> I totally agree that anything we should adopt should support
> versioning. Glibc does have a non-standard versioning scheme, but we
> don't use it. Other stdlibs may do versioning another way, or not at
> all. A world in which ICU is the defacto standard for Postgres (i.e.
> the actual standard on all major platforms), we mostly just have one
> thing to target, which seems like something to aim for.

Having to figure out how each and every stdlib does versioning doesn't
sound fun, I certainly agree with you there, but it hardly seems
impossible.  What we need, even if we look to move to ICU, is a place to
remember that version information and a way to do something when we
discover that we're now using a different version.

I'm not quite sure what the best way to do that is, but I imagine it
involves changes to existing catalogs or perhaps even a new one.  I
don't have any particularly great ideas for existing releases (maybe
stash information in the index somewhere when it's rebuilt and then
check it and throw an ERROR if they don't match?)

> The question is only how we deal with this when it happens. One thing
> that's attractive about ICU is that it makes this explicit, both for
> the logical behavior of a collation, as well as the stability of
> binary sort keys (Glibc's versioning seemingly just does the former).
> So the equivalent of strxfrm() output has license to change for
> technical reasons that are orthogonal to the practical concerns of
> end-users about how text sorts in their locale. ICU is clear on what
> it takes to make binary sort keys in indexes work. And various major
> database systems rely on this being right.

There seems to be some disagreement about if ICU provides the
information we'd need to make a decision or not.  It seems like it
would, given its usage in other database systems, but if so, we need to
very clearly understand exactly how it works and how we can depend on
it.

> > Regarding key abbreviation and performance, if we are confident that
> > strcoll and strxfrm are at least independently internally consistent
> > then we could consider offering an option to choose between them.
> 
> I think they just need to match, per the standard. After all,
> abbreviation will sometimes require strcoll() tie-breakers.

Ok, I didn't see that in the man-pages.  If that's the case then it
seems like there isn't much hope of just using strxfrm().

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

2016-03-28 Thread Peter Geoghegan
On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost  wrote:
> If we're going to talk about minimum requirements, I'd like to argue
> that we require whatever system we're using to have versioning (which
> glibc currently lacks, as I understand it...) to avoid the risk that
> indexes will become corrupt when whatever we're using for collation
> changes.  I'm pretty sure that's already bitten us on at least some
> RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> with strcoll vs. strxfrm.

I totally agree that anything we should adopt should support
versioning. Glibc does have a non-standard versioning scheme, but we
don't use it. Other stdlibs may do versioning another way, or not at
all. A world in which ICU is the defacto standard for Postgres (i.e.
the actual standard on all major platforms), we mostly just have one
thing to target, which seems like something to aim for.

Collations change from time to time, legitimately. Read from
"Collation order is not fixed", here:

http://unicode.org/reports/tr10/#Stability

The question is only how we deal with this when it happens. One thing
that's attractive about ICU is that it makes this explicit, both for
the logical behavior of a collation, as well as the stability of
binary sort keys (Glibc's versioning seemingly just does the former).
So the equivalent of strxfrm() output has license to change for
technical reasons that are orthogonal to the practical concerns of
end-users about how text sorts in their locale. ICU is clear on what
it takes to make binary sort keys in indexes work. And various major
database systems rely on this being right.

> Regarding key abbreviation and performance, if we are confident that
> strcoll and strxfrm are at least independently internally consistent
> then we could consider offering an option to choose between them.

I think they just need to match, per the standard. After all,
abbreviation will sometimes require strcoll() tie-breakers.

Clearly it would be very naive to imagine that ICU is bug-free.
However, I surmise that there is a large difference how ICU and glibc
think about things like strxfrm() or strcoll() stability and
consistency. Tom was able to demonstrate that strxfrm() and strcoll()
behaved inconsistently without too much effort, contrary to POSIX, and
in many common cases. I doubt that the Glibc maintainers are all that
concerned about it. Certainly, less concerned than they are about the
latest security bug. Whereas if this happened in ICU, it would be a
total failure of the project to fulfill its most basic goals. Our
disaster would also be a disaster for several other major database
systems. ICU carefully and explicitly considers multiple forms of
stability, "deterministic" sort ordering, etc. That *is* a big
difference, and it makes me optimistic that there'd be far fewer
problems.

I also think that ICU could be a reasonable basis for case-insensitive
collations, which would let us kill citext, a module that I consider
to be a total kludge. And, we might also be able to lock down WAL
compatibility, which would be generally useful.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers