[GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Gábor Farkas
hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

thanks,
gabor


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Ivan Voras
On 15/01/2014 10:10, Gábor Farkas wrote:
 hi,
 
 when i create an unique-constraint on a varchar field, how exactly
 does postgresql compare the texts? i'm asking because in UNICODE there
 are a lot of complexities about this..
 
 or in other words, when are two varchars equal in postgres? when their
 bytes are? or some algorithm is applied?

By default, it is whatever the operating system thinks it's right.
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

(which breaks on certain systems which don't have complete UTF-8 support
- I'm in favour of importing ICU at least as an optional dependancy,
similar to what the FreeBSD's patch does:
http://people.freebsd.org/~girgen/postgresql-icu/).



signature.asc
Description: OpenPGP digital signature


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Amit Langote
On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 15/01/2014 10:10, Gábor Farkas wrote:
 hi,

 when i create an unique-constraint on a varchar field, how exactly
 does postgresql compare the texts? i'm asking because in UNICODE there
 are a lot of complexities about this..

 or in other words, when are two varchars equal in postgres? when their
 bytes are? or some algorithm is applied?

 By default, it is whatever the operating system thinks it's right.
 PostgreSQL doesn't have its own collation code, it uses the OS's locale
 support for this.


Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

#ifdef HAVE_LOCALE_T
if (mylocale)
result = strcoll_l(a1p, a2p, mylocale);
else
#endif
result = strcoll(a1p, a2p);

/*
 * In some locales strcoll() can claim that
nonidentical strings are
 * equal.  Believing that would be bad news for a
number of reasons,
 * so we follow Perl's lead and sort equal strings
according to
 * strcmp().
 */
if (result == 0)
result = strcmp(a1p, a2p);

--
Amit Langote


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Ivan Voras
On 15/01/2014 12:36, Amit Langote wrote:
 On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 15/01/2014 10:10, Gábor Farkas wrote:
 hi,

 when i create an unique-constraint on a varchar field, how exactly
 does postgresql compare the texts? i'm asking because in UNICODE there
 are a lot of complexities about this..

 or in other words, when are two varchars equal in postgres? when their
 bytes are? or some algorithm is applied?

 By default, it is whatever the operating system thinks it's right.
 PostgreSQL doesn't have its own collation code, it uses the OS's locale
 support for this.

 
 Just to add to this, whenever strcoll() (a locale aware comparator)
 says two strings are equal, postgres re-compares them using strcmp().
 See following code snippet off
 src/backend/utils/adt/varlena.c:varstr_cmp() -

 /*
  * In some locales strcoll() can claim that
 nonidentical strings are
  * equal.  Believing that would be bad news for a
 number of reasons,
  * so we follow Perl's lead and sort equal strings
 according to
  * strcmp().
  */
 if (result == 0)
 result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?



signature.asc
Description: OpenPGP digital signature


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Amit Langote
On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 15/01/2014 12:36, Amit Langote wrote:
 On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 15/01/2014 10:10, Gábor Farkas wrote:
 hi,

 when i create an unique-constraint on a varchar field, how exactly
 does postgresql compare the texts? i'm asking because in UNICODE there
 are a lot of complexities about this..

 or in other words, when are two varchars equal in postgres? when their
 bytes are? or some algorithm is applied?

 By default, it is whatever the operating system thinks it's right.
 PostgreSQL doesn't have its own collation code, it uses the OS's locale
 support for this.


 Just to add to this, whenever strcoll() (a locale aware comparator)
 says two strings are equal, postgres re-compares them using strcmp().
 See following code snippet off
 src/backend/utils/adt/varlena.c:varstr_cmp() -

 /*
  * In some locales strcoll() can claim that
 nonidentical strings are
  * equal.  Believing that would be bad news for a
 number of reasons,
  * so we follow Perl's lead and sort equal strings
 according to
  * strcmp().
  */
 if (result == 0)
 result = strcmp(a1p, a2p);

 That seems odd and inefficient. Why would it be necessary? I would think
 indexing (and other collation-sensitive operations) don't care what the
 actual collation result is for arbitrary blobs of strings, as long as
 they are stable?


This is the behavior since quite some time introduced by this commit

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

--
Amit Langote


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Ivan Voras
On 15/01/2014 13:29, Amit Langote wrote:
 On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 15/01/2014 12:36, Amit Langote wrote:

  * In some locales strcoll() can claim that
 nonidentical strings are
  * equal.  Believing that would be bad news for a
 number of reasons,
  * so we follow Perl's lead and sort equal strings
 according to
  * strcmp().
  */
 if (result == 0)
 result = strcmp(a1p, a2p);

 That seems odd and inefficient. Why would it be necessary? I would think
 indexing (and other collation-sensitive operations) don't care what the
 actual collation result is for arbitrary blobs of strings, as long as
 they are stable?

 
 This is the behavior since quite some time introduced by this commit
 
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

Ok, the commit comment is:

Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical.  This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well.  Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all.

... so it's just another workaround for OS specific locale issues - to
me it looks like just another reason to use ICU.



signature.asc
Description: OpenPGP digital signature


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Tom Lane
Ivan Voras ivo...@freebsd.org writes:
 On 15/01/2014 12:36, Amit Langote wrote:
 Just to add to this, whenever strcoll() (a locale aware comparator)
 says two strings are equal, postgres re-compares them using strcmp().

 That seems odd and inefficient. Why would it be necessary? I would think
 indexing (and other collation-sensitive operations) don't care what the
 actual collation result is for arbitrary blobs of strings, as long as
 they are stable?

If we didn't do it like this, we could not use hashing techniques for
text --- at least not unless we could find a hash function guaranteed
to yield the same values for any two strings that strcoll() claims are
equal.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] how is text-equality handled in postgresql?

2014-01-15 Thread Vick Khera
On Wed, Jan 15, 2014 at 4:10 AM, Gábor Farkas gabor.far...@gmail.comwrote:

 or in other words, when are two varchars equal in postgres? when their
 bytes are? or some algorithm is applied?


On this topic, when I write my strings to the DB and search from the DB,
should I canonicalize them first as NKFC (or some other), or just let the
DB figure it out? In my specific case I use perl DBI with place holders to
submit my queries.