Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn  wrote:
> But in your strxfrm code in PostgreSQL, the keys are cached, and represented
> as int64:s if I remember correctly, so perhaps there is still a benefit
> using the abbreviated keys? More testing is required, I guess...

ICU's ucol_nextSortKeyPart() interface is faster for this, I believe,
and works because you only need the first sizeof(Datum) bytes (4 or 8
bytes). So, you have the right idea about using it (at least for the
abbreviated key stuff), but the second last argument needs to be
sizeof(Datum).

The whole point of the abbreviated key optimization is that you can
avoid pointer chasing during each and every comparison. Your test here
is invalid because it doesn't involved the reuse of the keys. Often,
during a sort, each item has to be compared about 20 times.

I've experimented with ICU, and know it works well with this. You
really need to create a scenario with a real sort, and all the
conditions I describe.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn

> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn :
> 
>> 
>> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan :
>> 
>> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  
>> wrote:
>>> They've been used for the FreeBSD ports since 2005, and have served us 
>>> well. I have of course updated them regularly. In this latest version, I've 
>>> removed support for other encodings beside UTF-8, mostly since I don't know 
>>> how to test them, but also, I see little point in supporting anything else 
>>> using ICU.
>> 
>> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
>> is not the release that introduced its use, it did expand it
>> significantly. I think you need to fix this, even though it isn't
>> actually used to sort text at present, since presumably FreeBSD builds
>> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
>> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
>> opportunity. (See the wiki page on the abbreviated keys issue [1] if
>> you don't know what I'm talking about.)
> 
> My plan was to get it working without TRUST_STRXFRM first, and then add that 
> functinality. I've made some preliminary tests using ICU:s ucol_getSortKey 
> but I will have to test it a bit more. For now, I just expect not to trust 
> strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code 
> base.

Here are some preliminary results running 1 times comparing the same two 
strings in a tight loop.

  ucol_strcollUTF8: -1  0.002448
   strcoll: 1   0.060711
  ucol_strcollIter: -1  0.009221
 direct memcmp: 1   0.000457
 memcpy memcmp: 1   0.001706
memcpy strcoll: 1   0.068425
   nextSortKeyPart: -1  0.041011
ucnv_toUChars + getSortKey: -1  0.050379


correct answer is -1, but since we compare åasdf and äasdf with a Swedish 
locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 
times faster than ucol_strcollUTF8 (used in my patch), but sadly the best 
implementation using sort keys with ICU, nextSortKeyPart, is way slower.



startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, 
);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - 
startTime);



vs


int sortkeysize=8;

startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter;

for ( int i = 0; i < loop; i++) {
uiter_setUTF8(, arg1, len1);
uiter_setUTF8(, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, , sState, key1, 
sortkeysize, );
ucol_nextSortKeyPart(collator, , tState, key2, 
sortkeysize, );
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - 
startTime);



But in your strxfrm code in PostgreSQL, the keys are cached, and represented as 
int64:s if I remember correctly, so perhaps there is still a benefit using the 
abbreviated keys? More testing is required, I guess...

Palle





signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn

> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan :
> 
> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  wrote:
>> They've been used for the FreeBSD ports since 2005, and have served us well. 
>> I have of course updated them regularly. In this latest version, I've 
>> removed support for other encodings beside UTF-8, mostly since I don't know 
>> how to test them, but also, I see little point in supporting anything else 
>> using ICU.
> 
> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
> is not the release that introduced its use, it did expand it
> significantly. I think you need to fix this, even though it isn't
> actually used to sort text at present, since presumably FreeBSD builds
> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
> opportunity. (See the wiki page on the abbreviated keys issue [1] if
> you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that 
functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but 
I will have to test it a bit more. For now, I just expect not to trust strxfrm. 
It is the first iteration wrt strxfrm, the plan is to use that code base.

> 
> Shouldn't you really have a strxfrm() wrapper, used across the board,
> including for callers outside of varlena.c? convert_string_datum() has
> been calling strxfrm() for many releases now. These calls are still
> used in FreeBSD builds, I would think, which seems like a bug that is
> not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
> that that matching the ordering used elsewhere not really hold on
> FreeBSD builds?

I was not aware of convert_string_datum, I will check that, thanks! Using a 
wrapper across the board seems like a good idea for refactoring.

> 
> [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
> --
> Peter Geoghegan



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  wrote:
> They've been used for the FreeBSD ports since 2005, and have served us well. 
> I have of course updated them regularly. In this latest version, I've removed 
> support for other encodings beside UTF-8, mostly since I don't know how to 
> test them, but also, I see little point in supporting anything else using ICU.

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

Shouldn't you really have a strxfrm() wrapper, used across the board,
including for callers outside of varlena.c? convert_string_datum() has
been calling strxfrm() for many releases now. These calls are still
used in FreeBSD builds, I would think, which seems like a bug that is
not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
that that matching the ordering used elsewhere not really hold on
FreeBSD builds?

[1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Palle Girgensohn

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian :
> 
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
> 
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
> 
>   
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support 
for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I 
have of course updated them regularly. In this latest version, I've removed 
support for other encodings beside UTF-8, mostly since I don't know how to test 
them, but also, I see little point in supporting anything else using ICU.



I have one question for someone with knowledge about Turkish (Devrim?). This is 
the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** 
/Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out
  2016-08-10 21:09:03.0 +0200
--- 
/Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out
2016-08-10 21:12:53.0 +0200
***
*** 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  f
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  t
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***
*** 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  t
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  f
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.

==

The Linux locale behaves differently from ICU for the above (corner ?) cases. 
Any ideas if one is more correct than the other? I seems unclear to me. Perhaps 
it depends on whether the case-insensitive match is done using lower(both) or 
upper(both)? I haven't investigated this yet. @Devrim, is one more correct than 
the other?


As Thomas points out, using ucoll_strcoll it is quick, since no copying is 
needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html



signature.asc
Description: Message signed with OpenPGP using GPGMail