Re: Inconsistent results with libc sorting on Windows
On Fri, Aug 11, 2023 at 11:48:18AM +0200, Juan José Santamaría Flecha wrote: > On Fri, Aug 11, 2023 at 7:29 AM Noah Misch wrote: > > > > > The LCMapStringEx() solution is elegant. I do see > > > > https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications > > says, "If an application calls the function to create a sort key for a > > string > > containing an Arabic kashida, the function creates no sort key value." > > That's > > aggravating. > > > > I think the problem there is that it is just poorly explained. > Take for example the attached program that compares "normal" and "kashida" > 'Raħīm' taken from [1], you'll get: > > c1 = c2 > c2 = c1 > > meaning that "normal" and "kashida" are the same string. So, probably that > phrase should read something like: "If an application calls the function to > create a sort key for a string containing an Arabic kashida character, the > function will ignore that character and no sort key value will be generated > for it." Good. That sounds fine. Thanks for clarifying.
Re: Inconsistent results with libc sorting on Windows
On Fri, Aug 11, 2023 at 7:29 AM Noah Misch wrote: > > The LCMapStringEx() solution is elegant. I do see > > https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications > says, "If an application calls the function to create a sort key for a > string > containing an Arabic kashida, the function creates no sort key value." > That's > aggravating. > I think the problem there is that it is just poorly explained. Take for example the attached program that compares "normal" and "kashida" 'Raħīm' taken from [1], you'll get: c1 = c2 c2 = c1 meaning that "normal" and "kashida" are the same string. So, probably that phrase should read something like: "If an application calls the function to create a sort key for a string containing an Arabic kashida character, the function will ignore that character and no sort key value will be generated for it." [1] https://en.wikipedia.org/wiki/Kashida Regards, Juan José Santamaría Flecha #include "windows.h" #include "stdio.h" static char comparison(int r) { switch (r) { case 1: return '>'; case -1: return '<'; case 0: return '='; default: return '!'; } } int wmain(int argc, wchar_t* argv[]) { const wchar_t *s1 = L"\u0631\u062d\u064a\u0645"; const wchar_t *s2 = L"\u0631\u062d\u0640\u0640\u0640\u0640\u0640\u0640\u064a\u0645"; int len; wchar_t c1[13]; wchar_t c2[13]; int result; len = LCMapStringEx(L"es-US", LCMAP_SORTKEY, s1, -1, c1, 13, NULL, NULL, 0); len = LCMapStringEx(L"es-US", LCMAP_SORTKEY, s2, -1, c2, 13, NULL, NULL, 0); result = memcmp(c1, c2, 13); printf("c1 %c c2\n", comparison(result)); result = memcmp(c2, c1, 13); printf("c2 %c c1\n", comparison(result)); return 0; }
Re: Inconsistent results with libc sorting on Windows
On Wed, Jun 14, 2023 at 12:50:28PM +0200, Juan José Santamaría Flecha wrote: > On Wed, Jun 14, 2023 at 4:13 AM Peter Geoghegan wrote: > > > On Tue, Jun 13, 2023 at 5:59 PM Thomas Munro > > wrote: > > > Trying to follow along here... you're doing the moral equivalent of > > > strxfrm(), so sort keys have the transitive property but direct string > > > comparisons don't? Or is this because LCIDs reach a different > > > algorithm somehow (or otherwise why do you need to use LCIDs for this, > > > when there is a non-LCID version of that function, with a warning not > > > to use the older LCID version[1]?) > > > > I'm reminded of the fact that the abbreviated keys strxfrm() debacle > > (back when 9.5 was released) was caused by a bug in strcoll() -- not a > > bug in strxfrm() itself. From our point of view the problem was that > > strxfrm() failed to be bug compatible with strcoll() due to a buggy > > strcoll() optimization. > > > > I believe that strxfrm() is generally less likely to have bugs than > > strcoll(). There are far fewer opportunities to dodge unnecessary work > > in the case of strxfrm()-like algorithms (offering something like > > ICU's pg_strnxfrm_prefix_icu() prefix optimization is the only one). > > On the other hand, collation library implementers are likely to > > heavily optimize strcoll() for typical use-cases such as sorting and > > binary search. Using strxfrm() for everything is discouraged [1]. > > > > Yes, I think the situation is quite similar to what you describe, with its > WIN32 peculiarities. Take for example the attached program, it'll output: > > s1 = s2 > s2 = s3 > s1 > s3 > c1 > c2 > c2 > c3 > c1 > c3 > > As you can see the test for CompareStringEx() is broken, but we get a sane > answer with LCMapStringEx(). The LCMapStringEx() solution is elegant. I do see https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications says, "If an application calls the function to create a sort key for a string containing an Arabic kashida, the function creates no sort key value." That's aggravating.
Re: Inconsistent results with libc sorting on Windows
El lun, 3 jul 2023, 17:42, Alvaro Herrera escribió: > On 2023-Jun-19, Juan José Santamaría Flecha wrote: > > > Ok, let's see where the report goes: > > > > > https://developercommunity.visualstudio.com/t/CompareStringEx-non-transitive/10393003?q=comparestringex > > Hm, so this appears to have been marked as solved by Microsoft. Can you > recheck? Also, what does the resolution mean for Postgres, in practical > terms? > It's not really solved, they have just pushed the issue to another team that's only reachable through the Windows feedback hub. I've already provided the feedback, but it's only visible through the proprietary app. I would say there haven't been any real progress on that front so far. Regards, Juan José Santamaría Flecha >
Re: Inconsistent results with libc sorting on Windows
On 2023-Jun-19, Juan José Santamaría Flecha wrote: > Ok, let's see where the report goes: > > https://developercommunity.visualstudio.com/t/CompareStringEx-non-transitive/10393003?q=comparestringex Hm, so this appears to have been marked as solved by Microsoft. Can you recheck? Also, what does the resolution mean for Postgres, in practical terms? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)
Re: Inconsistent results with libc sorting on Windows
On Thu, Jun 15, 2023 at 1:57 AM Thomas Munro wrote: > > Given that the documented behaviour is that ".. the sort key produces > the same order as when the source string is used in CompareString or > CompareStringEx"[1], this seems like a reportable bug, unless perhaps > your test program is hiding an error with that default case you have. > > [1] > https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications Ok, let's see where the report goes: https://developercommunity.visualstudio.com/t/CompareStringEx-non-transitive/10393003?q=comparestringex Regards, Juan José Santamaría Flecha
Re: Inconsistent results with libc sorting on Windows
On Wed, Jun 14, 2023 at 10:50 PM Juan José Santamaría Flecha wrote: > Yes, I think the situation is quite similar to what you describe, with its > WIN32 peculiarities. Take for example the attached program, it'll output: > > s1 = s2 > s2 = s3 > s1 > s3 > c1 > c2 > c2 > c3 > c1 > c3 > > As you can see the test for CompareStringEx() is broken, but we get a sane > answer with LCMapStringEx(). Given that the documented behaviour is that ".. the sort key produces the same order as when the source string is used in CompareString or CompareStringEx"[1], this seems like a reportable bug, unless perhaps your test program is hiding an error with that default case you have. [1] https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications
Re: Inconsistent results with libc sorting on Windows
On Wed, Jun 14, 2023 at 4:13 AM Peter Geoghegan wrote: > On Tue, Jun 13, 2023 at 5:59 PM Thomas Munro > wrote: > > Trying to follow along here... you're doing the moral equivalent of > > strxfrm(), so sort keys have the transitive property but direct string > > comparisons don't? Or is this because LCIDs reach a different > > algorithm somehow (or otherwise why do you need to use LCIDs for this, > > when there is a non-LCID version of that function, with a warning not > > to use the older LCID version[1]?) > > I'm reminded of the fact that the abbreviated keys strxfrm() debacle > (back when 9.5 was released) was caused by a bug in strcoll() -- not a > bug in strxfrm() itself. From our point of view the problem was that > strxfrm() failed to be bug compatible with strcoll() due to a buggy > strcoll() optimization. > > I believe that strxfrm() is generally less likely to have bugs than > strcoll(). There are far fewer opportunities to dodge unnecessary work > in the case of strxfrm()-like algorithms (offering something like > ICU's pg_strnxfrm_prefix_icu() prefix optimization is the only one). > On the other hand, collation library implementers are likely to > heavily optimize strcoll() for typical use-cases such as sorting and > binary search. Using strxfrm() for everything is discouraged [1]. > Yes, I think the situation is quite similar to what you describe, with its WIN32 peculiarities. Take for example the attached program, it'll output: s1 = s2 s2 = s3 s1 > s3 c1 > c2 c2 > c3 c1 > c3 As you can see the test for CompareStringEx() is broken, but we get a sane answer with LCMapStringEx(). Regards, Juan José Santamaría Flecha compare_transitivity.c Description: Binary data
Re: Inconsistent results with libc sorting on Windows
Trying to follow along here... you're doing the moral equivalent of strxfrm(), so sort keys have the transitive property but direct string comparisons don't? Or is this because LCIDs reach a different algorithm somehow (or otherwise why do you need to use LCIDs for this, when there is a non-LCID version of that function, with a warning not to use the older LCID version[1]?) [1] https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringw
Re: Inconsistent results with libc sorting on Windows
On Fri, Jun 9, 2023 at 11:18 AM Daniel Verite wrote: > Juan José Santamaría Flecha wrote: > > > Just to make sure we are all seeing the same problem, does the attached > > patch fix your test? > > The problem of the random changes in sorting disappears for all libc > locales in pg_collation, so this is very promising. > Great to hear, I'll try to push the patch to something reviewable as per attached. However it persists for the default collation. > We can apply a similar approach there. Regards, Juan José Santamaría Flecha 0001-WIN32-Inconsistent-results-with-libc-utf8-sorting.patch Description: Binary data
Re: Inconsistent results with libc sorting on Windows
Juan José Santamaría Flecha wrote: > Just to make sure we are all seeing the same problem, does the attached > patch fix your test? The problem of the random changes in sorting disappears for all libc locales in pg_collation, so this is very promising. However it persists for the default collation. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Inconsistent results with libc sorting on Windows
> On 6/7/23 07:58, Daniel Verite wrote: > > Thomas Munro wrote: > > > >> > > Also, it does not occur at all if parallel scan is disabled. > >> > > >> > Could this be a clue that it is failing to be transitive? > >> > >> That vaguely rang a bell for me... and then I remembered this thread: > >> > >> > https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com > > > > Thanks for the pointer, non-transitive comparisons seem a likely cause > > indeed. > Just to make sure we are all seeing the same problem, does the attached patch fix your test? Regards, Juan José Santamaría Flecha 0001-POC-Inconsistent-results-with-libc-sorting-on-Window.patch Description: Binary data
Re: Inconsistent results with libc sorting on Windows
On 6/7/23 07:58, Daniel Verite wrote: Thomas Munro wrote: > > Also, it does not occur at all if parallel scan is disabled. > > Could this be a clue that it is failing to be transitive? That vaguely rang a bell for me... and then I remembered this thread: https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com Thanks for the pointer, non-transitive comparisons seem a likely cause indeed. The parallel scan appears to imply some randomness in the sequence of comparisons, which makes the problem more visible. After changing the test to shuffle the rows before each sort, non-parallel scans also produce outputs that differ, proving that parallelism is not a root cause. Running the test with all the libc collations with collencoding in (-1,6) shows that the only ones not affected are C/POSIX/ucs_basic. Otherwise the 569 other pre-created libc collations that can be used with UTF-8 are affected, plus the default collation (French_France.1252 in my case). Wow, that sounds pretty horrid -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Inconsistent results with libc sorting on Windows
Thomas Munro wrote: > > > Also, it does not occur at all if parallel scan is disabled. > > > > Could this be a clue that it is failing to be transitive? > > That vaguely rang a bell for me... and then I remembered this thread: > > https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com Thanks for the pointer, non-transitive comparisons seem a likely cause indeed. The parallel scan appears to imply some randomness in the sequence of comparisons, which makes the problem more visible. After changing the test to shuffle the rows before each sort, non-parallel scans also produce outputs that differ, proving that parallelism is not a root cause. Running the test with all the libc collations with collencoding in (-1,6) shows that the only ones not affected are C/POSIX/ucs_basic. Otherwise the 569 other pre-created libc collations that can be used with UTF-8 are affected, plus the default collation (French_France.1252 in my case). Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Inconsistent results with libc sorting on Windows
On Tue, Jun 6, 2023 at 1:30 PM Thomas Munro wrote: > On Tue, Jun 6, 2023 at 10:08 AM Daniel Verite wrote: > > Also, it does not occur at all if parallel scan is disabled. > > Could this be a clue that it is failing to be transitive? That vaguely rang a bell for me... and then I remembered this thread: https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com
Re: Inconsistent results with libc sorting on Windows
On Tue, Jun 6, 2023 at 10:08 AM Daniel Verite wrote: > Also, it does not occur at all if parallel scan is disabled. Could this be a clue that it is failing to be transitive?
Re: Inconsistent results with libc sorting on Windows
On 6/5/23 18:07, Daniel Verite wrote: While trying pg16beta1 libc collations on Windows, I noticed that UTF-8 text sorts sometimes differently across invocations with the same locales, which is wrong since these collations are deterministic. Also, it does not occur at all if parallel scan is disabled. This is a wild shot in the dark, but I wonder if somehow the locale is being initialized (i.e. setlocale) differently in the parallel workers than the backend due to some Windows specific behavior differences? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Inconsistent results with libc sorting on Windows
Hi, While trying pg16beta1 libc collations on Windows, I noticed that UTF-8 text sorts sometimes differently across invocations with the same locales, which is wrong since these collations are deterministic. The OS is Windows 10 Home, version 10.0.19045 Build 19045, self-built 16beta1 with VS Community 2022, without ICU, default configuration in postgresql.conf. It seems to occur more or less randomly with all libc locales except C/POSIX, with the probability of getting differences being seemingly much higher when the data gets larger in number of rows and uses higher codepoints (like if all character are in [U+0001,U+0400] the sorts never differ with 40k rows, but they do if there are much more rows or if the range is [U+0001,U+2000]). Also, it does not occur at all if parallel scan is disabled. I've come up with a self-contained script that generates random words and repeatedly sorts and feed them to md5sum. It takes the number of rows and the highest Unicode codepoint as arguments, and shows when the checksums differ across consecutive invocations. Here's a typical run showing how it goes wrong after the 14th sort: $ bash repro-coll-windows.sh 4 16383 NOTICE: relation "random_words" already exists, skipping CREATE TABLE TRUNCATE TABLE CREATE FUNCTION DROP COLLATION CREATE COLLATION INSERT 0 4 ANALYZE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 35050d858f4c590788132627e74f62c8 -> e746b626fcc848cbbc67570a7dde03bb (iter=15) 16 e746b626fcc848cbbc67570a7dde03bb -> 35050d858f4c590788132627e74f62c8 (iter=16) 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 35050d858f4c590788132627e74f62c8 -> 6bf38563d1267339122154bd7d4fbfce (iter=38) 39 6bf38563d1267339122154bd7d4fbfce -> 35050d858f4c590788132627e74f62c8 (iter=39) 40 41 42 43 44 45 46 47 48 49 50 51 35050d858f4c590788132627e74f62c8 -> 3d2072698054d0bd57beefea0248b7e6 (iter=51) 52 3d2072698054d0bd57beefea0248b7e6 -> 35050d858f4c590788132627e74f62c8 (iter=52) 53 54 55 56 57 58 59 ^C Would anyone be able to reproduce this? That might be a local problem although there's nothing special installed AFAICS. Initially I saw this with a larger dataset that I can't share, and the diffs between outputs showed that only a few lines out of 2 million lines were getting displaced across sorts. It also happens on the same OS with Pg15.3 (EDB build) and the default libc collation, so I would not immediately suspect new code in Pg16. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite repro-coll-windows.sh Description: Binary data