Re: Reorganize collation lookup time and place
On Fri, Feb 1, 2019 at 7:27 AM Peter Eisentraut wrote: > This patch will probably make a reappearance when we consider making ICU > available as the database-wide default collation. Somebody recently reported that an ICU locale was over 2000x faster than the equivalent glibc locale with their query due to supporting the abbreviated keys optimization: https://postgr.es/m/CACd=f9cO5OcmAeBK1M5a3+t8BO7E91Ki0WaLAxFm=fbbon_...@mail.gmail.com We really should get around to supporting ICU for database-wide collations sooner rather than later. I knew that there were cases where glibc performs abysmally, but I didn't think it could ever be a difference of more than two orders of magnitude until recently. -- Peter Geoghegan
Re: Reorganize collation lookup time and place
On 31/01/2019 13:45, Andres Freund wrote: > So, what's the plan here? Should the CF entry be closed? Set to RwF. This patch will probably make a reappearance when we consider making ICU available as the database-wide default collation. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reorganize collation lookup time and place
Hi, On 2018-12-13 14:50:53 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote: > >> On 12/12/2018 18:56, Andres Freund wrote: > >>> Why isn't this integrated into fmgr_info()? > > >> fmgr_info() looks up stuff in pg_proc. pg_newlocale_...() looks up > >> stuff in pg_collation. There is no overlap between them. > > > It looks up stuff necessary for calling a function, that doesn't fit > > looking up the collation necessary to do so too badly. A lot of the the > > changes you made are rote changes to each caller, taking the collation > > oid and expanding it with pg_newlocale_from_collation(). > > I'm not very thrilled with the idea of changing every single caller of > InitFunctionCallInfoData and related functions, especially when exactly > zero functional change ensues. We should work harder on avoiding that > code churn; extension developers will thank us. > > >> There is also not necessarily a one-to-one correspondence between > >> function and collation lookup calls. For example, in some cases you > >> need to look up a sorting and a hashing functions, but only one > >> collation for both. > > > That seems rare enough not to matter, performancewise. > > I think it potentially does matter, but I also agree that it's not > the common case. Could we perhaps keep the API for the existing > functions the same, and introduce new functions alongside them > to be used by the small number of places where it matters? > > (I've not looked at Peter's patch yet, so maybe I'm talking through > my hat. But we should set a pretty high value on avoiding code > churn in stuff as widely used as the fmgr interfaces.) So, what's the plan here? Should the CF entry be closed? Greetings, Andres Freund
Re: Reorganize collation lookup time and place
On 2018-12-13 14:50:53 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote: > >> On 12/12/2018 18:56, Andres Freund wrote: > >>> Why isn't this integrated into fmgr_info()? > > >> fmgr_info() looks up stuff in pg_proc. pg_newlocale_...() looks up > >> stuff in pg_collation. There is no overlap between them. > > > It looks up stuff necessary for calling a function, that doesn't fit > > looking up the collation necessary to do so too badly. A lot of the the > > changes you made are rote changes to each caller, taking the collation > > oid and expanding it with pg_newlocale_from_collation(). > > I'm not very thrilled with the idea of changing every single caller of > InitFunctionCallInfoData and related functions, especially when exactly > zero functional change ensues. We should work harder on avoiding that > code churn; extension developers will thank us. I am thinking of moving to commit 0001 (but not 0002) of https://www.postgresql.org/message-id/20181009191802.ppt6lqcvkpjvkm76%40alap3.anarazel.de which'd break some (but fewer) of the same place that'd be affected by this. The amount of memory saved for plpgsql and other places with lots of expressoins is quite noticable, and it's a prerequisite for better code generation and caching for JIT. As that patch doesn't affect DirectFunctionCall* etc, I'm not sure it's a sufficient argument to go full breakage here, however. Independent of that it still seems like putting stuff onto every function caller that they don't need to deal with. Greetings, Andres Freund
Re: Reorganize collation lookup time and place
Andres Freund writes: > On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote: >> On 12/12/2018 18:56, Andres Freund wrote: >>> Why isn't this integrated into fmgr_info()? >> fmgr_info() looks up stuff in pg_proc. pg_newlocale_...() looks up >> stuff in pg_collation. There is no overlap between them. > It looks up stuff necessary for calling a function, that doesn't fit > looking up the collation necessary to do so too badly. A lot of the the > changes you made are rote changes to each caller, taking the collation > oid and expanding it with pg_newlocale_from_collation(). I'm not very thrilled with the idea of changing every single caller of InitFunctionCallInfoData and related functions, especially when exactly zero functional change ensues. We should work harder on avoiding that code churn; extension developers will thank us. >> There is also not necessarily a one-to-one correspondence between >> function and collation lookup calls. For example, in some cases you >> need to look up a sorting and a hashing functions, but only one >> collation for both. > That seems rare enough not to matter, performancewise. I think it potentially does matter, but I also agree that it's not the common case. Could we perhaps keep the API for the existing functions the same, and introduce new functions alongside them to be used by the small number of places where it matters? (I've not looked at Peter's patch yet, so maybe I'm talking through my hat. But we should set a pretty high value on avoiding code churn in stuff as widely used as the fmgr interfaces.) regards, tom lane
Re: Reorganize collation lookup time and place
Hi, On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote: > On 12/12/2018 18:56, Andres Freund wrote: > >> This makes the collation lookup more similar to the function lookup. In > >> many cases they now happen right next to each other. > >> pg_newlocale_from_collation() becomes analogous to fmgr_info() and > >> pg_locale_t to FmgrInfo *. > > Why isn't this integrated into fmgr_info()? > > fmgr_info() looks up stuff in pg_proc. pg_newlocale_...() looks up > stuff in pg_collation. There is no overlap between them. It looks up stuff necessary for calling a function, that doesn't fit looking up the collation necessary to do so too badly. A lot of the the changes you made are rote changes to each caller, taking the collation oid and expanding it with pg_newlocale_from_collation(). It seems not necessary to force every external user to do rote changes like: fmgr_info(opexpr->opfuncid, finfo); fmgr_info_set_expr((Node *) node, finfo); InitFunctionCallInfoData(*fcinfo, finfo, 2, - opexpr->inputcollid, NULL, NULL); + pg_newlocale_from_collation(opexpr->inputcollid), NULL, NULL); ... - h = FunctionCall1Coll(array_extra_data->hash, DEFAULT_COLLATION_OID, d); + h = FunctionCall1Coll(array_extra_data->hash, pg_newlocale_from_collation(DEFAULT_COLLATION_OID), d); A lot of the new pg_newlocale_from_collation() calls go a fair bit over the customary line length limit. As it stands I have another patch that wants to change the function call interface, including for extensions: https://www.postgresql.org/message-id/20180605172952.x34m5uz6ju6enaem%40alap3.anarazel.de so perhaps we can just bite the bullet and do both. But I'm somewhat doubtful it's useful to expose having to lookup pg_newlocale_from_collation() at the callsites to every single caller of functions in the codebase. You say that's a separate conern of initializing function calls, for me it's unnecessarily exposing details that seem likely to change. > There is also not necessarily a one-to-one correspondence between > function and collation lookup calls. For example, in some cases you > need to look up a sorting and a hashing functions, but only one > collation for both. That seems rare enough not to matter, performancewise. Greetings, Andres Freund
Re: Reorganize collation lookup time and place
On 12/12/2018 18:56, Andres Freund wrote: >> This makes the collation lookup more similar to the function lookup. In >> many cases they now happen right next to each other. >> pg_newlocale_from_collation() becomes analogous to fmgr_info() and >> pg_locale_t to FmgrInfo *. > Why isn't this integrated into fmgr_info()? fmgr_info() looks up stuff in pg_proc. pg_newlocale_...() looks up stuff in pg_collation. There is no overlap between them. There is also not necessarily a one-to-one correspondence between function and collation lookup calls. For example, in some cases you need to look up a sorting and a hashing functions, but only one collation for both. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Reorganize collation lookup time and place
Hi, On 2018-12-10 16:12:54 +0100, Peter Eisentraut wrote: > Attached is a patch that reorganizes how and where collations are looked up. > > Until now, a fmgr C function that makes use of collation information > (for example varstr_cmp(), str_toupper()) gets passed the collation OID, > looks up the collation with pg_newlocale_from_collation(), and then does > something with it. > > With this change, the lookup is moved earlier, typically during executor > initialization. The fmgr functions receive the locale pointer that is > the result of that lookup. > > This makes the collation lookup more similar to the function lookup. In > many cases they now happen right next to each other. > pg_newlocale_from_collation() becomes analogous to fmgr_info() and > pg_locale_t to FmgrInfo *. Why isn't this integrated into fmgr_info()? Greetings, Andres Freund
Re: Reorganize collation lookup time and place
On 12/10/18 4:12 PM, Peter Eisentraut wrote: Attached is a patch that reorganizes how and where collations are looked up. Until now, a fmgr C function that makes use of collation information (for example varstr_cmp(), str_toupper()) gets passed the collation OID, looks up the collation with pg_newlocale_from_collation(), and then does something with it. With this change, the lookup is moved earlier, typically during executor initialization. The fmgr functions receive the locale pointer that is the result of that lookup. Sounds like a great plan. I too feel that having the look ups there makes more logical sense. But when taking a look at your patch I got a segfault when running initdb. See the stack trace below. My compiler is "gcc (Debian 8.2.0-9) 8.2.0" and the locale when running initdb is glibc's "en_US.utf8". #0 __GI___strcoll_l (s1=0x55d64555d998 "({CONST :consttype 1009 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 160 :constvalue 16 [ 64 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0 ]})", s2=0x55d64555ddb0 "({CONST :consttype 1009 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 161 :constvalue 16 [ 64 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0 ]})", l=0x0) at strcoll_l.c:259 #1 0x55d644c4d607 in varstrfastcmp_locale (x=94378777346072, y=94378777347120, ssup=0x7ffea6a64990) at varlena.c:2185 #2 0x55d644958dad in ApplySortComparator (ssup=0x7ffea6a64990, isNull2=false, datum2=, isNull1=false, datum1=out>) at ../../../src/include/utils/sortsupport.h:224 #3 compare_scalars (a=, b=, arg=0x7ffea6a64980) at analyze.c:2784 #4 0x55d644cba494 in qsort_arg (a=a@entry=0x55d6456ee5f0, n=n@entry=14, es=es@entry=16, cmp=cmp@entry=0x55d644958d82 , arg=arg@entry=0x7ffea6a64980) at qsort_arg.c:140 #5 0x55d64495b833 in compute_scalar_stats (stats=0x55d6456c6c88, fetchfunc=0x55d6449597f1 , samplerows=2904, totalrows=2904) at analyze.c:2360 #6 0x55d64495cca3 in do_analyze_rel (onerel=onerel@entry=0x7f3424bdbd50, options=options@entry=2, params=params@entry=0x7ffea6a64d90, va_cols=va_cols@entry=0x0, acquirefunc=0x55d64495901b , relpages=77, inh=false, in_outer_xact=false, elevel=13) at analyze.c:527 #7 0x55d64495d2ec in analyze_rel (relid=, relation=0x0, options=options@entry=2, params=params@entry=0x7ffea6a64d90, va_cols=0x0, in_outer_xact=, bstrategy=0x55d6455413a8) at analyze.c:258 #8 0x55d6449d61b1 in vacuum (options=2, relations=0x55d645541520, params=params@entry=0x7ffea6a64d90, bstrategy=, bstrategy@entry=0x0, isTopLevel=) at vacuum.c:357 #9 0x55d6449d644c in ExecVacuum (vacstmt=vacstmt@entry=0x55d645442e50, isTopLevel=isTopLevel@entry=true) at vacuum.c:141 #10 0x55d644b5ec9d in standard_ProcessUtility (pstmt=0x55d6454431a0, queryString=0x55d645442468 "ANALYZE;\n", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55d644f50b00 , completionTag=0x7ffea6a650d0 "") at utility.c:670 #11 0x55d644b5f1cc in ProcessUtility (pstmt=pstmt@entry=0x55d6454431a0, queryString=, context=, params=, queryEnv=out>, dest=dest@entry=0x55d644f50b00 , completionTag=0x7ffea6a650d0 "") at utility.c:360 #12 0x55d644b5b582 in PortalRunUtility (portal=portal@entry=0x55d645434428, pstmt=pstmt@entry=0x55d6454431a0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55d644f50b00 , completionTag=completionTag@entry=0x7ffea6a650d0 "") at pquery.c:1175 #13 0x55d644b5c1fd in PortalRunMulti (portal=portal@entry=0x55d645434428, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55d644f50b00 , altdest=altdest@entry=0x55d644f50b00 , completionTag=completionTag@entry=0x7ffea6a650d0 "") at pquery.c:1321 #14 0x55d644b5cfbb in PortalRun (portal=portal@entry=0x55d645434428, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x55d644f50b00 , altdest=altdest@entry=0x55d644f50b00 , completionTag=0x7ffea6a650d0 "") at pquery.c:796 #15 0x55d644b591b8 in exec_simple_query (query_string=query_string@entry=0x55d645442468 "ANALYZE;\n") at postgres.c:1215 #16 0x55d644b5b0dd in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:4256 #17 0x55d644a3abf9 in main (argc=10, argv=0x55d6453c3490) at main.c:224