Re: optimize lookups in snapshot [sub]xip arrays
On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart wrote: > > [v8] Okay, I think it's basically in good shape. Since it should be a bit faster than a couple versions ago, would you be up for retesting with the original test having 8 to 512 writers? And also add the const markers we discussed upthread? Aside from that, I plan to commit this week unless there is further bikeshedding. -- John Naylor EDB: http://www.enterprisedb.com
Re: optimize lookups in snapshot [sub]xip arrays
On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy wrote: > > > 1) pg_lfind32 - why just uint32? If it's not possible to define > functions for char, unsigned char, int16, uint16, int32, int64, uint64 > and so on, can we add a few comments around that? Also, the comments Future work, as far as I'm concerned. I'm interested in using a char version for json strings. > 3) Can pg_lfind32 return the index of the key found, for instance to > use it for setting/resetting the found element in the array? That was just discussed. It's slightly faster not to return an index. -- John Naylor EDB: http://www.enterprisedb.com
Re: NAMEDATALEN increase because of non-latin languages
I wrote: > The syscache use of GETSTRUCT still uses a simple cast of the tuple (for > pg_cast those calls live in parse_coerce.c, which is unchanged from master in > v3). Next step I think is to see about the syscache piece -- teaching a > syscache miss to deform the entire tuple into a struct and store it in the > syscache. v4 is a hackish attempt at getting the syscache to deform the tuple and store the struct. Using only pg_cast again for ease, it seems to work for that. It's now about as far as I can get without thinking about byref types. 0001 just adds copies of some syscache / catcache functions for private use by pg_cast, as scaffolding. 0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call heap_deform_tuple(). This then calls a for-now handwritten function (similar to the one generated in v3) which palloc's space for the struct and copies the fields over. Only this function knows the catalog struct type, so a future design could call the per-cache function via a pointer in the catcache control array. Since we already have the isnull/values array, it's also trivial to copy the datums for the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW that knows it's looking directly at the struct. Getting to varlen attributes: For one, I think it was mentioned above that we will need a way to perform a deep copy of structs that contain pointers to varlen fields. I imagine keeping track of the attributes length will come up for some types and might be tricky. And before that, the catalog machinery will need some preprocessor tricks to declare pointers in the structs. -- John Naylor EDB: http://www.enterprisedb.com From 20fba44412e8ef1bb4cd5b051b9d7e82618a6d93 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 10 Aug 2022 17:19:24 +0700 Subject: [PATCH v4 2/2] Teach catcache to store structs for pg_cast --- src/backend/parser/parse_coerce.c | 6 +-- src/backend/utils/cache/catcache.c | 73 -- src/include/access/htup_details.h | 2 + 3 files changed, 55 insertions(+), 26 deletions(-) diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c index 39b7e5707b..07a1b047e3 100644 --- a/src/backend/parser/parse_coerce.c +++ b/src/backend/parser/parse_coerce.c @@ -3056,7 +3056,7 @@ IsBinaryCoercible(Oid srctype, Oid targettype) ObjectIdGetDatum(targettype)); if (!HeapTupleIsValid(tuple)) return false; /* no cast */ - castForm = (Form_pg_cast) GETSTRUCT(tuple); + castForm = GETSTRUCT_NEW(pg_cast, tuple); result = (castForm->castmethod == COERCION_METHOD_BINARY && castForm->castcontext == COERCION_CODE_IMPLICIT); @@ -3120,7 +3120,7 @@ find_coercion_pathway(Oid targetTypeId, Oid sourceTypeId, if (HeapTupleIsValid(tuple)) { - Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple); + Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple); CoercionContext castcontext; /* convert char value for castcontext to CoercionContext enum */ @@ -3287,7 +3287,7 @@ find_typmod_coercion_function(Oid typeId, if (HeapTupleIsValid(tuple)) { - Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple); + Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple); *funcid = castForm->castfunc; ReleaseSysCache(tuple); diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index b1287bb6a0..8ddc109052 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -21,6 +21,7 @@ #include "access/table.h" #include "access/valid.h" #include "access/xact.h" +#include "catalog/pg_cast.h" // fixme #include "catalog/pg_collation.h" #include "catalog/pg_operator.h" #include "catalog/pg_type.h" @@ -2158,6 +2159,42 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, return ct; } +// WIP: generated functions would look like this and be called through a pointer +// FIXME: ct->tuple is no longer a real tuple +// XXX: for now assume the caller has switched to the right memory context +static CatCTup * +CatCArrayGetStruct_pg_cast(HeapTuple pg_cast_tuple, CatCTup *ct, Datum *values, bool *isnull) +{ + Form_pg_cast pg_cast_struct; + + /* Allocate memory for CatCTup and the cached struct in one go */ + ct = (CatCTup *) palloc(sizeof(CatCTup) + + MAXIMUM_ALIGNOF + sizeof(FormData_pg_cast)); + + /* copy the identification info */ + // WIP: for caches we only need t_self, can we just have that as a + // separate field in CatCTup? + ct->tuple.t_len = pg_cast_tuple->t_len; + ct->tuple.t_self = pg_cast_tuple->t_self; + ct->tuple.t_tableOid = pg_cast_tuple->t_tableOid; + + // WIP: treat t_data as a pointer to the struct + ct->tuple.t_data = (HeapTupleHeader) + MAXALIGN(((char *) ct) + sizeof(CatCTup)); + pg_cast
Re: [RFC] building postgres with meson
On Thu, Aug 11, 2022 at 12:19 AM Andres Freund wrote: > I tried to ignore various generated files in the source tree, but I don't > think it's doable for all of them. Consider > e.g. src/backend/utils/misc/guc-file.c which is gets built via #include > "guc-file.c" from gram.c With a bit of work, we could probably get rid of those includes. See 27199058d98ef7f for one example. -- John Naylor EDB: http://www.enterprisedb.com
Re: use SSE2 for is_valid_ascii
On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart wrote: > > This is a neat patch. I don't know that we need an entirely separate code > block for the USE_SSE2 path, but I do think that a little bit of extra > commentary would improve the readability. IMO the existing comment for the > zero accumulator has the right amount of detail. > > + /* > +* Set all bits in each lane of the error accumulator where > input > +* bytes are zero. > +*/ > + error_cum = _mm_or_si128(error_cum, > + > _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); Okay, I will think about the comments, thanks for looking. > I wonder if reusing a zero vector (instead of creating a new one every > time) has any noticeable effect on performance. Creating a zeroed register is just FOO PXOR FOO, which should get hoisted out of the (unrolled in this case) loop, and which a recent CPU will just map to a hard-coded zero in the register file, in which case the execution latency is 0 cycles. :-) -- John Naylor EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson
On Thu, Aug 11, 2022 at 10:37 AM Tom Lane wrote: > > John Naylor writes: > > With a bit of work, we could probably get rid of those includes. See > > 27199058d98ef7f for one example. > > Yeah --- it would mean creating gram.h files for all the bison grammars > not just a few of them, but it's certainly do-able if there's motivation > to make the changes. Most of the files that are done that way date > from before we knew about flex's %top. I'll volunteer to work on this unless an easier solution happens to come along in the next couple days. (aside: guc-file.l doesn't have a grammar, so not yet sure if that makes the issue easier or harder...) -- John Naylor EDB: http://www.enterprisedb.com
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart wrote: > > On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote: > > LGTM, let's see what the buildfarm thinks of 0001. > > Thanks! I haven't noticed any related buildfarm failures yet. I was waiting for all the Windows animals to report in, and it looks like they have, so I've pushed 0002. Thanks for picking this topic up again! -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada wrote: > > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada > wrote: > > > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor > > wrote: > > > > > > > > > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada > > > wrote: > > > > > > > I’d like to keep the first version simple. We can improve it and add > > > > more optimizations later. Using radix tree for vacuum TID storage > > > > would still be a big win comparing to using a flat array, even without > > > > all these optimizations. In terms of single-value leaves method, I'm > > > > also concerned about an extra pointer traversal and extra memory > > > > allocation. It's most flexible but multi-value leaves method is also > > > > flexible enough for many use cases. Using the single-value method > > > > seems to be too much as the first step for me. > > > > > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > > > choice for me as the first step . It can cover wider use cases > > > > including vacuum TID use cases. And possibly it can cover use cases by > > > > combining a hash table or using tree of tree, for example. > > > > > > These two aspects would also bring it closer to Andres' prototype, which > > > 1) makes review easier and 2) easier to preserve optimization work > > > already done, so +1 from me. > > > > Thanks. > > > > I've updated the patch. It now implements 64-bit keys, 64-bit values, > > and the multi-value leaves method. I've tried to remove duplicated > > codes but we might find a better way to do that. > > > > With the recent changes related to simd, I'm going to split the patch > into at least two parts: introduce other simd optimized functions used > by the radix tree and the radix tree implementation. Particularly we > need two functions for radix tree: a function like pg_lfind32 but for > 8 bits integers and return the index, and a function that returns the > index of the first element that is >= key. I recommend looking at https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com since I did the work just now for searching bytes and returning a bool, buth = and <=. Should be pretty close. Also, i believe if you left this for last as a possible refactoring, it might save some work. In any case, I'll take a look at the latest patch next month. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Optimize json_lex_string by batching character copying
I wrote > On Mon, Jul 11, 2022 at 11:07 PM Andres Freund wrote: > > > I wonder if we can add a somewhat more general function for scanning until > > some characters are found using SIMD? There's plenty other places that could > > be useful. > > In simple cases, we could possibly abstract the entire loop. With this > particular case, I imagine the most approachable way to write the loop would > be a bit more low-level: > > while (p < end - VECTOR_WIDTH && >!vector_has_byte(p, '\\') && >!vector_has_byte(p, '"') && >vector_min_byte(p, 0x20)) > p += VECTOR_WIDTH > > I wonder if we'd lose a bit of efficiency here by not accumulating set bits > from the three conditions, but it's worth trying. The attached implements the above, more or less, using new pg_lfind8() and pg_lfind8_le(), which in turn are based on helper functions that act on a single vector. The pg_lfind* functions have regression tests, but I haven't done the same for json yet. I went the extra step to use bit-twiddling for non-SSE builds using uint64 as a "vector", which still gives a pretty good boost (test below, min of 3): master: 356ms v5: 259ms v5 disable SSE: 288ms It still needs a bit of polishing and testing, but I think it's a good workout for abstracting SIMD out of the way. - test: DROP TABLE IF EXISTS long_json_as_text; CREATE TABLE long_json_as_text AS with long as ( select repeat(description, 11) from pg_description ) select (select json_agg(row_to_json(long))::text as t from long) from generate_series(1, 100); VACUUM FREEZE long_json_as_text; select 1 from long_json_as_text where t::json is null; -- from Andrew upthread -- John Naylor EDB: http://www.enterprisedb.com diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index fefd1d24d9..1f9eb134e8 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -19,6 +19,7 @@ #include "common/jsonapi.h" #include "mb/pg_wchar.h" +#include "port/pg_lfind.h" #ifndef FRONTEND #include "miscadmin.h" @@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex) } else { - char *p; + char *p = s; if (hi_surrogate != -1) return JSON_UNICODE_LOW_SURROGATE; @@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex) * Skip to the first byte that requires special handling, so we * can batch calls to appendBinaryStringInfo. */ - for (p = s; p < end; p++) + while (p < end - SIZEOF_VECTOR && + !pg_lfind8('\\', (unsigned char*) p, SIZEOF_VECTOR) && + !pg_lfind8('"', (unsigned char*) p, SIZEOF_VECTOR) && + !pg_lfind8_le(0x1F, (unsigned char*) p, SIZEOF_VECTOR)) +p += SIZEOF_VECTOR; + + for (; p < end; p++) { if (*p == '\\' || *p == '"') break; diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h index fb125977b2..e090ea6ac3 100644 --- a/src/include/port/pg_lfind.h +++ b/src/include/port/pg_lfind.h @@ -1,7 +1,7 @@ /*- * * pg_lfind.h - * Optimized linear search routines. + * Optimized linear search routines using SIMD intrinsics where available * * Copyright (c) 2022, PostgreSQL Global Development Group * @@ -15,6 +15,76 @@ #include "port/simd.h" +/* + * pg_lfind8 + * + * Return true if there is an element in 'base' that equals 'key', otherwise + * return false. + */ +static inline bool +pg_lfind8(uint8 key, uint8 *base, uint32 nelem) +{ + uint32 i; + /* round down to multiple of vector length */ + uint32 iterations = nelem & ~(SIZEOF_VECTOR - 1); + TYPEOF_VECTOR chunk; + + for (i = 0; i < iterations; i += SIZEOF_VECTOR) + { +#ifdef USE_SSE2 + chunk = _mm_loadu_si128((const __m128i *) [i]); +#else + memcpy(, [i], sizeof(chunk)); +#endif /* USE_SSE2 */ + if (vector_eq_byte(chunk, key)) + return true; + } + + /* Process the remaining elements one at a time. */ + for (; i < nelem; i++) + { + if (key == base[i]) + return true; + } + + return false; +} + +/* + * pg_lfind8 + * + * Return true if there is an element in 'base' that is less than or equal to 'key', otherwise + * return false. + */ +static inline bool +pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem) +{ + uint32 i; + /* round down to multiple of vector length */ + uint32 iterations = nelem & ~(SIZEOF_VECTOR - 1); + TYPEOF_VECTOR chunk; + + for (i = 0; i < iterations; i += SIZEOF_VECTOR) + { +#ifdef USE_SSE2 + chunk = _mm_loadu_si128((const __m128i *) [i]); +#else + memcpy(, [i], sizeof(chunk)); +#endif /* USE_SSE2 */ + if (vector_le_byte(chunk, key)) + return true; + } + + /* Process the remaining elements one at a time. */ + for (; i < nelem; i++) + { + if (base[i] <= key) + return true; + } + + return false; +} + /* * pg_l
Re: fix typos
On Fri, Aug 12, 2022 at 8:55 PM Tom Lane wrote: > > John Naylor writes: > > This is really a straw-man proposal, since I'm not volunteering to do > > the work, or suggest anybody else should do the same. That being the > > case, it seems we should just go ahead with Justin's patch for > > consistency. Possibly we could also change the messages to say "ID"? > > I'd be content if we change the user-facing messages (and documentation > if any) to say "ID" not "OID". The documentation has both, so it makes sense to standardize on "ID". The messages all had oid/OID, which I changed in the attached. I think I got all the places. I'm thinking it's not wrong enough to confuse people, but consistency is good, so backpatch to v15 and no further. Does anyone want to make a case otherwise? -- John Naylor EDB: http://www.enterprisedb.com diff --git a/doc/src/sgml/replication-origins.sgml b/doc/src/sgml/replication-origins.sgml index 7e02c4605b..bb0fb624d2 100644 --- a/doc/src/sgml/replication-origins.sgml +++ b/doc/src/sgml/replication-origins.sgml @@ -27,12 +27,12 @@ - Replication origins have just two properties, a name and an OID. The name, + Replication origins have just two properties, a name and an ID. The name, which is what should be used to refer to the origin across systems, is free-form text. It should be used in a way that makes conflicts between replication origins created by different replication solutions unlikely; e.g., by prefixing the replication solution's name to it. - The OID is used only to avoid having to store the long version + The ID is used only to avoid having to store the long version in situations where space efficiency is important. It should never be shared across systems. diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index c72ad6b93d..9eb97fac58 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -328,7 +328,7 @@ replorigin_create(const char *roname) if (tuple == NULL) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("could not find free replication origin OID"))); + errmsg("could not find free replication origin ID"))); heap_freetuple(tuple); return roident; @@ -364,7 +364,7 @@ restart: if (nowait) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), - errmsg("could not drop replication origin with OID %d, in use by PID %d", + errmsg("could not drop replication origin with ID %u, in use by PID %d", state->roident, state->acquired_by))); @@ -408,7 +408,7 @@ restart: */ tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); if (!HeapTupleIsValid(tuple)) - elog(ERROR, "cache lookup failed for replication origin with oid %u", + elog(ERROR, "cache lookup failed for replication origin with ID %u", roident); CatalogTupleDelete(rel, >t_self); @@ -485,7 +485,7 @@ replorigin_by_oid(RepOriginId roident, bool missing_ok, char **roname) if (!missing_ok) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_OBJECT), - errmsg("replication origin with OID %u does not exist", + errmsg("replication origin with ID %u does not exist", roident))); return false; @@ -937,7 +937,7 @@ replorigin_advance(RepOriginId node, { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), - errmsg("replication origin with OID %d is already active for PID %d", + errmsg("replication origin with ID %u is already active for PID %d", replication_state->roident, replication_state->acquired_by))); } @@ -948,7 +948,7 @@ replorigin_advance(RepOriginId node, if (replication_state == NULL && free_state == NULL) ereport(ERROR, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), - errmsg("could not find free replication state slot for replication origin with OID %u", + errmsg("could not find free replication state slot for replication origin with ID %u", node), errhint("Increase max_replication_slots and try again."))); @@ -1126,7 +1126,7 @@ replorigin_session_setup(RepOriginId node) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), - errmsg("replication origin with OID %d is already active for PID %d", + errmsg("replication origin with ID %u is already active for PID %d", curstate->roident, curstate->acquired_by))); } @@ -1138,7 +1138,7 @@ replorigin_session_setup(RepOriginId node) if (session_replication_state == NULL && free_slot == -1) ereport(ERROR, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), - errmsg("could not find free replication state slot for replication origin with OID %u", + errmsg("could not find free replication state slot for replication origin with ID %u", node), errhint("Increase max_replication_slots and try again."))); else if (session_replication_state == NULL)
Re: build remaining Flex files standalone
For v3, I addressed some comments and added .h files to the headerscheck exceptions. On Tue, Aug 16, 2022 at 1:11 AM Andres Freund wrote: > On 2022-08-13 15:39:06 +0700, John Naylor wrote: > > Here are the rest. Most of it was pretty straightforward, with the > > main exception of jsonpath_scan.c, which is not quite finished. That > > one passes tests but still has one compiler warning. I'm unsure how > > much of what is there already is really necessary or was cargo-culted > > from elsewhere without explanation. For starters, I'm not sure why the > > grammar has a forward declaration of "union YYSTYPE". It's noteworthy > > that it used to compile standalone, but with a bit more stuff, and > > that was reverted in 550b9d26f80fa30. I can hack on it some more later > > but I ran out of steam today. I've got it in half-way decent shape now, with an *internal.h header and some cleanups. > > - Include order seems to matter for the grammar's .h file. I didn't > > test if that was the case every time, and after a few miscompiles just > > always made it the last inclusion, but I'm wondering if we should keep > > those inclusions outside %top{} and put it at the start of the next > > %{} ? > > I think we have a few of those dependencies already, see e.g. > /* > * NB: include gram.h only AFTER including scanner.h, because scanner.h > * is what #defines YYLTYPE. > */ Went with something like this in all cases: /* * NB: include bootparse.h only AFTER including bootstrap.h, because bootstrap.h * includes node definitions needed for YYSTYPE. */ Future cleanup: I see this in headerscheck: # We can't make these Bison output files compilable standalone # without using "%code require", which old Bison versions lack. # parser/gram.h will be included by parser/gramparse.h anyway. That directive has been supported in Bison since 2.4.2. > > From d723ba14acf56fd432e9e263db937fcc13fc0355 Mon Sep 17 00:00:00 2001 > > From: John Naylor > > Date: Thu, 11 Aug 2022 19:38:37 +0700 > > Subject: [PATCH v201 1/9] Build guc-file.c standalone > > Might be worth doing some of the moving around here separately from the > parser/scanner specific bits. Done in 0001/0003. > > +/* functions shared between guc.c and guc-file.l */ > > [...] > I think I prefer your suggestion of a guc_internal.h upthread. Started in 0002, but left open the headerscheck failure. Also, if such a thing is meant to be #include'd only by two generated files, maybe it should just live in the directory where they live, and not in the src/include dir? > > From 7d4ecfcb3e91f3b45e94b9e64c7c40f1bbd22aa8 Mon Sep 17 00:00:00 2001 > > From: John Naylor > > Date: Fri, 12 Aug 2022 15:45:24 +0700 > > Subject: [PATCH v201 2/9] Build booscanner.c standalone > > > -# bootscanner is compiled as part of bootparse > > -bootparse.o: bootscanner.c > > +# See notes in src/backend/parser/Makefile about the following two rules > > +bootparse.h: bootparse.c > > + touch $@ > > + > > +bootparse.c: BISONFLAGS += -d > > + > > +# Force these dependencies to be known even without dependency info built: > > +bootparse.o bootscan.o: bootparse.h > > Wonder if we could / should wrap this is something common. It's somewhat > annoying to repeat this stuff everywhere. I haven't looked at the Meson effort recently, but if the build rule is less annoying there, I'm inclined to leave this as a wart until autotools are retired. > > diff --git a/src/test/isolation/specscanner.l > > b/src/test/isolation/specscanner.l > > index aa6e89268e..2dc292c21d 100644 > > --- a/src/test/isolation/specscanner.l > > +++ b/src/test/isolation/specscanner.l > > @@ -1,4 +1,4 @@ > > -%{ > > +%top{ > > /*- > > * > > * specscanner.l > > @@ -9,7 +9,14 @@ > > * > > *- > > */ > > +#include "postgres_fe.h" > > Miniscule nitpick: I think we typically leave an empty line between header and > first include. In a small unscientific sample it seems like the opposite is true actually, but I'll at least try to be consistent within the patch set. > > diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h > > index dbe7d4f742..0b373048b5 100644 > > --- a/contrib/cube/cubedata.h > > +++ b/contrib/cube/cubedata.h > > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void); > > > > /* in cubeparse.y */ > > extern int cube_yyparse(NDBOX **result); > > + > > +/* All grammar constructs return strings */ > > +#de
Re: Cleaning up historical portability baggage
On Thu, Aug 4, 2022 at 11:09 AM Tom Lane wrote: > > clock_gettime is required by SUSv2 (1997), so I have to admit that > macOS 10.4 doesn't have a lot of excuse not to have it. In any case, > prairiedog is just sitting there doing its thing until I find cycles > to install a newer OS. If you want to move ahead with this, don't > let prairiedog block you. Now that prairiedog is offline, it seems the oldest Bison in the buildfarm is 2.3, as the vendor-supplied version for MacOS. Since earlier versions are now untested, it seems we should now make that the minimum, as in the attached. I took a quick look to see if that would enable anything useful, but nothing stuck out. Aside: The MSVC builds don't report the Bison version that I can see, but otherwise it seems now the only non-Apple pre-3.0 animals are prion (2.7) and the three Sparc animals on Debian 7 (2.5). -- John Naylor EDB: http://www.enterprisedb.com From 6e1022bae4a8d45783fa0d678d51eec03383d0de Mon Sep 17 00:00:00 2001 From: John Naylor Date: Sun, 7 Aug 2022 18:32:57 +0700 Subject: [PATCH v1] Require at least Bison 2.3 to match the buildfarm --- config/programs.m4 | 6 +++--- configure | 6 +++--- contrib/cube/cubeparse.y | 5 + contrib/seg/segparse.y | 5 + doc/src/sgml/install-windows.sgml | 2 +- doc/src/sgml/installation.sgml | 2 +- src/backend/bootstrap/bootparse.y | 5 + src/backend/parser/gram.y | 5 + src/backend/replication/repl_gram.y| 5 + src/backend/replication/syncrep_gram.y | 5 + src/backend/utils/adt/jsonpath_gram.y | 5 + src/pl/plpgsql/src/pl_gram.y | 5 + src/tools/msvc/pgbison.pl | 2 +- 13 files changed, 17 insertions(+), 41 deletions(-) diff --git a/config/programs.m4 b/config/programs.m4 index e7908d8793..cf432cb62c 100644 --- a/config/programs.m4 +++ b/config/programs.m4 @@ -22,7 +22,7 @@ fi # PGAC_PATH_BISON # --- # Look for Bison, set the output variable BISON to its path if found. -# Reject versions before 1.875 (they have bugs or capacity limits). +# Reject versions before 2.3. # Note we do not accept other implementations of yacc. AC_DEFUN([PGAC_PATH_BISON], @@ -31,11 +31,11 @@ AC_DEFUN([PGAC_PATH_BISON], if test "$BISON"; then pgac_bison_version=`$BISON --version 2>/dev/null | sed q` AC_MSG_NOTICE([using $pgac_bison_version]) - if echo "$pgac_bison_version" | $AWK '{ if ([$]4 < 1.875) exit 0; else exit 1;}' + if echo "$pgac_bison_version" | $AWK '{ if ([$]4 < 2.3) exit 0; else exit 1;}' then AC_MSG_WARN([ *** The installed version of Bison, $BISON, is too old to use with PostgreSQL. -*** Bison version 1.875 or later is required, but this is $pgac_bison_version.]) +*** Bison version 2.3 or later is required, but this is $pgac_bison_version.]) BISON="" fi # Bison >=3.0 issues warnings about %name-prefix="base_yy", instead diff --git a/configure b/configure index cf2c4b85fe..c3c7c1b39b 100755 --- a/configure +++ b/configure @@ -10219,14 +10219,14 @@ if test "$BISON"; then pgac_bison_version=`$BISON --version 2>/dev/null | sed q` { $as_echo "$as_me:${as_lineno-$LINENO}: using $pgac_bison_version" >&5 $as_echo "$as_me: using $pgac_bison_version" >&6;} - if echo "$pgac_bison_version" | $AWK '{ if ($4 < 1.875) exit 0; else exit 1;}' + if echo "$pgac_bison_version" | $AWK '{ if ($4 < 2.3) exit 0; else exit 1;}' then { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** The installed version of Bison, $BISON, is too old to use with PostgreSQL. -*** Bison version 1.875 or later is required, but this is $pgac_bison_version." >&5 +*** Bison version 2.3 or later is required, but this is $pgac_bison_version." >&5 $as_echo "$as_me: WARNING: *** The installed version of Bison, $BISON, is too old to use with PostgreSQL. -*** Bison version 1.875 or later is required, but this is $pgac_bison_version." >&2;} +*** Bison version 2.3 or later is required, but this is $pgac_bison_version." >&2;} BISON="" fi # Bison >=3.0 issues warnings about %name-prefix="base_yy", instead diff --git a/contrib/cube/cubeparse.y b/contrib/cube/cubeparse.y index 7577c4515c..9bfd7a0df6 100644 --- a/contrib/cube/cubeparse.y +++ b/contrib/cube/cubeparse.y @@ -15,10 +15,7 @@ /* * Bison doesn't allocate anything that needs to live across parser calls, * so we can easily have it use palloc instead of malloc. This prevents - * memory leaks if we error out during parsing. Note this only works with - * bison >= 2.0. However, in bison 1.875 the default is to use alloca() - * if possible, so there's not really much problem anyhow, at least if - * you're building with gc
Re: fix typos
I wrote: > On Thu, Aug 4, 2022 at 8:41 PM Tom Lane wrote: > > > > John Naylor writes: > > > > RepOriginId is a typedef for uint16, so this can't print the wrong answer, > > > but it is inconsistent with other uses. So it seems we don't need to > > > backpatch this one? > > > > Um ... if it's int16, then it can't be an OID, so I'd say this message has > > far worse problems than %d vs %u. It should not use that terminology. > > The catalog has the following. Since it's not a real oid, maybe this column > should be rethought? This is really a straw-man proposal, since I'm not volunteering to do the work, or suggest anybody else should do the same. That being the case, it seems we should just go ahead with Justin's patch for consistency. Possibly we could also change the messages to say "ID"? > CATALOG(pg_replication_origin,6000,ReplicationOriginRelationId) > BKI_SHARED_RELATION > { > /* > * Locally known id that get included into WAL. > * > * This should never leave the system. > * > * Needs to fit into an uint16, so we don't waste too much space in WAL > * records. For this reason we don't use a normal Oid column here, since > * we need to handle allocation of new values manually. > */ > Oid roident; -- John Naylor EDB: http://www.enterprisedb.com
build remaining Flex files standalone
Starting a new thread to control clutter. [was: Re: [RFC] building postgres with meson] motivation: https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de On Thu, Aug 11, 2022 at 11:07 AM Andres Freund wrote: > I think we should consider compiling it separately from guc.c. guc.c already > compiles quite slowly (iirc beat only by ecpg and main grammar), and it's a > relatively commonly changed source file. Done in the attached, and will do the rest in time. It seemed most straightforward to put ProcessConfigFileInternal() in guc.c since that's where most of its callers are, and it relies on some vars and types declared there. There are a couple new extern declarations in guc.h that are only for guc.c and guc-file.c: +/* functions shared between guc.c and guc-file.l */ +extern int guc_name_compare(const char *namea, const char *nameb); +extern ConfigVariable *ProcessConfigFileInternal(GucContext context, + bool applySettings, int elevel); +extern void record_config_file_error(const char *errmsg, + const char *config_file, + int lineno, + ConfigVariable **head_p, + ConfigVariable **tail_p); These might be better placed in a new guc_internal.h. Thoughts? > It might even be a good idea to split guc.c so it only contains the settings > arrays + direct dependencies... Perhaps this can be a TODO item, one which falls under "[E] marks items that are easier to implement". I've been slacking on removing the old/intractable cruft from the TODO list, but we should also be sticking small nice-but-not-necessary things in there. That said, if this idea has any bearing on the guc_internal.h idea, it might be better dealt with now. -- John Naylor EDB: http://www.enterprisedb.com From d723ba14acf56fd432e9e263db937fcc13fc0355 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 11 Aug 2022 19:38:37 +0700 Subject: [PATCH v1] Build guc-file.c standalone The proposed Meson build system will need a way to ignore certain generated files in order to coexist with the autoconf build system, and #include'd C files generated by Flex make this more difficult. Build guc-file.c separately from guc.c, as was done in 72b1e3a21. TODO: other Flex-generated files Discussion: https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de --- src/backend/utils/misc/Makefile | 5 +- src/backend/utils/misc/guc-file.l | 367 +- src/backend/utils/misc/guc.c | 362 - src/include/utils/guc.h | 9 + 4 files changed, 370 insertions(+), 373 deletions(-) diff --git a/src/backend/utils/misc/Makefile b/src/backend/utils/misc/Makefile index 1d5327cf64..cf7ce9bc83 100644 --- a/src/backend/utils/misc/Makefile +++ b/src/backend/utils/misc/Makefile @@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS) OBJS = \ guc.o \ + guc-file.o \ help_config.o \ pg_config.o \ pg_controldata.o \ @@ -37,10 +38,6 @@ endif include $(top_srcdir)/src/backend/common.mk -# guc-file is compiled as part of guc -guc.o: guc-file.c - # Note: guc-file.c is not deleted by 'make clean', # since we want to ship it in distribution tarballs. clean: - @rm -f lex.yy.c diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l index ce5633844c..08adb454de 100644 --- a/src/backend/utils/misc/guc-file.l +++ b/src/backend/utils/misc/guc-file.l @@ -7,7 +7,7 @@ * src/backend/utils/misc/guc-file.l */ -%{ +%top{ #include "postgres.h" @@ -17,9 +17,12 @@ #include "mb/pg_wchar.h" #include "miscadmin.h" #include "storage/fd.h" +#include #include "utils/guc.h" +#include "utils/memutils.h" +} - +%{ /* * flex emits a yy_fatal_error() function that it calls in response to * critical errors like malloc failure, file I/O errors, and detection of @@ -48,12 +51,6 @@ static sigjmp_buf *GUC_flex_fatal_jmp; static void FreeConfigVariable(ConfigVariable *item); -static void record_config_file_error(const char *errmsg, - const char *config_file, - int lineno, - ConfigVariable **head_p, - ConfigVariable **tail_p); - static int GUC_flex_fatal(const char *msg); /* LCOV_EXCL_START */ @@ -159,358 +156,6 @@ ProcessConfigFile(GucContext context) MemoryContextDelete(config_cxt); } -/* - * This function handles both actual config file (re)loads and execution of - * show_all_file_settings() (i.e., the pg_file_settings view). In the latter - * case we don't apply any of the settings, but we make all the usual validity - * checks, and we return the ConfigVariable list so that it can be printed out - * by show_all_file_settings(). - */ -static ConfigVariable * -ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel) -{ - bool error = false; - bool applying = false; - const char *ConfFileWithError; - ConfigVariable *item, - *head, -
Re: build remaining Flex files standalone
Here are the rest. Most of it was pretty straightforward, with the main exception of jsonpath_scan.c, which is not quite finished. That one passes tests but still has one compiler warning. I'm unsure how much of what is there already is really necessary or was cargo-culted from elsewhere without explanation. For starters, I'm not sure why the grammar has a forward declaration of "union YYSTYPE". It's noteworthy that it used to compile standalone, but with a bit more stuff, and that was reverted in 550b9d26f80fa30. I can hack on it some more later but I ran out of steam today. Other questions thus far: - "BISONFLAGS += -d" is now in every make file with a .y file -- can we just force that everywhere? - Include order seems to matter for the grammar's .h file. I didn't test if that was the case every time, and after a few miscompiles just always made it the last inclusion, but I'm wondering if we should keep those inclusions outside %top{} and put it at the start of the next %{} ? - contrib/cubeparse.y now has a global variable -- not terrific, but I wanted to get something working first. - I'm actually okay with guc-file.c now, but I'll still welcome comments on that. -- John Naylor EDB: http://www.enterprisedb.com From da2b610b8608e6759f5ed9cc32b487ea8e750ce4 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 12 Aug 2022 17:09:45 +0700 Subject: [PATCH v201 3/9] Build repl_scanner.c standalone --- src/backend/replication/Makefile | 11 +-- src/backend/replication/repl_gram.y| 2 -- src/backend/replication/repl_scanner.l | 27 +++--- 3 files changed, 25 insertions(+), 15 deletions(-) diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile index 2bffac58c0..bc8170418f 100644 --- a/src/backend/replication/Makefile +++ b/src/backend/replication/Makefile @@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS) OBJS = \ repl_gram.o \ + repl_scanner.o \ slot.o \ slotfuncs.o \ syncrep.o \ @@ -28,8 +29,14 @@ SUBDIRS = logical include $(top_srcdir)/src/backend/common.mk -# repl_scanner is compiled as part of repl_gram -repl_gram.o: repl_scanner.c +# See notes in src/backend/parser/Makefile about the following two rules +repl_gram.h: repl_gram.c + touch $@ + +repl_gram.c: BISONFLAGS += -d + +# Force these dependencies to be known even without dependency info built: +repl_gram.o repl_scanner.o: repl_gram.h # syncrep_scanner is compiled as part of syncrep_gram syncrep_gram.o: syncrep_scanner.c diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y index 4cf087e602..b343f108d3 100644 --- a/src/backend/replication/repl_gram.y +++ b/src/backend/replication/repl_gram.y @@ -416,5 +416,3 @@ ident_or_keyword: ; %% - -#include "repl_scanner.c" diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l index 586f0d3a5c..95a933c9a8 100644 --- a/src/backend/replication/repl_scanner.l +++ b/src/backend/replication/repl_scanner.l @@ -1,4 +1,4 @@ -%{ +%top{ /*- * * repl_scanner.l @@ -17,7 +17,12 @@ #include "utils/builtins.h" #include "parser/scansup.h" +#include "replication/walsender_private.h" + +#include "repl_gram.h" +} +%{ /* Avoid exit() on fatal scanner errors (a bit ugly -- see yy_fatal_error) */ #undef fprintf #define fprintf(file, fmt, msg) fprintf_to_ereport(fmt, msg) @@ -130,7 +135,7 @@ WAIT{ return K_WAIT; } {space}+ { /* do nothing */ } {digit}+ { - yylval.uintval = strtoul(yytext, NULL, 10); + replication_yylval.uintval = strtoul(yytext, NULL, 10); return UCONST; } @@ -138,8 +143,8 @@ WAIT{ return K_WAIT; } uint32 hi, lo; if (sscanf(yytext, "%X/%X", , ) != 2) - yyerror("invalid streaming start location"); - yylval.recptr = ((uint64) hi) << 32 | lo; + replication_yyerror("invalid streaming start location"); + replication_yylval.recptr = ((uint64) hi) << 32 | lo; return RECPTR; } @@ -151,7 +156,7 @@ WAIT{ return K_WAIT; } {quotestop} { yyless(1); BEGIN(INITIAL); - yylval.str = litbufdup(); + replication_yylval.str = litbufdup(); return SCONST; } @@ -173,9 +178,9 @@ WAIT{ return K_WAIT; } yyless(1); BEGIN(INITIAL); - yylval.str = litbufdup(); - len = strlen(yylval.str); - truncate_identifier(yylval.str, len, true); + replication_yylval.str = litbufdup(); + len = strlen(replication_yylval.str); + truncate_identifier(replication_yylval.str, len, true); return IDENT; } @@ -186,7 +191,7 @@ WAIT{ return K_WAIT; } {identifier} { int len = strlen(yytext); - yylval.str = downcase_truncate_identifier(yytext, len, true); + replication_yylval.str = downcase_
Re: support for SSE2 intrinsics
On Thu, Aug 4, 2022 at 12:38 PM Masahiko Sawada wrote: > > I also think it's a good start. There is a typo in the commit message: > > s/hepler/helper/ > > The rest looks good to me. Fixed, and pushed, thanks to you both! I'll polish a small patch I have that actually uses this. -- John Naylor EDB: http://www.enterprisedb.com
Re: fix typos
On Wed, Aug 3, 2022 at 11:41 PM Robert Haas wrote: > > I think that it's talking about this (documented) syntax: > > ALTER ROUTINE name [ ( [ [ argmode ] [ argname ] argtype [, ...] ] ) ] > [ NO ] DEPENDS ON EXTENSION extension_name > > So the change from "depends" to "depend" here is incorrect. Maybe we > can say something like: > > the DEPENDS ON EXTENSION > extension_name action > > (I haven't tested whether this markup works.) Makes sense, I'll go make it happen. -- John Naylor EDB: http://www.enterprisedb.com
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart wrote: > Here is a new patch set without the annotation. Were you considering adding the new function to simd.h now that that's committed? It's a bit up in the air what should go in there, but this new function is low-level and generic enough to be a candidate... I wonder if the "pg_" prefix is appropriate here, as that is most often used for things that hide specific details *and* where the base name would clash, like OS calls or C library functions. I'm not quite sure where the line is drawn, but I mean that "linearsearch" is a generic algorithm and not a specific API we are implementing, if that makes sense. The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(), pg_popcount32, etc). We'll likely want to search bytes sometime, so something to keep in mind as far as naming ("_int" vs "_byte"?). I'm not a fan of "its" as a variable name, and I'm curious what it's intended to convey. All the __m128i vars could technically be declared const, although I think it doesn't matter -- it's just a hint to the reader. Out of curiosity do we know how much we get by loading four registers rather than two? -- John Naylor EDB: http://www.enterprisedb.com
Re: fix typos
On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby wrote: > > On Mon, Aug 01, 2022 at 08:04:54PM +0200, Erik Rijkers wrote: > > Recent typos... > > LGTM, thanks. > > Here are some others I've been sitting on, mostly in .c files. I pushed Robert's suggestion, then pushed the rest of Erik's changes and two of Justin's. For Justin's 0004: --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -364,7 +364,7 @@ restart: if (nowait) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), - errmsg("could not drop replication origin with OID %d, in use by PID %d", + errmsg("could not drop replication origin with OID %u, in use by PID %d", RepOriginId is a typedef for uint16, so this can't print the wrong answer, but it is inconsistent with other uses. So it seems we don't need to backpatch this one? For patch 0002, the whitespace issue in the top comment in inval.c, I'm inclined to just change all the out-of-place tabs in a single commit, so we can add that to the list of whitespace commits. -- John Naylor EDB: http://www.enterprisedb.com
Re: build remaining Flex files standalone
On Wed, Aug 17, 2022 at 8:14 AM Andres Freund wrote: > > > > +/* functions shared between guc.c and guc-file.l */ > > > > [...] > > > I think I prefer your suggestion of a guc_internal.h upthread. > > > > Started in 0002, but left open the headerscheck failure. > > > > Also, if such a thing is meant to be #include'd only by two generated > > files, maybe it should just live in the directory where they live, and > > not in the src/include dir? > > It's not something we've done for the backend afaics, but I don't see a reason > not to start at some point. BTW, I forgot to mention I did this for the json path parser, which makes the makefile code simpler than what was there before 550b9d26f80fa30. AFAICS, we could also do the same for gramparse.h, which is internal to parser.c. If I'm not mistaken, the only reason we symlink gram.h to src/include/* is so that gramparse.h can include it. So keeping gramparse.h in the backend could allow removing some gram.h makefile incantations. > > > Why does this need to be defined in a semi-public header? If we do this in > > > multiple files we'll end up with the danger of macro redefinition > > > warnings. > > > > I tried to put all the Flex/Bison stuff in another *_internal header, > > but that breaks the build. Putting just this one symbol in a header is > > silly, but done that way for now. Maybe two copies of the symbol? > > The problem is that if it's in a header you can't include another header with > such a define. That's fine if it's a .h that's just intended to be included by > a limited set of files, but for something like a header for a datatype that > might need to be included to e.g. define a PL transform or a new operator or > ... This would be solved by the %code requires thing, right? I believe it would. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the comments about varlena header encoding
On Thu, Aug 18, 2022 at 1:06 AM Aleksander Alekseev wrote: > > Hi hackers, > > I noticed that the comments regarding bit layouts for varlena headers > in postgres.h are somewhat misleading. For instance, when reading: I agree it's confusing, but I don't think this patch is the right direction. > ``` > 00xx 4-byte length word, aligned, uncompressed data (up to 1G) > ``` > > ... one can assume this is a 00xx byte followed by another 4 bytes > (which is wrong). Also one can read this as "aligned, uncompressed > data" (which again is wrong). - * 00xx 4-byte length word, aligned, uncompressed data (up to 1G) + * 00xx , uncompressed data (up to 1G) Maybe "00xx 4-byte length word (aligned)," is more clear about what is aligned. Also, adding all those xxx's obscures the point that we only need to examine one byte to figure out what to do next. > ``` > 1000 1-byte length word, unaligned, TOAST pointer > ``` > > This is misleading too. The comments above this line say that `struct > varatt_external` is a TOAST pointer. sizeof(varatt_external) = 16, > plus 1 byte equals 17, right? However the documentation [1] claims the > result should be 18: The patch has: + * In the third case the va_tag field (see varattrib_1b_e) is used to discern + * the specific type and length of the pointer datum. On disk the "xxx" bits + * currently always store sizeof(varatt_external) + 2. ...so not sure where 17 came from. - * 1000 1-byte length word, unaligned, TOAST pointer + * 1000 , TOAST pointer (struct varatt_external) This implies that the header is two bytes, which is not accurate. That next byte is a type tag: /* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */ typedef struct { uint8 va_header; /* Always 0x80 or 0x01 */ uint8 va_tag; /* Type of datum */ char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */ } varattrib_1b_e; ...and does not always represent the on-disk length: /* * Type tag for the various sorts of "TOAST pointer" datums. The peculiar * value for VARTAG_ONDISK comes from a requirement for on-disk compatibility * with a previous notion that the tag field was the pointer datum's length. */ typedef enum vartag_external { VARTAG_INDIRECT = 1, VARTAG_EXPANDED_RO = 2, VARTAG_EXPANDED_RW = 3, VARTAG_ONDISK = 18 } vartag_external; And I don't think the new comments referring to "third case", "first two cases", etc make it easier to follow. -- John Naylor EDB: http://www.enterprisedb.com
Re: fix typos
On Tue, Aug 16, 2022 at 8:48 AM John Naylor wrote: > > On Fri, Aug 12, 2022 at 8:55 PM Tom Lane wrote: > > > > John Naylor writes: > > > This is really a straw-man proposal, since I'm not volunteering to do > > > the work, or suggest anybody else should do the same. That being the > > > case, it seems we should just go ahead with Justin's patch for > > > consistency. Possibly we could also change the messages to say "ID"? > > > > I'd be content if we change the user-facing messages (and documentation > > if any) to say "ID" not "OID". > > The documentation has both, so it makes sense to standardize on "ID". > The messages all had oid/OID, which I changed in the attached. I think > I got all the places. > > I'm thinking it's not wrong enough to confuse people, but consistency > is good, so backpatch to v15 and no further. Does anyone want to make > a case otherwise? This is done. -- John Naylor EDB: http://www.enterprisedb.com
Re: build remaining Flex files standalone
On Wed, Aug 17, 2022 at 8:14 AM Andres Freund wrote: > > Hi, > > On 2022-08-16 17:41:43 +0700, John Naylor wrote: > > For v3, I addressed some comments and added .h files to the > > headerscheck exceptions. > > Thanks! > > > > /* > > * NB: include bootparse.h only AFTER including bootstrap.h, because > > bootstrap.h > > * includes node definitions needed for YYSTYPE. > > */ > > > > Future cleanup: I see this in headerscheck: > > > > # We can't make these Bison output files compilable standalone > > # without using "%code require", which old Bison versions lack. > > # parser/gram.h will be included by parser/gramparse.h anyway. > > > > That directive has been supported in Bison since 2.4.2. > > 2.4.2 is from 2010. So I think we could just start relying on it? > > > > > > +/* functions shared between guc.c and guc-file.l */ > > > > [...] > > > I think I prefer your suggestion of a guc_internal.h upthread. > > > > Started in 0002, but left open the headerscheck failure. > > > > Also, if such a thing is meant to be #include'd only by two generated > > files, maybe it should just live in the directory where they live, and > > not in the src/include dir? > > It's not something we've done for the backend afaics, but I don't see a reason > not to start at some point. > > > > > > From 7d4ecfcb3e91f3b45e94b9e64c7c40f1bbd22aa8 Mon Sep 17 00:00:00 2001 > > > > From: John Naylor > > > > Date: Fri, 12 Aug 2022 15:45:24 +0700 > > > > Subject: [PATCH v201 2/9] Build booscanner.c standalone > > > > > > > -# bootscanner is compiled as part of bootparse > > > > -bootparse.o: bootscanner.c > > > > +# See notes in src/backend/parser/Makefile about the following two > > > > rules > > > > +bootparse.h: bootparse.c > > > > + touch $@ > > > > + > > > > +bootparse.c: BISONFLAGS += -d > > > > + > > > > +# Force these dependencies to be known even without dependency info > > > > built: > > > > +bootparse.o bootscan.o: bootparse.h > > > > > > Wonder if we could / should wrap this is something common. It's somewhat > > > annoying to repeat this stuff everywhere. > > > > I haven't looked at the Meson effort recently, but if the build rule > > is less annoying there, I'm inclined to leave this as a wart until > > autotools are retired. > > The only complicating thing in the rules there is the dependencies from one .c > file to another .c file. > > > > > > diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h > > > > index dbe7d4f742..0b373048b5 100644 > > > > --- a/contrib/cube/cubedata.h > > > > +++ b/contrib/cube/cubedata.h > > > > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void); > > > > > > > > /* in cubeparse.y */ > > > > extern int cube_yyparse(NDBOX **result); > > > > + > > > > +/* All grammar constructs return strings */ > > > > +#define YYSTYPE char * > > > > > > Why does this need to be defined in a semi-public header? If we do this in > > > multiple files we'll end up with the danger of macro redefinition > > > warnings. For v4, I #defined YYSTYPE -- John Naylor EDB: http://www.enterprisedb.com
Re: build remaining Flex files standalone
> > > > > index dbe7d4f742..0b373048b5 100644 > > > > > --- a/contrib/cube/cubedata.h > > > > > +++ b/contrib/cube/cubedata.h > > > > > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void); > > > > > > > > > > /* in cubeparse.y */ > > > > > extern int cube_yyparse(NDBOX **result); > > > > > + > > > > > +/* All grammar constructs return strings */ > > > > > +#define YYSTYPE char * > > > > > > > > Why does this need to be defined in a semi-public header? If we do this > > > > in > > > > multiple files we'll end up with the danger of macro redefinition > > > > warnings. > > For v4, I #defined YYSTYPE Sorry for the misfire. Continuing on, I #defined YYSTYPE in cubescan.l before #including cubeparse.h. I also added scanbuflen to the %parse-param to prevent resorting to a global variable. The rest of the patches are unchanged. -- John Naylor EDB: http://www.enterprisedb.com From 2b95401d925bed67b2cb1eb9e8cdb1f1dd3bcc8e Mon Sep 17 00:00:00 2001 From: John Naylor Date: Tue, 16 Aug 2022 12:01:41 +0700 Subject: [PATCH v4 02/11] Move private declarations shared between guc.c and guc-file.l to new header FIXME: fails headerscheck --- src/backend/utils/misc/guc-file.l | 1 + src/backend/utils/misc/guc.c | 1 + src/include/utils/guc.h | 10 -- src/include/utils/guc_internal.h | 24 4 files changed, 26 insertions(+), 10 deletions(-) create mode 100644 src/include/utils/guc_internal.h diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l index b4fa09749b..843838b1df 100644 --- a/src/backend/utils/misc/guc-file.l +++ b/src/backend/utils/misc/guc-file.l @@ -18,6 +18,7 @@ #include "miscadmin.h" #include "storage/fd.h" #include "utils/guc.h" +#include "utils/guc_internal.h" /* diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 66ab3912a0..293834fc13 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -100,6 +100,7 @@ #include "utils/builtins.h" #include "utils/bytea.h" #include "utils/float.h" +#include "utils/guc_internal.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h index aae071cd82..45ae1b537f 100644 --- a/src/include/utils/guc.h +++ b/src/include/utils/guc.h @@ -442,16 +442,6 @@ extern void GUC_check_errcode(int sqlerrcode); pre_format_elog_string(errno, TEXTDOMAIN), \ GUC_check_errhint_string = format_elog_string -/* functions shared between guc.c and guc-file.l */ -extern int guc_name_compare(const char *namea, const char *nameb); -extern ConfigVariable *ProcessConfigFileInternal(GucContext context, - bool applySettings, int elevel); -extern void record_config_file_error(const char *errmsg, - const char *config_file, - int lineno, - ConfigVariable **head_p, - ConfigVariable **tail_p); - /* * The following functions are not in guc.c, but are declared here to avoid * having to include guc.h in some widely used headers that it really doesn't diff --git a/src/include/utils/guc_internal.h b/src/include/utils/guc_internal.h new file mode 100644 index 00..5d5db6bdce --- /dev/null +++ b/src/include/utils/guc_internal.h @@ -0,0 +1,24 @@ +/* + * guc_internals.h + * + * Declarations shared between backend/utils/misc/guc.c and + * backend/utils/misc/guc-file.l + * + * Copyright (c) 2000-2022, PostgreSQL Global Development Group + * + * src/include/utils/guc_internals.h + * + */ +#ifndef GUC_INTERNALS_H +#define GUC_INTERNALS_H + +extern int guc_name_compare(const char *namea, const char *nameb); +extern ConfigVariable *ProcessConfigFileInternal(GucContext context, + bool applySettings, int elevel); +extern void record_config_file_error(const char *errmsg, + const char *config_file, + int lineno, + ConfigVariable **head_p, + ConfigVariable **tail_p); + +#endif /* GUC_INTERNALS_H */ -- 2.36.1 From c066efea2193be8e7b21bb44c067383f34f37ec8 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Tue, 16 Aug 2022 10:42:19 +0700 Subject: [PATCH v4 01/11] Preparatory refactoring for compiling guc-file.c standalone Mostly this involves moving ProcessConfigFileInternal() to guc.c and fixing the shared API to match. --- src/backend/utils/misc/guc-file.l | 360 +- src/backend/utils/misc/guc.c | 360 +- src/include/utils/guc.h |
Re: build remaining Flex files standalone
I wrote > [v4] This piece is a leftover from the last version, and forgot to remove it, will fix: diff --git a/contrib/cube/cubeparse.y b/contrib/cube/cubeparse.y index 7577c4515c..e3b750b695 100644 --- a/contrib/cube/cubeparse.y +++ b/contrib/cube/cubeparse.y @@ -7,6 +7,7 @@ #include "postgres.h" #include "cubedata.h" +#include "cube_internal.h" #include "utils/float.h" -- John Naylor EDB: http://www.enterprisedb.com
Re: Typo in pg_db_role_setting.h
On Mon, Aug 1, 2022 at 4:18 PM Japin Li wrote: > > > Hi, hackers > > I think there is a typo in pg_db_role_setting.h, should we fix it? > > diff --git a/src/include/catalog/pg_db_role_setting.h b/src/include/catalog/pg_db_role_setting.h > index 45d478e9e7..f92e867df4 100644 > /* > - * prototypes for functions in pg_db_role_setting.h > + * prototypes for functions in pg_db_role_setting.c > */ You are correct, but I wonder if it'd be better to just drop the comment entirely. I checked a couple other random headers with function declarations and they didn't have such a comment, and it's kind of obvious what they're for. -- John Naylor EDB: http://www.enterprisedb.com
Re: support for SSE2 intrinsics
On Tue, Aug 2, 2022 at 11:53 PM Nathan Bossart wrote: > I did a bit of cross-checking, and AFAICT this is a reasonable starting > point. emmintrin.h appears to be sufficient for one of my patches that > makes use of SSE2 instructions. That being said, I imagine it'll be > especially important to keep an eye on the buildfarm when this change is > committed. Thanks for checking! Here's a concrete patch for testing. -- John Naylor EDB: http://www.enterprisedb.com From 0e01cc8f5f1c2c27631953091a1d657c5e5486be Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 3 Aug 2022 11:07:40 +0700 Subject: [PATCH v1] Support SSE2 intrinsics where available SSE2 vector instructions are part of the spec for the 64-bit x86 architecture. Until now we have relied on the compiler to autovectorize in some limited situations, but some useful coding idioms can only be expressed explicitly via compiler intrinsics. To this end, add a header that defines USE_SSE2 when available. While x86-only for now, we can add other architectures in the future. This will also be the intended place for low-level hepler functions that use vector operations. Reviewed by Nathan Bossart Discussion: https://www.postgresql.org/message-id/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com --- src/include/port/simd.h | 30 ++ 1 file changed, 30 insertions(+) create mode 100644 src/include/port/simd.h diff --git a/src/include/port/simd.h b/src/include/port/simd.h new file mode 100644 index 00..a571e79f57 --- /dev/null +++ b/src/include/port/simd.h @@ -0,0 +1,30 @@ +/*- + * + * simd.h + * Support for platform-specific vector operations. + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/port/simd.h + * + *- + */ +#ifndef SIMD_H +#define SIMD_H + +/* + * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume + * that compilers targeting this architecture understand SSE2 intrinsics. + * + * We use emmintrin.h rather than the comprehensive header immintrin.h in + * order to exclude extensions beyond SSE2. This is because MSVC, at least, + * will allow the use of intrinsics that haven't been enabled at compile + * time. + */ +#if (defined(__x86_64__) || defined(_M_AMD64)) +#include +#define USE_SSE2 +#endif + +#endif /* SIMD_H */ -- 2.36.1
Re: optimize lookups in snapshot [sub]xip arrays
On Wed, Aug 3, 2022 at 6:43 AM Nathan Bossart wrote: > Just under half of the callers in 0002 require the offset, but I don't know > if any of those are worth optimizing in the first place. I'll change it > for now. It's easy enough to add it back in the future if required. Yeah, some of those callers will rarely have more than several elements to search in the first place, or aren't performance-sensitive. -- John Naylor EDB: http://www.enterprisedb.com
Re: Fix unmatched file identifications
On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada wrote: > I found that there are two .c and .h files whose identification in the > header comment doesn't match its actual path. > The attached small patch fixes them. Pushed, thanks! -- John Naylor EDB: http://www.enterprisedb.com
Re: optimize lookups in snapshot [sub]xip arrays
On Tue, Aug 9, 2022 at 10:51 AM Nathan Bossart wrote: > Fixed in v10. I decided I wasn't quite comfortable changing snapshot handling without further guarantees. To this end, 0002 in the attached v11 is an addendum that adds assert checking (also pgindent and some comment-smithing). As I suspected, make check-world passes even with purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and I verified that sticking in the wrong answer will lead to a crash in assert-enabled builds in short order. I'd kind of like to throw this (or something else suitable) at the build farm first for that reason. It's simpler than the qsort/qunique/binary search that was there before, so that's nice, but I've not tried to test performance. -- John Naylor EDB: http://www.enterprisedb.com From 1c89b9b7d3c71bb1a703caaf7c01c93bc9e2515f Mon Sep 17 00:00:00 2001 From: John Naylor Date: Tue, 9 Aug 2022 11:51:52 +0700 Subject: [PATCH v11 2/4] Add assert checking, run pgindent, comment changes --- src/include/port/pg_lfind.h | 78 ++--- 1 file changed, 56 insertions(+), 22 deletions(-) diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h index 24de544f63..fb125977b2 100644 --- a/src/include/port/pg_lfind.h +++ b/src/include/port/pg_lfind.h @@ -18,51 +18,85 @@ /* * pg_lfind32 * - * Returns true if there is an element in 'base' that equals 'key'. Otherwise, - * returns false. + * Return true if there is an element in 'base' that equals 'key', otherwise + * return false. */ static inline bool pg_lfind32(uint32 key, uint32 *base, uint32 nelem) { uint32 i = 0; - /* If possible, use SSE2 intrinsics to speed up the search. */ + /* Use SIMD intrinsics where available. */ #ifdef USE_SSE2 - const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */ - uint32 iterations = nelem & ~0xF;/* round down to multiple of 16 */ - for (; i < iterations; i += 16) + /* + * A 16-byte register only has four 4-byte lanes. For better + * instruction-level parallelism, each loop iteration operates on a block + * of four registers. Testing has showed this is ~40% faster than using a + * block of two registers. + */ + const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */ + uint32 iterations = nelem & ~0xF; /* round down to multiple of 16 */ + +#if defined(USE_ASSERT_CHECKING) + bool assert_result = false; + + /* pre-compute the result for assert checking */ + for (i = 0; i < nelem; i++) { - /* load the next 16 values into __m128i variables */ - const __m128i vals1 = _mm_loadu_si128((__m128i *) [i]); - const __m128i vals2 = _mm_loadu_si128((__m128i *) [i + 4]); - const __m128i vals3 = _mm_loadu_si128((__m128i *) [i + 8]); - const __m128i vals4 = _mm_loadu_si128((__m128i *) [i + 12]); + if (key == base[i]) + { + assert_result = true; + break; + } + } +#endif - /* perform the comparisons */ - const __m128i result1 = _mm_cmpeq_epi32(keys, vals1); - const __m128i result2 = _mm_cmpeq_epi32(keys, vals2); - const __m128i result3 = _mm_cmpeq_epi32(keys, vals3); - const __m128i result4 = _mm_cmpeq_epi32(keys, vals4); + for (i = 0; i < iterations; i += 16) + { + /* load the next block into 4 registers holding 4 values each */ + const __m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]); + const __m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]); + const __m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]); + const __m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]); - /* shrink the results into a single variable */ - const __m128i tmp1 = _mm_or_si128(result1, result2); - const __m128i tmp2 = _mm_or_si128(result3, result4); - const __m128i result = _mm_or_si128(tmp1, tmp2); + /* compare each value to the key */ + const __m128i result1 = _mm_cmpeq_epi32(keys, vals1); + const __m128i result2 = _mm_cmpeq_epi32(keys, vals2); + const __m128i result3 = _mm_cmpeq_epi32(keys, vals3); + const __m128i result4 = _mm_cmpeq_epi32(keys, vals4); + + /* combine the results into a single variable */ + const __m128i tmp1 = _mm_or_si128(result1, result2); + const __m128i tmp2 = _mm_or_si128(result3, result4); + const __m128i result = _mm_or_si128(tmp1, tmp2); /* see if there was a match */ if (_mm_movemask_epi8(result) != 0) + { +#if defined(USE_ASSERT_CHECKING) + Assert(assert_result == true); +#endif return true; + } } -#endif +#endif /* USE_SSE2 */ - /* Process the remaining elements the slow way. */ + /* Process the remaining elements one at a time. */ for (; i < nelem; i++) { if (key == base[i]) + { +#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING) + Assert(assert_result == true); +#endif return true; + } } +#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING) + Assert(assert_result == false); +#endif return false; } -- 2.36.1 From ff77224f9227bcff88a68e63f39754296810351c Mon
Re: failing to build preproc.c on solaris with sun studio
On Sun, Aug 7, 2022 at 7:05 AM Tom Lane wrote: > Even on a modern Linux: > > $ size src/backend/parser/gram.o >textdata bss dec hex filename > 656568 0 0 656568 a04b8 src/backend/parser/gram.o > $ size src/interfaces/ecpg/preproc/preproc.o >textdata bss dec hex filename > 912005 1887348 919541 e07f5 src/interfaces/ecpg/preproc/preproc.o > > So there's something pretty bloated there. It doesn't seem like > ecpg's additional productions should justify a nigh 50% code > size increase. Comparing gram.o with preproc.o: $ objdump -t src/backend/parser/gram.o | grep yy | grep -v UND | awk '{print $5, $6}' | sort -r | head -n3 0003a24a yytable 0003a24a yycheck 00013672 base_yyparse $ objdump -t src/interfaces/ecpg/preproc/preproc.o | grep yy | grep -v UND | awk '{print $5, $6}' | sort -r | head -n3 0004d8e2 yytable 0004d8e2 yycheck 0002841e base_yyparse The largest lookup tables are ~25% bigger (other tables are trivial in comparison), and the function base_yyparse is about double the size, most of which is a giant switch statement with 2510 / 3912 cases, respectively. That difference does seem excessive. I've long wondered if it would be possible / feasible to have more strict separation for each C, ECPG commands, and SQL. That sounds like a huge amount of work, though. Playing around with the compiler flags on preproc.c, I get these compile times, gcc memory usage as reported by /usr/bin/time -v , and symbol sizes (non-debug build): -O2: time 8.0s Maximum resident set size (kbytes): 255884 -O1: time 6.3s Maximum resident set size (kbytes): 170636 0004d8e2 yytable 0004d8e2 yycheck 000292de base_yyparse -O0: time 2.9s Maximum resident set size (kbytes): 153148 0004d8e2 yytable 0004d8e2 yycheck 0003585e base_yyparse Note that -O0 bloats the binary probably because it's not using a jump table anymore. O1 might be worth it just to reduce build times for slower animals, even if Noah reported this didn't help the issue upthread. I suspect it wouldn't slow down production use much since the output needs to be compiled anyway. -- John Naylor EDB: http://www.enterprisedb.com
remap the .text segment into huge pages at run time
dback on whether the entire approach in this thread is sound enough to justify working further on. [1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf (paper: "On the Impact of Instruction Address Translation Overhead") [2] https://twitter.com/AndresFreundTec/status/1214305610172289024 [3] https://github.com/intel/iodlr -- John Naylor EDB: http://www.enterprisedb.com From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 31 Oct 2022 13:59:30 +0700 Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages Tell linker to align addresses on 2MB boundaries. The .init section will be so aligned, with the .text section soon after that. Therefore, the start address of .text should always be align up to nearly 2MB ahead of the actual start. The first nearly 2MB of .text will not map to huge pages. We count on cold sections linking to the front of the .text segment: Since the cold sections total about 600kB in size, we need ~1.4MB of additional padding to keep non-cold pages mappable to huge pages. Since PG has about 5.0MB of .text, we also need an additional 1MB to push the .text end just past an aligned boundary, so when we align the end down, only a small number of pages will remain un-remapped at their original 4kB size. --- meson.build | 3 +++ src/backend/port/filler.c| 29 + src/backend/port/meson.build | 3 +++ 3 files changed, 35 insertions(+) create mode 100644 src/backend/port/filler.c diff --git a/meson.build b/meson.build index bfacbdc0af..450946370c 100644 --- a/meson.build +++ b/meson.build @@ -239,6 +239,9 @@ elif host_system == 'freebsd' elif host_system == 'linux' sema_kind = 'unnamed_posix' cppflags += '-D_GNU_SOURCE' + # WIP: debug builds are huge + # TODO: add portability check + ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152'] elif host_system == 'netbsd' # We must resolve all dynamic linking in the core server at program start. diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c new file mode 100644 index 00..de4e33bb05 --- /dev/null +++ b/src/backend/port/filler.c @@ -0,0 +1,29 @@ +/* + * Add enough padding to .text segment to bring the end just + * past a 2MB alignment boundary. In practice, this means .text needs + * to be at least 8MB. It shouldn't be much larger than this, + * because then more hot pages will remain in 4kB pages. + * + * FIXME: With this filler added, if explicit huge pages are turned off + * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash + * instead of reporting failure if the .text segment is larger than 8MB. + * + * See MapStaticCodeToLargePages() in large_page.c + * + * XXX: The exact amount of filler must be determined experimentally + * on platforms of interest, in non-assert builds. + * + */ +static void +__attribute__((used)) +__attribute__((cold)) +fill_function(int x) +{ + /* TODO: More architectures */ +#ifdef __x86_64__ +__asm__( + ".fill 3251000" +); +#endif + (void) x; +} \ No newline at end of file diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build index 5ab65115e9..d876712e0c 100644 --- a/src/backend/port/meson.build +++ b/src/backend/port/meson.build @@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES') endif if cdata.has('USE_SYSV_SHARED_MEMORY') + if host_system == 'linux' +backend_sources += files('filler.c') + endif backend_sources += files('large_page.c') backend_sources += files('sysv_shmem.c') endif -- 2.37.3 From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 31 Oct 2022 15:24:29 +0700 Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at postmaster start Based on MIT licensed libary at https://github.com/intel/iodlr The basic steps are: - read ELF info to get the start/end addresses of the .text segment - calculate addresses therein aligned at huge page boundaries - mmap temporary region and memcpy aligned portion of .text segment - mmap start address to new region with huge pages and MAP_FIXED - memcpy over and revoke the PROT_WRITE bit The Postgres .text segment is ~5.0MB in a non-assert build, so this method can put 2-4MB into huge pages. --- src/backend/port/large_page.c | 348 src/backend/port/meson.build| 1 + src/backend/postmaster/postmaster.c | 7 + src/include/port/large_page.h | 18 ++ 4 files changed, 374 insertions(+) create mode 100644 src/backend/port/large_page.c create mode 100644 src/include/port/large_page.h diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c new file mode 100644 index 00..66a584f785 --- /dev/null +++ b/src/backend/port/large_page.c @@ -0,0 +1,348 @@ +/*- + * + * large_page.c + * Map .text segment of binary to hu
Re: Incorrect include file order in guc-file.l
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud wrote: > > Hi, > > On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote: > > > > While reviewing a different patch, I have noticed that guc-file.l > > includes sys/stat.h in the middle of the PG internal headers. The > > usual practice is to have first postgres[_fe].h, followed by the > > system headers and finally the internal headers. That's a nit, but > > all the other files do that. > > > > {be,fe}-secure-openssl.c include some exceptions though, as documented > > there. > > Agreed, it's apparently an oversight in dac048f71eb. +1 for the patch. Yeah. +1, thanks. -- John Naylor EDB: http://www.enterprisedb.com
Re: appendBinaryStringInfo stuff
On Thu, Dec 22, 2022 at 4:19 PM David Rowley wrote: > > Test 1 (from earlier) > > master + escape_json using appendStringInfoCharMacro > $ pgbench -n -T 60 -f bench.sql -M prepared postgres | grep latency > latency average = 1.807 ms > latency average = 1.800 ms > latency average = 1.812 ms (~4.8% faster than master) > 23.05% postgres [.] pg_utf_mblen I get about 20% improvement by adding an ascii fast path in pg_mbstrlen_with_len, which I think would work with all encodings we support: @@ -1064,7 +1064,12 @@ pg_mbstrlen_with_len(const char *mbstr, int limit) while (limit > 0 && *mbstr) { - int l = pg_mblen(mbstr); + int l; + + if (!IS_HIGHBIT_SET(*mbstr)) + l = 1; + else + l = pg_mblen(mbstr); -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada wrote: > > On Tue, Dec 20, 2022 at 3:09 PM John Naylor > wrote: > > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de > > > > I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint. > > You mean that the memory used by the radix tree should be limited not > by the amount of memory actually used, but by the amount of memory > allocated? In other words, it checks by MomoryContextMemAllocated() in > the local cases and by dsa_get_total_size() in the shared case. I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it's also expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did, we don't want to assume that every future caller of radix tree is willing to incur that cost. > The idea of using up to half of maintenance_work_mem might be a good > idea compared to the current flat-array solution. But since it only > uses half, I'm concerned that there will be users who double their > maintenace_work_mem. When it is improved, the user needs to restore > maintenance_work_mem again. I find it useful to step back and look at the usage patterns: Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run at any time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimation of memory taken by the context is not terribly far off, so we can afford to be more optimistic here. Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manual vacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB. So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) because we'll be both - much more efficient with memory on average - free from the 1GB cap That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example: 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going 766 + 256 = 1022MB -> stop I'm not sure if that calculation could cause going over the limit, or how common that would be. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada wrote: > If the value is a power of 2, it seems to work perfectly fine. But for > example if it's 700MB, the total memory exceeds the limit: > > 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going > 510 + 256 = 766MB -> stop but it exceeds the limit. > > In a more bigger case, if it's 11000MB, > > 2*(1+2+...+2048) = 8190MB (74.4%) > 8190 + 4096 = 12286MB > > That being said, I don't think they are not common cases. So the 75% > threshold seems to work fine in most cases. Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent. Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to be clever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments, which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'll at most double in size. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada wrote: > > On Fri, Dec 23, 2022 at 8:47 PM John Naylor > wrote: > These 4 patches make sense to me.We can merge them into 0002 patch Okay, then I'll squash them when I post my next patch. > and I'll do similar changes for functions for leaf nodes as well. I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away. In any case, some things are still half-baked, so please wait until my next patch before doing work on these files. Also, CI found a bug on 32-bit -- I know what I missed and will fix next week. > > 0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting. > > > > 0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward. Two things came to mind since I posted this, which I'll make clear next patch: - A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way. - Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes. > What do you > think about how we can expand this template method to deal with DSA > memory? I imagined that we load say radixtree_template.h with some > macros to use the radix tree like we do for simplehash.h. And > radixtree_template.h further loads xxx_impl.h files for some internal > functions. Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada wrote: > It looks no problem in terms of vacuum integration, although I've not > fully tested yet. TID store uses the radix tree as the main storage, > and with the template radix tree, the data types for shared and > non-shared will be different. TID store can have an union for the > radix tree and the structure would be like follows: > /* Storage for Tids */ > union tree > { > local_radix_tree*local; > shared_radix_tree *shared; > }; We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago. > In the functions of TID store, we need to call either local or shared > radix tree functions depending on whether TID store is shared or not. > We need if-branch for each key-value pair insertion, but I think it > would not be a big performance problem in TID store use cases, since > vacuum is an I/O intensive operation in many cases. Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths. > Overall, I think > there is no problem and I'll investigate it in depth. Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need. > Apart from that, I've been considering the lock support for shared > radix tree. As we discussed before, the current usage (i.e, only > parallel index vacuum) doesn't require locking support at all, so it > would be enough to have a single lock for simplicity. Right, that should be enough for PG16. > If we want to > use the shared radix tree for other use cases such as the parallel > heap vacuum or the replacement of the hash table for shared buffers, > we would need better lock support. For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course. > For example, if we want to support > Optimistic Lock Coupling[1], Interesting, from the same authors! > we would need to change not only the node > structure but also the logic. Which probably leads to widen the gap > between the code for non-shared and shared radix tree. In this case, > once we have a better radix tree optimized for shared case, perhaps we > can replace the templated shared radix tree with it. I'd like to hear > your opinion on this line. I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes. > > One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding. > > Oh, I didn't notice that. The chunk field was originally used when > redirecting the child pointer in the parent node from old to new > (grown) node. When redirecting the pointer, since the corresponding > chunk surely exists on the parent we can skip existence checks. > Currently we use RT_NODE_UPDATE_INNER() for that (see > RT_REPLACE_NODE()) but having a dedicated function to update the > existing chunk and child pointer might improve the performance. Or > reducing the node size by getting rid of the chunk field might be > better. I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked). I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Simple code cleanup in tuplesort.c.
On Mon, Jan 9, 2023 at 7:29 PM Xing Guo wrote: > > Thank you John. This is my first patch, I'll keep it in mind that > adding a version number next time I sending the patch. Welcome to the community! You may also consider reviewing a patch from the current commitfest, since we can always use additional help there. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
On Tue, Jan 3, 2023 at 1:59 PM Ankit Kumar Pandey wrote: > > > On 03/01/23 08:38, David Rowley wrote: > > > > Do you actually have a need for this or are you just trying to tick > > off some TODO items? > > > I would say Iatter but reason I picked it up was more on side of > learning optimizer better. Note that the TODO list has accumulated some cruft over the years. Some time ago I started an effort to remove outdated/undesirable entries, and I should get back to that, but for the present, please take the warning at the top to heart: "WARNING for Developers: Unfortunately this list does not contain all the information necessary for someone to start coding a feature. Some of these items might have become unnecessary since they were added --- others might be desirable but the implementation might be unclear. When selecting items listed below, be prepared to first discuss the value of the feature. Do not assume that you can select one, code it and then expect it to be committed. " -- John Naylor EDB: http://www.enterprisedb.com
Re: cutting down the TODO list thread
So, I had intended to spend some time on this at least three times a year. I've clearly failed at that, but now is as good a time as any to pick it back up again. Over in [1], Tom opined: > John Naylor writes: > > > "WARNING for Developers: Unfortunately this list does not contain all the > > information necessary for someone to start coding a feature. Some of these > > items might have become unnecessary since they were added --- others might > > be desirable but the implementation might be unclear. When selecting items > > listed below, be prepared to first discuss the value of the feature. Do not > > assume that you can select one, code it and then expect it to be committed. > > " > > I think we could make that even stronger: there's basically nothing on > the TODO list that isn't problematic in some way. Otherwise it would > have been done already. The entries involve large amounts of work, > or things that are subtler than they might appear, or cases where the > desirable semantics aren't clear, or tradeoffs that there's not > consensus about, or combinations of those. > > IME it's typically a lot more productive to approach things via > "scratch your own itch". If a problem is biting you directly, then > at least you have some clear idea of what it is that needs to be fixed. > You might have to work up to an understanding of how to fix it, but > you have a clear goal. I've come up with some revised language, including s/15/16/ and removing the category of "[E]" (easier to implement), since it wouldn't be here if it were actually easy: -- WARNING for Developers: This list contains some known PostgreSQL bugs, some feature requests, and some things we are not even sure we want. This is not meant to be a resource for beginning developers to get ideas for things to work on. All of these items are hard, and some are perhaps impossible. Some of these items might have become unnecessary since they were added. Others might be desirable but: - a large amount work is required - the problems are subtler than they might appear - the desirable semantics aren't clear - there are tradeoffs that there's not consensus about - some combinations of the above If you really need a feature that is listed below, it will be worth reading the linked email thread if there is one, since it will often show the difficulties, or perhaps contain previous failed attempts to get a patch committed. If after that you still want to work on it, be prepared to first discuss the value of the feature. Do not assume that you can start coding and expect it to be committed. Always discuss design on the Hackers list before starting to code. Over time, it may become clear that a TODO item has become outdated or otherwise determined to be either too controversial or not worth the development effort. Such items should be retired to the Not Worth Doing page. [D] marks changes that are done, and will appear in the PostgreSQL 16 release. -- We could also revise the developer FAQ: - remove phrase "Outstanding features are detailed in Todo." - add suggestion to to check the Todo or Not_worth_doing pages to see if the desired feature is undesirable or problematic - rephrase "Working in isolation is not advisable because others might be working on the same TODO item, or you might have misunderstood the TODO item." so it doesn't mention 'TODO' at all. [1] https://www.postgresql.org/message-id/415636.1673411259%40sss.pgh.pa.us -- John Naylor EDB: http://www.enterprisedb.com
Re: SQL/JSON revisited
On Wed, Jan 11, 2023 at 2:02 PM Elena Indrupskaya < e.indrupsk...@postgrespro.ru> wrote: > > Sorry for upsetting your bot. :( What I do in these cases is save the incremental patch as a .txt file -- that way people can read it, but the cf bot doesn't try to launch a CI run. And if I forget that detail, well it's not a big deal, it happens sometimes. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Simple code cleanup in tuplesort.c.
On Fri, Sep 16, 2022 at 1:43 PM Richard Guo wrote: > > > On Wed, Jul 27, 2022 at 5:10 PM Xing Guo wrote: >> >> The bounded heap sorting status flag is set twice in sort_bounded_heap() and tuplesort_performsort(). This patch helps remove one of them. > > > Revisiting this patch I think maybe it's better to remove the setting of > Tuplesort status from tuplesort_performsort() for the TSS_BOUNDED case. > Thus we keep the heap manipulation routines, make_bounded_heap and > sort_bounded_heap, consistent in that they update their status > accordingly inside the function. The label TSS_BUILDRUNS has a similar style and also the following comment, so I will push this patch with a similar comment added unless someone wants to make a case for doing otherwise. * Note that mergeruns sets the correct state->status. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
I wrote: > I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked). On further reflection, this is completely false and I'm not sure what I was thinking. However, for the update-inner case maybe we can assert that we found a valid slot. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG
On Wed, Jan 11, 2023 at 1:38 PM Ankit Kumar Pandey wrote: > > > > On 11/01/23 09:57, Tom Lane wrote: > > IME it's typically a lot more productive to approach things via > > "scratch your own itch". If a problem is biting you directly, then > > at least you have some clear idea of what it is that needs to be fixed. > > You might have to work up to an understanding of how to fix it, but > > you have a clear goal. > > > Question is, how newcomers should start contribution if they are not > coming with a problem in their hand? I would say find something that gets you excited. Worked for me, at least. > Todo list is possibly first thing anyone, who is willing to contribute > is going to read and for a new Yeah, that's a problem we need to address. > That being said, I think this is part of learning process and okay to > come up with ideas and fail. Of course it is! A key skill in engineering is to fail as quickly as possible, preferably before doing any actual work. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Simple code cleanup in tuplesort.c.
On Thu, Jan 5, 2023 at 8:18 AM John Naylor wrote: > > The label TSS_BUILDRUNS has a similar style and also the following comment, so I will push this patch with a similar comment added unless someone wants to make a case for doing otherwise. > > * Note that mergeruns sets the correct state->status. This has been pushed, thanks. Note that both patches in this thread have the same name. Adding a version number to the name is a good way to distinguish them. -- John Naylor EDB: http://www.enterprisedb.com
Re: Small miscellaneus fixes (Part II)
On Fri, Dec 23, 2022 at 8:08 AM Justin Pryzby wrote: > Makes sense now (in your first message, you said that the problem was > with "sign", and the patch didn't address the actual problem in > IS_PLUS()). > > One can look and find that the unreachable code was introduced at > 7a3e7b64a. > > With your proposed change, the unreachable line is hit by regression > tests, which is an improvment. As is the change to pg_dump.c. But that now reachable line just unsets a flag that we previously found unset, right? And if that line was unreachable, then surely the previous flag-clearing operation is too? 5669 994426 : if (IS_MINUS(Np->Num)) // <- also always false 5670 0 : Np->Num->flag &= ~NUM_F_MINUS; 5671 : } 5672 524 : else if (Np->sign != '+' && IS_PLUS(Np->Num)) 5673 0 : Np->Num->flag &= ~NUM_F_PLUS; https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html I'm inclined to turn the dead unsets into asserts. -- John Naylor EDB: http://www.enterprisedb.com
Re: Small miscellaneus fixes (Part II)
On Thu, Jan 12, 2023 at 12:34 PM Justin Pryzby wrote: > > On Thu, Jan 12, 2023 at 12:15:24PM +0700, John Naylor wrote: > > On Fri, Dec 23, 2022 at 8:08 AM Justin Pryzby wrote: > > > > > Makes sense now (in your first message, you said that the problem was > > > with "sign", and the patch didn't address the actual problem in > > > IS_PLUS()). > > > > > > One can look and find that the unreachable code was introduced at > > > 7a3e7b64a. > > > > > > With your proposed change, the unreachable line is hit by regression > > > tests, which is an improvment. As is the change to pg_dump.c. > > > > But that now reachable line just unsets a flag that we previously found > > unset, right? > > Good point. > > > And if that line was unreachable, then surely the previous flag-clearing > > operation is too? > > > > 5669 994426 : if (IS_MINUS(Np->Num)) // <- also always > > false > > 5670 0 : Np->Num->flag &= ~NUM_F_MINUS; > > 5671 : } > > 5672 524 : else if (Np->sign != '+' && IS_PLUS(Np->Num)) > > 5673 0 : Np->Num->flag &= ~NUM_F_PLUS; > > > > https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html > > > > I'm inclined to turn the dead unsets into asserts. > > To be clear - did you mean like this ? > > diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c > index a4b524ea3ac..848956879f5 100644 > --- a/src/backend/utils/adt/formatting.c > +++ b/src/backend/utils/adt/formatting.c > @@ -5662,15 +5662,13 @@ NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, > } > else > { > + Assert(!IS_MINUS(Np->Num)); > + Assert(!IS_PLUS(Np->Num)); > if (Np->sign != '-') > { > if (IS_BRACKET(Np->Num) && IS_FILLMODE(Np->Num)) > Np->Num->flag &= ~NUM_F_BRACKET; > - if (IS_MINUS(Np->Num)) > - Np->Num->flag &= ~NUM_F_MINUS; > } > - else if (Np->sign != '+' && IS_PLUS(Np->Num)) > - Np->Num->flag &= ~NUM_F_PLUS; > > if (Np->sign == '+' && IS_FILLMODE(Np->Num) && IS_LSIGN(Np->Num) == false) > Np->sign_wrote = true; /* needn't sign */ I was originally thinking of something more localized: --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -5666,11 +5666,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, { if (IS_BRACKET(Np->Num) && IS_FILLMODE(Np->Num)) Np->Num->flag &= ~NUM_F_BRACKET; -if (IS_MINUS(Np->Num)) -Np->Num->flag &= ~NUM_F_MINUS; +Assert(!IS_MINUS(Np->Num)); } -else if (Np->sign != '+' && IS_PLUS(Np->Num)) -Np->Num->flag &= ~NUM_F_PLUS; +else if (Np->sign != '+') +Assert(!IS_PLUS(Np->Num)); ...but arguably the earlier check is close enough that it's silly to assert in the "else" branch, and I'd be okay with just nuking those lines. Another thing that caught my attention is the assumption that unsetting a bit is so expensive that we have to first check if it's set, so we may as well remove "IS_BRACKET(Np->Num)" as well. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada wrote: > > On Wed, Jan 11, 2023 at 12:13 PM John Naylor > wrote: > > > > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada wrote: > I agree to keep this as a template. Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently. > From the vacuum integration > perspective, it would be better if we can use a common data type for > shared and local. It makes sense to have different data types if the > radix trees have different values types. I agree it would be better, all else being equal. I have some further thoughts below. > > > It looks no problem in terms of vacuum integration, although I've not > > > fully tested yet. TID store uses the radix tree as the main storage, > > > and with the template radix tree, the data types for shared and > > > non-shared will be different. TID store can have an union for the > > > radix tree and the structure would be like follows: > > > > > /* Storage for Tids */ > > > union tree > > > { > > > local_radix_tree*local; > > > shared_radix_tree *shared; > > > }; > > > > We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago. > > One idea to have a common data type without unused fields is to use > radix_tree a base class. We cast it to radix_tree_shared or > radix_tree_local depending on the flag is_shared in radix_tree. For > instance we have like (based on non-template version), > struct radix_tree > { > boolis_shared; > MemoryContext context; > }; That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places. Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate. The template might be easier for future use cases if shared memory were all-or-nothing, meaning either - completely different functions and types depending on RT_SHMEM - use branches (like v8) The union sounds like a good thing to try, but do whatever seems right. -- John Naylor EDB: http://www.enterprisedb.com
Re: static assert cleanup
On Fri, Dec 9, 2022 at 2:47 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > 0003-Move-some-static-assertions-to-better-places.patch > > This moves some that I thought were suboptimally placed but it could be > debated or refined. + * We really want ItemPointerData to be exactly 6 bytes. This is rather a + * random place to check, but there is no better place. Since the assert is no longer in a random function body, it seems we can remove the second sentence. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Sat, Dec 10, 2022 at 11:02 AM David Rowley wrote: > [v4] Thanks for working on this! I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware: v13-0001 to 0005: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 151123432 | 222 47.06% postgres postgres [.] rt_set 22.89% postgres postgres [.] SlabAlloc 9.65% postgres postgres [.] rt_node_insert_inner.isra.0 5.94% postgres [unknown][k] 0xb5e011b7 3.62% postgres postgres [.] MemoryContextAlloc 2.70% postgres libc.so.6[.] __memmove_avx_unaligned_erms 2.60% postgres postgres [.] SlabFree + v4 slab: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 152463112 | 213 52.42% postgres postgres [.] rt_set 12.80% postgres postgres [.] SlabAlloc 9.38% postgres postgres [.] rt_node_insert_inner.isra.0 7.87% postgres [unknown] [k] 0xb5e011b7 4.98% postgres postgres [.] SlabFree While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. num_keys = 50, height = 7 n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257 Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xb5e011b7 address is referring to) in a profile, so not sure what is happening there. [1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada wrote: > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor wrote: > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail). > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big > the radix tree is using dsa_get_total_size(). If the size returned by > dsa_get_total_size() (+ some memory used by TIDStore meta information) > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum > and heap vacuum. However, when allocating DSA memory for > radix_tree_control at creation, we allocate 1MB > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for > radix_tree_control from it. das_get_total_size() returns 1MB even if > there is no TID collected. 2MB makes sense. If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below. > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing. > > Right, I think it's no problem in slab cases. In DSA cases, the new > segment size follows a geometric series that approximately doubles the > total storage each time we create a new segment. This behavior comes > from the fact that the underlying DSM system isn't designed for large > numbers of segments. And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this: maintenance work mem = 256MB, so stop if we go over 128MB: 2*(1+2+4+8+16+32) = 126MB -> keep going 126MB + 64 = 190MB-> stop That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine). And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit. After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it would no longer be needed. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Mon, Dec 5, 2022 at 3:02 PM David Rowley wrote: > > On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); > > Doesn't this create a sort of round-robin use of the free list? What > we want is a sort of "histogram" bucket set of free lists so we can > group together blocks that have a close-enough free number of chunks. > Unless I'm mistaken, I think what you have doesn't do that. The intent must have slipped my mind along the way. > I wondered if simply: > > index = -(-freecount >> slab->freelist_shift); > > would be faster than Andres' version. I tried it out and on my AMD > machine, it's about the same speed. Same on a Raspberry Pi 4. > > Going by [2], the instructions are very different with each method, so > other machines with different latencies on those instructions might > show something different. I attached what I used to test if anyone > else wants a go. I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea. -- John Naylor EDB: http://www.enterprisedb.com
move some bitmapset.c macros to bitmapset.h
Over in [1], Masahiko and I found that using some bitmapset logic yields a useful speedup in one part of the proposed radix tree patch. In addition to what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own copies of the first two, and we could adapt that strategy again. But instead of three files defining these, it seems like it's time to consider moving them somewhere more central. Attached is a simple form of this concept, including moving HAS_MULTIPLE_ONES just for consistency. One possible objection is the possibility of namespace clash. Thoughts? I also considered putting the macros and typedefs in pg_bitutils.h. One organizational advantage is that pg_bitutils.h already offers convenience function symbols where the parameter depends on SIZEOF_SIZE_T, so putting the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a clear win, so I didn't go that far. [1] https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com src/backend/nodes/bitmapset.c | 24 src/backend/nodes/tidbitmap.c | 5 - src/include/nodes/bitmapset.h | 24 3 files changed, 24 insertions(+), 29 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b7b274aeff..3204b49738 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -26,33 +26,9 @@ #include "port/pg_bitutils.h" -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - #define BITMAPSET_SIZE(nwords) \ (offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword)) -/*-- - * This is a well-known cute trick for isolating the rightmost one-bit - * in a word. It assumes two's complement arithmetic. Consider any - * nonzero value, and focus attention on the rightmost one. The value is - * then something like - *xx1 - * where x's are unspecified bits. The two's complement negative is formed - * by inverting all the bits and adding one. Inversion gives - *yy0 - * where each y is the inverse of the corresponding x. Incrementing gives - *yy1 - * and then ANDing with the original value gives - *001 - * This works for all cases except original value = zero, where of course - * we get zero. - *-- - */ -#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) - -#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) - /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index a7a6b26668..8a1fd1d205 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -72,11 +72,6 @@ */ #define PAGES_PER_CHUNK (BLCKSZ / 32) -/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */ - -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - /* number of active words for an exact page: */ #define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1) /* number of active words for a lossy chunk: */ diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 2792281658..8462282b9e 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */ #endif +#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) +#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) + +/*-- + * This is a well-known cute trick for isolating the rightmost one-bit + * in a word. It assumes two's complement arithmetic. Consider any + * nonzero value, and focus attention on the rightmost one. The value is + * then something like + *xx1 + * where x's are unspecified bits. The two's complement negative is formed + * by inverting all the bits and adding one. Inversion gives + *yy0 + * where each y is the inverse of the corresponding x. Incrementing gives + *yy1 + * and then ANDing with the original value gives + *001 + * This works for all cases except original value = zero, where of course + * we get zero. + *-- + */ +#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) + +#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) + typedef struct Bitmapset { pg_node_attr(custom_copy_equal, special_read_write)
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan wrote: > > At the suggestion of Jeff, I wrote a Wiki page that shows motivating > examples for the patch series: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples > > These are all cases where VACUUM currently doesn't do the right thing > around freezing, in a way that is greatly ameliorated by the patch. > Perhaps this will help other hackers to understand the motivation > behind some of these mechanisms. There are plenty of details that only > make sense in the context of a certain kind of table, with certain > performance characteristics that the design is sensitive to, and seeks > to take advantage of in one way or another. Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy to reason about. I'd actually recommend in the future to open discussion with this kind of analysis -- even before coding, it's possible to indicate what a design is *intended* to achieve. And reviewers can likewise bring up cases of their own in a concrete fashion. On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan wrote: > At the very least, a given VACUUM operation has to choose its freezing > strategy based on how it expects the table will look when it's done > vacuuming the table, and how that will impact the next VACUUM against > the same table. Without that, then vacuuming an append-only table will > fall into a pattern of setting pages all-visible in one vacuum, and > then freezing those same pages all-frozen in the very next vacuum > because there are too many. Which makes little sense; we're far better > off freezing the pages at the earliest opportunity instead. That makes sense, but I wonder if we can actually be more specific: One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, we could use that information as one way to drive eager freezing independently of size. At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that it's as useful as a user-tunable knob. If we then ignored the append-only case when evaluating a freezing policy, maybe other ideas will fall out. I don't have a well-thought out idea about policy or knobs, but it's worth thinking about. Aside from that, I've only given the patches a brief reading. Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat to see fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates asynchronous IO. Keeping only one VM snapshot page in memory makes perfect sense. I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind of negative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the "scanning strategy". For example, I wonder if the VM strategies would be easier to read as: VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL Notice here they're listed in order of increasing eagerness. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Tue, Dec 20, 2022 at 10:36 AM David Rowley wrote: > > I'm planning on pushing the attached v3 patch shortly. I've spent > several days reading over this and testing it in detail along with > adding additional features to the SlabCheck code to find more > inconsistencies. FWIW, I reran my test from last week and got similar results. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada wrote: > > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada wrote: > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it > > seems that they look at only memory that are actually dsa_allocate'd. > > To be exact, we estimate the number of hash buckets based on work_mem > > (and hash_mem_multiplier) and use it as the upper limit. So I've > > confirmed that the result of dsa_get_total_size() could exceed the > > limit. I'm not sure it's a known and legitimate usage. If we can > > follow such usage, we can probably track how much dsa_allocate'd > > memory is used in the radix tree. > > I've experimented with this idea. The newly added 0008 patch changes > the radix tree so that it counts the memory usage for both local and > shared cases. As shown below, there is an overhead for that: > > w/o 0008 patch > 298453544 | 282 > w/0 0008 patch > 293603184 | 297 This adds about as much overhead as the improvement I measured in the v4 slab allocator patch. That's not acceptable, and is exactly what Andres warned about in https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint. -- John Naylor EDB: http://www.enterprisedb.com
Re: code cleanups
On Thu, Nov 24, 2022 at 12:53 AM Tom Lane wrote: > > Justin Pryzby writes: > > Some modest cleanups I've accumulated. > 0004: Right, somebody injected code in a poorly chosen place > (yet another victim of the "add at the end" anti-pattern). I've pushed 0004. -- John Naylor EDB: http://www.enterprisedb.com
Re: Rework of collation code, extensibility
On Sun, Dec 18, 2022 at 10:28 AM Ted Yu wrote: > It seems the `else` is not needed (since when the if branch is taken, we return from the func). By that same logic, this review comment is not needed, since compiler vendors don't charge license fees by the number of keywords. ;-) Joking aside, we don't really have a project style preference for this case. > nul-terminate -> null-terminate NUL is a common abbreviation for the zero byte (but not for zero pointers). See the ascii manpage. -- John Naylor EDB: http://www.enterprisedb.com
Re: move some bitmapset.c macros to bitmapset.h
On Mon, Dec 5, 2022 at 9:33 PM Tom Lane wrote: > > Alvaro Herrera writes: > > On 2022-Dec-05, John Naylor wrote: > >> -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) > >> -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) > > > In this location, nobody can complain about the naming of these macros, > > since they're just used to implement other bitmapset.c code. However, > > if you move them to the .h file, ISTM you should give them more > > meaningful names. > > IMV these are absolutely private to bitmapset.c. I reject the idea > that they should be exposed publicly, under these names or any others. Well, they've already escaped to tidbitmap.c as a copy. How do you feel about going that route? > Maybe we need some more bitmapset primitive functions? What is it > you actually want to accomplish in the end? An inserter into one type of node in a tree structure must quickly find a free position in an array. We have a bitmap of 128 bits to indicate whether the corresponding array position is free. The proposed coding is: /* get the first word with at least one bit not set */ for (idx = 0; idx < WORDNUM(128); idx++) { if (isset[idx] < ~((bitmapword) 0)) break; } /* To get the first unset bit in X, get the first set bit in ~X */ inverse = ~(isset[idx]); slotpos = idx * BITS_PER_BITMAPWORD; slotpos += bmw_rightmost_one_pos(inverse); /* mark the slot used */ isset[idx] |= RIGHTMOST_ONE(inverse); return slotpos; ...which, if it were reversed so that a set bit meant "available", would essentially be bms_first_member(), so a more primitive version of that might work. -- John Naylor EDB: http://www.enterprisedb.com
Re: move some bitmapset.c macros to bitmapset.h
On Tue, Dec 6, 2022 at 12:57 PM Tom Lane wrote: > > Well, they've already escaped to tidbitmap.c as a copy. How do you feel > > about going that route? > > Not terribly pleased with that either, I must admit. Okay, I won't pursue that further. > If we do put RIGHTMOST_ONE functionality into pg_bitutils.h, > I'd envision it as pg_bitutils.h exporting both 32-bit and > 64-bit versions of that, and then bitmapset.c choosing the > appropriate one just like it chooses bmw_rightmost_one_pos. Here's a quick go at that. I've not attempted to use it for what I need, but it looks like it fits the bill. -- John Naylor EDB: http://www.enterprisedb.com src/backend/nodes/bitmapset.c| 36 ++-- src/include/nodes/bitmapset.h| 16 ++-- src/include/port/pg_bitutils.h | 31 +++ src/tools/pgindent/typedefs.list | 1 - 4 files changed, 47 insertions(+), 37 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b7b274aeff..4384ff591d 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -32,39 +32,7 @@ #define BITMAPSET_SIZE(nwords) \ (offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword)) -/*-- - * This is a well-known cute trick for isolating the rightmost one-bit - * in a word. It assumes two's complement arithmetic. Consider any - * nonzero value, and focus attention on the rightmost one. The value is - * then something like - *xx1 - * where x's are unspecified bits. The two's complement negative is formed - * by inverting all the bits and adding one. Inversion gives - *yy0 - * where each y is the inverse of the corresponding x. Incrementing gives - *yy1 - * and then ANDing with the original value gives - *001 - * This works for all cases except original value = zero, where of course - * we get zero. - *-- - */ -#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) - -#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) - -/* Select appropriate bit-twiddling functions for bitmap word size */ -#if BITS_PER_BITMAPWORD == 32 -#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) -#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) -#define bmw_popcount(w)pg_popcount32(w) -#elif BITS_PER_BITMAPWORD == 64 -#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) -#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) -#define bmw_popcount(w)pg_popcount64(w) -#else -#error "invalid BITS_PER_BITMAPWORD" -#endif +#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x)) /* @@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a) { int result; - w = RIGHTMOST_ONE(w); + w = bmw_rightmost_one(w); a->words[wordnum] &= ~w; result = wordnum * BITS_PER_BITMAPWORD; diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 2792281658..fdc504596b 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -38,13 +38,11 @@ struct List; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ -typedef int64 signedbitmapword; /* must be the matching signed type */ #else #define BITS_PER_BITMAPWORD 32 typedef uint32 bitmapword; /* must be an unsigned type */ -typedef int32 signedbitmapword; /* must be the matching signed type */ #endif @@ -75,6 +73,20 @@ typedef enum BMS_MULTIPLE/* >1 member */ } BMS_Membership; +/* Select appropriate bit-twiddling functions for bitmap word size */ +#if BITS_PER_BITMAPWORD == 32 +#define bmw_rightmost_one(w) pg_rightmost_one32(w) +#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) +#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) +#define bmw_popcount(w)pg_popcount32(w) +#elif BITS_PER_BITMAPWORD == 64 +#define bmw_rightmost_one(w) pg_rightmost_one64(w) +#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) +#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) +#define bmw_popcount(w)pg_popcount64(w) +#else +#error "invalid BITS_PER_BITMAPWORD" +#endif /* * function prototypes in nodes/bitmapset.c diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 814e0b2dba..f95b6afd86 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; +/*-- + * This is a well-known cute trick for isolating the rightmost one-bit + * in a word. It assumes two's complement arithmetic. Consider any + * nonzero value, and focus attention on the rightmost one. The value is + * then something like + *xx1 + * where x's are unspecified bits. The t
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada wrote: > In the meanwhile, I've been working on vacuum integration. There are > two things I'd like to discuss some time: > > The first is the minimum of maintenance_work_mem, 1 MB. Since the > initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel > vacuum with radix tree cannot work with the minimum > maintenance_work_mem. It will need to increase it to 4MB or so. Maybe > we can start a new thread for that. I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail). > The second is how to limit the size of the radix tree to > maintenance_work_mem. I think that it's tricky to estimate the maximum > number of keys in the radix tree that fit in maintenance_work_mem. The > radix tree size varies depending on the key distribution. The next > idea I considered was how to limit the size when inserting a key. In > order to strictly limit the radix tree size, probably we have to > change the rt_set so that it breaks off and returns false if the radix > tree size is about to exceed the memory limit when we allocate a new > node or grow a node kind/class. That seems complex, fragile, and wrong scope. > Ideally, I'd like to control the size > outside of radix tree (e.g. TIDStore) since it could introduce > overhead to rt_set() but probably we need to add such logic in radix > tree. Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow. Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing. -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan wrote: > > On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis wrote: > > But the heuristic also seems off to me. What if you have lots of partitions > > in an append-only range-partitioned table? That would tend to use the > > lazy freezing strategy (because each partition is small), but that's > > not what you want. I understand heuristics aren't perfect, but it feels > > like we could do something better. > > It is at least vastly superior to vacuum_freeze_min_age in cases like > this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever > trigger freezing in any autovacuum given a table like pgbench_history > (barring during aggressive mode), due to how it interacts with the > visibility map. So we're practically guaranteed to do literally all > freezing for an append-only table in an aggressive mode VACUUM. > > Worst of all, that happens on a timeline that has nothing to do with > the physical characteristics of the table itself (like the number of > unfrozen heap pages or something). If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy? > That said, I agree that the system-level picture of debt (the system > level view of the number of unfrozen heap pages) is relevant, and that > it isn't directly considered by the patch. I think that that can be > treated as work for a future release. In fact, I think that there is a > great deal that we could teach autovacuum.c about the system level > view of things -- this is only one. It seems an easier path to considering system-level of debt (as measured by unfrozen heap pages) would be to start with considering table-level debt measured the same way. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Tue, Dec 13, 2022 at 7:50 AM David Rowley wrote: > > Thanks for testing the patch. > > On Mon, 12 Dec 2022 at 20:14, John Naylor wrote: > > While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. > > I've tried to reproduce this with the v13 patches applied and I'm not > really getting the same as you are. To run the function 100 times I > used: > > select x, a.* from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; Simply running over a longer period of time like this makes the SlabFree difference much closer to your results, so it doesn't seem out of line anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current slab, with a 5% reduction in total time: 500k ints: v13-0001-0005 average of 30: 217ms 47.61% postgres postgres [.] rt_set 20.99% postgres postgres [.] SlabAlloc 10.00% postgres postgres [.] rt_node_insert_inner.isra.0 6.87% postgres [unknown][k] 0xbce011b7 3.53% postgres postgres [.] MemoryContextAlloc 2.82% postgres postgres [.] SlabFree +slab v4 average of 30: 206ms 51.13% postgres postgres [.] rt_set 14.08% postgres postgres [.] SlabAlloc 11.41% postgres postgres [.] rt_node_insert_inner.isra.0 7.44% postgres [unknown][k] 0xbce011b7 3.89% postgres postgres [.] MemoryContextAlloc 3.39% postgres postgres [.] SlabFree It doesn't look mysterious anymore, but I went ahead and took some more perf measurements, including for cache misses. My naive impression is that we're spending a bit more time waiting for data, but having to do less work with it once we get it, which is consistent with your earlier comments: perf stat -p $pid sleep 2 v13: 2,001.55 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 311,690 page-faults:u# 155.724 K/sec 3,128,740,701 cycles:u #1.563 GHz 4,739,333,861 instructions:u #1.51 insn per cycle 820,014,588 branches:u # 409.690 M/sec 7,385,923 branch-misses:u #0.90% of all branches +slab v4: 2,001.09 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 326,017 page-faults:u# 162.920 K/sec 3,016,668,818 cycles:u #1.508 GHz 4,324,863,908 instructions:u #1.43 insn per cycle 761,839,927 branches:u # 380.712 M/sec 7,718,366 branch-misses:u #1.01% of all branches perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2 min/max of 3 runs: v13: LL cache misses: 25.08% - 25.41% +slab v4: LL cache misses: 25.74% - 26.01% -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 28, 2022 at 1:18 PM I wrote: > Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density. > > I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.) One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way). Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future. -- John Naylor EDB: http://www.enterprisedb.com From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 17 Nov 2022 16:10:44 +0700 Subject: [PATCH v901 2/2] Make node32 variable sized Add a size class for 15 elements, which corresponds to 160 bytes, an allocation size used by DSA. When a 16th element is to be inserted, allocte a larger area and memcpy the entire old node to it. NB: Zeroing the new area is only necessary if it's for an inner node128, since insert logic must check for null child pointers. This technique allows us to limit the node kinds to 4, which 1. limits the number of cases in switch statements 2. allows a possible future optimization to encode the node kind in a pointer tag --- src/backend/lib/radixtree.c | 141 +++- 1 file changed, 108 insertions(+), 33 deletions(-) diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c index bef1a438ab..f368e750d5 100644 --- a/src/backend/lib/radixtree.c +++ b/src/backend/lib/radixtree.c @@ -130,6 +130,7 @@ typedef enum typedef enum rt_size_class { RT_CLASS_4_FULL = 0, + RT_CLASS_32_PARTIAL, RT_CLASS_32_FULL, RT_CLASS_128_FULL, RT_CLASS_256 @@ -147,6 +148,8 @@ typedef struct rt_node uint16 count; /* Max number of children. We can use uint8 because we never need to store 256 */ + /* WIP: if we don't have a variable sized node4, this should instead be in the base + types as needed, since saving every byte is crucial for the smallest node kind */ uint8 fanout; /* @@ -166,6 +169,8 @@ typedef struct rt_node ((node)->base.n.count < (node)->ba
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada wrote: > > On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada wrote: > > > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor > > wrote: > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random: 3,343,352 branch-misses:u #0.85% of all branches 393,204,959 branches:u Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached. > > I'll also run the test on my environment and do the investigation tomorrow. > > > > FYI I've not tested the patch you shared today but here are the > benchmark results I did with the v9 patch in my environment (I used > the second filter). I splitted 0004 patch into two patches: a patch > for pure refactoring patch to introduce rt_node_ptr and a patch to do > pointer tagging. Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon. [1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada wrote: > > FYI I've not tested the patch you shared today but here are the > benchmark results I did with the v9 patch in my environment (I used > the second filter). I splitted 0004 patch into two patches: a patch > for pure refactoring patch to introduce rt_node_ptr and a patch to do > pointer tagging. > > v9 0003 patch: 1113 1114 1114 > introduce rt_node_ptr: 1127 1128 1128 > pointer tagging : 1085 1087 1086 (equivalent to 0004 patch) > > In my environment, rt_node_ptr seemed to lead some overhead but > pointer tagging had performance benefits. I'm not sure the reason why > the results are different from yours. The radix tree stats shows the > same as your tests. There is less than 2% difference from the medial set of results, so it's hard to distinguish from noise. I did a fresh rebuild and retested with the same results: about 15% slowdown in v9 0004. That's strange. On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada wrote: > > filter = (((uint64) 1<<32) | (0xFF<<24)); > > LOG: num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 > > > > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. > > In my environment, (0xFF<<24) is 0xFF00, not 0xFF00. > It seems the filter should be (((uint64) 1<<32) | ((uint64) > 0xFF<<24)). Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a straightforward explanation. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada wrote: > > On Mon, Nov 21, 2022 at 4:20 PM John Naylor > wrote: > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space: > > > > node4: 5 + 4+(7) + 4*8 = 48 bytes > > node3: 5 + 3 + 3*8 = 32 > > IIUC if we store the fanout member only in variable-sized nodes, > rt_node has only count, shift, and chunk, so 4 bytes in total. If so, > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The > size doesn't change but there is 1 byte padding space. I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations. > Also, even if we have the node3 a variable-sized node, size class 1 > for node3 could be a good choice since it also doesn't need padding > space and could be a good alternative to path compression. > > node3 : 5 + 3 + 3*8 = 32 bytes > size class 1 : 5 + 3 + 1*8 = 16 bytes Precisely! I have that scenario in my notes as well -- it's quite compelling. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 2:48 PM I wrote: > One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot. Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier). node4: NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16520 | 0 |3 NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16456 | 0 | 17 NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16456 | 0 | 89 NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |16488 | 0 | 327 node32: NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16488 | 0 |5 (1 row) NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16520 | 0 | 28 NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16408 | 0 | 79 NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |24616 | 0 | 199 In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory. Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space: node4: 5 + 4+(7) + 4*8 = 48 bytes node3: 5 + 3 + 3*8 = 32 -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada wrote: > > [v11] There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future? -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function: /* Return an unused slot in node-128 */ static int node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk) { int slotpos = 0; Assert(!NODE_IS_LEAF(node)); while (node_inner_128_is_slot_used(node, slotpos)) slotpos++; return slotpos; } ...passes an integer to this function, whose parameter is a uint8: /* Is the slot in the node used? */ static inline bool node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot) { Assert(!NODE_IS_LEAF(node)); return (node->children[slot] != NULL); } ...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this: add eax, 1 movzx ecx, al cmp QWORD PTR [rbx+264+rcx*8], 0 jne .L147 The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
> The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM Oops. I meant to finish with "Since VACUUM doesn't perform deletion we didn't have an opportunity to detect this during that operation." -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update: On Mon, Nov 14, 2022 at 7:59 PM John Naylor wrote: > > - See how much performance we actually gain from tagging the node kind. Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later. > - Try additional size classes while keeping the node kinds to only four. This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible. > - Optimize node128 insert. I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes. To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it: select * from bench_node128_load(120); v11 NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms +---+--+-- 120 | 14400 | 208304 | 56 v11 + 0006 addendum NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms +---+--+-- 120 | 14400 | 208816 | 34 I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there. > - Try templating out the differences between local and shared memory. I hope to start this sometime after the crashes on 32-bit are resolved. -- John Naylor EDB: http://www.enterprisedb.com diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 67ba568531..2fd689aa91 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -63,3 +63,14 @@ OUT rt_search_ms int8 returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_node128_load( +fanout int4, +OUT fanout int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT rt_sparseload_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index e69be48448..b035b3a747 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); PG_FUNCTION_INFO_V1(bench_fixed_height_search); PG_FUNCTION_INFO_V1(bench_search_random_nodes); +PG_FUNCTION_INFO_V1(bench_node128_load); static uint64 tid_to_key_off(ItemPointer tid, uint32 *off) @@ -552,3 +553,85 @@ finish_search: rt_free(rt); PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } + +Datum +bench_node128_load(PG_FUNCTION_ARGS) +{ + int fanout = PG_GETARG_INT32(0); + radix_tree *rt; + TupleDesc tupdesc; + TimestampTz start_time, + end_time; + longsecs; + int usecs; + int64 rt_sparseload_ms; + Datum values[5]; + boolnulls[5]; + + uint64 r, + h; + int key_id; + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + rt = rt_create(CurrentMemoryContext); + + key_id = 0; + + for (r = 1; r <= fanout; r++) + { + for (h = 1; h <= fanout; h++) + { + uint64 key; + + key = (r << 8) | (h); + + key_id++; + rt_set(rt, key, key_id); + } + } + + rt_stats(rt);
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada wrote: > > I've investigated this issue and have a question about using atomic > variables on palloc'ed memory. In non-parallel vacuum cases, > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines, > the memory allocated by aset.c is 4-bytes aligned so these atomic > variables are not always 8-bytes aligned. Is there any way to enforce > 8-bytes aligned memory allocations in 32-bit machines? The bigger question in my mind is: Why is there an atomic variable in backend-local memory? -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada wrote: > > On Thu, Dec 1, 2022 at 4:00 PM John Naylor wrote: > > > > The bigger question in my mind is: Why is there an atomic variable in backend-local memory? > > Because I use the same radix_tree and radix_tree_control structs for > non-parallel and parallel vacuum. Therefore, radix_tree_control is > allocated in DSM for parallel-vacuum cases or in backend-local memory > for non-parallel vacuum cases. Ok, that could be yet another reason to compile local- and shared-memory functionality separately, but now I'm wondering why there are atomic variables at all, since there isn't yet any locking support. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada wrote: > > On Fri, Nov 25, 2022 at 5:00 PM John Naylor > wrote: > > These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. > > Since the node has fanout regardless of fixed-sized and > variable-sized As currently coded, yes. But that's not strictly necessary, I think. >, only node256 is the special case where the fanout in > the node doesn't match the actual fanout of the node. I think if we > want to have two versions of NODE_HAS_FREE_SLOT, we can have one for > node256 and one for other classes. Thoughts? In your idea, for > NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the > following? > > #define FIXED_NODDE_HAS_FREE_SLOT(node, class) > (node->base.n.count < rt_size_class_info[class].fanout) Right, and the other one could be VAR_NODE_... -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 4:58 AM David Rowley wrote: > My current thoughts are that it might be best to go with 0005 to start > with. +1 > I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. There seem to be some open questions about that one as well. I reran the same test in [1] (except I don't have the ability to lock clock speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results: HEAD: Testing a1 latency average = 965.462 ms Testing a2 latency average = 1054.608 ms Testing a3 latency average = 1078.263 ms Testing a4 latency average = 1120.933 ms Testing a5 latency average = 1162.753 ms Testing a6 latency average = 1298.876 ms Testing a7 latency average = 1228.775 ms Testing a8 latency average = 1293.535 ms 0001+0005: Testing a1 latency average = 791.224 ms Testing a2 latency average = 876.421 ms Testing a3 latency average = 911.039 ms Testing a4 latency average = 981.693 ms Testing a5 latency average = 998.176 ms Testing a6 latency average = 979.954 ms Testing a7 latency average = 1066.523 ms Testing a8 latency average = 1030.235 ms I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure reports "checking for __builtin_prefetch... yes", but I don't think it does anything here, as the results are within noise level. A quick search didn't turn up anything informative on this platform, and I'm not motivated to dig deeper. In any case, it doesn't make things worse. HEAD: Testing a1 latency average = 1402.163 ms Testing a2 latency average = 1442.971 ms Testing a3 latency average = 1599.188 ms Testing a4 latency average = 1664.397 ms Testing a5 latency average = 1782.091 ms Testing a6 latency average = 1860.655 ms Testing a7 latency average = 1929.120 ms Testing a8 latency average = 2021.100 ms 0001+0005: Testing a1 latency average = 1433.080 ms Testing a2 latency average = 1428.369 ms Testing a3 latency average = 1542.406 ms Testing a4 latency average = 1642.452 ms Testing a5 latency average = 1737.173 ms Testing a6 latency average = 1828.239 ms Testing a7 latency average = 1920.909 ms Testing a8 latency average = 2036.922 ms [1] https://www.postgresql.org/message-id/CAFBsxsHqmH_S%3D4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: Improve performance of pg_strtointNN functions
On Thu, Dec 1, 2022 at 6:42 AM David Rowley wrote: > > I was thinking that we should likely apply this before doing the hex > literals, which is the main focus of [1]. The reason being is so that > that patch can immediately have faster conversions by allowing the > compiler to use bit shifting instead of other means of multiplying by > a power-of-2 number. I'm hoping this removes a barrier for Peter from > the small gripe I raised on that thread about the patch having slower > than required hex, octal and binary string parsing. I don't see why the non-decimal literal patch needs to be "immediately" faster? If doing this first leads to less code churn, that's another consideration, but you haven't made that argument. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Add native windows on arm64 support
On Fri, Dec 2, 2022 at 12:20 AM Niyas Sait wrote: > > > On 07/11/2022 06:58, Michael Paquier wrote: > > Seems so. Hmm, where does _ARM64_BARRIER_SY come from? Perhaps it > > would be better to have a comment referring to it from a different > > place than the forums of arm, like some actual docs? > > > _ARM64_BARRIER_SY is defined in Microsoft Arm64 intrinsic documentation > - > https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions In particular, at the bottom of that section: "For the __isb intrinsic, the only restriction that is currently valid is _ARM64_BARRIER_SY; all other values are reserved by the architecture." This corresponds to https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/ISB--Instruction-Synchronization-Barrier- which says "SY Full system barrier operation, encoded as CRm = 0b. Can be omitted." > I couldn't find something more official for the sse2neon library part. Not quite sure what this is referring to, but it seems we can just point to the __aarch64__ section in the same file, which uses the same instruction: spin_delay(void) { __asm__ __volatile__( " isb; \n"); } ...and which already explains the choice with a comment. About v4: + * Use _mm_pause (x64) or __isb(arm64) intrinsic instead of rep nop. Need a space here after __isb. + if cc.get_id() == 'msvc' +cdata.set('USE_ARMV8_CRC32C', false) +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) +have_optimized_crc = true + else That seems like a heavy-handed way to force it. Could we just use the same gating in the test program that the patch puts in the code of interest? Namely: +#ifndef _MSC_VER #include +#endif -- John Naylor EDB: http://www.enterprisedb.com
Re: initdb: Refactor PG_CMD_PUTS loops
On Thu, Dec 1, 2022 at 5:02 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > Keeping the SQL commands that initdb runs in string arrays before > feeding them to PG_CMD_PUTS() seems unnecessarily verbose and > inflexible. In some cases, the array only has one member. In other > cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a > command, but that would require breaking up the loop or using > workarounds like replace_token(). This patch unwinds all that; it's > much simpler that way. +1, I can't think of a reason to keep the current coding -- John Naylor EDB: http://www.enterprisedb.com
Re: Optimizing Node Files Support
On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela wrote: > > Hi, > > I believe that has room for improving generation node files. > > The patch attached reduced the size of generated files by 27 kbytes. > From 891 kbytes to 864 kbytes. > > About the patch: > 1. Avoid useless attribution when from->field is NULL, once that > the new node is palloc0. > > 2. Avoid useless declaration variable Size, when it is unnecessary. Not useless -- it prevents a multiple evaluation hazard, which this patch introduces. > 3. Optimize comparison functions like memcmp and strcmp, using > a short-cut comparison of the first element. Not sure if the juice is worth the squeeze. Profiling would tell. > 4. Switch several copy attributions like COPY_SCALAR_FIELD or COPY_LOCATION_FIELD > by one memcpy call. My first thought is, it would cause code churn. > 5. Avoid useless attribution, ignoring the result of pg_strtok when it is unnecessary. Looks worse. -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 5:00 AM David Rowley wrote: > > On Thu, 3 Nov 2022 at 22:09, John Naylor wrote: > > I tried a similar test, but with text fields of random length, and there is improvement here: > > Thank you for testing that. Can you share which CPU this was on? That was an Intel Core i7-10750H -- John Naylor EDB: http://www.enterprisedb.com
Re: Non-decimal integer literals
On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > On 15.11.22 11:31, Peter Eisentraut wrote: > > On 14.11.22 08:25, John Naylor wrote: > >> Regarding the patch, it looks good overall. My only suggestion would > >> be to add a regression test for just below and just above overflow, at > >> least for int2. > > > > ok > > This was a valuable suggestion, because this found some breakage. In > particular, the handling of grammar-level literals that overflow to > "Float" was not correct. (The radix prefix was simply stripped and > forgotten.) So I added a bunch more tests for this. Here is a new patch. Looks good to me. -- John Naylor EDB: http://www.enterprisedb.com
Re: Non-decimal integer literals
On Wed, Nov 23, 2022 at 3:54 PM David Rowley wrote: > > Going by [1], clang will actually use multiplication by 16 to > implement the former. gcc is better and shifts left by 4, so likely > won't improve things for gcc. It seems worth doing it this way for > anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway. FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've found it useful to run godbolt locally* and load the entire PG file (nicer to read than plain objdump) -- compilers can make different decisions when going from isolated snippets to within full functions. * clone from https://github.com/compiler-explorer/compiler-explorer install npm 16 run "make" and when finished will show the localhost url add the right flags, which in this case was -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -I/path/to/srcdir/src/include -I/path/to/builddir/src/include -D_GNU_SOURCE -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 16, 2022 at 11:46 AM John Naylor wrote: > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > > Thanks! Please let me know if there is something I can help with. > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > Thanks! Please let me know if there is something I can help with. I didn't get very far because the tests fail on 0004 in rt_verify_node: TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada wrote: > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor > wrote: > > > > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > > > Thanks! Please let me know if there is something I can help with. > > > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 > > Which tests do you use to get this assertion failure? I've confirmed > there is a bug in 0005 patch but without it, "make check-world" > passed. Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise. I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues. It's based on the random int load test, but tests search speed. Run like this: select * from bench_search_random_nodes(10 * 1000 * 1000) It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo: filter = ((uint64)1<<40)-1; LOG: num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130 Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF) which gives LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024 Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example: filter = (((uint64) 1<<32) | (0xFF<<24)); LOG: num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled): v9 0003: 2062 2051 2050 v9 0004: 2346 2316 2321 That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. -- John Naylor EDB: http://www.enterprisedb.com diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 0874201d7e..e0205b364e 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -43,6 +43,14 @@ returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; +create function bench_search_random_nodes( +cnt int8, +OUT mem_allocated int8, +OUT search_ms int8) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + create function bench_fixed_height_search( fanout int4, OUT fanout int4, diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 7abb237e96..a43fc61c2d 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search); PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); PG_FUNCTION_INFO_V1(bench_fixed_height_search); +PG_FUNCTION_INFO_V1(bench_search_random_nodes); static uint64 tid_to_key_off(ItemPointer tid, uint32 *off) @@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } +/* copy of splitmix64() */ +static uint64 +hash64(uint64 x) +{ + x ^= x >> 30; + x *= UINT64CONST(0xbf58476d1ce4e5b9); + x ^= x >> 27; + x *= UINT64CONST(0x94d049bb133111eb); + x ^= x >> 31; + return x; +} + +/* attempts to have a relatively even population of node kinds */ +Datum +bench_search_random_nodes(PG_FUNCTION_ARGS) +{ + uint64 cnt = (uint64) PG_GETARG_INT64(0); + radix_tree *rt; + TupleDesc tupdesc; + TimestampTz start_time, + end_time; + longsecs; + int usecs; + int64 search_time_ms; + Datum values[2] = {0}; + boolnulls[2] = {0}; + /* from trial and error */ + const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF); + +
Re: [PoC] Improve dead tuple storage for lazy vacuum
ASS(n32, RT_CLASS_32_PARTIAL)) This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow: + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size); + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout; Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of... +/* Copy both chunks and children/values arrays */ +static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + /* For better code generation */ + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout) + pg_unreachable(); + + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +} When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC? - .fanout = 256, + /* technically it's 256, but we can't store that in a uint8, + and this is the max size class so it will never grow */ + .fanout = 0, - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256)); + Assert(((rt_node *) n256)->fanout == 0); + Assert(chunk_exists || ((rt_node *) n256)->count < 256); These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada wrote: > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > wrote: > > > Thanks! cfbot complaints about some warnings but these are expected > (due to unused delete routines etc). But one reported error[1] might > be relevant with 0002 patch? > [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved > external symbol pg_popcount64 > [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll : > fatal error LNK1120: 1 unresolved externals Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo. > --- > +#ifndef RT_COMMON > +#define RT_COMMON > > What are we using this macro RT_COMMON for? It was a quick way to define some things only once, so they probably all showed up in the list of things you found not undefined. It's different from the style of simplehash.h, which is to have a local name and #undef for every single thing. simplehash.h is a precedent, so I'll change it to match. I'll take a look at your list, too. > > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers > > + * in 'offsets' are ordered in ascending order. > > > > Must? What happens otherwise? > > It ends up missing TIDs by overwriting the same key with different > values. Is it better to have a bool argument, say need_sort, to sort > the given array if the caller wants? > Since it seems you're working on another cleanup, I can address the > above comments after your work is completed. But I'm also fine with > including them into your cleanup work. I think we can work mostly simultaneously, if you work on tid store and vacuum, and I work on the template. We can always submit a full patchset including each other's latest work. That will catch rebase issues sooner. -- John Naylor EDB: http://www.enterprisedb.com
Re: cutting down the TODO list thread
I wrote: > We could also revise the developer FAQ: > - remove phrase "Outstanding features are detailed in Todo." > - add suggestion to to check the Todo or Not_worth_doing pages to see if the desired feature is undesirable or problematic > - rephrase "Working in isolation is not advisable because others might be working on the same TODO item, or you might have misunderstood the TODO item." so it doesn't mention 'TODO' at all. There is also https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs Changing the description of what links to the todo list will probably do more to reduce confusion than language in the todo list itself. -- John Naylor EDB: http://www.enterprisedb.com
Re: Small miscellaneus fixes (Part II)
On Mon, Jan 16, 2023 at 1:28 PM John Naylor wrote: > > > I wrote: > > ...but arguably the earlier check is close enough that it's silly to assert in the "else" branch, and I'd be okay with just nuking those lines. Another thing that caught my attention is the assumption that unsetting a bit is so expensive that we have to first check if it's set, so we may as well remove "IS_BRACKET(Np->Num)" as well. > > The attached is what I mean. I'll commit this this week unless there are objections. I've pushed this and the cosmetic fix in pg_dump. Those were the only things I saw that had some interest, so I closed the CF entry. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada wrote: > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > wrote: > > + * Add Tids on a block to TidStore. The caller must ensure the offset numbers > > + * in 'offsets' are ordered in ascending order. > > > > Must? What happens otherwise? > > It ends up missing TIDs by overwriting the same key with different > values. Is it better to have a bool argument, say need_sort, to sort > the given array if the caller wants? Now that I've studied it some more, I see what's happening: We need all bits set in the "value" before we insert it, since it would be too expensive to retrieve the current value, add one bit, and put it back. Also, as a consequence of the encoding, part of the tid is in the key, and part in the value. It makes more sense now, but it needs more than zero comments. As for the order, I don't think it's the responsibility of the caller to guess if it needs sorting -- if unordered offsets lead to data loss, this function needs to take care of it. > > + uint64 last_key = PG_UINT64_MAX; > > > > I'm having some difficulty understanding this sentinel and how it's used. > > Will improve the logic. Part of the problem is the English language: "last" can mean "previous" or "at the end", so maybe some name changes would help. -- John Naylor EDB: http://www.enterprisedb.com
Re: cutting down the TODO list thread
On Wed, Jan 18, 2023 at 3:13 AM Bruce Momjian wrote: > I think we risk overloading people with too many words above, and they > will not read it fully. Can it be simplified? I wonder if some of this > belows in the developer's FAQ and linked to that from the TODO list. I think you're right. That further drives home what I mentioned a few days ago: Maybe we don't need to change much in the actual list itself, but rephrase references to it from elsewhere. Here's a first draft to see what that would look like: -- https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs from: "PostgreSQL maintains a TODO list on our wiki: http://wiki.postgresql.org/wiki/TODO We have attempted to organize issues and link in relevant discussions from our mailing list archives. Please read background information if it is available before attempting to resolve a TODO. If no background information is available, it is appropriate to post a question to pgsql-hack...@postgresql.org and request additional information and inquire about the status of any ongoing work on the problem." to: "It's worth checking if the feature of interest is found in the TODO list on our wiki: http://wiki.postgresql.org/wiki/TODO. That list contains some known PostgreSQL bugs, some feature requests, and some things we are not even sure we want. Many entries have a link to an email thread containing prior discussion, or perhaps attempts that for whatever reason didn't make it as far as getting committed." ...which might make more sense if moved below the "brand new features" section. -- https://wiki.postgresql.org/wiki/Developer_FAQ 1) from: "What areas need work? Outstanding features are detailed in Todo. You can learn more about these features by consulting the archives, the SQL standards and the recommended texts (see books for developers)." to: ??? -> For "what areas need work?", we need to have a different answer, but I'm not sure what it is. 2) from: "What do I do after choosing an item to work on? Send an email to pgsql-hackers with a proposal for what you want to do (assuming your contribution is not trivial). Working in isolation is not advisable because others might be working on the same TODO item, or you might have misunderstood the TODO item. In the email, discuss both the internal implementation method you plan to use, and any user-visible changes (new syntax, etc)." to: "What do I do after choosing an area to work on? Send an email to pgsql-hackers with a proposal for what you want to do (assuming your contribution is not trivial). Working in isolation is not advisable because experience has shown that there are often requirements that are not obvious, and if those are not agreed on beforehand it leads to wasted effort. In the email, discuss both the internal implementation method you plan to use, and any user-visible changes (new syntax, etc)." > On Wed, Jan 11, 2023 at 02:09:56PM +0700, John Naylor wrote: > > I've come up with some revised language, including s/15/16/ and removing the > > category of "[E]" (easier to implement), since it wouldn't be here if it were > > actually easy: > > I think it is still possible for a simple item to be identified as > wanted and easy, but not completed and put on the TODO list. Theoretically it's possible, but in practice no one puts any easy items here. Currently, there are three marked as easy: pg_dump: - Dump security labels and comments on databases in a way that allows to load a dump into a differently named database - Add full object name to the tag field. eg. for operators we need '=(integer, integer)', instead of just '='. Dump and restore is critical to get right, and the code base is pretty large and hairy, so I don't actually believe for a second that these items are easy. ECPG: - sqlwarn[6] should be 'W' if the PRECISION or SCALE value specified In one sentence there are four uses of jargon that only someone with experience would understand. I have hacked around ECPG multiple times and have no idea what this means. The last two also don't have any motivation spelled out, much less an email thread. So my inclination is, the [E] marker here is unjustified. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, Jan 22, 2023 at 10:42 PM Joel Jacobson wrote: > I did write the code like you suggest first, but changed it, > since I realised the extra "else if" needed could be eliminated, > and thought div_var_int64() wouldn't be slower than div_var_int() since > I thought 64-bit instructions in general are as fast as 32-bit instructions, > on 64-bit platforms. According to Agner's instruction tables [1], integer division on Skylake (for example) has a latency of 26 cycles for 32-bit operands, and 42-95 cycles for 64-bit. [1] https://www.agner.org/optimize/instruction_tables.pdf -- John Naylor EDB: http://www.enterprisedb.com
bitscan forward/reverse on Windows
Attached is a quick-and-dirty attempt to add MSVC support for the rightmost/leftmost-one-pos functions. 0001 adds asserts to the existing coding. 0002 adds MSVC support. Tests pass on CI, but it's of course possible that there is some bug that prevents hitting the fastpath. I've mostly used the platform specific types, so some further cleanup might be needed. 0003 tries one way to reduce the duplication that arose in 0002. Maybe there is a better way. -- John Naylor EDB: http://www.enterprisedb.com From 9390a74f1712d58c88ac513afff8d878071c040c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Tue, 24 Jan 2023 15:57:53 +0700 Subject: [PATCH v3 2/3] Add MSVC support for bitscan reverse/forward --- src/include/port/pg_bitutils.h | 38 ++ 1 file changed, 38 insertions(+) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 9a53f53d11..ce04f4ffad 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -13,6 +13,10 @@ #ifndef PG_BITUTILS_H #define PG_BITUTILS_H +#ifdef _MSC_VER +#include +#endif + extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; @@ -27,6 +31,9 @@ pg_leftmost_one_pos32(uint32 word) { #ifdef HAVE__BUILTIN_CLZ int bitscan_result; +#elif defined(_MSC_VER) + unsigned long bitscan_result; + unsigned char non_zero; // XXX can this be bool? #endif #if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING) @@ -45,6 +52,11 @@ pg_leftmost_one_pos32(uint32 word) bitscan_result = 31 - __builtin_clz(word); Assert(bitscan_result == result); return bitscan_result; +#elif defined(_MSC_VER) + non_zero = _BitScanReverse(_result, word); + Assert(non_zero); + Assert(bitscan_result == result); + return bitscan_result; #else return result; #endif /* HAVE__BUILTIN_CLZ */ @@ -59,6 +71,9 @@ pg_leftmost_one_pos64(uint64 word) { #ifdef HAVE__BUILTIN_CLZ int bitscan_result; +#elif defined(_MSC_VER) + unsigned long bitscan_result; + unsigned char non_zero; // XXX can this be bool? #endif #if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING) @@ -83,6 +98,11 @@ pg_leftmost_one_pos64(uint64 word) #endif /* HAVE_LONG_... */ Assert(bitscan_result == result); return bitscan_result; +#elif defined(_MSC_VER) + non_zero = _BitScanReverse64(_result, word); + Assert(non_zero); + Assert(bitscan_result == result); + return bitscan_result; #else return result; #endif /* HAVE__BUILTIN_CLZ */ @@ -99,6 +119,10 @@ pg_rightmost_one_pos32(uint32 word) #ifdef HAVE__BUILTIN_CTZ const uint32 orig_word = word; int bitscan_result; +#elif defined(_MSC_VER) + const unsigned long orig_word = word; + unsigned long bitscan_result; + unsigned char non_zero; // XXX can this be bool? #endif #if !defined(HAVE__BUILTIN_CTZ) || defined(USE_ASSERT_CHECKING) @@ -118,6 +142,11 @@ pg_rightmost_one_pos32(uint32 word) bitscan_result = __builtin_ctz(orig_word); Assert(bitscan_result == result); return bitscan_result; +#elif defined(_MSC_VER) + non_zero = _BitScanForward(_result, orig_word); + Assert(non_zero); + Assert(bitscan_result == result); + return bitscan_result; #else return result; #endif /* HAVE__BUILTIN_CTZ */ @@ -133,6 +162,10 @@ pg_rightmost_one_pos64(uint64 word) #ifdef HAVE__BUILTIN_CTZ const uint64 orig_word = word; int bitscan_result; +#elif defined(_MSC_VER) + const unsigned __int64 orig_word = word; + unsigned long bitscan_result; + unsigned char non_zero; // XXX can this be bool? #endif #if !defined(HAVE__BUILTIN_CTZ) || defined(USE_ASSERT_CHECKING) @@ -158,6 +191,11 @@ pg_rightmost_one_pos64(uint64 word) #endif /* HAVE_LONG_... */ Assert(bitscan_result == result); return bitscan_result; +#elif defined(_MSC_VER) + non_zero = _BitScanForward64(_result, orig_word); + Assert(non_zero); + Assert(bitscan_result == result); + return bitscan_result; #else return result; #endif /* HAVE__BUILTIN_CTZ */ -- 2.39.0 From 88b98a3fe8d9b5f9a10a839e65b34c43c03e33b2 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Tue, 24 Jan 2023 14:39:26 +0700 Subject: [PATCH v3 1/3] Add asserts to verify bitscan intrinsics --- src/include/port/pg_bitutils.h | 86 -- 1 file changed, 60 insertions(+), 26 deletions(-) diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index a49a9c03d9..9a53f53d11 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -26,10 +26,11 @@ static inline int pg_leftmost_one_pos32(uint32 word) { #ifdef HAVE__BUILTIN_CLZ - Assert(word != 0); + int bitscan_result; +#endif - return 31 - __builtin_clz(word); -#else +#if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING) + int result; int shift = 32 - 8; Assert(word != 0); @@ -37,7 +38,15 @@ pg_leftmost_one_pos32(uint32 word) while ((word
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 10:11 AM Andres Freund wrote: > I am. Just not every tradeoff. I just don't see any useful tradeoffs purely > based on the relation size. I expressed reservations about relation size six weeks ago: On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan wrote: > > On Tue, Dec 13, 2022 at 12:29 AM John Naylor > wrote: > > If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy? > > That's not the only thing we care about, though. That was followed by several paragraphs that never got around to explaining why table size should drive freezing strategy. Review is a feedback mechanism alerting the patch author to possible problems. Listening to feedback is like vacuum, in a way: If it hurts, you're not doing it enough. -- John Naylor EDB: http://www.enterprisedb.com
Re: Considering additional sort specialisation functions for PG16
On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov wrote: > > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well). > Should we need to sort by the second sort key provided the first one > in NULL by standard or by some part of the code relying on this? I I'm not sure I quite understand the question. If there is more than one sort key, and the specialized comparison on the first key gives a definitive zero result, it falls back to comparing all keys from the full tuple. (The sorttuple struct only contains the first sortkey, which might actually be an abbreviated key.) A possible optimization, relevant here and also elsewhere, is to compare only using keys starting from key2. But note: if the first key is abbreviated, a zero result is not definitive, and we must check the first key's full value from the tuple. > suppose NULL values in the first sort key mean attribute values are > undefined and there is no preferred order between these tuples, even > if their second sort keys are different. There is in fact a preferred order between these tuples -- the second key is the tie breaker in this case. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada wrote: > > On Mon, Jan 23, 2023 at 8:20 PM John Naylor > wrote: > > > > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada wrote: > > > > > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor > > > wrote: > > > > In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went. > > Hmm, I don't remember I proposed such a patch, either. I went looking, and it turns out I remembered wrong, sorry. > One idea to address it would be that we pass a shared memory to > RT_CREATE() and we create a DSA area dedicated to the radix tree in > place. We should return the created DSA area along with the radix tree > so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(), > and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA > area. A downside of this idea would be that one DSA area only for a > radix tree is always required. > > Another idea would be that we allocate a big enough DSA area and > quarry small memory for nodes from there. But it would need to > introduce another complexity so I prefer to avoid it. > > FYI the current design is inspired by dshash.c. In dshash_destory(), > we dsa_free() each elements allocated by dshash.c Okay, thanks for the info. > > 0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic. > > radixtree_search_impl.h still assumes that the value type is an > integer type as follows: > > #ifdef RT_NODE_LEVEL_LEAF > RT_VALUE_TYPE value = 0; > > Assert(RT_NODE_IS_LEAF(node)); > #else > > Also, I think if we make the value type configurable, it's better to > pass the pointer of the value to RT_SET() instead of copying the > values since the value size could be large. Thanks, I will remove the assignment and look into pass-by-reference. > Oops, the fix is missed in the patch for some reason. I'll fix it. > > > There is also this, in the template, which I'm not sure has been addressed: > > > > * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter > > * has the local pointers to nodes, rather than RT_PTR_ALLOC. > > * We need either a safeguard to disallow other processes to begin the iteration > > * while one process is doing or to allow multiple processes to do the iteration. > > It's not addressed yet. I think adding a safeguard is better for the > first version. A simple solution is to add a flag, say iter_active, to > allow only one process to enable the iteration. What do you think? I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency? > > Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine. > > I think we can pass the maximum offset numbers to tidstore_create() > and calculate these values. That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value. > > About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment) > > That seems a valid concern. I borrowed the "control object" from > dshash.c but it supports only shared cases. The fact that the radix > tree supports both local and shared seems to introduce this confusion. > I came up with other names such as RT_RADIX_TREE_CORE or > RT_RADIX_TREE_ROOT but not sure these are better than the current > one. Okay, if dshash uses it, we have some precedent. > > Now might be a good time to look at earlier XXX comments and come up with a plan to address them. > > Agreed. > > Other
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar wrote: > > On Mon, Jan 23, 2023 at 6:00 PM John Naylor > wrote: > > > > Attached is a rebase to fix conflicts from recent commits. > > I have reviewed v22-0022* patch and I have some comments. > > 1. > >It also changes to the column names max_dead_tuples and num_dead_tuples and to > >show the progress information in bytes. > > I think this statement needs to be rephrased. Could you be more specific? > 3. > > We are changing the min value of 'maintenance_work_mem' to 2MB. Should > we do the same for the 'autovacuum_work_mem'? Yes, we should change that, too. We've discussed previously that autovacuum_work_mem is possibly rendered unnecessary by this work, but we agreed that that should be a separate thread. And needs additional testing to verify. I agree with your other comments. -- John Naylor EDB: http://www.enterprisedb.com
Re: Considering additional sort specialisation functions for PG16
On Tue, Aug 23, 2022 at 1:13 PM John Naylor wrote: > > On Tue, Aug 23, 2022 at 11:24 AM David Rowley wrote: > > > > On Tue, 23 Aug 2022 at 15:22, John Naylor wrote: > > > Did you happen to see > > > > > > https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com > > > > I missed that. It looks like a much more promising idea than what I > > came up with. I've not looked at your code yet, but I'm interested and > > will aim to look soon. > > Note that I haven't actually implemented this idea yet, just tried to > model the effects by lobotomizing the current comparators. I think > it's worth pursuing and will try to come back to it this cycle, but if > you or anyone else wants to try, that's fine of course. Coming back to this, I wanted to sketch out this idea in a bit more detail. Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null: - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary. - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well). - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior. I believe external merges wouldn't have to do anything different, since when writing out the tapes, we read from the arrays in the right order. (One could extend this idea further and have two pools of tapes for null and non-null first sortkey, that are merged separately, in the right order. That sounds like quite a bit more complexity than is worth, however.) -- John Naylor EDB: http://www.enterprisedb.com