Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart  wrote:
>
> [v8]

Okay, I think it's basically in good shape. Since it should be a bit
faster than a couple versions ago, would you be up for retesting with
the original test having 8 to 512 writers? And also add the const
markers we discussed upthread? Aside from that, I plan to commit this
week unless there is further bikeshedding.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy
 wrote:
>
>
> 1) pg_lfind32 - why just uint32? If it's not possible to define
> functions for char, unsigned char, int16, uint16, int32, int64, uint64
> and so on, can we add a few comments around that? Also, the comments

Future work, as far as I'm  concerned. I'm interested in using a char
version for json strings.

> 3) Can pg_lfind32 return the index of  the key found, for instance to
> use it for setting/resetting the found element in the array?

That was just discussed. It's slightly faster not to return an index.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: NAMEDATALEN increase because of non-latin languages

2022-08-10 Thread John Naylor
I wrote:

> The syscache use of GETSTRUCT still uses a simple cast of the tuple (for 
> pg_cast those calls live in parse_coerce.c, which is unchanged from master in 
> v3). Next step I think is to see about the syscache piece -- teaching a 
> syscache miss to deform the entire tuple into a struct and store it in the 
> syscache.

v4 is a hackish attempt at getting the syscache to deform the tuple
and store the struct. Using only pg_cast again for ease, it seems to
work for that. It's now about as far as I can get without thinking
about byref types.

0001 just adds copies of some syscache / catcache functions for
private use by pg_cast, as scaffolding.
0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call
heap_deform_tuple(). This then calls a for-now handwritten function
(similar to the one generated in v3) which palloc's space for the
struct and copies the fields over. Only this function knows the
catalog struct type, so a future design could call the per-cache
function via a pointer in the catcache control array. Since we already
have the isnull/values array, it's also trivial to copy the datums for
the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the
t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW
that knows it's looking directly at the struct.

Getting to varlen attributes: For one, I think it was mentioned above
that we will need a way to perform a deep copy of structs that contain
pointers to varlen fields. I imagine keeping track of the attributes
length will come up for some types and might be tricky. And before
that, the catalog machinery will need some preprocessor tricks to
declare pointers in the structs.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 20fba44412e8ef1bb4cd5b051b9d7e82618a6d93 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 10 Aug 2022 17:19:24 +0700
Subject: [PATCH v4 2/2] Teach catcache to store structs for pg_cast

---
 src/backend/parser/parse_coerce.c  |  6 +--
 src/backend/utils/cache/catcache.c | 73 --
 src/include/access/htup_details.h  |  2 +
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 39b7e5707b..07a1b047e3 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -3056,7 +3056,7 @@ IsBinaryCoercible(Oid srctype, Oid targettype)
 			ObjectIdGetDatum(targettype));
 	if (!HeapTupleIsValid(tuple))
 		return false;			/* no cast */
-	castForm = (Form_pg_cast) GETSTRUCT(tuple);
+	castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 	result = (castForm->castmethod == COERCION_METHOD_BINARY &&
 			  castForm->castcontext == COERCION_CODE_IMPLICIT);
@@ -3120,7 +3120,7 @@ find_coercion_pathway(Oid targetTypeId, Oid sourceTypeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 		CoercionContext castcontext;
 
 		/* convert char value for castcontext to CoercionContext enum */
@@ -3287,7 +3287,7 @@ find_typmod_coercion_function(Oid typeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 		*funcid = castForm->castfunc;
 		ReleaseSysCache(tuple);
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index b1287bb6a0..8ddc109052 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -21,6 +21,7 @@
 #include "access/table.h"
 #include "access/valid.h"
 #include "access/xact.h"
+#include "catalog/pg_cast.h" // fixme
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
@@ -2158,6 +2159,42 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	return ct;
 }
 
+// WIP: generated functions would look like this and be called through a pointer
+// FIXME: ct->tuple is no longer a real tuple
+// XXX: for now assume the caller has switched to the right memory context
+static CatCTup *
+CatCArrayGetStruct_pg_cast(HeapTuple pg_cast_tuple, CatCTup *ct, Datum *values, bool *isnull)
+{
+	Form_pg_cast pg_cast_struct;
+
+	/* Allocate memory for CatCTup and the cached struct in one go */
+	ct = (CatCTup *) palloc(sizeof(CatCTup) +
+			MAXIMUM_ALIGNOF + sizeof(FormData_pg_cast));
+
+	/* copy the identification info */
+	// WIP: for caches we only need t_self, can we just have that as a
+	// separate field in CatCTup?
+	ct->tuple.t_len = pg_cast_tuple->t_len;
+	ct->tuple.t_self = pg_cast_tuple->t_self;
+	ct->tuple.t_tableOid = pg_cast_tuple->t_tableOid;
+
+	// WIP: treat t_data as a pointer to the struct
+	ct->tuple.t_data = (HeapTupleHeader)
+		MAXALIGN(((char *) ct) + sizeof(CatCTup));
+	pg_cast

Re: [RFC] building postgres with meson

2022-08-10 Thread John Naylor
On Thu, Aug 11, 2022 at 12:19 AM Andres Freund  wrote:
> I tried to ignore various generated files in the source tree, but I don't
> think it's doable for all of them. Consider
> e.g. src/backend/utils/misc/guc-file.c which is gets built via #include
> "guc-file.c" from gram.c

With a bit of work, we could probably get rid of those includes. See
27199058d98ef7f for one example.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: use SSE2 for is_valid_ascii

2022-08-10 Thread John Naylor
On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart  wrote:
>
> This is a neat patch.  I don't know that we need an entirely separate code
> block for the USE_SSE2 path, but I do think that a little bit of extra
> commentary would improve the readability.  IMO the existing comment for the
> zero accumulator has the right amount of detail.
>
> +   /*
> +* Set all bits in each lane of the error accumulator where 
> input
> +* bytes are zero.
> +*/
> +   error_cum = _mm_or_si128(error_cum,
> +
> _mm_cmpeq_epi8(chunk, _mm_setzero_si128()));

Okay, I will think about the comments, thanks for looking.

> I wonder if reusing a zero vector (instead of creating a new one every
> time) has any noticeable effect on performance.

Creating a zeroed register is just FOO PXOR FOO, which should get
hoisted out of the (unrolled in this case) loop, and which a recent
CPU will just map to a hard-coded zero in the register file, in which
case the execution latency is 0 cycles. :-)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [RFC] building postgres with meson

2022-08-10 Thread John Naylor
On Thu, Aug 11, 2022 at 10:37 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > With a bit of work, we could probably get rid of those includes. See
> > 27199058d98ef7f for one example.
>
> Yeah --- it would mean creating gram.h files for all the bison grammars
> not just a few of them, but it's certainly do-able if there's motivation
> to make the changes.  Most of the files that are done that way date
> from before we knew about flex's %top.

I'll volunteer to work on this unless an easier solution happens to
come along in the next couple days. (aside: guc-file.l doesn't have a
grammar, so not yet sure if that makes the issue easier or harder...)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-10 Thread John Naylor
On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart  wrote:
>
> On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote:
> > LGTM, let's see what the buildfarm thinks of 0001.
>
> Thanks!  I haven't noticed any related buildfarm failures yet.

I was waiting for all the Windows animals to report in, and it looks
like they have, so I've pushed 0002. Thanks for picking this topic up
again!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-08-15 Thread John Naylor
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada  wrote:
>
> On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> >  wrote:
> > >
> > >
> > >
> > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > I’d like to keep the first version simple. We can improve it and add
> > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > would still be a big win comparing to using a flat array, even without
> > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > also concerned about an extra pointer traversal and extra memory
> > > > allocation. It's most flexible but multi-value leaves method is also
> > > > flexible enough for many use cases. Using the single-value method
> > > > seems to be too much as the first step for me.
> > > >
> > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > choice for me as the first step . It can cover wider use cases
> > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > combining a hash table or using tree of tree, for example.
> > >
> > > These two aspects would also bring it closer to Andres' prototype, which 
> > > 1) makes review easier and 2) easier to preserve optimization work 
> > > already done, so +1 from me.
> >
> > Thanks.
> >
> > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > and the multi-value leaves method. I've tried to remove duplicated
> > codes but we might find a better way to do that.
> >
>
> With the recent changes related to simd, I'm going to split the patch
> into at least two parts: introduce other simd optimized functions used
> by the radix tree and the radix tree implementation. Particularly we
> need two functions for radix tree: a function like pg_lfind32 but for
> 8 bits integers and return the index, and a function that returns the
> index of the first element that is >= key.

I recommend looking at

https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com

since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PATCH] Optimize json_lex_string by batching character copying

2022-08-15 Thread John Naylor
I wrote

> On Mon, Jul 11, 2022 at 11:07 PM Andres Freund  wrote:
>
> > I wonder if we can add a somewhat more general function for scanning until
> > some characters are found using SIMD? There's plenty other places that could
> > be useful.
>
> In simple cases, we could possibly abstract the entire loop. With this 
> particular case, I imagine the most approachable way to write the loop would 
> be a bit more low-level:
>
> while (p < end - VECTOR_WIDTH &&
>!vector_has_byte(p, '\\') &&
>!vector_has_byte(p, '"') &&
>vector_min_byte(p, 0x20))
> p += VECTOR_WIDTH
>
> I wonder if we'd lose a bit of efficiency here by not accumulating set bits 
> from the three conditions, but it's worth trying.

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

master:
356ms

v5:
259ms

v5 disable SSE:
288ms

It still needs a bit of polishing and testing, but I think it's a good
workout for abstracting SIMD out of the way.

-
test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null; -- from Andrew upthread

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..1f9eb134e8 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - SIZEOF_VECTOR &&
+	!pg_lfind8('\\', (unsigned char*) p, SIZEOF_VECTOR) &&
+	!pg_lfind8('"', (unsigned char*) p, SIZEOF_VECTOR) &&
+	!pg_lfind8_le(0x1F, (unsigned char*) p, SIZEOF_VECTOR))
+p += SIZEOF_VECTOR;
+
+			for (; p < end; p++)
 			{
 if (*p == '\\' || *p == '"')
 	break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..e090ea6ac3 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
 /*-
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where available
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +15,76 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		iterations = nelem & ~(SIZEOF_VECTOR - 1);
+	TYPEOF_VECTOR		chunk;
+
+	for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+	{
+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) [i]);
+#else
+		memcpy(, [i], sizeof(chunk));
+#endif			/* USE_SSE2 */
+		if (vector_eq_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that is less than or equal to 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		iterations = nelem & ~(SIZEOF_VECTOR - 1);
+	TYPEOF_VECTOR		chunk;
+
+	for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+	{
+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) [i]);
+#else
+		memcpy(, [i], sizeof(chunk));
+#endif			/* USE_SSE2 */
+		if (vector_le_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_l

Re: fix typos

2022-08-15 Thread John Naylor
On Fri, Aug 12, 2022 at 8:55 PM Tom Lane  wrote:
>
> John Naylor  writes:
> > This is really a straw-man proposal, since I'm not volunteering to do
> > the work, or suggest anybody else should do the same. That being the
> > case, it seems we should just go ahead with Justin's patch for
> > consistency. Possibly we could also change the messages to say "ID"?
>
> I'd be content if we change the user-facing messages (and documentation
> if any) to say "ID" not "OID".

The documentation has both, so it makes sense to standardize on "ID".
The messages all had oid/OID, which I changed in the attached. I think
I got all the places.

I'm thinking it's not wrong enough to confuse people, but consistency
is good, so backpatch to v15 and no further. Does anyone want to make
a case otherwise?

-- 
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/doc/src/sgml/replication-origins.sgml b/doc/src/sgml/replication-origins.sgml
index 7e02c4605b..bb0fb624d2 100644
--- a/doc/src/sgml/replication-origins.sgml
+++ b/doc/src/sgml/replication-origins.sgml
@@ -27,12 +27,12 @@
  
 
  
-  Replication origins have just two properties, a name and an OID. The name,
+  Replication origins have just two properties, a name and an ID. The name,
   which is what should be used to refer to the origin across systems, is
   free-form text. It should be used in a way that makes conflicts
   between replication origins created by different replication solutions
   unlikely; e.g., by prefixing the replication solution's name to it.
-  The OID is used only to avoid having to store the long version
+  The ID is used only to avoid having to store the long version
   in situations where space efficiency is important. It should never be shared
   across systems.
  
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index c72ad6b93d..9eb97fac58 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -328,7 +328,7 @@ replorigin_create(const char *roname)
 	if (tuple == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("could not find free replication origin OID")));
+ errmsg("could not find free replication origin ID")));
 
 	heap_freetuple(tuple);
 	return roident;
@@ -364,7 +364,7 @@ restart:
 if (nowait)
 	ereport(ERROR,
 			(errcode(ERRCODE_OBJECT_IN_USE),
-			 errmsg("could not drop replication origin with OID %d, in use by PID %d",
+			 errmsg("could not drop replication origin with ID %u, in use by PID %d",
 	state->roident,
 	state->acquired_by)));
 
@@ -408,7 +408,7 @@ restart:
 	 */
 	tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
 	if (!HeapTupleIsValid(tuple))
-		elog(ERROR, "cache lookup failed for replication origin with oid %u",
+		elog(ERROR, "cache lookup failed for replication origin with ID %u",
 			 roident);
 
 	CatalogTupleDelete(rel, >t_self);
@@ -485,7 +485,7 @@ replorigin_by_oid(RepOriginId roident, bool missing_ok, char **roname)
 		if (!missing_ok)
 			ereport(ERROR,
 	(errcode(ERRCODE_UNDEFINED_OBJECT),
-	 errmsg("replication origin with OID %u does not exist",
+	 errmsg("replication origin with ID %u does not exist",
 			roident)));
 
 		return false;
@@ -937,7 +937,7 @@ replorigin_advance(RepOriginId node,
 		{
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_IN_USE),
-	 errmsg("replication origin with OID %d is already active for PID %d",
+	 errmsg("replication origin with ID %u is already active for PID %d",
 			replication_state->roident,
 			replication_state->acquired_by)));
 		}
@@ -948,7 +948,7 @@ replorigin_advance(RepOriginId node,
 	if (replication_state == NULL && free_state == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
- errmsg("could not find free replication state slot for replication origin with OID %u",
+ errmsg("could not find free replication state slot for replication origin with ID %u",
 		node),
  errhint("Increase max_replication_slots and try again.")));
 
@@ -1126,7 +1126,7 @@ replorigin_session_setup(RepOriginId node)
 		{
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_IN_USE),
-	 errmsg("replication origin with OID %d is already active for PID %d",
+	 errmsg("replication origin with ID %u is already active for PID %d",
 			curstate->roident, curstate->acquired_by)));
 		}
 
@@ -1138,7 +1138,7 @@ replorigin_session_setup(RepOriginId node)
 	if (session_replication_state == NULL && free_slot == -1)
 		ereport(ERROR,
 (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
- errmsg("could not find free replication state slot for replication origin with OID %u",
+ errmsg("could not find free replication state slot for replication origin with ID %u",
 		node),
  errhint("Increase max_replication_slots and try again.")));
 	else if (session_replication_state == NULL)


Re: build remaining Flex files standalone

2022-08-16 Thread John Naylor
For v3, I addressed some comments and added .h files to the
headerscheck exceptions.

On Tue, Aug 16, 2022 at 1:11 AM Andres Freund  wrote:

> On 2022-08-13 15:39:06 +0700, John Naylor wrote:
> > Here are the rest. Most of it was pretty straightforward, with the
> > main exception of jsonpath_scan.c, which is not quite finished. That
> > one passes tests but still has one compiler warning. I'm unsure how
> > much of what is there already is really necessary or was cargo-culted
> > from elsewhere without explanation. For starters, I'm not sure why the
> > grammar has a forward declaration of "union YYSTYPE". It's noteworthy
> > that it used to compile standalone, but with a bit more stuff, and
> > that was reverted in 550b9d26f80fa30. I can hack on it some more later
> > but I ran out of steam today.

I've got it in half-way decent shape now, with an *internal.h header
and some cleanups.

> > - Include order seems to matter for the grammar's .h file. I didn't
> > test if that was the case every time, and after a few miscompiles just
> > always made it the last inclusion, but I'm wondering if we should keep
> > those inclusions outside %top{} and put it at the start of the next
> > %{} ?
>
> I think we have a few of those dependencies already, see e.g.
> /*
>  * NB: include gram.h only AFTER including scanner.h, because scanner.h
>  * is what #defines YYLTYPE.
>  */

Went with something like this in all cases:

/*
 * NB: include bootparse.h only AFTER including bootstrap.h, because bootstrap.h
 * includes node definitions needed for YYSTYPE.
 */

Future cleanup: I see this in headerscheck:

# We can't make these Bison output files compilable standalone
# without using "%code require", which old Bison versions lack.
# parser/gram.h will be included by parser/gramparse.h anyway.

That directive has been supported in Bison since 2.4.2.

> > From d723ba14acf56fd432e9e263db937fcc13fc0355 Mon Sep 17 00:00:00 2001
> > From: John Naylor 
> > Date: Thu, 11 Aug 2022 19:38:37 +0700
> > Subject: [PATCH v201 1/9] Build guc-file.c standalone
>
> Might be worth doing some of the moving around here separately from the
> parser/scanner specific bits.

Done in 0001/0003.

> > +/* functions shared between guc.c and guc-file.l */
> > [...]
> I think I prefer your suggestion of a guc_internal.h upthread.

Started in 0002, but left open the headerscheck failure.

Also, if such a thing is meant to be #include'd only by two generated
files, maybe it should just live in the directory where they live, and
not in the src/include dir?

> > From 7d4ecfcb3e91f3b45e94b9e64c7c40f1bbd22aa8 Mon Sep 17 00:00:00 2001
> > From: John Naylor 
> > Date: Fri, 12 Aug 2022 15:45:24 +0700
> > Subject: [PATCH v201 2/9] Build booscanner.c standalone
>
> > -# bootscanner is compiled as part of bootparse
> > -bootparse.o: bootscanner.c
> > +# See notes in src/backend/parser/Makefile about the following two rules
> > +bootparse.h: bootparse.c
> > + touch $@
> > +
> > +bootparse.c: BISONFLAGS += -d
> > +
> > +# Force these dependencies to be known even without dependency info built:
> > +bootparse.o bootscan.o: bootparse.h
>
> Wonder if we could / should wrap this is something common. It's somewhat
> annoying to repeat this stuff everywhere.

I haven't looked at the Meson effort recently, but if the build rule
is less annoying there, I'm inclined to leave this as a wart until
autotools are retired.

> > diff --git a/src/test/isolation/specscanner.l 
> > b/src/test/isolation/specscanner.l
> > index aa6e89268e..2dc292c21d 100644
> > --- a/src/test/isolation/specscanner.l
> > +++ b/src/test/isolation/specscanner.l
> > @@ -1,4 +1,4 @@
> > -%{
> > +%top{
> >  /*-
> >   *
> >   * specscanner.l
> > @@ -9,7 +9,14 @@
> >   *
> >   *-
> >   */
> > +#include "postgres_fe.h"
>
> Miniscule nitpick: I think we typically leave an empty line between header and
> first include.

In a small unscientific sample it seems like the opposite is true
actually, but I'll at least try to be consistent within the patch set.

> > diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h
> > index dbe7d4f742..0b373048b5 100644
> > --- a/contrib/cube/cubedata.h
> > +++ b/contrib/cube/cubedata.h
> > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void);
> >
> >  /* in cubeparse.y */
> >  extern int   cube_yyparse(NDBOX **result);
> > +
> > +/* All grammar constructs return strings */
> > +#de

Re: Cleaning up historical portability baggage

2022-08-12 Thread John Naylor
On Thu, Aug 4, 2022 at 11:09 AM Tom Lane  wrote:
>
> clock_gettime is required by SUSv2 (1997), so I have to admit that
> macOS 10.4 doesn't have a lot of excuse not to have it.  In any case,
> prairiedog is just sitting there doing its thing until I find cycles
> to install a newer OS.  If you want to move ahead with this, don't
> let prairiedog block you.

Now that prairiedog is offline, it seems the oldest Bison in the
buildfarm is 2.3, as the vendor-supplied version for MacOS. Since
earlier versions are now untested, it seems we should now make that
the minimum, as in the attached. I took a quick look to see if that
would enable anything useful, but nothing stuck out.

Aside: The MSVC builds don't report the Bison version that I can see,
but otherwise it seems now the only non-Apple pre-3.0 animals are
prion (2.7) and the three Sparc animals on Debian 7 (2.5).

--
John Naylor
EDB: http://www.enterprisedb.com
From 6e1022bae4a8d45783fa0d678d51eec03383d0de Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sun, 7 Aug 2022 18:32:57 +0700
Subject: [PATCH v1] Require at least Bison 2.3 to match the buildfarm

---
 config/programs.m4 | 6 +++---
 configure  | 6 +++---
 contrib/cube/cubeparse.y   | 5 +
 contrib/seg/segparse.y | 5 +
 doc/src/sgml/install-windows.sgml  | 2 +-
 doc/src/sgml/installation.sgml | 2 +-
 src/backend/bootstrap/bootparse.y  | 5 +
 src/backend/parser/gram.y  | 5 +
 src/backend/replication/repl_gram.y| 5 +
 src/backend/replication/syncrep_gram.y | 5 +
 src/backend/utils/adt/jsonpath_gram.y  | 5 +
 src/pl/plpgsql/src/pl_gram.y   | 5 +
 src/tools/msvc/pgbison.pl  | 2 +-
 13 files changed, 17 insertions(+), 41 deletions(-)

diff --git a/config/programs.m4 b/config/programs.m4
index e7908d8793..cf432cb62c 100644
--- a/config/programs.m4
+++ b/config/programs.m4
@@ -22,7 +22,7 @@ fi
 # PGAC_PATH_BISON
 # ---
 # Look for Bison, set the output variable BISON to its path if found.
-# Reject versions before 1.875 (they have bugs or capacity limits).
+# Reject versions before 2.3.
 # Note we do not accept other implementations of yacc.
 
 AC_DEFUN([PGAC_PATH_BISON],
@@ -31,11 +31,11 @@ AC_DEFUN([PGAC_PATH_BISON],
 if test "$BISON"; then
   pgac_bison_version=`$BISON --version 2>/dev/null | sed q`
   AC_MSG_NOTICE([using $pgac_bison_version])
-  if echo "$pgac_bison_version" | $AWK '{ if ([$]4 < 1.875) exit 0; else exit 1;}'
+  if echo "$pgac_bison_version" | $AWK '{ if ([$]4 < 2.3) exit 0; else exit 1;}'
   then
 AC_MSG_WARN([
 *** The installed version of Bison, $BISON, is too old to use with PostgreSQL.
-*** Bison version 1.875 or later is required, but this is $pgac_bison_version.])
+*** Bison version 2.3 or later is required, but this is $pgac_bison_version.])
 BISON=""
   fi
   # Bison >=3.0 issues warnings about %name-prefix="base_yy", instead
diff --git a/configure b/configure
index cf2c4b85fe..c3c7c1b39b 100755
--- a/configure
+++ b/configure
@@ -10219,14 +10219,14 @@ if test "$BISON"; then
   pgac_bison_version=`$BISON --version 2>/dev/null | sed q`
   { $as_echo "$as_me:${as_lineno-$LINENO}: using $pgac_bison_version" >&5
 $as_echo "$as_me: using $pgac_bison_version" >&6;}
-  if echo "$pgac_bison_version" | $AWK '{ if ($4 < 1.875) exit 0; else exit 1;}'
+  if echo "$pgac_bison_version" | $AWK '{ if ($4 < 2.3) exit 0; else exit 1;}'
   then
 { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
 *** The installed version of Bison, $BISON, is too old to use with PostgreSQL.
-*** Bison version 1.875 or later is required, but this is $pgac_bison_version." >&5
+*** Bison version 2.3 or later is required, but this is $pgac_bison_version." >&5
 $as_echo "$as_me: WARNING:
 *** The installed version of Bison, $BISON, is too old to use with PostgreSQL.
-*** Bison version 1.875 or later is required, but this is $pgac_bison_version." >&2;}
+*** Bison version 2.3 or later is required, but this is $pgac_bison_version." >&2;}
 BISON=""
   fi
   # Bison >=3.0 issues warnings about %name-prefix="base_yy", instead
diff --git a/contrib/cube/cubeparse.y b/contrib/cube/cubeparse.y
index 7577c4515c..9bfd7a0df6 100644
--- a/contrib/cube/cubeparse.y
+++ b/contrib/cube/cubeparse.y
@@ -15,10 +15,7 @@
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
  * so we can easily have it use palloc instead of malloc.  This prevents
- * memory leaks if we error out during parsing.  Note this only works with
- * bison >= 2.0.  However, in bison 1.875 the default is to use alloca()
- * if possible, so there's not really much problem anyhow, at least if
- * you're building with gc

Re: fix typos

2022-08-12 Thread John Naylor
I wrote:

> On Thu, Aug 4, 2022 at 8:41 PM Tom Lane  wrote:
> >
> > John Naylor  writes:
>
> > > RepOriginId is a typedef for uint16, so this can't print the wrong answer,
> > > but it is inconsistent with other uses. So it seems we don't need to
> > > backpatch this one?
> >
> > Um ... if it's int16, then it can't be an OID, so I'd say this message has
> > far worse problems than %d vs %u.  It should not use that terminology.
>
> The catalog has the following. Since it's not a real oid, maybe this column 
> should be rethought?

This is really a straw-man proposal, since I'm not volunteering to do
the work, or suggest anybody else should do the same. That being the
case, it seems we should just go ahead with Justin's patch for
consistency. Possibly we could also change the messages to say "ID"?

> CATALOG(pg_replication_origin,6000,ReplicationOriginRelationId) 
> BKI_SHARED_RELATION
> {
> /*
> * Locally known id that get included into WAL.
> *
> * This should never leave the system.
> *
> * Needs to fit into an uint16, so we don't waste too much space in WAL
> * records. For this reason we don't use a normal Oid column here, since
> * we need to handle allocation of new values manually.
> */
> Oid roident;

-- 
John Naylor
EDB: http://www.enterprisedb.com




build remaining Flex files standalone

2022-08-12 Thread John Naylor
Starting a new thread to control clutter. [was: Re: [RFC] building
postgres with meson]

motivation: 
https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de

On Thu, Aug 11, 2022 at 11:07 AM Andres Freund  wrote:
> I think we should consider compiling it separately from guc.c. guc.c already
> compiles quite slowly (iirc beat only by ecpg and main grammar), and it's a
> relatively commonly changed source file.

Done in the attached, and will do the rest in time. It seemed most
straightforward to put ProcessConfigFileInternal() in guc.c since
that's where most of its callers are, and it relies on some vars and
types declared there. There are a couple new extern declarations in
guc.h that are only for guc.c and guc-file.c:

+/* functions shared between guc.c and guc-file.l */
+extern int guc_name_compare(const char *namea, const char *nameb);
+extern ConfigVariable *ProcessConfigFileInternal(GucContext context,
+ bool applySettings, int elevel);
+extern void record_config_file_error(const char *errmsg,
+ const char *config_file,
+ int lineno,
+ ConfigVariable **head_p,
+ ConfigVariable **tail_p);

These might be better placed in a new guc_internal.h. Thoughts?

> It might even be a good idea to split guc.c so it only contains the settings
> arrays + direct dependencies...

Perhaps this can be a TODO item, one which falls under "[E] marks
items that are easier to implement". I've been slacking on removing
the old/intractable cruft from the TODO list, but we should also be
sticking small nice-but-not-necessary things in there. That said, if
this idea has any bearing on the guc_internal.h idea, it might be
better dealt with now.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From d723ba14acf56fd432e9e263db937fcc13fc0355 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 11 Aug 2022 19:38:37 +0700
Subject: [PATCH v1] Build guc-file.c standalone

The proposed Meson build system will need a way to ignore certain
generated files in order to coexist with the autoconf build system,
and #include'd C files generated by Flex make this more difficult.
Build guc-file.c separately from guc.c, as was done in 72b1e3a21.

TODO: other Flex-generated files

Discussion: https://www.postgresql.org/message-id/20220810171935.7k5zgnjwqzalzmtm%40awork3.anarazel.de
---
 src/backend/utils/misc/Makefile   |   5 +-
 src/backend/utils/misc/guc-file.l | 367 +-
 src/backend/utils/misc/guc.c  | 362 -
 src/include/utils/guc.h   |   9 +
 4 files changed, 370 insertions(+), 373 deletions(-)

diff --git a/src/backend/utils/misc/Makefile b/src/backend/utils/misc/Makefile
index 1d5327cf64..cf7ce9bc83 100644
--- a/src/backend/utils/misc/Makefile
+++ b/src/backend/utils/misc/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	guc.o \
+	guc-file.o \
 	help_config.o \
 	pg_config.o \
 	pg_controldata.o \
@@ -37,10 +38,6 @@ endif
 
 include $(top_srcdir)/src/backend/common.mk
 
-# guc-file is compiled as part of guc
-guc.o: guc-file.c
-
 # Note: guc-file.c is not deleted by 'make clean',
 # since we want to ship it in distribution tarballs.
 clean:
-	@rm -f lex.yy.c
diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index ce5633844c..08adb454de 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -7,7 +7,7 @@
  * src/backend/utils/misc/guc-file.l
  */
 
-%{
+%top{
 
 #include "postgres.h"
 
@@ -17,9 +17,12 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
+#include 
 #include "utils/guc.h"
+#include "utils/memutils.h"
+}
 
-
+%{
 /*
  * flex emits a yy_fatal_error() function that it calls in response to
  * critical errors like malloc failure, file I/O errors, and detection of
@@ -48,12 +51,6 @@ static sigjmp_buf *GUC_flex_fatal_jmp;
 
 static void FreeConfigVariable(ConfigVariable *item);
 
-static void record_config_file_error(const char *errmsg,
-	 const char *config_file,
-	 int lineno,
-	 ConfigVariable **head_p,
-	 ConfigVariable **tail_p);
-
 static int	GUC_flex_fatal(const char *msg);
 
 /* LCOV_EXCL_START */
@@ -159,358 +156,6 @@ ProcessConfigFile(GucContext context)
 	MemoryContextDelete(config_cxt);
 }
 
-/*
- * This function handles both actual config file (re)loads and execution of
- * show_all_file_settings() (i.e., the pg_file_settings view).  In the latter
- * case we don't apply any of the settings, but we make all the usual validity
- * checks, and we return the ConfigVariable list so that it can be printed out
- * by show_all_file_settings().
- */
-static ConfigVariable *
-ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
-{
-	bool		error = false;
-	bool		applying = false;
-	const char *ConfFileWithError;
-	ConfigVariable *item,
-			   *head,
-

Re: build remaining Flex files standalone

2022-08-13 Thread John Naylor
Here are the rest. Most of it was pretty straightforward, with the
main exception of jsonpath_scan.c, which is not quite finished. That
one passes tests but still has one compiler warning. I'm unsure how
much of what is there already is really necessary or was cargo-culted
from elsewhere without explanation. For starters, I'm not sure why the
grammar has a forward declaration of "union YYSTYPE". It's noteworthy
that it used to compile standalone, but with a bit more stuff, and
that was reverted in 550b9d26f80fa30. I can hack on it some more later
but I ran out of steam today.

Other questions thus far:

- "BISONFLAGS += -d" is now in every make file with a .y file -- can
we just force that everywhere?

- Include order seems to matter for the grammar's .h file. I didn't
test if that was the case every time, and after a few miscompiles just
always made it the last inclusion, but I'm wondering if we should keep
those inclusions outside %top{} and put it at the start of the next
%{} ?

- contrib/cubeparse.y now has a global variable -- not terrific, but I
wanted to get something working first.

- I'm actually okay with guc-file.c now, but I'll still welcome
comments on that.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From da2b610b8608e6759f5ed9cc32b487ea8e750ce4 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 12 Aug 2022 17:09:45 +0700
Subject: [PATCH v201 3/9] Build repl_scanner.c standalone

---
 src/backend/replication/Makefile   | 11 +--
 src/backend/replication/repl_gram.y|  2 --
 src/backend/replication/repl_scanner.l | 27 +++---
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 2bffac58c0..bc8170418f 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -16,6 +16,7 @@ override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
 
 OBJS = \
 	repl_gram.o \
+	repl_scanner.o \
 	slot.o \
 	slotfuncs.o \
 	syncrep.o \
@@ -28,8 +29,14 @@ SUBDIRS = logical
 
 include $(top_srcdir)/src/backend/common.mk
 
-# repl_scanner is compiled as part of repl_gram
-repl_gram.o: repl_scanner.c
+# See notes in src/backend/parser/Makefile about the following two rules
+repl_gram.h: repl_gram.c
+	touch $@
+
+repl_gram.c: BISONFLAGS += -d
+
+# Force these dependencies to be known even without dependency info built:
+repl_gram.o repl_scanner.o: repl_gram.h
 
 # syncrep_scanner is compiled as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index 4cf087e602..b343f108d3 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -416,5 +416,3 @@ ident_or_keyword:
 		;
 
 %%
-
-#include "repl_scanner.c"
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 586f0d3a5c..95a933c9a8 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -1,4 +1,4 @@
-%{
+%top{
 /*-
  *
  * repl_scanner.l
@@ -17,7 +17,12 @@
 
 #include "utils/builtins.h"
 #include "parser/scansup.h"
+#include "replication/walsender_private.h"
+
+#include "repl_gram.h"
+}
 
+%{
 /* Avoid exit() on fatal scanner errors (a bit ugly -- see yy_fatal_error) */
 #undef fprintf
 #define fprintf(file, fmt, msg)  fprintf_to_ereport(fmt, msg)
@@ -130,7 +135,7 @@ WAIT{ return K_WAIT; }
 {space}+		{ /* do nothing */ }
 
 {digit}+		{
-	yylval.uintval = strtoul(yytext, NULL, 10);
+	replication_yylval.uintval = strtoul(yytext, NULL, 10);
 	return UCONST;
 }
 
@@ -138,8 +143,8 @@ WAIT{ return K_WAIT; }
 	uint32	hi,
 			lo;
 	if (sscanf(yytext, "%X/%X", , ) != 2)
-		yyerror("invalid streaming start location");
-	yylval.recptr = ((uint64) hi) << 32 | lo;
+		replication_yyerror("invalid streaming start location");
+	replication_yylval.recptr = ((uint64) hi) << 32 | lo;
 	return RECPTR;
 }
 
@@ -151,7 +156,7 @@ WAIT{ return K_WAIT; }
 {quotestop}	{
 	yyless(1);
 	BEGIN(INITIAL);
-	yylval.str = litbufdup();
+	replication_yylval.str = litbufdup();
 	return SCONST;
 }
 
@@ -173,9 +178,9 @@ WAIT{ return K_WAIT; }
 
 	yyless(1);
 	BEGIN(INITIAL);
-	yylval.str = litbufdup();
-	len = strlen(yylval.str);
-	truncate_identifier(yylval.str, len, true);
+	replication_yylval.str = litbufdup();
+	len = strlen(replication_yylval.str);
+	truncate_identifier(replication_yylval.str, len, true);
 	return IDENT;
 }
 
@@ -186,7 +191,7 @@ WAIT{ return K_WAIT; }
 {identifier}	{
 	int			len = strlen(yytext);
 
-	yylval.str = downcase_truncate_identifier(yytext, len, true);
+	replication_yylval.str = downcase_

Re: support for SSE2 intrinsics

2022-08-04 Thread John Naylor
On Thu, Aug 4, 2022 at 12:38 PM Masahiko Sawada 
wrote:
>
> I also think it's a good start. There is a typo in the commit message:
>
> s/hepler/helper/
>
> The rest looks good to me.

Fixed, and pushed, thanks to you both! I'll polish a small patch I have
that actually uses this.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: fix typos

2022-08-04 Thread John Naylor
On Wed, Aug 3, 2022 at 11:41 PM Robert Haas  wrote:
>
> I think that it's talking about this (documented) syntax:
>
> ALTER ROUTINE name [ ( [ [ argmode ] [ argname ] argtype [, ...] ] ) ]
> [ NO ] DEPENDS ON EXTENSION extension_name
>
> So the change from "depends" to "depend" here is incorrect. Maybe we
> can say something like:
>
> the DEPENDS ON EXTENSION
> extension_name action
>
> (I haven't tested whether this markup works.)

Makes sense, I'll go make it happen.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread John Naylor
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart 
wrote:
> Here is a new patch set without the annotation.

Were you considering adding the new function to simd.h now that that's
committed? It's a bit up in the air what should go in there, but this new
function is low-level and generic enough to be a candidate...

I wonder if the "pg_" prefix is appropriate here, as that is most often
used for things that hide specific details *and* where the base name would
clash, like OS calls or C library functions. I'm not quite sure where the
line is drawn, but I mean that "linearsearch" is a generic algorithm and
not a specific API we are implementing, if that makes sense.

The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
pg_popcount32, etc). We'll likely want to search bytes sometime, so
something to keep in mind as far as naming ("_int" vs "_byte"?).

I'm not a fan of "its" as a variable name, and I'm curious what it's
intended to convey.

All the __m128i vars could technically be declared const, although I think
it doesn't matter -- it's just a hint to the reader.

Out of curiosity do we know how much we get by loading four registers
rather than two?
--
John Naylor
EDB: http://www.enterprisedb.com


Re: fix typos

2022-08-04 Thread John Naylor
On Tue, Aug 2, 2022 at 1:11 AM Justin Pryzby  wrote:
>
> On Mon, Aug 01, 2022 at 08:04:54PM +0200, Erik Rijkers wrote:
> > Recent typos...
>
> LGTM, thanks.
>
> Here are some others I've been sitting on, mostly in .c files.

I pushed Robert's suggestion, then pushed the rest of Erik's changes and
two of Justin's. For Justin's 0004:

--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -364,7 +364,7 @@ restart:
  if (nowait)
  ereport(ERROR,
  (errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("could not drop replication origin with OID %d, in use by PID %d",
+ errmsg("could not drop replication origin with OID %u, in use by PID %d",

RepOriginId is a typedef for uint16, so this can't print the wrong answer,
but it is inconsistent with other uses. So it seems we don't need to
backpatch this one?

For patch 0002, the whitespace issue in the top comment in inval.c, I'm
inclined to just change all the out-of-place tabs in a single commit, so we
can add that to the list of whitespace commits.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: build remaining Flex files standalone

2022-08-16 Thread John Naylor
On Wed, Aug 17, 2022 at 8:14 AM Andres Freund  wrote:

> > > > +/* functions shared between guc.c and guc-file.l */
> > > > [...]
> > > I think I prefer your suggestion of a guc_internal.h upthread.
> >
> > Started in 0002, but left open the headerscheck failure.
> >
> > Also, if such a thing is meant to be #include'd only by two generated
> > files, maybe it should just live in the directory where they live, and
> > not in the src/include dir?
>
> It's not something we've done for the backend afaics, but I don't see a reason
> not to start at some point.

BTW, I forgot to mention I did this for the json path parser, which
makes the makefile code simpler than what was there before
550b9d26f80fa30. AFAICS, we could also do the same for gramparse.h,
which is internal to parser.c. If I'm not mistaken, the only reason we
symlink gram.h to src/include/* is so that gramparse.h can include it.
So keeping gramparse.h in the backend could allow removing some gram.h
makefile incantations.

> > > Why does this need to be defined in a semi-public header? If we do this in
> > > multiple files we'll end up with the danger of macro redefinition 
> > > warnings.
> >
> > I tried to put all the Flex/Bison stuff in another *_internal header,
> > but that breaks the build. Putting just this one symbol in a header is
> > silly, but done that way for now. Maybe two copies of the symbol?
>
> The problem is that if it's in a header you can't include another header with
> such a define. That's fine if it's a .h that's just intended to be included by
> a limited set of files, but for something like a header for a datatype that
> might need to be included to e.g. define a PL transform or a new operator or
> ...  This would be solved by the %code requires thing, right?

I believe it would.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PATCH] Clarify the comments about varlena header encoding

2022-08-17 Thread John Naylor
On Thu, Aug 18, 2022 at 1:06 AM Aleksander Alekseev
 wrote:
>
> Hi hackers,
>
> I noticed that the comments regarding bit layouts for varlena headers
> in postgres.h are somewhat misleading. For instance, when reading:

I agree it's confusing, but I don't think this patch is the right direction.

> ```
> 00xx 4-byte length word, aligned, uncompressed data (up to 1G)
> ```
>
> ... one can assume this is a 00xx byte followed by another 4 bytes
> (which is wrong). Also one can read this as "aligned, uncompressed
> data" (which again is wrong).

- * 00xx 4-byte length word, aligned, uncompressed data (up to 1G)
+ * 00xx   , uncompressed data (up to 1G)

Maybe "00xx 4-byte length word (aligned)," is more clear about
what is aligned. Also, adding all those xxx's obscures the point that
we only need to examine one byte to figure out what to do next.

> ```
> 1000 1-byte length word, unaligned, TOAST pointer
> ```
>
> This is misleading too. The comments above this line say that `struct
> varatt_external` is a TOAST pointer. sizeof(varatt_external) = 16,
> plus 1 byte equals 17, right? However the documentation [1] claims the
> result should be 18:

The patch has:

+ * In the third case the va_tag field (see varattrib_1b_e) is used to discern
+ * the specific type and length of the pointer datum. On disk the "xxx" bits
+ * currently always store sizeof(varatt_external) + 2.

...so not sure where 17 came from.

- * 1000 1-byte length word, unaligned, TOAST pointer
+ * 1000 , TOAST pointer (struct varatt_external)

This implies that the header is two bytes, which is not accurate. That
next byte is a type tag:

/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;

...and does not always represent the on-disk length:

/*
 * Type tag for the various sorts of "TOAST pointer" datums.  The peculiar
 * value for VARTAG_ONDISK comes from a requirement for on-disk compatibility
 * with a previous notion that the tag field was the pointer datum's length.
 */
typedef enum vartag_external
{
VARTAG_INDIRECT = 1,
VARTAG_EXPANDED_RO = 2,
VARTAG_EXPANDED_RW = 3,
VARTAG_ONDISK = 18
} vartag_external;

And I don't think the new comments referring to "third case", "first
two cases", etc make it easier to follow.
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: fix typos

2022-08-17 Thread John Naylor
On Tue, Aug 16, 2022 at 8:48 AM John Naylor
 wrote:
>
> On Fri, Aug 12, 2022 at 8:55 PM Tom Lane  wrote:
> >
> > John Naylor  writes:
> > > This is really a straw-man proposal, since I'm not volunteering to do
> > > the work, or suggest anybody else should do the same. That being the
> > > case, it seems we should just go ahead with Justin's patch for
> > > consistency. Possibly we could also change the messages to say "ID"?
> >
> > I'd be content if we change the user-facing messages (and documentation
> > if any) to say "ID" not "OID".
>
> The documentation has both, so it makes sense to standardize on "ID".
> The messages all had oid/OID, which I changed in the attached. I think
> I got all the places.
>
> I'm thinking it's not wrong enough to confuse people, but consistency
> is good, so backpatch to v15 and no further. Does anyone want to make
> a case otherwise?

This is done.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: build remaining Flex files standalone

2022-08-18 Thread John Naylor
On Wed, Aug 17, 2022 at 8:14 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-16 17:41:43 +0700, John Naylor wrote:
> > For v3, I addressed some comments and added .h files to the
> > headerscheck exceptions.
>
> Thanks!
>
>
> > /*
> >  * NB: include bootparse.h only AFTER including bootstrap.h, because 
> > bootstrap.h
> >  * includes node definitions needed for YYSTYPE.
> >  */
> >
> > Future cleanup: I see this in headerscheck:
> >
> > # We can't make these Bison output files compilable standalone
> > # without using "%code require", which old Bison versions lack.
> > # parser/gram.h will be included by parser/gramparse.h anyway.
> >
> > That directive has been supported in Bison since 2.4.2.
>
> 2.4.2 is from 2010. So I think we could just start relying on it?
>
>
> > > > +/* functions shared between guc.c and guc-file.l */
> > > > [...]
> > > I think I prefer your suggestion of a guc_internal.h upthread.
> >
> > Started in 0002, but left open the headerscheck failure.
> >
> > Also, if such a thing is meant to be #include'd only by two generated
> > files, maybe it should just live in the directory where they live, and
> > not in the src/include dir?
>
> It's not something we've done for the backend afaics, but I don't see a reason
> not to start at some point.
>
>
> > > > From 7d4ecfcb3e91f3b45e94b9e64c7c40f1bbd22aa8 Mon Sep 17 00:00:00 2001
> > > > From: John Naylor 
> > > > Date: Fri, 12 Aug 2022 15:45:24 +0700
> > > > Subject: [PATCH v201 2/9] Build booscanner.c standalone
> > >
> > > > -# bootscanner is compiled as part of bootparse
> > > > -bootparse.o: bootscanner.c
> > > > +# See notes in src/backend/parser/Makefile about the following two 
> > > > rules
> > > > +bootparse.h: bootparse.c
> > > > + touch $@
> > > > +
> > > > +bootparse.c: BISONFLAGS += -d
> > > > +
> > > > +# Force these dependencies to be known even without dependency info 
> > > > built:
> > > > +bootparse.o bootscan.o: bootparse.h
> > >
> > > Wonder if we could / should wrap this is something common. It's somewhat
> > > annoying to repeat this stuff everywhere.
> >
> > I haven't looked at the Meson effort recently, but if the build rule
> > is less annoying there, I'm inclined to leave this as a wart until
> > autotools are retired.
>
> The only complicating thing in the rules there is the dependencies from one .c
> file to another .c file.
>
>
> > > > diff --git a/contrib/cube/cubedata.h b/contrib/cube/cubedata.h
> > > > index dbe7d4f742..0b373048b5 100644
> > > > --- a/contrib/cube/cubedata.h
> > > > +++ b/contrib/cube/cubedata.h
> > > > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void);
> > > >
> > > >  /* in cubeparse.y */
> > > >  extern int   cube_yyparse(NDBOX **result);
> > > > +
> > > > +/* All grammar constructs return strings */
> > > > +#define YYSTYPE char *
> > >
> > > Why does this need to be defined in a semi-public header? If we do this in
> > > multiple files we'll end up with the danger of macro redefinition 
> > > warnings.

For v4, I #defined YYSTYPE

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: build remaining Flex files standalone

2022-08-18 Thread John Naylor
> > > > > index dbe7d4f742..0b373048b5 100644
> > > > > --- a/contrib/cube/cubedata.h
> > > > > +++ b/contrib/cube/cubedata.h
> > > > > @@ -67,3 +67,7 @@ extern void cube_scanner_finish(void);
> > > > >
> > > > >  /* in cubeparse.y */
> > > > >  extern int   cube_yyparse(NDBOX **result);
> > > > > +
> > > > > +/* All grammar constructs return strings */
> > > > > +#define YYSTYPE char *
> > > >
> > > > Why does this need to be defined in a semi-public header? If we do this 
> > > > in
> > > > multiple files we'll end up with the danger of macro redefinition 
> > > > warnings.
>
> For v4, I #defined YYSTYPE

Sorry for the misfire. Continuing on, I #defined YYSTYPE in cubescan.l
before #including cubeparse.h.

I also added scanbuflen to the %parse-param to prevent resorting to a
global variable. The rest of the patches are unchanged.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 2b95401d925bed67b2cb1eb9e8cdb1f1dd3bcc8e Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 16 Aug 2022 12:01:41 +0700
Subject: [PATCH v4 02/11] Move private declarations shared between guc.c and
 guc-file.l to new header

FIXME: fails headerscheck
---
 src/backend/utils/misc/guc-file.l |  1 +
 src/backend/utils/misc/guc.c  |  1 +
 src/include/utils/guc.h   | 10 --
 src/include/utils/guc_internal.h  | 24 
 4 files changed, 26 insertions(+), 10 deletions(-)
 create mode 100644 src/include/utils/guc_internal.h

diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index b4fa09749b..843838b1df 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -18,6 +18,7 @@
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
+#include "utils/guc_internal.h"
 
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 66ab3912a0..293834fc13 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -100,6 +100,7 @@
 #include "utils/builtins.h"
 #include "utils/bytea.h"
 #include "utils/float.h"
+#include "utils/guc_internal.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index aae071cd82..45ae1b537f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -442,16 +442,6 @@ extern void GUC_check_errcode(int sqlerrcode);
 	pre_format_elog_string(errno, TEXTDOMAIN), \
 	GUC_check_errhint_string = format_elog_string
 
-/* functions shared between guc.c and guc-file.l */
-extern int	guc_name_compare(const char *namea, const char *nameb);
-extern ConfigVariable *ProcessConfigFileInternal(GucContext context,
- bool applySettings, int elevel);
-extern void record_config_file_error(const char *errmsg,
-	 const char *config_file,
-	 int lineno,
-	 ConfigVariable **head_p,
-	 ConfigVariable **tail_p);
-
 /*
  * The following functions are not in guc.c, but are declared here to avoid
  * having to include guc.h in some widely used headers that it really doesn't
diff --git a/src/include/utils/guc_internal.h b/src/include/utils/guc_internal.h
new file mode 100644
index 00..5d5db6bdce
--- /dev/null
+++ b/src/include/utils/guc_internal.h
@@ -0,0 +1,24 @@
+/*
+ * guc_internals.h
+ *
+ * Declarations shared between backend/utils/misc/guc.c and
+ * backend/utils/misc/guc-file.l
+ *
+ * Copyright (c) 2000-2022, PostgreSQL Global Development Group
+ *
+ * src/include/utils/guc_internals.h
+ *
+ */
+#ifndef GUC_INTERNALS_H
+#define GUC_INTERNALS_H
+
+extern int	guc_name_compare(const char *namea, const char *nameb);
+extern ConfigVariable *ProcessConfigFileInternal(GucContext context,
+ bool applySettings, int elevel);
+extern void record_config_file_error(const char *errmsg,
+	 const char *config_file,
+	 int lineno,
+	 ConfigVariable **head_p,
+	 ConfigVariable **tail_p);
+
+#endif			/* GUC_INTERNALS_H */
-- 
2.36.1

From c066efea2193be8e7b21bb44c067383f34f37ec8 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 16 Aug 2022 10:42:19 +0700
Subject: [PATCH v4 01/11] Preparatory refactoring for compiling guc-file.c
 standalone

Mostly this involves moving ProcessConfigFileInternal() to guc.c
and fixing the shared API to match.
---
 src/backend/utils/misc/guc-file.l | 360 +-
 src/backend/utils/misc/guc.c  | 360 +-
 src/include/utils/guc.h   |  

Re: build remaining Flex files standalone

2022-08-18 Thread John Naylor
I wrote
> [v4]

This piece is a leftover from the last version, and forgot to remove
it, will fix:

diff --git a/contrib/cube/cubeparse.y b/contrib/cube/cubeparse.y
index 7577c4515c..e3b750b695 100644
--- a/contrib/cube/cubeparse.y
+++ b/contrib/cube/cubeparse.y
@@ -7,6 +7,7 @@
 #include "postgres.h"

 #include "cubedata.h"
+#include "cube_internal.h"
 #include "utils/float.h"

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Typo in pg_db_role_setting.h

2022-08-01 Thread John Naylor
On Mon, Aug 1, 2022 at 4:18 PM Japin Li  wrote:
>
>
> Hi, hackers
>
> I think there is a typo in pg_db_role_setting.h, should we fix it?
>
> diff --git a/src/include/catalog/pg_db_role_setting.h
b/src/include/catalog/pg_db_role_setting.h
> index 45d478e9e7..f92e867df4 100644
>  /*
> - * prototypes for functions in pg_db_role_setting.h
> + * prototypes for functions in pg_db_role_setting.c
>   */

You are correct, but I wonder if it'd be better to just drop the comment
entirely. I checked a couple other random headers with function
declarations and they didn't have such a comment, and it's kind of obvious
what they're for.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: support for SSE2 intrinsics

2022-08-02 Thread John Naylor
On Tue, Aug 2, 2022 at 11:53 PM Nathan Bossart 
wrote:
> I did a bit of cross-checking, and AFAICT this is a reasonable starting
> point.  emmintrin.h appears to be sufficient for one of my patches that
> makes use of SSE2 instructions.  That being said, I imagine it'll be
> especially important to keep an eye on the buildfarm when this change is
> committed.

Thanks for checking! Here's a concrete patch for testing.

--
John Naylor
EDB: http://www.enterprisedb.com
From 0e01cc8f5f1c2c27631953091a1d657c5e5486be Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 3 Aug 2022 11:07:40 +0700
Subject: [PATCH v1] Support SSE2 intrinsics where available

SSE2 vector instructions are part of the spec for the 64-bit x86
architecture. Until now we have relied on the compiler to autovectorize
in some limited situations, but some useful coding idioms can only be
expressed explicitly via compiler intrinsics. To this end, add a header
that defines USE_SSE2 when available. While x86-only for now, we can
add other architectures in the future. This will also be the intended
place for low-level hepler functions that use vector operations.

Reviewed by Nathan Bossart

Discussion: https://www.postgresql.org/message-id/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com
---
 src/include/port/simd.h | 30 ++
 1 file changed, 30 insertions(+)
 create mode 100644 src/include/port/simd.h

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
new file mode 100644
index 00..a571e79f57
--- /dev/null
+++ b/src/include/port/simd.h
@@ -0,0 +1,30 @@
+/*-
+ *
+ * simd.h
+ *	  Support for platform-specific vector operations.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/simd.h
+ *
+ *-
+ */
+#ifndef SIMD_H
+#define SIMD_H
+
+/*
+ * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
+ * that compilers targeting this architecture understand SSE2 intrinsics.
+ *
+ * We use emmintrin.h rather than the comprehensive header immintrin.h in
+ * order to exclude extensions beyond SSE2. This is because MSVC, at least,
+ * will allow the use of intrinsics that haven't been enabled at compile
+ * time.
+ */
+#if (defined(__x86_64__) || defined(_M_AMD64))
+#include 
+#define USE_SSE2
+#endif
+
+#endif			/* SIMD_H */
-- 
2.36.1



Re: optimize lookups in snapshot [sub]xip arrays

2022-08-02 Thread John Naylor
On Wed, Aug 3, 2022 at 6:43 AM Nathan Bossart 
wrote:
> Just under half of the callers in 0002 require the offset, but I don't
know
> if any of those are worth optimizing in the first place.  I'll change it
> for now.  It's easy enough to add it back in the future if required.

Yeah, some of those callers will rarely have more than several elements to
search in the first place, or aren't performance-sensitive.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Fix unmatched file identifications

2022-08-08 Thread John Naylor
On Tue, Aug 9, 2022 at 8:58 AM Masahiko Sawada  wrote:
> I found that there are two .c and .h files whose identification in the
> header comment doesn't match its actual path.

> The attached small patch fixes them.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread John Naylor
On Tue, Aug 9, 2022 at 10:51 AM Nathan Bossart  wrote:
> Fixed in v10.

I decided I wasn't quite comfortable changing snapshot handling
without further guarantees.  To this end, 0002 in the attached v11 is
an addendum that adds assert checking (also pgindent and some
comment-smithing). As I suspected, make check-world passes even with
purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
I verified that sticking in the wrong answer will lead to a crash in
assert-enabled builds in short order. I'd kind of like to throw this
(or something else suitable) at the build farm first for that reason.
It's simpler than the qsort/qunique/binary search that was there
before, so that's nice, but I've not tried to test performance.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 1c89b9b7d3c71bb1a703caaf7c01c93bc9e2515f Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 9 Aug 2022 11:51:52 +0700
Subject: [PATCH v11 2/4] Add assert checking, run pgindent, comment changes

---
 src/include/port/pg_lfind.h | 78 ++---
 1 file changed, 56 insertions(+), 22 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 24de544f63..fb125977b2 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -18,51 +18,85 @@
 /*
  * pg_lfind32
  *
- * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
- * returns false.
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
  */
 static inline bool
 pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* If possible, use SSE2 intrinsics to speed up the search. */
+	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
-	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
 
-	for (; i < iterations; i += 16)
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers. Testing has showed this is ~40% faster than using a
+	 * block of two registers.
+	 */
+	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+#if defined(USE_ASSERT_CHECKING)
+	bool		assert_result = false;
+
+	/* pre-compute the result for assert checking */
+	for (i = 0; i < nelem; i++)
 	{
-		/* load the next 16 values into __m128i variables */
-		const __m128i vals1 = _mm_loadu_si128((__m128i *) [i]);
-		const __m128i vals2 = _mm_loadu_si128((__m128i *) [i + 4]);
-		const __m128i vals3 = _mm_loadu_si128((__m128i *) [i + 8]);
-		const __m128i vals4 = _mm_loadu_si128((__m128i *) [i + 12]);
+		if (key == base[i])
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif
 
-		/* perform the comparisons */
-		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
+		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
+		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
+		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
 
-		/* shrink the results into a single variable */
-		const __m128i tmp1 = _mm_or_si128(result1, result2);
-		const __m128i tmp2 = _mm_or_si128(result3, result4);
-		const __m128i result = _mm_or_si128(tmp1, tmp2);
+		/* compare each value to the key */
+		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		__m128i tmp1 = _mm_or_si128(result1, result2);
+		const		__m128i tmp2 = _mm_or_si128(result3, result4);
+		const		__m128i result = _mm_or_si128(tmp1, tmp2);
 
 		/* see if there was a match */
 		if (_mm_movemask_epi8(result) != 0)
+		{
+#if defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
-#endif
+#endif			/* USE_SSE2 */
 
-	/* Process the remaining elements the slow way. */
+	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
 		if (key == base[i])
+		{
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
 
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+	Assert(assert_result == false);
+#endif
 	return false;
 }
 
-- 
2.36.1

From ff77224f9227bcff88a68e63f39754296810351c Mon 

Re: failing to build preproc.c on solaris with sun studio

2022-08-07 Thread John Naylor
On Sun, Aug 7, 2022 at 7:05 AM Tom Lane  wrote:
> Even on a modern Linux:
>
> $ size src/backend/parser/gram.o
>textdata bss dec hex filename
>  656568   0   0  656568   a04b8 src/backend/parser/gram.o
> $ size src/interfaces/ecpg/preproc/preproc.o
>textdata bss dec hex filename
>  912005 1887348  919541   e07f5 src/interfaces/ecpg/preproc/preproc.o
>
> So there's something pretty bloated there.  It doesn't seem like
> ecpg's additional productions should justify a nigh 50% code
> size increase.

Comparing gram.o with preproc.o:

$ objdump -t src/backend/parser/gram.o | grep yy | grep -v
UND | awk '{print $5, $6}' | sort -r | head -n3
0003a24a yytable
0003a24a yycheck
00013672 base_yyparse

$ objdump -t src/interfaces/ecpg/preproc/preproc.o | grep yy | grep -v
UND | awk '{print $5, $6}' | sort -r | head -n3
0004d8e2 yytable
0004d8e2 yycheck
0002841e base_yyparse

The largest lookup tables are ~25% bigger (other tables are trivial in
comparison), and the function base_yyparse is about double the size,
most of which is a giant switch statement with 2510 / 3912 cases,
respectively. That difference does seem excessive. I've long wondered
if it would be possible / feasible to have more strict separation for
each C, ECPG commands, and SQL. That sounds like a huge amount of
work, though.

Playing around with the compiler flags on preproc.c, I get these
compile times, gcc memory usage as reported by /usr/bin/time -v , and
symbol sizes (non-debug build):

-O2:
time 8.0s
Maximum resident set size (kbytes): 255884

-O1:
time 6.3s
Maximum resident set size (kbytes): 170636
0004d8e2 yytable
0004d8e2 yycheck
000292de base_yyparse

-O0:
time 2.9s
Maximum resident set size (kbytes): 153148
0004d8e2 yytable
0004d8e2 yycheck
0003585e base_yyparse

Note that -O0 bloats the binary probably because it's not using a jump
table anymore. O1 might be worth it just to reduce build times for
slower animals, even if Noah reported this didn't help the issue
upthread. I suspect it wouldn't slow down production use much since
the output needs to be compiled anyway.

-- 
John Naylor
EDB: http://www.enterprisedb.com




remap the .text segment into huge pages at run time

2022-11-02 Thread John Naylor
dback on whether the entire approach
in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
(paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 31 Oct 2022 13:59:30 +0700
Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages

Tell linker to align addresses on 2MB boundaries. The .init
section will be so aligned, with the .text section soon after that.
Therefore, the start address of .text should always be align up to
nearly 2MB ahead of the actual start. The first nearly 2MB of .text
will not map to huge pages.

We count on cold sections linking to the front of the .text segment:
Since the cold sections total about 600kB in size, we need ~1.4MB of
additional padding to keep non-cold pages mappable to huge pages. Since
PG has about 5.0MB of .text, we also need an additional 1MB to push
the .text end just past an aligned boundary, so when we align the end
down, only a small number of pages will remain un-remapped at their
original 4kB size.
---
 meson.build  |  3 +++
 src/backend/port/filler.c| 29 +
 src/backend/port/meson.build |  3 +++
 3 files changed, 35 insertions(+)
 create mode 100644 src/backend/port/filler.c

diff --git a/meson.build b/meson.build
index bfacbdc0af..450946370c 100644
--- a/meson.build
+++ b/meson.build
@@ -239,6 +239,9 @@ elif host_system == 'freebsd'
 elif host_system == 'linux'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
+  # WIP: debug builds are huge
+  # TODO: add portability check
+  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']
 
 elif host_system == 'netbsd'
   # We must resolve all dynamic linking in the core server at program start.
diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
new file mode 100644
index 00..de4e33bb05
--- /dev/null
+++ b/src/backend/port/filler.c
@@ -0,0 +1,29 @@
+/*
+ * Add enough padding to .text segment to bring the end just
+ * past a 2MB alignment boundary. In practice, this means .text needs
+ * to be at least 8MB. It shouldn't be much larger than this,
+ * because then more hot pages will remain in 4kB pages.
+ *
+ * FIXME: With this filler added, if explicit huge pages are turned off
+ * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
+ * instead of reporting failure if the .text segment is larger than 8MB.
+ *
+ * See MapStaticCodeToLargePages() in large_page.c
+ *
+ * XXX: The exact amount of filler must be determined experimentally
+ * on platforms of interest, in non-assert builds.
+ *
+ */
+static void
+__attribute__((used))
+__attribute__((cold))
+fill_function(int x)
+{
+	/* TODO: More architectures */
+#ifdef __x86_64__
+__asm__(
+	".fill 3251000"
+);
+#endif
+	(void) x;
+}
\ No newline at end of file
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 5ab65115e9..d876712e0c 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  if host_system == 'linux'
+backend_sources += files('filler.c')
+  endif
   backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
-- 
2.37.3

From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at
 postmaster start

Based on MIT licensed libary at https://github.com/intel/iodlr

The basic steps are:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap temporary region and memcpy aligned portion of .text segment
- mmap start address to new region with huge pages and MAP_FIXED
- memcpy over and revoke the PROT_WRITE bit

The Postgres .text segment is ~5.0MB in a non-assert build, so this
method can put 2-4MB into huge pages.
---
 src/backend/port/large_page.c   | 348 
 src/backend/port/meson.build|   1 +
 src/backend/postmaster/postmaster.c |   7 +
 src/include/port/large_page.h   |  18 ++
 4 files changed, 374 insertions(+)
 create mode 100644 src/backend/port/large_page.c
 create mode 100644 src/include/port/large_page.h

diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c
new file mode 100644
index 00..66a584f785
--- /dev/null
+++ b/src/backend/port/large_page.c
@@ -0,0 +1,348 @@
+/*-
+ *
+ * large_page.c
+ *	  Map .text segment of binary to hu

Re: Incorrect include file order in guc-file.l

2022-11-02 Thread John Naylor
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote:
> >
> > While reviewing a different patch, I have noticed that guc-file.l
> > includes sys/stat.h in the middle of the PG internal headers.  The
> > usual practice is to have first postgres[_fe].h, followed by the
> > system headers and finally the internal headers.  That's a nit, but
> > all the other files do that.
> >
> > {be,fe}-secure-openssl.c include some exceptions though, as documented
> > there.
>
> Agreed, it's apparently an oversight in dac048f71eb.  +1 for the patch.

Yeah. +1, thanks.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: appendBinaryStringInfo stuff

2022-12-22 Thread John Naylor
On Thu, Dec 22, 2022 at 4:19 PM David Rowley  wrote:
>
> Test 1 (from earlier)
>
> master + escape_json using appendStringInfoCharMacro
> $ pgbench -n -T 60 -f bench.sql -M prepared postgres | grep latency
> latency average = 1.807 ms
> latency average = 1.800 ms
> latency average = 1.812 ms (~4.8% faster than master)

> 23.05%  postgres  [.] pg_utf_mblen

I get about 20% improvement by adding an ascii fast path in
pg_mbstrlen_with_len, which I think would work with all encodings we
support:

@@ -1064,7 +1064,12 @@ pg_mbstrlen_with_len(const char *mbstr, int limit)

while (limit > 0 && *mbstr)
{
-   int l = pg_mblen(mbstr);
+   int l;
+
+   if (!IS_HIGHBIT_SET(*mbstr))
+   l = 1;
+   else
+       l = pg_mblen(mbstr);

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-22 Thread John Naylor
On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada 
wrote:
>
> On Tue, Dec 20, 2022 at 3:09 PM John Naylor
>  wrote:

> >
https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
> >
> > I'm guessing the hash join case can afford to be precise about memory
because it must spill to disk when exceeding workmem. We don't have that
design constraint.
>
> You mean that the memory used by the radix tree should be limited not
> by the amount of memory actually used, but by the amount of memory
> allocated? In other words, it checks by MomoryContextMemAllocated() in
> the local cases and by dsa_get_total_size() in the shared case.

I mean, if this patch set uses 10x less memory than v15 (not always, but
easy to find cases where it does), and if it's also expensive to track
memory use precisely, then we don't have an incentive to track memory
precisely. Even if we did, we don't want to assume that every future caller
of radix tree is willing to incur that cost.

> The idea of using up to half of maintenance_work_mem might be a good
> idea compared to the current flat-array solution. But since it only
> uses half, I'm concerned that there will be users who double their
> maintenace_work_mem. When it is improved, the user needs to restore
> maintenance_work_mem again.

I find it useful to step back and look at the usage patterns:

Autovacuum: Limiting the memory allocated by vacuum is important, since
there are multiple workers and they can run at any time (possibly most of
the time). This case will not use parallel index vacuum, so will use slab,
where the quick estimation of memory taken by the context is not terribly
far off, so we can afford to be more optimistic here.

Manual vacuum: The default configuration assumes we want to finish as soon
as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used.
My experience leads me to believe users are willing to use a lot of memory
to make manual vacuum finish as quickly as possible, and are disappointed
to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.

So I don't believe anyone will have to double maintenance work mem after
upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap

That said, it's possible 50% is too pessimistic -- a 75% threshold will
bring us very close to powers of two for example:

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop

I'm not sure if that calculation could cause going over the limit, or how
common that would be.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-22 Thread John Naylor
On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada 
wrote:

> If the value is a power of 2, it seems to work perfectly fine. But for
> example if it's 700MB, the total memory exceeds the limit:
>
> 2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
> 510 + 256 = 766MB -> stop but it exceeds the limit.
>
> In a more bigger case, if it's 11000MB,
>
> 2*(1+2+...+2048) = 8190MB (74.4%)
> 8190 + 4096 = 12286MB
>
> That being said, I don't think they are not common cases. So the 75%
> threshold seems to work fine in most cases.

Thinking some more, I agree this doesn't have large practical risk, but
thinking from the point of view of the community, being loose with memory
limits by up to 10% is not a good precedent.

Perhaps we can be clever and use 75% when the limit is a power of two and
50% otherwise. I'm skeptical of trying to be clever, and I just thought of
an additional concern: We're assuming behavior of the growth in size of new
DSA segments, which could possibly change. Given how allocators are
typically coded, though, it seems safe to assume that they'll at most
double in size.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-26 Thread John Naylor
On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada 
wrote:
>
> On Fri, Dec 23, 2022 at 8:47 PM John Naylor
>  wrote:

> These 4 patches make sense to me.We can merge them into 0002 patch

Okay, then I'll squash them when I post my next patch.

> and I'll do similar changes for functions for leaf nodes as well.

I assume you meant something else? -- some of the differences between inner
and leaf are already abstracted away.

In any case, some things are still half-baked, so please wait until my next
patch before doing work on these files.

Also, CI found a bug on 32-bit -- I know what I missed and will fix next
week.

> > 0010 and 0011 template a common implementation for both leaf and inner
nodes for searching and inserting.
> >
> > 0012: While at it, I couldn't resist using this technique to separate
out delete from search, which makes sense and might give a small
performance boost (at least on less capable hardware). I haven't got to the
iteration functions, but they should be straightforward.

Two things came to mind since I posted this, which I'll make clear next
patch:
- A good compiler will get rid of branches when inlining, so maybe no
difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_
looks like search because we don't yet shrink nodes.

> What do you
> think about how we can expand this template method to deal with DSA
> memory? I imagined that we load say radixtree_template.h with some
> macros to use the radix tree like we do for simplehash.h. And
> radixtree_template.h further loads xxx_impl.h files for some internal
> functions.

Right, I was thinking the same. I wanted to start small and look for
opportunities to shrink the code footprint.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-10 Thread John Naylor
On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada 
wrote:

> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:

> /* Storage for Tids */
> union tree
> {
> local_radix_tree*local;
> shared_radix_tree   *shared;
> };

We could possibly go back to using a common data type for this, but with
unused fields in each setting, as before. We would have to be more careful
of things like the 32-bit crash from a few weeks ago.

> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier
patches, but with many more branches and fatter code paths.

> Overall, I think
> there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we
can always go back to the branching approach for shared memory. I think
we'll want to keep this as a template overall, at least to allow different
value types and to ease adding variable-length keys if someone finds a need.

> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine
if the workers buffer in local arrays. Highly concurrent applications will
need additional work, of course.

> For example, if we want to support
> Optimistic Lock Coupling[1],

Interesting, from the same authors!

> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency,
much less how it should coexist with the local implementation. It's
interesting that their "ROWEX" scheme gives up maintaining order in the
linear nodes.

> > One review point I'll mention: Somehow I didn't notice there is no use
for the "chunk" field in the rt_node type -- it's only set to zero and
copied when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.

I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't
yet have path compression or lazy path expansion), and this seems the way
to get there.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Simple code cleanup in tuplesort.c.

2023-01-10 Thread John Naylor
On Mon, Jan 9, 2023 at 7:29 PM Xing Guo  wrote:
>
> Thank you John. This is my first patch, I'll keep it in mind that
> adding a version number next time I sending the patch.

Welcome to the community! You may also consider reviewing a patch from the
current commitfest, since we can always use additional help there.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-10 Thread John Naylor
On Tue, Jan 3, 2023 at 1:59 PM Ankit Kumar Pandey 
wrote:
>
>
> On 03/01/23 08:38, David Rowley wrote:
> >
> > Do you actually have a need for this or are you just trying to tick
> > off some TODO items?
> >
> I would say Iatter but reason I picked it up was more on side of
> learning optimizer better.

Note that the TODO list has accumulated some cruft over the years. Some
time ago I started an effort to remove outdated/undesirable entries, and I
should get back to that, but for the present, please take the warning at
the top to heart:

"WARNING for Developers: Unfortunately this list does not contain all the
information necessary for someone to start coding a feature. Some of these
items might have become unnecessary since they were added --- others might
be desirable but the implementation might be unclear. When selecting items
listed below, be prepared to first discuss the value of the feature. Do not
assume that you can select one, code it and then expect it to be committed.
"

--
John Naylor
EDB: http://www.enterprisedb.com


Re: cutting down the TODO list thread

2023-01-10 Thread John Naylor
So, I had intended to spend some time on this at least three times a year.
I've clearly failed at that, but now is as good a time as any to pick it
back up again.

Over in [1], Tom opined:

> John Naylor  writes:
>
> > "WARNING for Developers: Unfortunately this list does not contain all
the
> > information necessary for someone to start coding a feature. Some of
these
> > items might have become unnecessary since they were added --- others
might
> > be desirable but the implementation might be unclear. When selecting
items
> > listed below, be prepared to first discuss the value of the feature. Do
not
> > assume that you can select one, code it and then expect it to be
committed.
> > "
>
> I think we could make that even stronger: there's basically nothing on
> the TODO list that isn't problematic in some way.  Otherwise it would
> have been done already.  The entries involve large amounts of work,
> or things that are subtler than they might appear, or cases where the
> desirable semantics aren't clear, or tradeoffs that there's not
> consensus about, or combinations of those.
>
> IME it's typically a lot more productive to approach things via
> "scratch your own itch".  If a problem is biting you directly, then
> at least you have some clear idea of what it is that needs to be fixed.
> You might have to work up to an understanding of how to fix it, but
> you have a clear goal.

I've come up with some revised language, including s/15/16/ and removing
the category of "[E]" (easier to implement), since it wouldn't be here if
it were actually easy:

--
WARNING for Developers: This list contains some known PostgreSQL bugs, some
feature requests, and some things we are not even sure we want. This is not
meant to be a resource for beginning developers to get ideas for things to
work on. 

All of these items are hard, and some are perhaps impossible. Some of these
items might have become unnecessary since they were added. Others might be
desirable but:

- a large amount work is required
- the problems are subtler than they might appear
- the desirable semantics aren't clear
- there are tradeoffs that there's not consensus about
- some combinations of the above

If you really need a feature that is listed below, it will be worth reading
the linked email thread if there is one, since it will often show the
difficulties, or perhaps contain previous failed attempts to get a patch
committed. If after that you still want to work on it, be prepared to first
discuss the value of the feature. Do not assume that you can start coding
and expect it to be committed. Always discuss design on the Hackers list
before starting to code.

Over time, it may become clear that a TODO item has become outdated or
otherwise determined to be either too controversial or not worth the
development effort. Such items should be retired to the Not Worth Doing
page.

[D] marks changes that are done, and will appear in the PostgreSQL 16
release.
--

We could also revise the developer FAQ:
- remove phrase "Outstanding features are detailed in Todo."
- add suggestion to to check the Todo or Not_worth_doing pages to see if
the desired feature is undesirable or problematic
- rephrase "Working in isolation is not advisable because others might be
working on the same TODO item, or you might have misunderstood the TODO
item." so it doesn't mention 'TODO' at all.

[1] https://www.postgresql.org/message-id/415636.1673411259%40sss.pgh.pa.us

--
John Naylor
EDB: http://www.enterprisedb.com


Re: SQL/JSON revisited

2023-01-10 Thread John Naylor
On Wed, Jan 11, 2023 at 2:02 PM Elena Indrupskaya <
e.indrupsk...@postgrespro.ru> wrote:
>
> Sorry for upsetting your bot. :(

What I do in these cases is save the incremental patch as a .txt file --
that way people can read it, but the cf bot doesn't try to launch a CI run.
And if I forget that detail, well it's not a big deal, it happens sometimes.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Simple code cleanup in tuplesort.c.

2023-01-04 Thread John Naylor
On Fri, Sep 16, 2022 at 1:43 PM Richard Guo  wrote:
>
>
> On Wed, Jul 27, 2022 at 5:10 PM Xing Guo  wrote:
>>
>> The bounded heap sorting status flag is set twice in sort_bounded_heap()
and tuplesort_performsort(). This patch helps remove one of them.
>
>
> Revisiting this patch I think maybe it's better to remove the setting of
> Tuplesort status from tuplesort_performsort() for the TSS_BOUNDED case.
> Thus we keep the heap manipulation routines, make_bounded_heap and
> sort_bounded_heap, consistent in that they update their status
> accordingly inside the function.

The label TSS_BUILDRUNS has a similar style and also the following comment,
so I will push this patch with a similar comment added unless someone wants
to make a case for doing otherwise.

* Note that mergeruns sets the correct state->status.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-11 Thread John Naylor
I wrote:

> I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

On further reflection, this is completely false and I'm not sure what I was
thinking. However, for the update-inner case maybe we can assert that we
found a valid slot.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Improve ability to display optimizer analysis using OPTIMIZER_DEBUG

2023-01-11 Thread John Naylor
On Wed, Jan 11, 2023 at 1:38 PM Ankit Kumar Pandey 
wrote:
>
>
> > On 11/01/23 09:57, Tom Lane wrote:
> > IME it's typically a lot more productive to approach things via
> > "scratch your own itch".  If a problem is biting you directly, then
> > at least you have some clear idea of what it is that needs to be fixed.
> > You might have to work up to an understanding of how to fix it, but
> > you have a clear goal.
>
>
> Question is, how newcomers should start contribution if they are not
> coming with a problem in their hand?

I would say find something that gets you excited. Worked for me, at least.

> Todo list is possibly first thing anyone, who is willing to contribute
> is going to read and for a new

Yeah, that's a problem we need to address.

> That being said, I think this is part of learning process and okay to
> come up with ideas and fail.

Of course it is! A key skill in engineering is to fail as quickly as
possible, preferably before doing any actual work.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Simple code cleanup in tuplesort.c.

2023-01-09 Thread John Naylor
On Thu, Jan 5, 2023 at 8:18 AM John Naylor 
wrote:
>
> The label TSS_BUILDRUNS has a similar style and also the following
comment, so I will push this patch with a similar comment added unless
someone wants to make a case for doing otherwise.
>
> * Note that mergeruns sets the correct state->status.

This has been pushed, thanks. Note that both patches in this thread have
the same name. Adding a version number to the name is a good way to
distinguish them.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Small miscellaneus fixes (Part II)

2023-01-11 Thread John Naylor
On Fri, Dec 23, 2022 at 8:08 AM Justin Pryzby  wrote:

> Makes sense now (in your first message, you said that the problem was
> with "sign", and the patch didn't address the actual problem in
> IS_PLUS()).
>
> One can look and find that the unreachable code was introduced at
> 7a3e7b64a.
>
> With your proposed change, the unreachable line is hit by regression
> tests, which is an improvment.  As is the change to pg_dump.c.

But that now reachable line just unsets a flag that we previously found
unset, right?
And if that line was unreachable, then surely the previous flag-clearing
operation is too?

5669  994426 : if (IS_MINUS(Np->Num)) // <- also always
false
5670   0 : Np->Num->flag &= ~NUM_F_MINUS;
5671 : }
5672 524 : else if (Np->sign != '+' && IS_PLUS(Np->Num))
5673   0 : Np->Num->flag &= ~NUM_F_PLUS;

https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html

I'm inclined to turn the dead unsets into asserts.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Small miscellaneus fixes (Part II)

2023-01-11 Thread John Naylor
On Thu, Jan 12, 2023 at 12:34 PM Justin Pryzby  wrote:
>
> On Thu, Jan 12, 2023 at 12:15:24PM +0700, John Naylor wrote:
> > On Fri, Dec 23, 2022 at 8:08 AM Justin Pryzby 
wrote:
> >
> > > Makes sense now (in your first message, you said that the problem was
> > > with "sign", and the patch didn't address the actual problem in
> > > IS_PLUS()).
> > >
> > > One can look and find that the unreachable code was introduced at
> > > 7a3e7b64a.
> > >
> > > With your proposed change, the unreachable line is hit by regression
> > > tests, which is an improvment.  As is the change to pg_dump.c.
> >
> > But that now reachable line just unsets a flag that we previously found
> > unset, right?
>
> Good point.
>
> > And if that line was unreachable, then surely the previous flag-clearing
> > operation is too?
> >
> > 5669  994426 : if (IS_MINUS(Np->Num)) // <- also
always
> > false
> > 5670   0 : Np->Num->flag &= ~NUM_F_MINUS;
> > 5671 : }
> > 5672 524 : else if (Np->sign != '+' &&
IS_PLUS(Np->Num))
> > 5673   0 : Np->Num->flag &= ~NUM_F_PLUS;
> >
> >
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
> >
> > I'm inclined to turn the dead unsets into asserts.
>
> To be clear - did you mean like this ?
>
> diff --git a/src/backend/utils/adt/formatting.c
b/src/backend/utils/adt/formatting.c
> index a4b524ea3ac..848956879f5 100644
> --- a/src/backend/utils/adt/formatting.c
> +++ b/src/backend/utils/adt/formatting.c
> @@ -5662,15 +5662,13 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
char *inout,
> }
> else
> {
> +   Assert(!IS_MINUS(Np->Num));
> +   Assert(!IS_PLUS(Np->Num));
> if (Np->sign != '-')
> {
> if (IS_BRACKET(Np->Num) &&
IS_FILLMODE(Np->Num))
> Np->Num->flag &= ~NUM_F_BRACKET;
> -   if (IS_MINUS(Np->Num))
> -   Np->Num->flag &= ~NUM_F_MINUS;
> }
> -   else if (Np->sign != '+' && IS_PLUS(Np->Num))
> -   Np->Num->flag &= ~NUM_F_PLUS;
>
> if (Np->sign == '+' && IS_FILLMODE(Np->Num) &&
IS_LSIGN(Np->Num) == false)
> Np->sign_wrote = true;  /* needn't sign */

I was originally thinking of something more localized:

--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -5666,11 +5666,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num, char
*inout,
 {
 if (IS_BRACKET(Np->Num) && IS_FILLMODE(Np->Num))
 Np->Num->flag &= ~NUM_F_BRACKET;
-if (IS_MINUS(Np->Num))
-Np->Num->flag &= ~NUM_F_MINUS;
+Assert(!IS_MINUS(Np->Num));
 }
-else if (Np->sign != '+' && IS_PLUS(Np->Num))
-Np->Num->flag &= ~NUM_F_PLUS;
+else if (Np->sign != '+')
+Assert(!IS_PLUS(Np->Num));

...but arguably the earlier check is close enough that it's silly to assert
in the "else" branch, and I'd be okay with just nuking those lines. Another
thing that caught my attention is the assumption that unsetting a bit is so
expensive that we have to first check if it's set, so we may as well remove
"IS_BRACKET(Np->Num)" as well.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-12 Thread John Naylor
On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada 
wrote:
>
> On Wed, Jan 11, 2023 at 12:13 PM John Naylor
>  wrote:
> >
> > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada 
wrote:

> I agree to keep this as a template.

Okay, I'll squash the previous patch and work on cleaning up the internals.
I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.

> From the vacuum integration
> perspective, it would be better if we can use a common data type for
> shared and local. It makes sense to have different data types if the
> radix trees have different values types.

I agree it would be better, all else being equal. I have some further
thoughts below.

> > > It looks no problem in terms of vacuum integration, although I've not
> > > fully tested yet. TID store uses the radix tree as the main storage,
> > > and with the template radix tree, the data types for shared and
> > > non-shared will be different. TID store can have an union for the
> > > radix tree and the structure would be like follows:
> >
> > > /* Storage for Tids */
> > > union tree
> > > {
> > > local_radix_tree*local;
> > > shared_radix_tree   *shared;
> > > };
> >
> > We could possibly go back to using a common data type for this, but
with unused fields in each setting, as before. We would have to be more
careful of things like the 32-bit crash from a few weeks ago.
>
> One idea to have a common data type without unused fields is to use
> radix_tree a base class. We cast it to radix_tree_shared or
> radix_tree_local depending on the flag is_shared in radix_tree. For
> instance we have like (based on non-template version),

> struct radix_tree
> {
> boolis_shared;
> MemoryContext context;
> };

That could work in principle. My first impression is, just a memory context
is not much of a base class. Also, casts can creep into a large number of
places.

Another thought came to mind: I'm guessing the TID store is unusual --
meaning most uses of radix tree will only need one kind of memory
(local/shared). I could be wrong about that, and it _is_ a guess about the
future. If true, then it makes more sense that only code that needs both
memory kinds should be responsible for keeping them separate.

The template might be easier for future use cases if shared memory were
all-or-nothing, meaning either

- completely different functions and types depending on RT_SHMEM
- use branches (like v8)

The union sounds like a good thing to try, but do whatever seems right.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: static assert cleanup

2022-12-09 Thread John Naylor
On Fri, Dec 9, 2022 at 2:47 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> 0003-Move-some-static-assertions-to-better-places.patch
>
> This moves some that I thought were suboptimally placed but it could be
> debated or refined.

+ * We really want ItemPointerData to be exactly 6 bytes.  This is rather a
+ * random place to check, but there is no better place.

Since the assert is no longer in a random function body, it seems we can
remove the second sentence.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-11 Thread John Naylor
On Sat, Dec 10, 2022 at 11:02 AM David Rowley  wrote:
> [v4]

Thanks for working on this!

I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but
should be useful enough for testing allocation speed), only applying the
first five, which are local-memory only. The benchmark is not meant to
represent a realistic workload, and primarily stresses traversal and
allocation of the smallest node type. Minimum of five, with turbo-boost
off, on recent Intel laptop hardware:

v13-0001 to 0005:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 151123432 | 222

47.06%  postgres  postgres [.] rt_set
22.89%  postgres  postgres [.] SlabAlloc
 9.65%  postgres  postgres [.] rt_node_insert_inner.isra.0
 5.94%  postgres  [unknown][k] 0xb5e011b7
 3.62%  postgres  postgres [.] MemoryContextAlloc
 2.70%  postgres  libc.so.6[.] __memmove_avx_unaligned_erms
 2.60%  postgres  postgres [.] SlabFree

+ v4 slab:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 152463112 | 213

  52.42%  postgres  postgres  [.] rt_set
  12.80%  postgres  postgres  [.] SlabAlloc
   9.38%  postgres  postgres  [.] rt_node_insert_inner.isra.0
   7.87%  postgres  [unknown] [k] 0xb5e011b7
   4.98%  postgres  postgres  [.] SlabFree

While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.

num_keys = 50, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257

Sidenote: I don't recall ever seeing vsyscall (I think that's what the
0xb5e011b7 address is referring to) in a profile, so not sure what
is happening there.

[1]
https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-12 Thread John Naylor
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada 
wrote:
>
> On Fri, Dec 9, 2022 at 5:53 PM John Naylor 
wrote:
> >

> > I don't think that'd be very controversial, but I'm also not sure why
we'd need 4MB -- can you explain in more detail what exactly we'd need so
that the feature would work? (The minimum doesn't have to work *well* IIUC,
just do some useful work and not fail).
>
> The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> the radix tree is using dsa_get_total_size(). If the size returned by
> dsa_get_total_size() (+ some memory used by TIDStore meta information)
> exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> and heap vacuum. However, when allocating DSA memory for
> radix_tree_control at creation, we allocate 1MB
> (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> radix_tree_control from it. das_get_total_size() returns 1MB even if
> there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards
the total. We want the decision to be driven by blocks allocated. I have an
idea on that below.

> > Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.
>
> Right, I think it's no problem in slab cases. In DSA cases, the new
> segment size follows a geometric series that approximately doubles the
> total storage each time we create a new segment. This behavior comes
> from the fact that the underlying DSM system isn't designed for large
> numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems
we could test if the total DSA area allocated is greater than half of
maintenance_work_mem. If that parameter is a power of two (common) and
>=8MB, then the area will contain just under a power of two the last time
it passes the test. The next segment will bring it to about 3/4 full, like
this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB-> stop

That would be a simple way to be conservative with the memory limit. The
unfortunate aspect is that the last segment would be mostly wasted, but
it's paradise compared to the pessimistically-sized single array we have
now (even with Peter G.'s VM snapshot informing the allocation size, I
imagine).

And as for minimum possible maintenance work mem, I think this would work
with 2MB, if the community is okay with technically going over the limit by
a few bytes of overhead if a buildfarm animal set to that value. I imagine
it would never go over the limit for realistic (and even most unrealistic)
values. Even with a VM snapshot page in memory and small local arrays of
TIDs, I think with this scheme we'll be well under the limit.

After this feature is complete, I think we should consider a follow-on
patch to get rid of vacuum_work_mem, since it would no longer be needed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-05 Thread John Naylor
On Mon, Dec 5, 2022 at 3:02 PM David Rowley  wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor 
wrote:

> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list?  What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.

The intent must have slipped my mind along the way.

> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version.  I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.

I get about 0.1% difference on my machine.  Both ways boil down to (on gcc)
3 instructions with low latency. The later ones need the prior results to
execute, which I think is what the XXX comment "isn't great" was referring
to. The new coding is more mysterious (does it do the right thing on all
platforms?), so I guess the original is still the way to go unless we get a
better idea.

--
John Naylor
EDB: http://www.enterprisedb.com


move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
Over in [1], Masahiko and I found that using some bitmapset logic yields a
useful speedup in one part of the proposed radix tree patch. In addition to
what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and
bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own
copies of the first two, and we could adapt that strategy again. But
instead of three files defining these, it seems like it's time to consider
moving them somewhere more central.

Attached is a simple form of this concept, including moving
HAS_MULTIPLE_ONES just for consistency. One possible objection is the
possibility of namespace clash. Thoughts?

I also considered putting the macros and typedefs in pg_bitutils.h. One
organizational advantage is that pg_bitutils.h already offers convenience
function symbols where the parameter depends on SIZEOF_SIZE_T, so putting
the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a
clear win, so I didn't go that far.

[1]
https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/nodes/bitmapset.c | 24 
 src/backend/nodes/tidbitmap.c |  5 -
 src/include/nodes/bitmapset.h | 24 
 3 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3204b49738 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -26,33 +26,9 @@
 #include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*--
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *xx1
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *yy0
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *yy1
- * and then ANDing with the original value gives
- *001
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *--
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
 /* Select appropriate bit-twiddling functions for bitmap word size */
 #if BITS_PER_BITMAPWORD == 32
 #define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index a7a6b26668..8a1fd1d205 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -72,11 +72,6 @@
  */
 #define PAGES_PER_CHUNK  (BLCKSZ / 32)
 
-/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */
-
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 /* number of active words for an exact page: */
 #define WORDS_PER_PAGE	((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
 /* number of active words for a lossy chunk: */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..8462282b9e 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)  ((x) % BITS_PER_BITMAPWORD)
+
+/*--
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *xx1
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *yy0
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *yy1
+ * and then ANDing with the original value gives
+ *001
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *--
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
+
 typedef struct Bitmapset
 {
 	pg_node_attr(custom_copy_equal, special_read_write)


Re: New strategies for freezing, advancing relfrozenxid early

2022-12-15 Thread John Naylor
On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan  wrote:
>
> At the suggestion of Jeff, I wrote a Wiki page that shows motivating
> examples for the patch series:
>
>
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
>
> These are all cases where VACUUM currently doesn't do the right thing
> around freezing, in a way that is greatly ameliorated by the patch.
> Perhaps this will help other hackers to understand the motivation
> behind some of these mechanisms. There are plenty of details that only
> make sense in the context of a certain kind of table, with certain
> performance characteristics that the design is sensitive to, and seeks
> to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I
find much more convincing, or at least easy to reason about. I'd actually
recommend in the future to open discussion with this kind of analysis --
even before coding, it's possible to indicate what a design is *intended*
to achieve. And reviewers can likewise bring up cases of their own in a
concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan  wrote:

> At the very least, a given VACUUM operation has to choose its freezing
> strategy based on how it expects the table will look when it's done
> vacuuming the table, and how that will impact the next VACUUM against
> the same table. Without that, then vacuuming an append-only table will
> fall into a pattern of setting pages all-visible in one vacuum, and
> then freezing those same pages all-frozen in the very next vacuum
> because there are too many. Which makes little sense; we're far better
> off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One
motivating example mentioned is the append-only table. If we detected that
case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist,
we could use that information as one way to drive eager freezing
independently of size. At least in theory -- it's very possible size will
be a necessary part of the decision, but it's less clear that it's as
useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy,
maybe other ideas will fall out. I don't have a well-thought out idea about
policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen
the VM snapshot in practice (under "Scanned pages, visibility map snapshot"
in the wiki page), it's neat to see fewer pages being scanned. Prefetching
not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates
asynchronous IO. Keeping only one VM snapshot page in memory makes perfect
sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding
"skipping strategy". That's phrased as a kind of negative -- what we're
*not* doing. Many times I had to pause and compute in my head what we're
*doing*, i.e. the "scanning strategy". For example, I wonder if the VM
strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-20 Thread John Naylor
On Tue, Dec 20, 2022 at 10:36 AM David Rowley  wrote:
>
> I'm planning on pushing the attached v3 patch shortly. I've spent
> several days reading over this and testing it in detail along with
> adding additional features to the SlabCheck code to find more
> inconsistencies.

FWIW, I reran my test from last week and got similar results.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-19 Thread John Naylor
On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada 
wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada 
wrote:

> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases. As shown below, there is an overhead for that:
>
> w/o 0008 patch
>  298453544 | 282

> w/0 0008 patch
>  293603184 | 297

This adds about as much overhead as the improvement I measured in the v4
slab allocator patch. That's not acceptable, and is exactly what Andres
warned about in

https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de

I'm guessing the hash join case can afford to be precise about memory
because it must spill to disk when exceeding workmem. We don't have that
design constraint.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: code cleanups

2022-12-19 Thread John Naylor
On Thu, Nov 24, 2022 at 12:53 AM Tom Lane  wrote:
>
> Justin Pryzby  writes:
> > Some modest cleanups I've accumulated.

> 0004: Right, somebody injected code in a poorly chosen place
> (yet another victim of the "add at the end" anti-pattern).

I've pushed 0004.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Rework of collation code, extensibility

2022-12-17 Thread John Naylor
On Sun, Dec 18, 2022 at 10:28 AM Ted Yu  wrote:

> It seems the `else` is not needed (since when the if branch is taken, we
return from the func).

By that same logic, this review comment is not needed, since compiler
vendors don't charge license fees by the number of keywords. ;-)
Joking aside, we don't really have a project style preference for this case.

> nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero pointers).
See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
On Mon, Dec 5, 2022 at 9:33 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2022-Dec-05, John Naylor wrote:
> >> -#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
> >> -#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)
>
> > In this location, nobody can complain about the naming of these macros,
> > since they're just used to implement other bitmapset.c code.  However,
> > if you move them to the .h file, ISTM you should give them more
> > meaningful names.
>
> IMV these are absolutely private to bitmapset.c.  I reject the idea
> that they should be exposed publicly, under these names or any others.

Well, they've already escaped to tidbitmap.c as a copy. How do you feel
about going that route?

> Maybe we need some more bitmapset primitive functions?  What is it
> you actually want to accomplish in the end?

 An inserter into one type of node in a tree structure must quickly find a
free position in an array. We have a bitmap of 128 bits to indicate whether
the corresponding array position is free. The proposed coding is:

 /* get the first word with at least one bit not set */
for (idx = 0; idx < WORDNUM(128); idx++)
{
  if (isset[idx] < ~((bitmapword) 0))
break;
}

/* To get the first unset bit in X, get the first set bit in ~X */
inverse = ~(isset[idx]);
slotpos = idx * BITS_PER_BITMAPWORD;
slotpos += bmw_rightmost_one_pos(inverse);

/* mark the slot used */
isset[idx] |= RIGHTMOST_ONE(inverse);

return slotpos;

...which, if it were reversed so that a set bit meant "available", would
essentially be bms_first_member(), so a more primitive version of that
might work.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
On Tue, Dec 6, 2022 at 12:57 PM Tom Lane  wrote:

> > Well, they've already escaped to tidbitmap.c as a copy. How do you feel
> > about going that route?
>
> Not terribly pleased with that either, I must admit.

Okay, I won't pursue that further.

> If we do put RIGHTMOST_ONE functionality into pg_bitutils.h,
> I'd envision it as pg_bitutils.h exporting both 32-bit and
> 64-bit versions of that, and then bitmapset.c choosing the
> appropriate one just like it chooses bmw_rightmost_one_pos.

Here's a quick go at that. I've not attempted to use it for what I need,
but it looks like it fits the bill.

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/nodes/bitmapset.c| 36 ++--
 src/include/nodes/bitmapset.h| 16 ++--
 src/include/port/pg_bitutils.h   | 31 +++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*--
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *xx1
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *yy0
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *yy1
- * and then ANDing with the original value gives
- *001
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *--
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*--
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *xx1
+ * where x's are unspecified bits.  The t

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-09 Thread John Naylor
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada 
wrote:

> In the meanwhile, I've been working on vacuum integration. There are
> two things I'd like to discuss some time:
>
> The first is the minimum of maintenance_work_mem, 1 MB. Since the
> initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> vacuum with radix tree cannot work with the minimum
> maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> we can start a new thread for that.

I don't think that'd be very controversial, but I'm also not sure why we'd
need 4MB -- can you explain in more detail what exactly we'd need so that
the feature would work? (The minimum doesn't have to work *well* IIUC, just
do some useful work and not fail).

> The second is how to limit the size of the radix tree to
> maintenance_work_mem. I think that it's tricky to estimate the maximum
> number of keys in the radix tree that fit in maintenance_work_mem. The
> radix tree size varies depending on the key distribution. The next
> idea I considered was how to limit the size when inserting a key. In
> order to strictly limit the radix tree size, probably we have to
> change the rt_set so that it breaks off and returns false if the radix
> tree size is about to exceed the memory limit when we allocate a new
> node or grow a node kind/class.

That seems complex, fragile, and wrong scope.

> Ideally, I'd like to control the size
> outside of radix tree (e.g. TIDStore) since it could introduce
> overhead to rt_set() but probably we need to add such logic in radix
> tree.

Does the TIDStore have the ability to ask the DSA (or slab context) to see
how big it is? If a new segment has been allocated that brings us to the
limit, we can stop when we discover that fact. In the local case with slab
blocks, it won't be on nice neat boundaries, but we could check if we're
within the largest block size (~64kB) of overflow.

Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: New strategies for freezing, advancing relfrozenxid early

2022-12-13 Thread John Naylor
On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan  wrote:
>
> On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis  wrote:
> > But the heuristic also seems off to me. What if you have lots of
partitions
> > in an append-only range-partitioned table? That would tend to use the
> > lazy freezing strategy (because each partition is small), but that's
> > not what you want. I understand heuristics aren't perfect, but it feels
> > like we could do something better.
>
> It is at least vastly superior to vacuum_freeze_min_age in cases like
> this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
> trigger freezing in any autovacuum given a table like pgbench_history
> (barring during aggressive mode), due to how it interacts with the
> visibility map. So we're practically guaranteed to do literally all
> freezing for an append-only table in an aggressive mode VACUUM.
>
> Worst of all, that happens on a timeline that has nothing to do with
> the physical characteristics of the table itself (like the number of
> unfrozen heap pages or something).

If the number of unfrozen heap pages is the thing we care about, perhaps
that, and not the total size of the table, should be the parameter that
drives freezing strategy?

> That said, I agree that the system-level picture of debt (the system
> level view of the number of unfrozen heap pages) is relevant, and that
> it isn't directly considered by the patch. I think that that can be
> treated as work for a future release. In fact, I think that there is a
> great deal that we could teach autovacuum.c about the system level
> view of things -- this is only one.

It seems an easier path to considering system-level of debt (as measured by
unfrozen heap pages) would be to start with considering table-level debt
measured the same way.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-12-14 Thread John Naylor
On Tue, Dec 13, 2022 at 7:50 AM David Rowley  wrote:
>
> Thanks for testing the patch.
>
> On Mon, 12 Dec 2022 at 20:14, John Naylor 
wrote:

> > While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.
>
> I've tried to reproduce this with the v13 patches applied and I'm not
> really getting the same as you are. To run the function 100 times I
> used:
>
> select x, a.* from generate_series(1,100) x(x), lateral (select * from
> bench_load_random_int(500 * 1000 * (1+x-x))) a;

Simply running over a longer period of time like this makes the SlabFree
difference much closer to your results, so it doesn't seem out of line
anymore. Here SlabAlloc seems to take maybe 2/3 of the time of current
slab, with a 5% reduction in total time:

500k ints:

v13-0001-0005
average of 30: 217ms

  47.61%  postgres  postgres [.] rt_set
  20.99%  postgres  postgres [.] SlabAlloc
  10.00%  postgres  postgres [.] rt_node_insert_inner.isra.0
   6.87%  postgres  [unknown][k] 0xbce011b7
   3.53%  postgres  postgres [.] MemoryContextAlloc
   2.82%  postgres  postgres [.] SlabFree

+slab v4
average of 30: 206ms

  51.13%  postgres  postgres [.] rt_set
  14.08%  postgres  postgres [.] SlabAlloc
  11.41%  postgres  postgres [.] rt_node_insert_inner.isra.0
   7.44%  postgres  [unknown][k] 0xbce011b7
   3.89%  postgres  postgres [.] MemoryContextAlloc
   3.39%  postgres  postgres [.] SlabFree

It doesn't look mysterious anymore, but I went ahead and took some more
perf measurements, including for cache misses. My naive impression is that
we're spending a bit more time waiting for data, but having to do less work
with it once we get it, which is consistent with your earlier comments:

perf stat -p $pid sleep 2
v13:
  2,001.55 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   311,690  page-faults:u#  155.724 K/sec

 3,128,740,701  cycles:u #1.563 GHz

 4,739,333,861  instructions:u   #1.51  insn
per cycle
   820,014,588  branches:u   #  409.690 M/sec

 7,385,923  branch-misses:u  #0.90% of all
branches

+slab v4:
  2,001.09 msec task-clock:u #1.000 CPUs
utilized
 0  context-switches:u   #0.000 /sec

 0  cpu-migrations:u #0.000 /sec

   326,017  page-faults:u#  162.920 K/sec

 3,016,668,818  cycles:u #1.508 GHz

 4,324,863,908  instructions:u   #1.43  insn
per cycle
   761,839,927  branches:u   #  380.712 M/sec

 7,718,366  branch-misses:u  #1.01% of all
branches


perf stat -e LLC-loads,LLC-loads-misses -p $pid sleep 2
min/max of 3 runs:
v13:  LL cache misses: 25.08% - 25.41%
+slab v4: LL cache misses: 25.74% - 26.01%

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-17 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of
size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.

While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized

Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.

NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 141 +++-
 1 file changed, 108 insertions(+), 33 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
 typedef enum rt_size_class
 {
RT_CLASS_4_FULL = 0,
+   RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
uint16  count;
 
/* Max number of children. We can use uint8 because we never need to 
store 256 */
+   /* WIP: if we don't have a variable sized node4, this should instead be 
in the base
+   types as needed, since saving every byte is crucial for the smallest 
node kind */
uint8   fanout;
 
/*
@@ -166,6 +169,8 @@ typedef struct rt_node
((node)->base.n.count < (node)->ba

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada 
wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> >  wrote:

> > > That means my idea for the pointer struct might have some problems,
at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:

 3,343,352  branch-misses:u  #0.85% of all
branches
   393,204,959  branches:u

Recall a previous benchmark [1] where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.

> > I'll also run the test on my environment and do the investigation
tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.

[1]
https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch: 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.

There is less than 2% difference from the medial set of results, so it's
hard to distinguish from noise. I did a fresh rebuild and retested with the
same results: about 15% slowdown in v9 0004. That's strange.

On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada 
wrote:

> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).

Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a
straightforward explanation.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread John Naylor
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada 
wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
>  wrote:

> > Assuming the smallest node is fixed size (i.e. fanout/capacity member
not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.

> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 :  5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite
compelling.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't
hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.

Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).

node4:

NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16520 |  0 |3

NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16456 |  0 |   17

NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16456 |  0 |   89

NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |16488 |  0 |  327


node32:

NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16488 |  0 |5
(1 row)

NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16520 |  0 |   28

NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16408 |  0 |   79

NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |24616 |  0 |  199

In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> [v11]

There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
While creating a benchmark for inserting into node128-inner, I found a bug.
If a caller deletes from a node128, the slot index is set to invalid, but
the child pointer is still valid. Do that a few times, and every child
pointer is valid, even if no slot index points to it. When the next
inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
  int slotpos = 0;

  Assert(!NODE_IS_LEAF(node));
  while (node_inner_128_is_slot_used(node, slotpos))
  slotpos++;

  return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
  Assert(!NODE_IS_LEAF(node));
  return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters
an infinite loop doing this:

add eax, 1
movzx   ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147

The fix is easy enough -- set the child pointer to null upon deletion, but
I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
> The fix is easy enough -- set the child pointer to null upon deletion,
but I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

Oops. I meant to finish with "Since VACUUM doesn't perform deletion we
didn't have an opportunity to detect this during that operation."

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread John Naylor
There are a few things up in the air, so I'm coming back to this list to
summarize and add a recent update:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor 
wrote:
>
> - See how much performance we actually gain from tagging the node kind.

Needs a benchmark that has enough branch mispredicts and L2/3 misses to
show a benefit. Otherwise either neutral or worse in its current form,
depending on compiler(?). Put off for later.

> - Try additional size classes while keeping the node kinds to only four.

This is relatively simple and effective. If only one additional size class
(total 5) is coded as a placeholder, I imagine it will be easier to rebase
shared memory logic than using this technique everywhere possible.

> - Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our
bitmapset nodes, so we can iterate over and operate on word-sized (32- or
64-bit) types at a time, rather than bytes. To make this easier, I've moved
some of the lower-level macros and types from bitmapset.h/.c to
pg_bitutils.h. That's probably going to need a separate email thread to
resolve the coding style clash this causes, so that can be put off for
later. This is not meant to be included in the next patchset.  For
demonstration purposes, I get these results with a function that repeatedly
deletes the last value from a mostly-full node128 leaf and re-inserts it:

select * from bench_node128_load(120);

v11

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
+---+--+--
120 | 14400 |   208304 |   56

v11 + 0006 addendum

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
+---+--+--
120 | 14400 |   208816 |   34

I didn't test inner nodes, but I imagine the difference is bigger. This
bitmap style should also be used for the node256-leaf isset array simply to
be consistent and avoid needing single-use macros, but that has not been
done yet. It won't make a difference for performance because there is no
iteration there.

> - Try templating out the differences between local and shared memory.

I hope to start this sometime after the crashes on 32-bit are resolved.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 67ba568531..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -63,3 +63,14 @@ OUT rt_search_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index e69be48448..b035b3a747 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -552,3 +553,85 @@ finish_search:
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+   int fanout = PG_GETARG_INT32(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   rt_sparseload_ms;
+   Datum   values[5];
+   boolnulls[5];
+
+   uint64  r,
+   h;
+   int key_id;
+
+   /* Build a tuple descriptor for our result type */
+   if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
+   elog(ERROR, "return type must be a row type");
+
+   rt = rt_create(CurrentMemoryContext);
+
+   key_id = 0;
+
+   for (r = 1; r <= fanout; r++)
+   {
+   for (h = 1; h <= fanout; h++)
+   {
+   uint64  key;
+
+   key = (r << 8) | (h);
+
+   key_id++;
+   rt_set(rt, key, key_id);
+   }
+   }
+
+   rt_stats(rt);

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread John Naylor
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada 
wrote:
>
> I've investigated this issue and have a question about using atomic
> variables on palloc'ed memory. In non-parallel vacuum cases,
> radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> the memory allocated by aset.c is 4-bytes aligned so these atomic
> variables are not always 8-bytes aligned. Is there any way to enforce
> 8-bytes aligned memory allocations in 32-bit machines?

The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-01 Thread John Naylor
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada 
wrote:
>
> On Thu, Dec 1, 2022 at 4:00 PM John Naylor 
wrote:
> >
> > The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?
>
> Because I use the same radix_tree and radix_tree_control structs for
> non-parallel and parallel vacuum. Therefore, radix_tree_control is
> allocated in DSM for parallel-vacuum cases or in backend-local memory
> for non-parallel vacuum cases.

Ok, that could be yet another reason to compile local- and shared-memory
functionality separately, but now I'm wondering why there are atomic
variables at all, since there isn't yet any locking support.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread John Naylor
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada 
wrote:
>
> On Fri, Nov 25, 2022 at 5:00 PM John Naylor
>  wrote:

> > These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.
>
> Since the node has fanout regardless of fixed-sized and
> variable-sized

As currently coded, yes. But that's not strictly necessary, I think.

>, only node256 is the special case where the fanout in
> the node doesn't match the actual fanout of the node. I think if we
> want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
> node256 and one for other classes. Thoughts? In your idea, for
> NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
> following?
>
> #define FIXED_NODDE_HAS_FREE_SLOT(node, class)
>   (node->base.n.count < rt_size_class_info[class].fanout)

Right, and the other one could be VAR_NODE_...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Prefetch the next tuple's memory during seqscans

2022-11-30 Thread John Naylor
On Wed, Nov 23, 2022 at 4:58 AM David Rowley  wrote:

> My current thoughts are that it might be best to go with 0005 to start
> with.

+1

> I know Melanie is working on making some changes in this area,
> so perhaps it's best to leave 0002 until that work is complete.

There seem to be some open questions about that one as well.

I reran the same test in [1] (except I don't have the ability to lock clock
speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU
E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results:

HEAD:

Testing a1
latency average = 965.462 ms
Testing a2
latency average = 1054.608 ms
Testing a3
latency average = 1078.263 ms
Testing a4
latency average = 1120.933 ms
Testing a5
latency average = 1162.753 ms
Testing a6
latency average = 1298.876 ms
Testing a7
latency average = 1228.775 ms
Testing a8
latency average = 1293.535 ms

0001+0005:

Testing a1
latency average = 791.224 ms
Testing a2
latency average = 876.421 ms
Testing a3
latency average = 911.039 ms
Testing a4
latency average = 981.693 ms
Testing a5
latency average = 998.176 ms
Testing a6
latency average = 979.954 ms
Testing a7
latency average = 1066.523 ms
Testing a8
latency average = 1030.235 ms

I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure
reports "checking for __builtin_prefetch... yes", but I don't think it does
anything here, as the results are within noise level. A quick search didn't
turn up anything informative on this platform, and I'm not motivated to dig
deeper. In any case, it doesn't make things worse.

HEAD:

Testing a1
latency average = 1402.163 ms
Testing a2
latency average = 1442.971 ms
Testing a3
latency average = 1599.188 ms
Testing a4
latency average = 1664.397 ms
Testing a5
latency average = 1782.091 ms
Testing a6
latency average = 1860.655 ms
Testing a7
latency average = 1929.120 ms
Testing a8
latency average = 2021.100 ms

0001+0005:

Testing a1
latency average = 1433.080 ms
Testing a2
latency average = 1428.369 ms
Testing a3
latency average = 1542.406 ms
Testing a4
latency average = 1642.452 ms
Testing a5
latency average = 1737.173 ms
Testing a6
latency average = 1828.239 ms
Testing a7
latency average = 1920.909 ms
Testing a8
latency average = 2036.922 ms

[1]
https://www.postgresql.org/message-id/CAFBsxsHqmH_S%3D4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Improve performance of pg_strtointNN functions

2022-11-30 Thread John Naylor
On Thu, Dec 1, 2022 at 6:42 AM David Rowley  wrote:
>
> I was thinking that we should likely apply this before doing the hex
> literals, which is the main focus of [1].  The reason being is so that
> that patch can immediately have faster conversions by allowing the
> compiler to use bit shifting instead of other means of multiplying by
> a power-of-2 number. I'm hoping this removes a barrier for Peter from
> the small gripe I raised on that thread about the patch having slower
> than required hex, octal and binary string parsing.

I don't see why the non-decimal literal patch needs to be "immediately"
faster? If doing this first leads to less code churn, that's another
consideration, but you haven't made that argument.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Add native windows on arm64 support

2022-12-01 Thread John Naylor
On Fri, Dec 2, 2022 at 12:20 AM Niyas Sait  wrote:
>
>
> On 07/11/2022 06:58, Michael Paquier wrote:
> > Seems so.  Hmm, where does _ARM64_BARRIER_SY come from?  Perhaps it
> > would be better to have a comment referring to it from a different
> > place than the forums of arm, like some actual docs?
>
>
> _ARM64_BARRIER_SY is defined in Microsoft Arm64 intrinsic documentation
> -
>
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions

In particular, at the bottom of that section:

"For the __isb intrinsic, the only restriction that is currently valid is
_ARM64_BARRIER_SY; all other values are reserved by the architecture."

This corresponds to

https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/ISB--Instruction-Synchronization-Barrier-

which says

"SY Full system barrier operation, encoded as CRm = 0b. Can be omitted."

> I couldn't find something more official for the sse2neon library part.

Not quite sure what this is referring to, but it seems we can just point to
the __aarch64__ section in the same file, which uses the same instruction:

spin_delay(void)
{
  __asm__ __volatile__(
  " isb; \n");
}

...and which already explains the choice with a comment.

About v4:

+ * Use _mm_pause (x64) or __isb(arm64) intrinsic instead of rep nop.

Need a space here after __isb.

+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  else

That seems like a heavy-handed way to force it. Could we just use the same
gating in the test program that the patch puts in the code of interest?
Namely:

+#ifndef _MSC_VER
 #include 
+#endif

--
John Naylor
EDB: http://www.enterprisedb.com


Re: initdb: Refactor PG_CMD_PUTS loops

2022-12-01 Thread John Naylor
On Thu, Dec 1, 2022 at 5:02 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> Keeping the SQL commands that initdb runs in string arrays before
> feeding them to PG_CMD_PUTS() seems unnecessarily verbose and
> inflexible.  In some cases, the array only has one member.  In other
> cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a
> command, but that would require breaking up the loop or using
> workarounds like replace_token().  This patch unwinds all that; it's
> much simpler that way.

+1,  I can't think of a reason to keep the current coding

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Optimizing Node Files Support

2022-12-02 Thread John Naylor
On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela  wrote:
>
> Hi,
>
> I believe that has room for improving generation node files.
>
> The patch attached reduced the size of generated files by 27 kbytes.
> From 891 kbytes to 864 kbytes.
>
> About the patch:
> 1. Avoid useless attribution when from->field is NULL, once that
> the new node is palloc0.
>
> 2. Avoid useless declaration variable Size, when it is unnecessary.

Not useless -- it prevents a multiple evaluation hazard, which this patch
introduces.

> 3. Optimize comparison functions like memcmp and strcmp, using
>  a short-cut comparison of the first element.

Not sure if the juice is worth the squeeze. Profiling would tell.

> 4. Switch several copy attributions like COPY_SCALAR_FIELD or
COPY_LOCATION_FIELD
> by one memcpy call.

My first thought is, it would cause code churn.

> 5. Avoid useless attribution, ignoring the result of pg_strtok when it is
unnecessary.

Looks worse.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Prefetch the next tuple's memory during seqscans

2022-11-22 Thread John Naylor
On Wed, Nov 23, 2022 at 5:00 AM David Rowley  wrote:
>
> On Thu, 3 Nov 2022 at 22:09, John Naylor 
wrote:
> > I tried a similar test, but with text fields of random length, and
there is improvement here:
>
> Thank you for testing that. Can you share which CPU this was on?

That was an Intel Core i7-10750H

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2022-11-22 Thread John Naylor
On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> On 15.11.22 11:31, Peter Eisentraut wrote:
> > On 14.11.22 08:25, John Naylor wrote:
> >> Regarding the patch, it looks good overall. My only suggestion would
> >> be to add a regression test for just below and just above overflow, at
> >> least for int2.
> >
> > ok
>
> This was a valuable suggestion, because this found some breakage.  In
> particular, the handling of grammar-level literals that overflow to
> "Float" was not correct.  (The radix prefix was simply stripped and
> forgotten.)  So I added a bunch more tests for this.  Here is a new patch.

Looks good to me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2022-11-23 Thread John Naylor
On Wed, Nov 23, 2022 at 3:54 PM David Rowley  wrote:
>
> Going by [1], clang will actually use multiplication by 16 to
> implement the former. gcc is better and shifts left by 4, so likely
> won't improve things for gcc.  It seems worth doing it this way for
> anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.

FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've
found it useful to run godbolt locally* and load the entire PG file (nicer
to read than plain objdump) -- compilers can make different decisions when
going from isolated snippets to within full functions.

* clone from https://github.com/compiler-explorer/compiler-explorer
install npm 16
run "make" and when finished will show the localhost url
add the right flags, which in this case was

-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Werror=vla -Wendif-labels -Wmissing-format-attribute
-Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation -Wno-stringop-truncation -O2
-I/path/to/srcdir/src/include -I/path/to/builddir/src/include  -D_GNU_SOURCE

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 11:46 AM John Naylor 
wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada 
wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
>  wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like
this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:

filter = ((uint64)1<<40)-1;
LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.

2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
 create function bench_fixed_height_search(
 fanout int4,
 OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
 
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+   x ^= x >> 30;
+   x *= UINT64CONST(0xbf58476d1ce4e5b9);
+   x ^= x >> 27;
+   x *= UINT64CONST(0x94d049bb133111eb);
+   x ^= x >> 31;
+   return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+   uint64  cnt = (uint64) PG_GETARG_INT64(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   search_time_ms;
+   Datum   values[2] = {0};
+   boolnulls[2] = {0};
+   /* from trial and error */
+   const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 
0xFF);
+
+   

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
ASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch means "node doesn't need to
grow class", which is false. If we want to simplify expressions in this
block, I think it'd be more effective to improve the lines that follow:

+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;

Maybe we can have const variables old_size and new_fanout to break out the
array lookup? While I'm thinking of it, these arrays should be const so the
compiler can avoid runtime lookups. Speaking of...

+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

When I looked at this earlier, I somehow didn't go far enough -- why are we
passing the runtime count in the first place? This function can only be
called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last
parameter to memcpy should evaluate to a compile-time constant, right? Even
when we add node shrinking in the future, the constant should be correct,
IIUC?

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+  and this is the max size class so it will never grow */
+ .fanout = 0,

- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);

These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-16 Thread John Naylor
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada 
wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
>  wrote:
> >

> Thanks! cfbot complaints about some warnings but these are expected
> (due to unused delete routines etc). But one reported error[1] might
> be relevant with 0002 patch?

> [05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
> external symbol pg_popcount64
> [05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
> fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging
function, we could work around it, but it would be nice to understand why,
so I'll probably have to experiment on my CI repo.

> ---
> +#ifndef RT_COMMON
> +#define RT_COMMON
>
> What are we using this macro RT_COMMON for?

It was a quick way to define some things only once, so they probably all
showed up in the list of things you found not undefined. It's different
from the style of simplehash.h, which is to have a local name and #undef
for every single thing. simplehash.h is a precedent, so I'll change it to
match. I'll take a look at your list, too.

> > + * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?

> Since it seems you're working on another cleanup, I can address the
> above comments after your work is completed. But I'm also fine with
> including them into your cleanup work.

I think we can work mostly simultaneously, if you work on tid store and
vacuum, and I work on the template. We can always submit a full patchset
including each other's latest work. That will catch rebase issues sooner.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: cutting down the TODO list thread

2023-01-16 Thread John Naylor
I wrote:

> We could also revise the developer FAQ:
> - remove phrase "Outstanding features are detailed in Todo."
> - add suggestion to to check the Todo or Not_worth_doing pages to see if
the desired feature is undesirable or problematic
> - rephrase "Working in isolation is not advisable because others might be
working on the same TODO item, or you might have misunderstood the TODO
item." so it doesn't mention 'TODO' at all.

There is also

https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs

Changing the description of what links to the todo list will probably do
more to reduce confusion than language in the todo list itself.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Small miscellaneus fixes (Part II)

2023-01-16 Thread John Naylor
On Mon, Jan 16, 2023 at 1:28 PM John Naylor 
wrote:
>
>
> I wrote:
> > ...but arguably the earlier check is close enough that it's silly to
assert in the "else" branch, and I'd be okay with just nuking those lines.
Another thing that caught my attention is the assumption that unsetting a
bit is so expensive that we have to first check if it's set, so we may as
well remove "IS_BRACKET(Np->Num)" as well.
>
> The attached is what I mean. I'll commit this this week unless there are
objections.

I've pushed this and the cosmetic fix in pg_dump. Those were the only
things I saw that had some interest, so I closed the CF entry.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-17 Thread John Naylor
On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada 
wrote:
>
> On Mon, Jan 16, 2023 at 2:02 PM John Naylor
>  wrote:

> > + * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
> > + * in 'offsets' are ordered in ascending order.
> >
> > Must? What happens otherwise?
>
> It ends up missing TIDs by overwriting the same key with different
> values. Is it better to have a bool argument, say need_sort, to sort
> the given array if the caller wants?

Now that I've studied it some more, I see what's happening: We need all
bits set in the "value" before we insert it, since it would be too
expensive to retrieve the current value, add one bit, and put it back.
Also, as a consequence of the encoding, part of the tid is in the key, and
part in the value. It makes more sense now, but it needs more than zero
comments.

As for the order, I don't think it's the responsibility of the caller to
guess if it needs sorting -- if unordered offsets lead to data loss, this
function needs to take care of it.

> > + uint64 last_key = PG_UINT64_MAX;
> >
> > I'm having some difficulty understanding this sentinel and how it's
used.
>
> Will improve the logic.

Part of the problem is the English language: "last" can mean "previous" or
"at the end", so maybe some name changes would help.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: cutting down the TODO list thread

2023-01-23 Thread John Naylor
On Wed, Jan 18, 2023 at 3:13 AM Bruce Momjian  wrote:

> I think we risk overloading people with too many words above, and they
> will not read it fully.  Can it be simplified?  I wonder if some of this
> belows in the developer's FAQ and linked to that from the TODO list.

I think you're right. That further drives home what I mentioned a few days
ago: Maybe we don't need to change much in the actual list itself, but
rephrase references to it from elsewhere. Here's a first draft to see what
that would look like:
--
https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs

from:
"PostgreSQL maintains a TODO list on our wiki:
http://wiki.postgresql.org/wiki/TODO

We have attempted to organize issues and link in relevant discussions from
our mailing list archives. Please read background information if it is
available before attempting to resolve a TODO. If no background information
is available, it is appropriate to post a question to
pgsql-hack...@postgresql.org and request additional information and inquire
about the status of any ongoing work on the problem."

to:
"It's worth checking if the feature of interest is found in the TODO list
on our wiki: http://wiki.postgresql.org/wiki/TODO. That list contains some
known PostgreSQL bugs, some feature requests, and some things we are not
even sure we want. Many entries have a link to an email thread containing
prior discussion, or perhaps attempts that for whatever reason didn't make
it as far as getting committed."

...which might make more sense if moved below the "brand new features"
section.

--
https://wiki.postgresql.org/wiki/Developer_FAQ

1)
from:
"What areas need work?
Outstanding features are detailed in Todo.

You can learn more about these features by consulting the archives, the SQL
standards and the recommended texts (see books for developers)."

to:
??? -> For "what areas need work?", we need to have a different answer, but
I'm not sure what it is.

2)
from:
"What do I do after choosing an item to work on?

Send an email to pgsql-hackers with a proposal for what you want to do
(assuming your contribution is not trivial). Working in isolation is not
advisable because others might be working on the same TODO item, or you
might have misunderstood the TODO item. In the email, discuss both the
internal implementation method you plan to use, and any user-visible
changes (new syntax, etc)."

to:
"What do I do after choosing an area to work on?

Send an email to pgsql-hackers with a proposal for what you want to do
(assuming your contribution is not trivial). Working in isolation is not
advisable because experience has shown that there are often requirements
that are not obvious, and if those are not agreed on beforehand it leads to
wasted effort. In the email, discuss both the internal implementation
method you plan to use, and any user-visible changes (new syntax, etc)."

> On Wed, Jan 11, 2023 at 02:09:56PM +0700, John Naylor wrote:
> > I've come up with some revised language, including s/15/16/ and
removing the
> > category of "[E]" (easier to implement), since it wouldn't be here if
it were
> > actually easy:
>
> I think it is still possible for a simple item to be identified as
> wanted and easy, but not completed and put on the TODO list.

Theoretically it's possible, but in practice no one puts any easy items
here. Currently, there are three marked as easy:

pg_dump:
- Dump security labels and comments on databases in a way that allows to
load a dump into a differently named database
- Add full object name to the tag field. eg. for operators we need
'=(integer, integer)', instead of just '='.

Dump and restore is critical to get right, and the code base is pretty
large and hairy, so I don't actually believe for a second that these items
are easy.

ECPG:
- sqlwarn[6] should be 'W' if the PRECISION or SCALE value specified

In one sentence there are four uses of jargon that only someone with
experience would understand. I have hacked around ECPG multiple times and
have no idea what this means.

The last two also don't have any motivation spelled out, much less an email
thread.

So my inclination is, the [E] marker here is unjustified.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread John Naylor
On Sun, Jan 22, 2023 at 10:42 PM Joel Jacobson  wrote:

> I did write the code like you suggest first, but changed it,
> since I realised the extra "else if" needed could be eliminated,
> and thought div_var_int64() wouldn't be slower than div_var_int() since
> I thought 64-bit instructions in general are as fast as 32-bit
instructions,
> on 64-bit platforms.

According to Agner's instruction tables [1], integer division on Skylake
(for example) has a latency of 26 cycles for 32-bit operands, and 42-95
cycles for 64-bit.

[1] https://www.agner.org/optimize/instruction_tables.pdf

--
John Naylor
EDB: http://www.enterprisedb.com


bitscan forward/reverse on Windows

2023-01-24 Thread John Naylor
Attached is a quick-and-dirty attempt to add MSVC support for the
rightmost/leftmost-one-pos functions.

0001 adds asserts to the existing coding.
0002 adds MSVC support. Tests pass on CI, but it's of course possible that
there is some bug that prevents hitting the fastpath. I've mostly used
the platform specific types, so some further cleanup might be needed.
0003 tries one way to reduce the duplication that arose in 0002. Maybe
there is a better way.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 9390a74f1712d58c88ac513afff8d878071c040c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 24 Jan 2023 15:57:53 +0700
Subject: [PATCH v3 2/3] Add MSVC support for bitscan reverse/forward

---
 src/include/port/pg_bitutils.h | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 9a53f53d11..ce04f4ffad 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -13,6 +13,10 @@
 #ifndef PG_BITUTILS_H
 #define PG_BITUTILS_H
 
+#ifdef _MSC_VER
+#include 
+#endif
+
 extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
@@ -27,6 +31,9 @@ pg_leftmost_one_pos32(uint32 word)
 {
 #ifdef HAVE__BUILTIN_CLZ
 	int			bitscan_result;
+#elif defined(_MSC_VER)
+	unsigned long bitscan_result;
+	unsigned char non_zero; // XXX can this be bool?
 #endif
 
 #if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING)
@@ -45,6 +52,11 @@ pg_leftmost_one_pos32(uint32 word)
 	bitscan_result = 31 - __builtin_clz(word);
 	Assert(bitscan_result == result);
 	return bitscan_result;
+#elif defined(_MSC_VER)
+	non_zero = _BitScanReverse(_result, word);
+	Assert(non_zero);
+	Assert(bitscan_result == result);
+	return bitscan_result;
 #else
 	return result;
 #endif			/* HAVE__BUILTIN_CLZ */
@@ -59,6 +71,9 @@ pg_leftmost_one_pos64(uint64 word)
 {
 #ifdef HAVE__BUILTIN_CLZ
 	int			bitscan_result;
+#elif defined(_MSC_VER)
+	unsigned long bitscan_result;
+	unsigned char non_zero; // XXX can this be bool?
 #endif
 
 #if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING)
@@ -83,6 +98,11 @@ pg_leftmost_one_pos64(uint64 word)
 #endif			/* HAVE_LONG_... */
 	Assert(bitscan_result == result);
 	return bitscan_result;
+#elif defined(_MSC_VER)
+	non_zero = _BitScanReverse64(_result, word);
+	Assert(non_zero);
+	Assert(bitscan_result == result);
+	return bitscan_result;
 #else
 	return result;
 #endif			/* HAVE__BUILTIN_CLZ */
@@ -99,6 +119,10 @@ pg_rightmost_one_pos32(uint32 word)
 #ifdef HAVE__BUILTIN_CTZ
 	const uint32 orig_word = word;
 	int			bitscan_result;
+#elif defined(_MSC_VER)
+	const unsigned long orig_word = word;
+	unsigned long bitscan_result;
+	unsigned char non_zero; // XXX can this be bool?
 #endif
 
 #if !defined(HAVE__BUILTIN_CTZ) || defined(USE_ASSERT_CHECKING)
@@ -118,6 +142,11 @@ pg_rightmost_one_pos32(uint32 word)
 	bitscan_result = __builtin_ctz(orig_word);
 	Assert(bitscan_result == result);
 	return bitscan_result;
+#elif defined(_MSC_VER)
+	non_zero = _BitScanForward(_result, orig_word);
+	Assert(non_zero);
+	Assert(bitscan_result == result);
+	return bitscan_result;
 #else
 	return result;
 #endif			/* HAVE__BUILTIN_CTZ */
@@ -133,6 +162,10 @@ pg_rightmost_one_pos64(uint64 word)
 #ifdef HAVE__BUILTIN_CTZ
 	const uint64 orig_word = word;
 	int			bitscan_result;
+#elif defined(_MSC_VER)
+	const unsigned __int64 orig_word = word;
+	unsigned long bitscan_result;
+	unsigned char non_zero; // XXX can this be bool?
 #endif
 
 #if !defined(HAVE__BUILTIN_CTZ) || defined(USE_ASSERT_CHECKING)
@@ -158,6 +191,11 @@ pg_rightmost_one_pos64(uint64 word)
 #endif			/* HAVE_LONG_... */
 	Assert(bitscan_result == result);
 	return bitscan_result;
+#elif defined(_MSC_VER)
+	non_zero = _BitScanForward64(_result, orig_word);
+	Assert(non_zero);
+	Assert(bitscan_result == result);
+	return bitscan_result;
 #else
 	return result;
 #endif			/* HAVE__BUILTIN_CTZ */
-- 
2.39.0

From 88b98a3fe8d9b5f9a10a839e65b34c43c03e33b2 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 24 Jan 2023 14:39:26 +0700
Subject: [PATCH v3 1/3] Add asserts to verify bitscan intrinsics

---
 src/include/port/pg_bitutils.h | 86 --
 1 file changed, 60 insertions(+), 26 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..9a53f53d11 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -26,10 +26,11 @@ static inline int
 pg_leftmost_one_pos32(uint32 word)
 {
 #ifdef HAVE__BUILTIN_CLZ
-	Assert(word != 0);
+	int			bitscan_result;
+#endif
 
-	return 31 - __builtin_clz(word);
-#else
+#if !defined(HAVE__BUILTIN_CLZ) || defined(USE_ASSERT_CHECKING)
+	int			result;
 	int			shift = 32 - 8;
 
 	Assert(word != 0);
@@ -37,7 +38,15 @@ pg_leftmost_one_pos32(uint32 word)
 	while ((word

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-25 Thread John Naylor
On Thu, Jan 26, 2023 at 10:11 AM Andres Freund  wrote:

> I am. Just not every tradeoff. I just don't see any useful tradeoffs
purely
> based on the relation size.

I expressed reservations about relation size six weeks ago:

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan  wrote:
>
> On Tue, Dec 13, 2022 at 12:29 AM John Naylor
>  wrote:
> > If the number of unfrozen heap pages is the thing we care about,
perhaps that, and not the total size of the table, should be the parameter
that drives freezing strategy?
>
> That's not the only thing we care about, though.

That was followed by several paragraphs that never got around to explaining
why table size should drive freezing strategy. Review is a feedback
mechanism alerting the patch author to possible problems. Listening to
feedback is like vacuum, in a way: If it hurts, you're not doing it enough.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Thu, Jan 26, 2023 at 6:14 PM Pavel Borisov 
wrote:

> > - Only if there is more than one sort key, qsort the null array.
Ideally at some point we would have a method of ignoring the first sortkey
(this is an existing opportunity that applies elsewhere as well).

> Should we need to sort by the second sort key provided the first one
> in NULL by standard or by some part of the code relying on this? I

I'm not sure I quite understand the question.

If there is more than one sort key, and the specialized comparison on the
first key gives a definitive zero result, it falls back to comparing all
keys from the full tuple. (The sorttuple struct only contains the first
sortkey, which might actually be an abbreviated key.) A possible
optimization, relevant here and also elsewhere, is to compare only using
keys starting from key2. But note: if the first key is abbreviated, a zero
result is not definitive, and we must check the first key's full value from
the tuple.

> suppose NULL values in the first sort key mean attribute values are
> undefined and there is no preferred order between these tuples, even
> if their second sort keys are different.

There is in fact a preferred order between these tuples -- the second key
is the tie breaker in this case.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-25 Thread John Naylor
On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada 
wrote:
>
> On Mon, Jan 23, 2023 at 8:20 PM John Naylor
>  wrote:
> >
> > On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada 
wrote:
> > >
> > > On Mon, Jan 16, 2023 at 2:02 PM John Naylor
> > >  wrote:
> >
> > In v21, all of your v20 improvements to the radix tree template and
test have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.
>
> Hmm, I don't remember I proposed such a patch, either.

I went looking, and it turns out I remembered wrong, sorry.

> One idea to address it would be that we pass a shared memory to
> RT_CREATE() and we create a DSA area dedicated to the radix tree in
> place. We should return the created DSA area along with the radix tree
> so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
> and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
> area. A downside of this idea would be that one DSA area only for a
> radix tree is always required.
>
> Another idea would be that we allocate a big enough DSA area and
> quarry small memory for nodes from there. But it would need to
> introduce another complexity so I prefer to avoid it.
>
> FYI the current design is inspired by dshash.c. In dshash_destory(),
> we dsa_free() each elements allocated by dshash.c

Okay, thanks for the info.

> > 0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
>
> radixtree_search_impl.h still assumes that the value type is an
> integer type as follows:
>
> #ifdef RT_NODE_LEVEL_LEAF
> RT_VALUE_TYPE   value = 0;
>
> Assert(RT_NODE_IS_LEAF(node));
> #else
>
> Also, I think if we make the value type configurable, it's better to
> pass the pointer of the value to RT_SET() instead of copying the
> values since the value size could be large.

Thanks, I will remove the assignment and look into pass-by-reference.

> Oops, the fix is missed in the patch for some reason. I'll fix it.
>
> > There is also this, in the template, which I'm not sure has been
addressed:
> >
> >  * XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
> >  * has the local pointers to nodes, rather than RT_PTR_ALLOC.
> >  * We need either a safeguard to disallow other processes to begin the
iteration
> >  * while one process is doing or to allow multiple processes to do the
iteration.
>
> It's not addressed yet. I think adding a safeguard is better for the
> first version. A simple solution is to add a flag, say iter_active, to
> allow only one process to enable the iteration. What do you think?

I don't quite have enough info to offer an opinion, but this sounds like a
different form of locking. I'm sure it's come up before, but could you
describe why iteration is different from other operations, regarding
concurrency?

> > Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.
>
> I think we can pass the maximum offset numbers to tidstore_create()
> and calculate these values.

That would work easily for vacuumlazy.c, since it's in the "heap" subdir so
we know the max possible offset. I haven't looked at vacuumparallel.c, but
I can tell it is not in a heap-specific directory, so I don't know how easy
that would be to pass along the right value.

> > About shared memory: I have some mild reservations about the naming of
the "control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)
>
> That seems a valid concern. I borrowed the "control object" from
> dshash.c but it supports only shared cases. The fact that the radix
> tree supports both local and shared seems to introduce this confusion.
> I came up with other names such as RT_RADIX_TREE_CORE or
> RT_RADIX_TREE_ROOT  but not sure these are better than the current
> one.

Okay, if dshash uses it, we have some precedent.

> > Now might be a good time to look at earlier XXX comments and come up
with a plan to address them.
>
> Agreed.
>
> Other

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-01-25 Thread John Naylor
On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar  wrote:
>
> On Mon, Jan 23, 2023 at 6:00 PM John Naylor
>  wrote:
> >
> > Attached is a rebase to fix conflicts from recent commits.
>
> I have reviewed v22-0022* patch and I have some comments.
>
> 1.
> >It also changes to the column names max_dead_tuples and num_dead_tuples
and to
> >show the progress information in bytes.
>
> I think this statement needs to be rephrased.

Could you be more specific?

> 3.
>
> We are changing the min value of 'maintenance_work_mem' to 2MB. Should
> we do the same for the 'autovacuum_work_mem'?

Yes, we should change that, too. We've discussed previously that
autovacuum_work_mem is possibly rendered unnecessary by this work, but we
agreed that that should be a separate thread. And needs additional testing
to verify.

I agree with your other comments.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Considering additional sort specialisation functions for PG16

2023-01-26 Thread John Naylor
On Tue, Aug 23, 2022 at 1:13 PM John Naylor 
wrote:
>
> On Tue, Aug 23, 2022 at 11:24 AM David Rowley 
wrote:
> >
> > On Tue, 23 Aug 2022 at 15:22, John Naylor 
wrote:
> > > Did you happen to see
> > >
> > >
https://www.postgresql.org/message-id/CAFBsxsFhq8VUSkUL5YO17cFXbCPwtbbxBu%2Bd9MFrrsssfDXm3Q%40mail.gmail.com
> >
> > I missed that.  It looks like a much more promising idea than what I
> > came up with. I've not looked at your code yet, but I'm interested and
> > will aim to look soon.
>
> Note that I haven't actually implemented this idea yet, just tried to
> model the effects by lobotomizing the current comparators. I think
> it's worth pursuing and will try to come back to it this cycle, but if
> you or anyone else wants to try, that's fine of course.

Coming back to this, I wanted to sketch out this idea in a bit more detail.

Have two memtuple arrays, one for first sortkey null and one for first
sortkey non-null:
- Qsort the non-null array, including whatever specialization is available.
Existing specialized comparators could ignore nulls (and their ordering)
taking less space in the binary.
- Only if there is more than one sort key, qsort the null array. Ideally at
some point we would have a method of ignoring the first sortkey (this is an
existing opportunity that applies elsewhere as well).
- To handle two arrays, grow_memtuples() would need some adjustment, as
would any callers that read the final result of an in-memory sort -- they
would need to retrieve the tuples starting with the appropriate array
depending on NULLS FIRST/LAST behavior.

I believe external merges wouldn't have to do anything different, since
when writing out the tapes, we read from the arrays in the right order.

(One could extend this idea further and have two pools of tapes for null
and non-null first sortkey, that are merged separately, in the right order.
That sounds like quite a bit more complexity than is worth, however.)

--
John Naylor
EDB: http://www.enterprisedb.com


<    5   6   7   8   9   10   11   12   13   14   >