Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-08-20 Thread John Naylor
On Mon, Aug 12, 2024 at 11:01 AM John Naylor  wrote:
>
> Extra explicitness might be good. "Response time" seems like a
> networking concept, so possibly ", in turn slowing down INSERT/UPDATE
> statements." I'm inclined to commit that way in a couple days, barring
> further comments.

This is done.




Re: ECPG cleanup and fix for clang compile-time problem

2024-08-12 Thread John Naylor
On Fri, Jul 5, 2024 at 10:59 PM Tom Lane  wrote:
>
> The cfbot noticed that this patchset had a conflict with d35cd0619,
> so here's a rebase.  It's just a rebase of v1, no other changes.

Hi Tom,

I started looking at the earlier cleanup patches.

0001 seems straightforward. Note: It doesn't apply cleanly anymore,
but does with 'patch'.
0002 LGTM, just a couple minor comments:

--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -1,7 +1,13 @@
 #!/usr/bin/perl
 # src/interfaces/ecpg/preproc/parse.pl
 # parser generator for ecpg version 2
-# call with backend parser as stdin
+#
+# See README.parser for some explanation of what this does.

Doesn't this patch set put us up to version 3? ;-) Looking in the
history, a very long time ago a separate "parse2.pl" was committed for
some reason, but that was reconciled some time later. This patch
doesn't need to get rid of that meaningless version number, but I find
it distracting.

+ # There may be multiple ECPG: lines and then multiple lines of code.
+ # Each block of code needs to be added to all prior ECPG records.

This took me a while to parse at first. Some places in this script put
quotes around words-with-colons, and that seems good for readability.

0003:

Looks a heck of a lot better, but I didn't try to understand
everything in the script, either before or after.

+ # Emit the target part of the rule.
+ # Note: the leading space is just to match
+ # the old, rather weird output logic.
+ $tstr = ' ' . $non_term_id . ':';
+ add_to_buffer('rules', $tstr);

Removing the leading space (or making it two spaces) has no effect on
the output -- does that get normalized elsewhere?

That's all I have for now.




Re: Linux likely() unlikely() for PostgreSQL

2024-08-12 Thread John Naylor
On Sun, Jun 30, 2024 at 10:24 PM Tom Lane  wrote:

> In general, we have a policy of using likely/unlikely very sparingly,
> and only in demonstrably hot code paths.  This hook call certainly
> doesn't qualify as hot.

I just remembered an article I read a while back by a compiler
engineer who said that compilers have heuristics that treat NULL
pointers as "unlikely" since the most common reason to test that is so
we can actually dereference it and do something with it, and in that
case a NULL pointer is an exceptional condition. He also said it
wasn't documented so you can only see this by looking at the source
code. Instead of taking the time to verify that, I created some toy
examples and it seems to be true:

https://godbolt.org/z/dv6M5ecY5

This works all the way back to clang 3.1 and (at least, because
Compiler Explorer doesn't go back any further) gcc 3.4.6, so it's an
ancient heuristic. If we can rely on this, we could remove the handful
of statements of the form "if (unlikely(foo_ptr == NULL))" and
document it in c.h. It may be iffy to rely on an undocumented
heuristic, but it also seems silly to have a use-sparingly policy if
some of our current uses have zero effect on the emitted binary (I
haven't checked yet).




Re: ECPG cleanup and fix for clang compile-time problem

2024-08-11 Thread John Naylor
On Fri, Apr 19, 2024 at 10:21 PM Tom Lane  wrote:
>
> One other bit of randomness that I noticed: ecpg's parse.pl has
> this undocumented bit of logic:
>
> if ($a eq 'IDENT' && $prior eq '%nonassoc')
> {
>
> # add more tokens to the list
> $str = $str . "\n%nonassoc CSTRING";
> }

> preproc.c has
>
>  %nonassoc UNBOUNDED NESTED
>  %nonassoc IDENT
> %nonassoc CSTRING PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP
>  SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT PATH
>  %left Op OPERATOR
>
> If you don't find that scary as heck, I suggest reading the very long
> comment just in front of the cited lines of gram.y.  The argument why
> assigning these keywords a precedence at all is OK depends heavily
> on it being the same precedence as IDENT, yet here's ECPG randomly
> breaking that.

Before 7f380c59f (Reduce size of backend scanner's tables), it was
even more spread out:

# add two more tokens to the list
$str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";

...giving:
 %nonassoc UNBOUNDED
 %nonassoc IDENT
%nonassoc CSTRING
%nonassoc UIDENT GENERATED NULL_P PARTITION RANGE ROWS GROUPS
PRECEDING FOLLOWING CUBE ROLLUP

> We seem to have avoided problems though, because if I fix things
> by manually editing preproc.y to re-join the lines:
>
>  %nonassoc IDENT CSTRING PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE 
> ROLLUP
>
> the generated preproc.c doesn't change at all.

On a whim I tried rejoining on v12 and the .c doesn't change there, either.

> Actually, I can
> take CSTRING out of this list altogether and it still doesn't
> change the results ... although looking at how CSTRING is used,
> it looks safer to give it the same precedence as IDENT.

Doing that on v12 on top of rejoining results in a shift-reduce
conflict, so I imagine that's why it's there. Maybe it's outdated, but
this backs up your inclination that it's safer to keep.




Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-08-11 Thread John Naylor
On Mon, May 20, 2024 at 5:43 PM Jakub Wartak
 wrote:
>
> On Tue, May 14, 2024 at 8:19 PM Robert Haas  wrote:
> >
> > I looked at your version and wrote something that is shorter and
> > doesn't touch any existing text. Here it is.
>
> Hi Robert, you are a real tactician here - thanks for whatever
> references the original problem! :)

I like this text as well.

> Maybe just slight hint nearby
> expensive (to me indicating a just a CPU problem?):
>
> finding an OID that is still free can become expensive ->
> finding an OID that is still free can become expensive, thus
> significantly increasing INSERT/UPDATE response time.
>
> ? (this potentially makes it easier in future to pinpoint the user's
> problem to the this exact limitation; but feel free to ignore that too
> as I'm not attached to any of those versions)

Extra explicitness might be good. "Response time" seems like a
networking concept, so possibly ", in turn slowing down INSERT/UPDATE
statements." I'm inclined to commit that way in a couple days, barring
further comments.

PS: Sorry for the delay in looking at the latest messages




Re: 回复: An implementation of multi-key sort

2024-08-10 Thread John Naylor
On Fri, Jul 26, 2024 at 6:18 PM Yao Wang  wrote:

> 2. Should we update optimizer to take in account statistic
> (pg_statistic->stadistinct) and choose mksort/qsort accordingly?

> According to the result, the choosing mechanism in optimizer
> almost eliminated all regressions. Please note that even when mksort is
> disabled (i.e. qsort was performed twice actually), there are still
> "regressions" which are usually less than 5% and should be accounted
> to kinda error range.

This is kind of an understatement. To be clear, I see mostly ~1%,
which I take to be in the noise level. If it were commonly 5%
regression, that'd be cause for concern. If actual noise were 5%, the
testing methodology is not strict enough.

> I am still hesitating about putting the code to final version because
> there are a number of concerns:
>
> a. for sort keys without a real table, stadistinct is unavailable.
> b. stadistinct may not be accurate as mentioned.
> c. the formula I used may not be able to cover all possible cases.
>
> On the other hand, the worst result may be just 5% regression. So the
> side effects may not be so serious?

FWIW, I share these concerns as well. I don't believe this proves a
bound on the worst result, since the test is using ideal well-behaved
data with no filter. The worst result is when the estimates are wrong,
so real world use could easily be back to 10-20% regression, which is
not acceptable. I believe this is what Robert and Tomas were warning
about.

It'd be good to understand what causes the differences, whether better
or worse. Some initial thoughts:

- If the first key is unique, then I would hope multikey would be no
different then a standard sort.

- If the first key commonly ties, and other following keys tie also,
those later comparisons are a waste. In that case, it's not hard to
imagine that partitioning on only one key at a time might be fast.

- If the first key commonly ties, but the second key is closer to
unique, I'm not sure which way is better. Have we tested this case?

- If we actually only have one sort key, a multi-key sort with a
single depth should ideally have no significant performance difference
than standard sort. That seems like a good sanity check. Has this been
tried?

- For the biggest benefit/regression cases, it'd be good to know what
changed at the hardware level. # comparsisons? # swaps? # cache
misses? # branch mispredicts?

Looking at the code a bit, I have some questions and see some
architectural issues. My thoughts below have a bunch of brainstorms so
should be taken with a large grain of salt:

1. The new single-use abstraction for the btree tid tiebreak seems
awkward. In standard sort, all that knowledge was confined to btree's
full comparetup function. Now it's spread out, and general code has to
worry about "duplicated" tuples. The passed-down isNull seems only
needed for this? (Quick idea: It seems we could pass a start-depth and
max-depth to some "comparetup_mk", which would be very similar to
current "comparetup + comparetup_tiebreak". The btree function would
have to know that a depth greater than the last sortkey is a signal to
do the tid comparison. And if start-depth and max-depth are the same,
that means comparing at a single depth. That might simplify the code
elsewhere because there is no need for a separate getDatum function.)

2. I don't understand why the pre-ordered check sometimes tolerates
duplicates and sometimes doesn't.

3. "tiebreak" paths and terminology is already somewhat awkward for
standard sort (my fault), but seems really out of place in multikey
sort. It already has a general concept of "depth", so that should be
used in fullest generality.

3A. Random thought: I wonder if the shortcut (and abbreviated?)
comparisons could be thought of as having their own depth < 0. If it's
worth it to postpone later keys, maybe it's worth it to postpone the
full comparison for the first key as well? I could be wrong, though.

3B. Side note: I've long wanted to try separating all NULL first keys
to a separate array, so we can remove all those branches for NULL
ordering and reduce SortTuple to 16 bytes. That might be easier to
code if we could simply specify "start_depth = 1" at the top level for
that.

4. Trying to stuff all our optimized comparators in the same path was
a heroic effort, but it's quite messy and seems pretty bad for the
instruction cache and branch predictor. I don't think we need yet
another template, but some of these branches should be taken as we
recurse into a partition to keep them out of the hot path.

5.
+ /*
+ * When the count < 16 and no need to handle duplicated tuples, use
+ * bubble sort.
+ *
+ * Use 16 instead of 7 which is used in standard qsort, because mk qsort
+ * need more cost to maintain more complex state.

Note: 7 isn't ideal for standard sort either, and should probably be
at least 10 (at least for single-key sorts). If one implementation's
parameter is more ideal than the other, it obscures what th

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-08-10 Thread John Naylor
On Tue, Aug 6, 2024 at 9:58 PM John Naylor  wrote:
>
> It also turns out that to support 64kB memory settings, we actually
> wouldn't need to change radix tree to lazily create memory contexts --
> at least currently, SlabCreate doesn't allocate a keeper block, so a
> newly created slab context reports 0 for "mem_allocated". So I'm
> inclined to go ahead change the minimum m_w_m on v17 and master to
> 64kB. It's the quickest and (I think) most future-proof way to make
> this test work. Any objections?

This is done. I also changed autovacuum_work_mem just for the sake of
consistency. I did some quick math and found that there shouldn't be a
difference between 32- and 64-bit platforms for when they exceed 64kB
in the tid store. That's because exceeding the limit is caused by
allocating the first block of one of the slab contexts. That
independence may not be stable, so I'm thinking of hard-coding the
block sizes in master only, but I've left that for another time.




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-08-06 Thread John Naylor
On Sat, Jul 27, 2024 at 4:04 AM Melanie Plageman
 wrote:
>
> On Wed, Jul 24, 2024 at 8:19 AM John Naylor  wrote:

> I ran my test with your patch (on my 64-bit system, non-assert build)
> and the result is great:
>
> master with my test (slightly modified to now use DELETE instead of
> UPDATE as mentioned upthread)
> 3.09s
>
> master with your patch applied, MWM set to 64kB and 9000 rows instead of 
> 80
> 1.06s

Glad to hear it!

> I took a look at the patch, but I can't say I know enough about the
> memory allocation subsystems and how TIDStore works to meaningfully
> review it -- nor enough about DSM to comment about the interactions.

I tried using parallel vacuum with 64kB and it succeeded, but needed
to perform an index scan for every heap page pruned. It's not hard to
imagine some code moving around so that it doesn't work anymore, but
since this is for testing only, it seems a warning comment is enough.

> I suspect 256kB would also be fast enough to avoid my test timing out
> on the buildfarm, but it is appealing to have a minimum for
> maintenance_work_mem that is the same as work_mem.

Agreed on both counts:

I came up with a simple ctid expression to make the bitmap arrays larger:

delete from test where ctid::text like '%,2__)';

With that, it still takes between 250k and 300k tuples to force a
second index scan with 256kB m_w_m, default fillfactor, and without
asserts. (It may need a few more pages for 32-bit but not many more)
The table is around 1300 pages, where on v16 it's about 900. But with
fewer tuples deleted, the WAL for deletes should be lower. So it might
be comparable to v16's test.

It also turns out that to support 64kB memory settings, we actually
wouldn't need to change radix tree to lazily create memory contexts --
at least currently, SlabCreate doesn't allocate a keeper block, so a
newly created slab context reports 0 for "mem_allocated". So I'm
inclined to go ahead change the minimum m_w_m on v17 and master to
64kB. It's the quickest and (I think) most future-proof way to make
this test work. Any objections?




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-27 Thread John Naylor
On Saturday, July 27, 2024, Melanie Plageman 
wrote:
>
>
> Yes, the only thing that is important is having two rounds of index
> vacuuming and having one tuple with a value matching my cursor
> condition before the first index vacuum and one after. What do you
> mean update only the last few tuples though?
>

I meant we could update tuples with the highest offsets on each page. That
would then lead to longer arrays of bitmaps to store offsets during vacuum.
Lowering the minimum memory setting is easier to code and reason about,
however.


Re: Make tuple deformation faster

2024-07-24 Thread John Naylor
On Mon, Jul 1, 2024 at 5:07 PM David Rowley  wrote:

> cycles idle
>8505168  stalled-cycles-backend:u  #0.02% backend cycles 
> idle
>   165442142326  instructions:u#3.35  insn per cycle
>   #0.00  stalled
> cycles per insn
>39409877343  branches:u#3.945 G/sec
>  146350275  branch-misses:u   #0.37% of all branches

> patched

> cycles idle
>   24259785  stalled-cycles-backend:u  #0.05% backend cycles 
> idle
>   213688149862  instructions:u#4.29  insn per cycle
>   #0.00  stalled
> cycles per insn
>44147675129  branches:u#4.420 G/sec
>   14282567  branch-misses:u   #0.03% of all branches

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have

Nice!

> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-24 Thread John Naylor
On Wed, Jul 24, 2024 at 2:42 PM John Naylor  wrote:
> As for lowering the limit, we've experimented with 256kB here:
>
> https://www.postgresql.org/message-id/canwcazzutvz3lsypauyqvzcezxz7qe+9ntnhgyzdtwxpul+...@mail.gmail.com
>
> As I mention there, going lower than that would need a small amount of
> reorganization in the radix tree. Not difficult -- the thing I'm
> concerned about is that we'd likely need to document a separate
> minimum for DSA, since that behaves strangely with 256kB and might not
> work at all lower than that.

For experimentation, here's a rough patch (really two, squashed
together for now) that allows m_w_m to go down to 64kB.

drop table if exists test;
create table test (a int) with (autovacuum_enabled=false, fillfactor=10);
insert into test (a) select i from generate_series(1,2000) i;
create index on test (a);
update test set a = a + 1;

set maintenance_work_mem = '64kB';
vacuum (verbose) test;

INFO:  vacuuming "john.public.test"
INFO:  finished vacuuming "john.public.test": index scans: 3
pages: 0 removed, 91 remain, 91 scanned (100.00% of total)

The advantage with this is that we don't need to care about
MEMORY_CONTEXT_CHECKING or 32/64 bit-ness, since allocating a single
large node will immediately blow the limit, and that will happen
fairly quickly regardless. I suspect going this low will not work with
dynamic shared memory and if so would need a warning comment.
From 25ccfb9ed812c72194bd585906f9f11e18400422 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 24 Jul 2024 18:41:51 +0700
Subject: [PATCH v1] WIP: set minimum maintenance_work_mem to 64kB, matching
 work_mem

To allow this to work for vacuum, create radix tree node
contexts lazily.
---
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/lib/radixtree.h | 44 +
 2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f6fcdebb03..853ad8dba1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2445,7 +2445,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 64, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 88bf695e3f..bd83b670dd 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -834,12 +834,30 @@ RT_ALLOC_NODE(RT_RADIX_TREE * tree, const uint8 kind, const RT_SIZE_CLASS size_c
 	RT_CHILD_PTR allocnode;
 	RT_NODE*node;
 	size_t		allocsize;
+	RT_SIZE_CLASS_ELEM class_info;
 
-	allocsize = RT_SIZE_CLASS_INFO[size_class].allocsize;
+	class_info = RT_SIZE_CLASS_INFO[size_class];
+	allocsize = class_info.allocsize;
 
 #ifdef RT_SHMEM
 	allocnode.alloc = dsa_allocate(tree->dsa, allocsize);
 #else
+
+	/*
+	 * The context for RT_CLASS_4 should have been created when the tree was
+	 * initialized.
+	 */
+	Assert(tree->node_slabs[RT_CLASS_4] != NULL);
+		if (size_class > RT_CLASS_4 && tree->node_slabs[size_class] == NULL)
+	{
+		size_t		blocksize = RT_SLAB_BLOCK_SIZE(allocsize);
+
+		tree->node_slabs[size_class] = SlabContextCreate(tree->context,
+		 class_info.name,
+		 blocksize,
+		 allocsize);
+	}
+
 	allocnode.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->node_slabs[size_class],
 		allocsize);
 #endif
@@ -1827,6 +1845,9 @@ RT_CREATE(MemoryContext ctx)
 	RT_CHILD_PTR rootnode;
 #ifdef RT_SHMEM
 	dsa_pointer dp;
+#else
+	RT_SIZE_CLASS_ELEM class_info;
+	size_t		node_blocksize;
 #endif
 
 	old_ctx = MemoryContextSwitchTo(ctx);
@@ -1852,17 +1873,18 @@ RT_CREATE(MemoryContext ctx)
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
-	/* Create a slab context for each size class */
-	for (int i = 0; i < RT_NUM_SIZE_CLASSES; i++)
-	{
-		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
-		size_t		inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.allocsize);
+	/*
+	 * For now only create a slab context for the smallest size class. The
+	 * rest will be created on demand. This allows us to stay within memory
+	 * settings that are smaller than the size of the larger slab blocks.
+	 */
+	class_info = RT_SIZE_CLASS_INFO[RT_CLASS_4];
+	node_blocksize = RT_SLAB_BLOCK_SIZE(class_info.allocsize);
 
-		tree->node_slabs[i] = SlabContextCreate(ctx,
-size_class.name,
-inner_blocksize,
-size_class.allocsize);
-	}
+	tree->node_slabs[RT_CLASS_4] = SlabContextCreate(ctx,
+	 class_info.name,
+	 node_blocksize,
+	 class_info.allocsize);
 
 	/* By default we use the passed context for leaves. */
 	tree->leaf_context = tree->context;
-- 
2.45.2



Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-24 Thread John Naylor
On Wed, Jul 24, 2024 at 5:40 AM Masahiko Sawada  wrote:

> Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
> also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
> On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
> bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
> bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
> and use more Bump memory blocks, resulting in filling up TidStore in
> the test cases. We can easily reproduce this test failure with
> PostgreSQL server built without --enable-cassert. It seems that
> copperhead is the sole BF animal that doesn't use --enable-cassert but
> runs recovery-check.

It seems we could force the bitmaps to be larger, and also reduce the
number of updated tuples by updating only the last few tuples (say
5-10) by looking at the ctid's offset. This requires some trickery,
but I believe I've done it in the past by casting to text and
extracting with a regex. (I'm assuming the number of tuples updated is
more important than the number of tuples inserted on a newly created
table.)

As for lowering the limit, we've experimented with 256kB here:

https://www.postgresql.org/message-id/canwcazzutvz3lsypauyqvzcezxz7qe+9ntnhgyzdtwxpul+...@mail.gmail.com

As I mention there, going lower than that would need a small amount of
reorganization in the radix tree. Not difficult -- the thing I'm
concerned about is that we'd likely need to document a separate
minimum for DSA, since that behaves strangely with 256kB and might not
work at all lower than that.




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-21 Thread John Naylor
On Thu, Jun 20, 2024 at 2:58 PM John Naylor  wrote:
> the struct. If that's agreeable I'll commit it that way tomorrow
> unless someone beats me to it.

Pushed. I'll clear the open item once all buildfarm members have reported in.




Re: Speed up collation cache

2024-06-20 Thread John Naylor
On Sat, Jun 15, 2024 at 6:46 AM Jeff Davis  wrote:
> Attached is a patch to use simplehash.h instead, which speeds things up
> enough to make them fairly close (from around 15% slower to around 8%).

+#define SH_HASH_KEY(tb, key)   hash_uint32((uint32) key)

For a static inline hash for speed reasons, we can use murmurhash32
here, which is also inline.




Re: 回复: An implementation of multi-key sort

2024-06-20 Thread John Naylor
On Fri, Jun 14, 2024 at 6:20 PM Yao Wang  wrote:
>
> hi Tomas,
>
> So many thanks for your kind response and detailed report. I am working
> on locating issues based on your report/script and optimizing code, and
> will update later.

Hi,
This is an interesting proof-of-concept!

Given the above, I've set this CF entry to "waiting on author".

Also, I see you've added Heikki as a reviewer. I'm not sure how others
think, but I consider a "reviewer" in the CF app to be someone who has
volunteered to be responsible to help move this patch forward. If
there is a name in the reviewer column, it may discourage others from
doing review. It also can happened that people ping reviewers to ask
"There's been no review for X months -- are you planning on looking at
this?", and it's not great if that message is a surprise.

Note that we prefer not to top-post in emails since it makes our web
archive more difficult to read.

Thanks,
John




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-20 Thread John Naylor
On Thu, Jun 20, 2024 at 8:12 AM Masahiko Sawada  wrote:

> On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
> >

> I agree with radixtree-fix-proposed.patch. Even if we zero more fields
> in the node it would not add noticeable overheads.

+1 in general, although I'm slightly concerned about this part:

> > (The RT_NODE_48 case could be optimized a bit if we cared
> > to swap the order of its slot_idxs[] and isset[] arrays;
> > then the initial zeroing could just go up to slot_idxs[].

- memset(n48->isset, 0, sizeof(n48->isset));
+ memset(n48, 0, offsetof(RT_NODE_48, children));
  memset(n48->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n48->slot_idxs));

I was a bit surprised that neither gcc 14 nor clang 18 can figure out
that they can skip zeroing the slot index array since it's later
filled in with "invalid index", so they actually zero out 272 bytes
before re-initializing 256 of those bytes. It may not matter in
practice, but it's also not free, and trivial to avoid.

> > I don't know if there's any reason why the current order
> > is preferable.)
>
> IIUC there is no particular reason for the current order in RT_NODE_48.

Yeah. I found that simply swapping them enables clang to avoid
double-initialization, but gcc still can't figure it out and must be
told to stop at slot_idxs[]. I'd prefer to do it that way and document
that slot_idxs is purposefully the last member of the fixed part of
the struct. If that's agreeable I'll commit it that way tomorrow
unless someone beats me to it.




Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-21 Thread John Naylor
On Mon, May 20, 2024 at 8:41 PM Masahiko Sawada  wrote:
>
> On Mon, May 20, 2024 at 8:47 PM Jonathan S. Katz  wrote:
> >
> > On 5/20/24 2:58 AM, John Naylor wrote:
> > > Hi Jon,
> > >
> > > Regarding vacuum "has shown up to a 6x improvement in overall time to
> > > complete its work" -- I believe I've seen reported numbers close to
> > > that only 1) when measuring the index phase in isolation or maybe 2)
> > > the entire vacuum of unlogged tables with one, perfectly-correlated
> > > index (testing has less variance with WAL out of the picture). I
> > > believe tables with many indexes would show a lot of improvement, but
> > > I'm not aware of testing that case specifically. Can you clarify where
> > > 6x came from?
> >
> > Sawada-san showed me the original context, but I can't rapidly find it
> > in the thread. Sawada-san, can you please share the numbers behind this?
> >
>
> I referenced the numbers that I measured during the development[1]
> (test scripts are here[2]). IIRC I used unlogged tables and indexes,
> and these numbers were the entire vacuum execution time including heap
> scanning, index vacuuming and heap vacuuming.

Thanks for confirming.

The wording "has a new internal data structure that reduces memory
usage and has shown up to a 6x improvement in overall time to complete
its work" is specific for runtime, and the memory use is less
specific. Unlogged tables are not the norm, so I'd be cautious of
reporting numbers specifically designed (for testing) to isolate the
thing that changed.

I'm wondering if it might be both more impressive-sounding and more
realistic for the average user experience to reverse that: specific on
memory, and less specific on speed. The best-case memory reduction
occurs for table update patterns that are highly localized, such as
the most recently inserted records, and I'd say those are a lot more
common than the use of unlogged tables.

Maybe something like "has a new internal data structure that reduces
overall time to complete its work and can use up to 20x less memory."

Now, it is true that when dead tuples are sparse and evenly spaced
(e.g. 1 every 100 pages), vacuum can now actually use more memory than
v16. However, the nature of that scenario also means that the number
of TIDs just can't get very big to begin with. In contrast, while the
runtime improvement for normal (logged) tables is likely not
earth-shattering, I believe it will always be at least somewhat
faster, and never slower.




Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-19 Thread John Naylor
Hi Jon,

Regarding vacuum "has shown up to a 6x improvement in overall time to
complete its work" -- I believe I've seen reported numbers close to
that only 1) when measuring the index phase in isolation or maybe 2)
the entire vacuum of unlogged tables with one, perfectly-correlated
index (testing has less variance with WAL out of the picture). I
believe tables with many indexes would show a lot of improvement, but
I'm not aware of testing that case specifically. Can you clarify where
6x came from?




Re: First draft of PG 17 release notes

2024-05-19 Thread John Naylor
Hi Bruce, thanks for doing this again!

I'm a bit late to this discussion -- there's been a bit of churn in
the vacuum items, and some streams got crossed along the way. I've
attached an attempt to rectify this.
diff --git a/doc/src/sgml/release-17.sgml b/doc/src/sgml/release-17.sgml
index 0e7ff51f4a..612eb100a7 100644
--- a/doc/src/sgml/release-17.sgml
+++ b/doc/src/sgml/release-17.sgml
@@ -492,23 +492,23 @@ Allow BRIN indexes to be created using parallel workers (Tomas Vondra, Matthias
  
 
 
 
 
 
-Allow vacuum to more efficiently remove and freeze tuples (Masahiko Sawada, John Naylor, Melanie Plageman)
+Allow vacuum to more efficiently remove and freeze tuples (Melanie Plageman)
+
+
+
+WAL traffic caused by vacuum is also more compact.
 
 
 
 

Re: Lowering the minimum value for maintenance_work_mem

2024-05-19 Thread John Naylor
On Mon, May 20, 2024 at 11:59 AM Masahiko Sawada  wrote:
>
> On Fri, May 17, 2024 at 5:55 AM Andres Freund  wrote:

> > I think we should consider lowering the minimum setting of
> > maintenance_work_mem to the minimum of work_mem.
>
> +1 for lowering the minimum value of maintenance_work_mem. I've faced
> the same situation.
>
> Even if a shared tidstore is empty, TidStoreMemoryUsage() returns
> 256kB because it's the minimum segment size of DSA, i.e.
> DSA_MIN_SEGMENT_SIZE. So we can lower the minimum maintenance_work_mem
> down to 256kB, from a vacuum perspective.

I've verified 256kB works with both local and shared memory with the
below commands, and 200k records are enough to cause a second round of
index cleanup. I don't think we can go much smaller than that without
changing how we size the blocks in the node slab contexts (or when
they're created), which is currently somewhat arbitrary. That'll need
some thought, at least when we get a use case with work_mem as the
limit.

set maintenance_work_mem = '256kB';

drop table if exists test;
create unlogged table test (a int) with (autovacuum_enabled=false);
insert into test (a) select i from generate_series(1,200_000) i;
create index on test (a);
--create index on test (a); -- toggle for parallel vacuum

delete from test;
vacuum (verbose) test;

Side note: I'm confused why shared memory works at all in this case,
since it failed for 1MB init segments until we allowed callers to
specify a smaller init size. The overhead for DSA seems to be
significant for small sizes, as evidenced from the amount of usable
memory:

shared:
INFO:  finished vacuuming "john.public.test": index scans: 56

local:
INFO:  finished vacuuming "john.public.test": index scans: 2




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-01 Thread John Naylor
On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
>
> On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:

> > - RT_KEY_GET_SHIFT is not covered for key=0:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> >
> > That should be fairly simple to add to the tests.
>
> There are two paths to call RT_KEY_GET_SHIFT():
>
> 1. RT_SET() -> RT_KEY_GET_SHIFT()
> 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
>
> In both cases, it's called when key > tree->ctl->max_val. Since the
> minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> when key=0.

Ah, right, so it is dead code. Nothing to worry about, but it does
point the way to some simplifications, which I've put together in the
attached.

> > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> >
> > That should be easy to add.
>
> Agreed. The patch is attached.

LGTM

> > - TidStoreCreate* has some memory clamps that are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> >
> > Maybe we could experiment with using 1MB for shared, and something
> > smaller for local.
>
> I've confirmed that the local and shared tidstore with small max sizes
> such as 4kB and 1MB worked. Currently the max size is hard-coded in
> test_tidstore.c but if we use work_mem as the max size, we can pass
> different max sizes for local and shared in the test script.

Seems okay, do you want to try that and see how it looks?
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2896a6efc5..fdac103763 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -217,7 +217,6 @@
 #define RT_NODE_48_GET_CHILD RT_MAKE_NAME(node_48_get_child)
 #define RT_NODE_256_IS_CHUNK_USED RT_MAKE_NAME(node_256_is_chunk_used)
 #define RT_NODE_256_GET_CHILD RT_MAKE_NAME(node_256_get_child)
-#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
 #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
 #define RT_NODE_SEARCH RT_MAKE_NAME(node_search)
 #define RT_NODE_DELETE RT_MAKE_NAME(node_delete)
@@ -320,9 +319,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 /* Mask for extracting a chunk from a key */
 #define RT_CHUNK_MASK ((1 << RT_SPAN) - 1)
 
-/* Maximum shift needed to extract a chunk from a key */
-#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
-
 /* Maximum level a tree can reach for a key */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN)
 
@@ -797,28 +793,15 @@ RT_NODE_256_GET_CHILD(RT_NODE_256 * node, uint8 chunk)
 	return &node->children[chunk];
 }
 
-/*
- * Return the smallest shift that will allowing storing the given key.
- */
-static inline int
-RT_KEY_GET_SHIFT(uint64 key)
-{
-	if (key == 0)
-		return 0;
-	else
-		return (pg_leftmost_one_pos64(key) / RT_SPAN) * RT_SPAN;
-}
-
 /*
  * Return the max value that can be stored in the tree with the given shift.
  */
 static uint64
 RT_SHIFT_GET_MAX_VAL(int shift)
 {
-	if (shift == RT_MAX_SHIFT)
-		return UINT64_MAX;
-	else
-		return (UINT64CONST(1) << (shift + RT_SPAN)) - 1;
+	int max_shift = (sizeof(uint64) - 1) * BITS_PER_BYTE;
+
+	return UINT64_MAX >> (max_shift - shift);
 }
 
 /*
@@ -1574,9 +1557,8 @@ RT_NODE_INSERT(RT_RADIX_TREE * tree, RT_PTR_ALLOC * parent_slot, RT_CHILD_PTR no
  * and move the old node below it.
  */
 static pg_noinline void
-RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key, int target_shift)
 {
-	int			target_shift = RT_KEY_GET_SHIFT(key);
 	int			shift = tree->ctl->start_shift;
 
 	Assert(shift < target_shift);
@@ -1713,11 +1695,15 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 	/* Extend the tree if necessary */
 	if (unlikely(key > tree->ctl->max_val))
 	{
+		int			start_shift;
+
+		/* compute the smallest shift that will allowing storing the key */
+		start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;
+
 		if (tree->ctl->num_keys == 0)
 		{
 			RT_CHILD_PTR node;
 			RT_NODE_4  *n4;
-			int			start_shift = RT_KEY_GET_SHIFT(key);
 
 			/*
 			 * With an empty root node, we don't extend the tree upwards,
@@ -1738,7 +1724,7 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 			goto have_slot;
 		}
 		else
-			RT_EXTEND_UP(tree, key);
+			RT_EXTEND_UP(tree, key, start_shift);
 	}
 
 	slot = RT_GET_SLOT_RECURSIVE(tree, &tree->ctl->root,
@@ -2937,7 +2923,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SPAN
 #un

Re: cpluspluscheck/headerscheck require build in REL_16_STABLE

2024-04-26 Thread John Naylor
On Wed, Apr 17, 2024 at 7:21 AM John Naylor  wrote:
>
> On Mon, Apr 15, 2024 at 9:20 PM Marina Polyakova
>  wrote:
> > Everything seems to work with this patch, thank you!
>
> Glad to hear it -- I'll push next week when I get back from vacation,
> unless there are objections.

Pushed, thanks for the report!




Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread John Naylor
On Fri, Apr 26, 2024 at 6:54 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Congratulations to you both!




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread John Naylor
On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  wrote:
>
> > I saw a SIGSEGV there when using tidstore to write a fix for something else.
> > Patch attached.
>
> Great find, thank you for the patch!

+1

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that  Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

I found that it doesn't add any measurable time to run the test.

> The fix looks good to me. I think we can improve regression tests for
> better coverage. In TidStore on a 64-bit machine, we can store 3
> offsets in the header and these values are embedded to the leaf page.
> With more than 3 offsets, the value size becomes more than 16 bytes
> and a single value leaf. Therefore, if we can add the test with the
> array[1,2,3,4,100], we can cover the case of replacing a single-value
> leaf with a different size new single-value leaf. Now we add 9 pairs

Good idea.

> of do_gset_block_offset() and check_set_block_offsets(). If these are
> annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.




Re: cpluspluscheck/headerscheck require build in REL_16_STABLE

2024-04-16 Thread John Naylor
On Mon, Apr 15, 2024 at 9:20 PM Marina Polyakova
 wrote:
>
> On 2024-04-13 08:40, John Naylor wrote:
> > so that they build fine on PG16 as well. The problem is, not all the
> > required headers are generated when invoking `make headerscheck`. The
> > attached patch brings in some Makefile rules from master to make this
> > work. Does this fix it for you?
>
> Everything seems to work with this patch, thank you!

Glad to hear it -- I'll push next week when I get back from vacation,
unless there are objections.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-15 Thread John Naylor
I took a look at the coverage report from [1] and it seems pretty
good, but there are a couple more tests we could do.

- RT_KEY_GET_SHIFT is not covered for key=0:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

That should be fairly simple to add to the tests.

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

- RT_FREE_RECURSE isn't well covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

The TID store test is pretty simple as far as distribution of block
keys, and focuses more on the offset bitmaps. We could try to cover
all branches here, but it would make the test less readable, and it's
kind of the wrong place to do that anyway. test_radixtree.c does have
a commented-out option to use shared memory, but that's for local
testing and won't be reflected in the coverage report. Maybe it's
enough.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

- RT_DUMP_NODE is not covered, and never called by default anyway:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804

It seems we could just leave it alone since it's debug-only, but it's
also a lot of lines. One idea is to use elog with DEBUG5 instead of
commenting out the call sites, but that would cause a lot of noise.

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

[1] 
https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de




Re: cpluspluscheck/headerscheck require build in REL_16_STABLE

2024-04-12 Thread John Naylor
On Fri, Apr 12, 2024 at 11:51 PM Marina Polyakova
 wrote:
>
> Hello, hackers!
>
> When running cpluspluscheck/headerscheck on REL_16_STABLE [1] I found
> that everything was ok only if it was preceded by a build and make
> maintainer-clean was not used:

I can reproduce this.

> In the other branches everything is fine: these problems begin with
> commits [2] (jsonpath_gram.h) and [3] (gram.h) and in the master branch
> there're no such problems after commit [4]. The attached diff.patch
> fixes this issue for me. (IIUC internal headers generated by Bison are
> usually excluded from such checks so I also excluded gramparse.h and
> jsonpath_internal.h...)

I'm not in favor of this patch because these files build fine on
master, so there is no good reason to exclude them. We should arrange
so that they build fine on PG16 as well. The problem is, not all the
required headers are generated when invoking `make headerscheck`. The
attached patch brings in some Makefile rules from master to make this
work. Does this fix it for you?
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 3c42003175..82cae98a44 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -160,7 +160,7 @@ submake-utils-headers:
 
 .PHONY: generated-headers
 
-generated-headers: $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers
+generated-headers: $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers parser/gram.h
 
 $(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h
 	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
diff --git a/src/backend/utils/Makefile b/src/backend/utils/Makefile
index deb901609f..4299735cb6 100644
--- a/src/backend/utils/Makefile
+++ b/src/backend/utils/Makefile
@@ -38,9 +38,12 @@ all: distprep probes.h generated-header-symlinks
 
 distprep: fmgr-stamp errcodes.h
 
-.PHONY: generated-header-symlinks
+.PHONY: generated-header-symlinks submake-adt-headers
 
-generated-header-symlinks: $(top_builddir)/src/include/utils/header-stamp $(top_builddir)/src/include/utils/probes.h
+generated-header-symlinks: $(top_builddir)/src/include/utils/header-stamp $(top_builddir)/src/include/utils/probes.h submake-adt-headers
+
+submake-adt-headers:
+	$(MAKE) -C adt jsonpath_gram.h
 
 $(SUBDIRS:%=%-recursive): fmgr-stamp errcodes.h
 


Re: post-freeze damage control

2024-04-09 Thread John Naylor
On Tue, Apr 9, 2024 at 2:47 AM Robert Haas  wrote:
>
> - I'm slightly worried about the TID store work (ee1b30f12, 30e144287,
> 667e65aac35), perhaps for no reason. Actually, the results seem really
> impressive,

First, thanks for the complement. I actually suspect if we had this
years ago, it might never have occurred to anyone to go through the
trouble of adding parallel index cleanup.

In a funny way, it's almost too effective -- the claim is that m_w_m
over 1GB is perfectly usable, but I haven't been able to get anywere
near that via vacuum (we have indirectly, via bespoke dev code,
but...). I don't have easy access to hardware that can hold a table
big enough to do so -- vacuuming 1 billion records stays under 400MB.

> and a very quick look at some of the commits seemed kind
> of reassuring. But, like the changes to pruning and freezing, this is
> making some really fundamental changes to critical code. In this case,
> it's code that has evolved very little over the years and seems to
> have now evolved quite a lot.

True. I'd say that at a high level, storage and retrieval of TIDs is a
lot simpler conceptually than other aspects of vacuuming. The
low-level guts are now much more complex, but I'm confident it won't
just output a wrong answer. That aspect has been working for a long
time, and when it has broken during development, it fails very quickly
and obviously.

The more challenging aspects are less cut-and-dried, like memory
management, delegation of responsibility, how to expose locking (which
vacuum doesn't even use), readability/maintainability. Those are more
subjective, but it seems to have finally clicked into place in a way
that feels like the right trade-offs have been made. That's hand-wavy,
I realize.

The more recent follow-up commits are pieces that were discussed and
planned for earlier, but have had less review and were left out from
the initial commits since they're not essential to the functionality.
I judged that test coverage was enough to have confidence in them.

On Tue, Apr 9, 2024 at 6:33 AM Michael Paquier  wrote:
>
> This worries me less than other patches like the ones around heap
> changes or the radix tree stuff for TID collection plugged into
> vacuum, which don't have explicit on/off switches AFAIK.

Yes, there is no switch. Interestingly enough, the previous item array
ended up with its own escape hatch to tamp down its eager allocation,
autovacuum_work_mem. Echoing what I said to Robert, if we had the new
storage years ago, I doubt this GUC would ever have been proposed.

I'll also mention the array is still in the code base, but only in a
test module as a standard to test against. Hopefully that offers some
reassurance.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-09 Thread John Naylor
On Mon, Apr 8, 2024 at 7:26 PM John Naylor  wrote:
>
> I pushed both of these and see that mylodon complains that anonymous
> unions are a C11 feature. I'm not actually sure that the union with
> uintptr_t is actually needed, though, since that's not accessed as
> such here. The simplest thing seems to get rid if the union and name
> the inner struct "header", as in the attached.

I pushed this with some comment adjustments.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread John Naylor
On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov  wrote:
>
>> I pushed both of these and see that mylodon complains that anonymous
>> unions are a C11 feature. I'm not actually sure that the union with
>> uintptr_t is actually needed, though, since that's not accessed as
>> such here. The simplest thing seems to get rid if the union and name
>> the inner struct "header", as in the attached.
>
>
> Provided  uintptr_t is not accessed it might be good to get rid of it.
>
> Maybe this patch also need correction in this:
> +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) 
> / sizeof(OffsetNumber))

For full context the diff was

-#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) /
sizeof(OffsetNumber))
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

I wanted the former, from f35bd9bf35 , to be independently useful (in
case the commit in question had some unresolvable issue), and its
intent is to fill struct padding when the array of bitmapword happens
to have length zero. Changing to uintptr_t for the size calculation
reflects the intent to fit in a (local) pointer, regardless of the
size of a bitmapword. (If a DSA pointer happens to be a different size
for some odd platform, it should still work, BTW.)

My thinking with the union was, for big-endian, to force the 'flags'
member to where it can be set, but thinking again, it should still
work if by happenstance the header was smaller than the child pointer:
A different bit would get tagged, but I believe that's irrelevant. The
'flags' member makes sure a byte is reserved for the tag, but it may
not be where the tag is actually located, if that makes sense.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread John Naylor
On Sun, Apr 7, 2024 at 9:08 AM John Naylor  wrote:
>
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> As discussed, this includes a macro to cap max possible offset that
> can be stored in the bitmap, which I believe only reduces the valid
> offset range for 32kB pages on 32-bit platforms. Even there, it allows
> for more line pointers than can possibly be useful. It also splits
> into two parts for readability. It would be committed in two pieces as
> well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index cddbaf013b..78730797d6 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -43,35 +43,30 @@
  */
 typedef struct BlocktableEntry
 {
-	union
+	struct
 	{
-		struct
-		{
 #ifndef WORDS_BIGENDIAN
-			/*
-			 * We need to position this member so that the backing radix tree
-			 * can use the lowest bit for a pointer tag. In particular, it
-			 * must be placed within 'header' so that it corresponds to the
-			 * lowest byte in 'ptr'. We position 'nwords' along with it to
-			 * avoid struct padding.
-			 */
-			uint8		flags;
-
-			int8		nwords;
+		/*
+		 * We need to position this member so that the backing radix tree can
+		 * use the lowest bit for a pointer tag. In particular, it must be
+		 * placed within 'header' so that it corresponds to the lowest byte in
+		 * 'ptr'. We position 'nwords' along with it to avoid struct padding.
+		 */
+		uint8		flags;
+
+		int8		nwords;
 #endif
 
-			/*
-			 * We can store a small number of offsets here to avoid wasting
-			 * space with a sparse bitmap.
-			 */
-			OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+		/*
+		 * We can store a small number of offsets here to avoid wasting space
+		 * with a sparse bitmap.
+		 */
+		OffsetNumber full_offsets[NUM_FULL_OFFSETS];
 
 #ifdef WORDS_BIGENDIAN
-			int8		nwords;
-			uint8		flags;
+		int8		nwords;
+		uint8		flags;
 #endif
-		};
-		uintptr_t	ptr;
 	}			header;
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread John Naylor
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund  wrote:
>
> Looking at the code, the failure isn't suprising anymore:
> chardata[MaxBlocktableEntrySize];
> BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that.  Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!




Re: Add bump memory context type and use it for tuplesorts

2024-04-07 Thread John Naylor
On Sat, Apr 6, 2024 at 7:37 PM David Rowley  wrote:
>
 I'm planning on pushing these, pending a final look at 0002 and 0003
> on Sunday morning NZ time (UTC+12), likely in about 10 hours time.

+1

I haven't looked at v6, but I've tried using it in situ, and it seems
to work as well as hoped:

https://www.postgresql.org/message-id/CANWCAZZQFfxvzO8yZHFWtQV%2BZ2gAMv1ku16Vu7KWmb5kZQyd1w%40mail.gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread John Naylor
On Sun, Apr 7, 2024 at 9:08 AM John Naylor  wrote:
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).

And...since there's a new bump context patch, I wanted to anticipate
squeezing an update on top of that, if that gets committed. 0004/5 are
the v6 bump context, and 0006 uses it for vacuum. The rest are to show
it works -- the expected.out changes make possible problems in CI
easier to see. The allocation size is 16 bytes, so this difference is
entirely due to lack of chunk header:

aset: 6619136
bump: 5047296

(Note: assert builds still have the chunk header for sanity checking,
so this was done in a more optimized build)
From e2333bad47974a22120a4de3fad804a21757f631 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 20 Feb 2024 21:49:57 +1300
Subject: [PATCH v84 5/8] Introduce a bump memory allocator

---
 src/backend/nodes/gen_node_support.pl |   2 +-
 src/backend/utils/mmgr/Makefile   |   1 +
 src/backend/utils/mmgr/bump.c | 818 ++
 src/backend/utils/mmgr/mcxt.c |  15 +-
 src/backend/utils/mmgr/meson.build|   1 +
 src/include/nodes/memnodes.h  |   3 +-
 src/include/utils/memutils.h  |   7 +
 src/include/utils/memutils_internal.h |  18 +-
 src/tools/pgindent/typedefs.list  |   2 +
 9 files changed, 862 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/mmgr/bump.c

diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index d4244facbb..81df3bdf95 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -149,7 +149,7 @@ my @abstract_types = qw(Node);
 # they otherwise don't participate in node support.
 my @extra_tags = qw(
   IntList OidList XidList
-  AllocSetContext GenerationContext SlabContext
+  AllocSetContext GenerationContext SlabContext BumpContext
   TIDBitmap
   WindowObjectData
 );
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index dae3432c98..01a1fb8527 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	alignedalloc.o \
 	aset.o \
+	bump.o \
 	dsa.o \
 	freepage.o \
 	generation.o \
diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
new file mode 100644
index 00..f98a203a0c
--- /dev/null
+++ b/src/backend/utils/mmgr/bump.c
@@ -0,0 +1,818 @@
+/*-
+ *
+ * bump.c
+ *	  Bump allocator definitions.
+ *
+ * Bump is a MemoryContext implementation designed for memory usages which
+ * require allocating a large number of chunks, none of which ever need to be
+ * pfree'd or realloc'd.  Chunks allocated by this context have no chunk header
+ * and operations which ordinarily require looking at the chunk header cannot
+ * be performed.  For example, pfree, realloc, GetMemoryChunkSpace and
+ * GetMemoryChunkContext are all not possible with bump allocated chunks.  The
+ * only way to release memory allocated by this context type is to reset or
+ * delete the context.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mmgr/bump.c
+ *
+ *
+ *	Bump is best suited to cases which require a large number of short-lived
+ *	chunks where performance matters.  Because bump allocated chunks don't
+ *	have a chunk header, it can fit more chunks on each block.  This means we
+ *	can do more with less memory and fewer cache lines.  The reason it's best
+ *	suited for short-lived usages of memory is that ideally, pointers to bump
+ *	allocated chunks won't be visible to a large amount of code.  The more
+ *	code that operates on memory allocated by this allocator, the more chances
+ *	that some code will try to perform a pfree or one of the other operations
+ *	which are made impossible due to the lack of chunk header.  In order to
+ *	to detect accidental usage of the various disallowed operations, we do add
+ *	a MemoryChunk chunk header in MEMORY_CONTEXT_CHECKING builds and have the
+ *	various disallowed functions raise an ERROR.
+ *
+ *	Allocations are MAXALIGNed.
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "port/pg_bitutils.h"
+#include "utils/memdebug.h"
+#include "utils/memutils.h"
+#include "utils/memutils_memorychunk.h"
+#include "utils/memutils_internal.h"
+
+#define Bump_BLOCKHDRSZ	MAXALIGN(sizeof(BumpBlock))
+
+/* No chunk header unless built with MEMORY_CONTEXT_CHECKING */
+#ifdef MEMORY_CONTEXT_CHECKING
+#define Bump_CHUNKHDRSZ	sizeof(MemoryChunk)
+#else
+#define Bump_CHUNKHDRSZ	0
+#endif
+
+#define Bump_CHUNK_FRACT

Re: Improve heapgetpage() performance, overhead from serializable

2024-04-06 Thread John Naylor
On Sun, Apr 7, 2024 at 11:49 AM Andres Freund  wrote:
>
> Hi,
>
> On 2024-01-22 13:01:31 +0700, John Naylor wrote:
> > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund  wrote:
> > > And 397->320ms
> > > for something as core as this, is imo worth considering on its own.
> >
> > Hi Andres, this interesting work seems to have fallen off the radar --
> > are you still planning to move forward with this for v17?
>
> I had completely forgotten about this patch, but some discussion around
> streaming read reminded me of it.  Here's a rebased version, with conflicts
> resolved and very light comment polish and a commit message. Given that
> there's been no changes otherwise in the last months, I'm inclined to push in
> a few hours.

Just in time ;-) The commit message should also have "reviewed by
Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
thread for a withrawn CF entry with a similar idea but covering fewer
cases:

https://www.postgresql.org/message-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce%40yeah.net




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-06 Thread John Naylor
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada  wrote:
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Excellent!

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.
From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Apr 2024 17:04:12 +0700
Subject: [PATCH v83 1/3] store offsets in the header

---
 src/backend/access/common/tidstore.c  | 52 +++
 .../test_tidstore/expected/test_tidstore.out  | 14 +
 .../test_tidstore/sql/test_tidstore.sql   |  5 ++
 3 files changed, 71 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e1a7e82469..4eb5d46951 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -34,6 +34,9 @@
 /* number of active words for a page: */
 #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
 
+/* number of offsets we can store in the header of a BlocktableEntry */
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber))
+
 /*
  * This is named similarly to PagetableEntry in tidbitmap.c
  * because the two have a similar function.
@@ -41,6 +44,10 @@
 typedef struct BlocktableEntry
 {
 	uint16		nwords;
+
+	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	for (int i = 1; i < num_offsets; i++)
 		Assert(offsets[i] > offsets[i - 1]);
 
+	memset(page, 0, offsetof(BlocktableEntry, words));
+
+	if (num_offsets <= NUM_FULL_OFFSETS)
+	{
+		for (int i = 0; i < num_offsets; i++)
+		{
+			OffsetNumber off = offsets[i];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+elog(ERROR, "tuple offset out of range: %u", off);
+
+			page->full_offsets[i] = off;
+		}
+
+		page->nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	page->nwords = wordnum;
 	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+}
 
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
@@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] == off)
+return true;
+		}
+		return false;
+	}
+	else
+	{
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
@@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 }
+}
 
 /*
  * Prepare to iterate through a TidStore.
@@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] != InvalidOffsetNumber)
+result->offsets[result->num_offsets++] = page->full_offsets[i];
+		}
+	}
+	else
+	{
 	for (wordnum = 0; wordnum < page->nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
@@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 		}
 	}
 }
+}
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 0ae2f970da..c40779859b 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
 (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk),
 (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((

Re: Change GUC hashtable to use simplehash?

2024-04-06 Thread John Naylor
On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
>
> But given that we know the data length and we have it in a register
> already, it's easy enough to just mask out data past the end with a
> shift. See patch 1. Performance benefit is about 1.5x Measured on a
> small test harness that just hashes and finalizes an array of strings,
> with a data dependency between consecutive hashes (next address
> depends on the previous hash output).

I pushed this with a couple cosmetic adjustments, after fixing the
endianness issue. I'm not sure why valgrind is fine with this way, and
the other ways I tried forming the (little-endian) mask raised errors.
In addition to "zero_byte_low | (zero_byte_low - 1)", I tried
"~zero_byte_low & (zero_byte_low - 1)" and "zero_byte_low ^
(zero_byte_low - 1)" to no avail.

On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis  wrote:
> 0001 looks good to me, thank you.
>
> > v21-0003 adds a new file hashfn_unstable.c for convenience functions
> > and converts all the duplicate frontend uses of hash_string_pointer.
>
> Why not make hash_string() inline, too? I'm fine with it either way,
> I'm just curious why you went to the trouble to create a new .c file so
> it didn't have to be inlined.

Thanks for looking! I pushed these, with hash_string() inlined.

I've attached (not reindented for clarity) an update of something
mentioned a few times already -- removing strlen calls for dynahash
and dshash string keys. I'm not quite sure how the comments should be
updated about string_hash being deprecated to call directly. This
patch goes further and semi-deprecates calling it at all, so these
comments seem a bit awkward now.
From 2e41e683b2fe2bc76b808e58ca2fea9067bff4e1 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Apr 2024 13:59:13 +0700
Subject: [PATCH v21] Use fasthash for string keys in dynahash and dshash

This avoids strlen calls. string_hash is kept around in case
extensions are using it.
---
 src/backend/lib/dshash.c |  5 +++--
 src/backend/utils/hash/dynahash.c| 25 -
 src/common/hashfn.c  |  4 +++-
 src/include/common/hashfn_unstable.h | 33 
 src/include/lib/dshash.h |  2 +-
 5 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 93a9e21ddd..8bebf92cb8 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -32,6 +32,7 @@
 #include "postgres.h"
 
 #include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "lib/dshash.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
@@ -605,14 +606,14 @@ dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
 }
 
 /*
- * A hash function that forwards to string_hash.
+ * A hash function that forwards to hash_string_with_len.
  */
 dshash_hash
 dshash_strhash(const void *v, size_t size, void *arg)
 {
 	Assert(strlen((const char *) v) < size);
 
-	return string_hash((const char *) v, size);
+	return hash_string_with_len((const char *) v, size - 1);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 145e058fe6..9b85af2743 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -309,6 +310,18 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+/*
+ * hash function used when HASH_STRINGS is set
+ *
+ * If the string exceeds keysize-1 bytes, we want to hash only that many,
+ * because when it is copied into the hash table it will be truncated at
+ * that length.
+ */
+static uint32
+default_string_hash(const void *key, Size keysize)
+{
+	return hash_string_with_len((const char *) key, keysize - 1);
+}
 
 /** CREATE ROUTINES **/
 
@@ -420,8 +433,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	else
 	{
 		/*
-		 * string_hash used to be considered the default hash method, and in a
-		 * non-assert build it effectively still is.  But we now consider it
+		 * string_hash used to be considered the default hash method, and
+		 * it effectively still was until version 17.  Since version 14 we consider it
 		 * an assertion error to not say HASH_STRINGS explicitly.  To help
 		 * catch mistaken usage of HASH_STRINGS, we also insist on a
 		 * reasonably long string length: if the keysize is only 4 or 8 bytes,
@@ -430,12 +443,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, in

Re: fasthash32() returning uint64?

2024-04-05 Thread John Naylor
On Sat, Apr 6, 2024 at 3:47 AM Jeff Davis  wrote:
> In hashfn_unstable.h, fasthash32() is declared as:
>
>   /* like fasthash64, but returns a 32-bit hashcode */
>   static inline uint64
>   fasthash32(const char *k, size_t len, uint64 seed)
>
> Is the return type of uint64 a typo?

Yes it is, will fix, thanks!




Re: Change GUC hashtable to use simplehash?

2024-04-01 Thread John Naylor
On Tue, Mar 5, 2024 at 5:30 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 5:04 PM John Naylor  wrote:
> >
> > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > > But given that we know the data length and we have it in a register
> > > already, it's easy enough to just mask out data past the end with a
> > > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > > small test harness that just hashes and finalizes an array of strings,
> > > with a data dependency between consecutive hashes (next address
> > > depends on the previous hash output).
> >
> > Interesting work! I've taken this idea and (I'm guessing, haven't
> > tested) improved it by re-using an intermediate step for the
> > conditional, simplifying the creation of the mask, and moving the
> > bitscan out of the longest dependency chain.
>
> This needed a rebase, and is now 0001. I plan to push this soon.

I pushed but had to revert -- my version (and I believe both) failed
to keep the invariant that the aligned and unaligned must result in
the same hash. It's clear to me how to fix, but I've injured my strong
hand and won't be typing much in for a cuople days. I'll prioritize
the removal of strlen calls for v17, since the optimization can wait
and there is also a valgrind issue I haven't looked into.




Re: Change GUC hashtable to use simplehash?

2024-03-30 Thread John Naylor
On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis  wrote:
>
> > v21-0003 adds a new file hashfn_unstable.c for convenience functions
> > and converts all the duplicate frontend uses of hash_string_pointer.
>
> Why not make hash_string() inline, too? I'm fine with it either way,
> I'm just curious why you went to the trouble to create a new .c file so
> it didn't have to be inlined.

Yeah, it's a bit strange looking in isolation, and I'm not sure I'll
go that route. When I was thinking of this, I also had dynahash and
dshash in mind, which do indirect calls, even if the function is
defined in the same file. That would still work with an inline
definition in the header, just duplicated in the different translation
units. Maybe that's not worth worrying about, since I imagine use
cases with indirect calls will remain rare.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread John Naylor
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
 too high.  It may be useful to control for this by separately
 setting .

-   
-Note that for the collection of dead tuple identifiers,
-VACUUM is only able to utilize up to a maximum of
-1GB of memory.
-   
   
  

This is mentioned twice for two different GUCs -- need to remove the
other one, too. Other than that, I just have minor nits:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs

I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".

- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.

I think "maximum" is redundant with "upper bounds".

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

I've marked it Ready for Committer.
Use TID Store for storage of dead tuple IDs during lazy vacuum

Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit XX.

Since the backing radix tree makes small allocations as needed,
the 1GB limit is now gone. Further, the total memory used is now
often smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the number
of tuples no longer provides any meaningful insights for users, nor
is the maximum number predictable. For that reason this commit also
changes to byte-based progress reporting, with the relevant columns
of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes
and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader process.

XXX: bump catalog version

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: 
https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-28 Thread John Naylor
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
>
> Pushed the refactoring patch.
>
> I've attached the rebased vacuum improvement patch for cfbot. I
> mentioned in the commit message that this patch eliminates the 1GB
> limitation.
>
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test. On my humble laptop, I can still
fit a table large enough to cause PG16 to choke on multiple rounds of
index cleanup:

drop table if exists test;
create unlogged table test (a int, b uuid) with (autovacuum_enabled=false);

insert into test (a,b) select i, gen_random_uuid() from
generate_series(1,1000*1000*1000) i;

create index on test (a);
create index on test (b);

delete from test;

vacuum (verbose, truncate off, parallel 2) test;

INFO:  vacuuming "john.public.test"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total)
tuples: 97174 removed, 2826 remain, 0 are dead but not yet removable
tuples missed: 2826 dead from 18 pages not removed due to cleanup lock
contention
removable cutoff: 771, which was 0 XIDs old when operation ended
new relfrozenxid: 767, which is 4 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 6369409 pages from table (100.00% of total) had
97174 dead item identifiers removed
index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted,
2741825 currently deleted, 0 reusable
index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted,
3842056 currently deleted, 0 reusable
avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s
buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied
WAL usage: 3 records, 1 full page images, 2050 bytes
system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s

Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880.

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.




Re: Change GUC hashtable to use simplehash?

2024-03-26 Thread John Naylor
On Wed, Mar 20, 2024 at 11:01 PM Jeff Davis  wrote:
>
> > I found that adding __attribute__((no_sanitize_address)) to
> > fasthash_accum_cstring_aligned() passes CI. While this kind of
> > exception is warned against (for good reason), I think it's fine here
> > given that glibc and NetBSD, and probably others, do something
> > similar
> > for optimized strlen(). Before I write the proper macro for that, are
> > there any objections? Better ideas?
>
> It appears that the spelling no_sanitize_address is deprecated in
> clang[1] in favor of 'no_sanitize("address")'. It doesn't appear to be
> deprecated in gcc[2].

Thanks for the pointers! In v20-0001, I've drafted checking thes
spelling first, since pg_attribute_no_sanitize_alignment has a similar
version check. Then it checks for no_sanitize_address using
__has_attribute, which goes back to gcc 5. That's plenty for the
buildfarm and CI, and I'm not sure it's worth expending additional
effort to cover more cases. (A similar attribute exists for MSVC in
case it comes up.)

v21-0003 adds a new file hashfn_unstable.c for convenience functions
and converts all the duplicate frontend uses of hash_string_pointer.

This will be where a similar hash_string_with_len will live for
dynash/dshash, which I tested some time ago. I haven't decided whether
to merge that earlier work here or keep it in a separate patch, but
regardless of how 0003 ends up I'd like to push 0001/0002 shortly.
From 690061ff4a54e7baef213bb16e7cc4c4f4c79dbd Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 6 Feb 2024 13:11:33 +0700
Subject: [PATCH v20 2/3] Speed up tail processing when hashing aligned C
 strings

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated the
terminator's position and re-loaded the remaining bytes from the input
string. We already have all the data we need in a register, so let's
just mask off the bytes we need and hash them immediately. The mask can
be cheaply computed without knowing the terminator's position. We still
need that position for the length calculation, but the CPU can now
do that in parallel with other work, shortening the dependency chain.

Ants Aasma and John Naylor

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
---
 src/include/common/hashfn_unstable.h | 44 +---
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 4227afd141..8998475ccf 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -222,8 +222,9 @@ static inline size_t
 fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 {
 	const char *const start = str;
-	size_t		remainder;
+	uint64		chunk;
 	uint64		zero_byte_low;
+	uint64		mask;
 
 	Assert(PointerIsAligned(start, uint64));
 
@@ -242,7 +243,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 */
 	for (;;)
 	{
-		uint64		chunk = *(uint64 *) str;
+		chunk = *(uint64 *) str;
 
 #ifdef WORDS_BIGENDIAN
 		zero_byte_low = haszero64(pg_bswap64(chunk));
@@ -257,14 +258,37 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 		str += FH_SIZEOF_ACCUM;
 	}
 
-	/*
-	 * The byte corresponding to the NUL will be 0x80, so the rightmost bit
-	 * position will be in the range 7, 15, ..., 63. Turn this into byte
-	 * position by dividing by 8.
-	 */
-	remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
-	str += remainder;
+	if (zero_byte_low & 0xFF)
+	{
+		/*
+		 * The next byte in the input is the NUL terminator, so we have
+		 * nothing to do.
+		 */
+	}
+	else
+	{
+		/*
+		 * Create a mask for the remaining bytes so we can combine them into
+		 * the hash. The mask also covers the NUL terminator, but that's
+		 * harmless. The mask could contain 0x80 in bytes corresponding to the
+		 * input past the terminator, but only where the input byte is zero or
+		 * one, so also harmless.
+		 */
+		mask = zero_byte_low | (zero_byte_low - 1);
+#ifdef WORDS_BIGENDIAN
+		/* need to mask the upper bytes */
+		mask = pg_bswap64(mask);
+#endif
+		hs->accum = chunk & mask;
+		fasthash_combine(hs);
+
+		/*
+		 * The byte corresponding to the NUL will be 0x80, so the rightmost
+		 * bit position will be in the range 15, 23, ..., 63. Turn this into
+		 * byte position by dividing by 8.
+		 */
+		str += pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
+	}
 
 	return str - start;
 }
-- 
2.44.0

From 97c2996c4e081c5612c06da6c80ae87a1d273090 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 5 Mar 2024 16:59:39 +0700
Subject: [PATCH v20 3/3] Convert some frontend hash functions to fasthash

Also go one step further

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-26 Thread John Naylor
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> >
> > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > wrote:

> > - * remaining LP_DEAD line pointers on the page in the dead_items
> > - * array. These dead items include those pruned by lazy_scan_prune()
> > - * as well we line pointers previously marked LP_DEAD.
> > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * These dead items include those pruned by lazy_scan_prune() as well
> > + * we line pointers previously marked LP_DEAD.
> >
> > Here maybe "into dead_items".

- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.

Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

> > > > Did you try it with 1MB m_w_m?
> > >
> > > I've incorporated the above comments and test results look good to me.
> >
> > Could you be more specific about what the test was?
> > Does it work with 1MB m_w_m?
>
> If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
>
> FYI other test cases I tested were:
>
> * m_w_m = 2199023254528 (maximum value)
> initial: 1MB
> max: 128GB
>
> * m_w_m = 64MB (default)
> initial: 1MB
> max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

> > The existing comment slipped past my radar, but max_bytes is not a
> > limit, it's a hint. Come to think of it, it never was a limit in the
> > normal sense, but in earlier patches it was the criteria for reporting
> > "I'm full" when asked.
>
> Updated the comment.

+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.

This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."

+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.

-> "the memory context"

I think you can go ahead and commit 0002 and 0003/4.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);

I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:

> > v77-0001
> >
> > - dead_items = (VacDeadItems *) 
> > palloc(vac_max_items_to_alloc_size(max_items));
> > - dead_items->max_items = max_items;
> > - dead_items->num_items = 0;
> > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > +
> > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> >
> > This is confusing enough that it looks like a bug:
> >
> > [inside TidStoreCreate()]
> > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > while (16 * maxBlockSize > max_bytes * 1024L)
> > maxBlockSize >>= 1;
> >
> > This was copied from CreateWorkExprContext, which operates directly on
> > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > here. If we're passing something measured in kilobytes, the parameter
> > is badly named. Let's use convert once and use bytes everywhere.
>
> True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

More on 0003:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs

+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum

I wonder if the comments here should refer to it using a more natural
spelling, like "TID store".

- * items in the dead_items array for later vacuuming, count live and
+ * items in the dead_items for later vacuuming, count live and

Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

Also, "we line pointers" seems to be a pre-existing typo.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));

This is a translated message, so let's keep the message the same.

/*
 * Allocate dead_items (either using palloc, or in dynamic shared memory).
 * Sets dead_items in vacrel for caller.
 *
 * Also handles parallel initialization as part of allocating dead_items in
 * DSM when required.
 */
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)

This comment didn't change at all. It's not wrong, but let's consider
updating the specifics.

v78-0004:

> > +#define dsa_create(tranch_id) \
> > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
> >
> > Since these macros are now referring to defaults, maybe their name
> > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> > (*_MAX_*)
>
> It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
> the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
> name also makes sense to me.

Right, that makes sense.

v78-0005:

"Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly."

Perhaps s/still needed/would have needed/ ..., since we're preventing
that necessity.

> > Did you try it with 1MB m_w_m?
>
> I've incorporated the above comments and test results look good t

Re: add AVX2 support to simd.h

2024-03-24 Thread John Naylor
On Fri, Mar 22, 2024 at 12:09 AM Nathan Bossart
 wrote:
>
> On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:

> > If this were "<=" then the for long arrays we could assume there is
> > always more than one block, and wouldn't need to check if any elements
> > remain -- first block, then a single loop and it's done.
> >
> > The loop could also then be a "do while" since it doesn't have to
> > check the exit condition up front.
>
> Good idea.  That causes us to re-check all of the tail elements when the
> number of elements is evenly divisible by nelem_per_iteration, but that
> might be worth the trade-off.

Yeah, if there's no easy way to avoid that it's probably fine. I
wonder if we can subtract one first to force even multiples to round
down, although I admit I haven't thought through the consequences of
that.

> [v8]

Seems pretty good. It'd be good to see the results of 2- vs.
4-register before committing, because that might lead to some
restructuring, but maybe it won't, and v8 is already an improvement
over HEAD.

/* Process the remaining elements one at a time. */

This now does all of them if that path is taken, so "remaining" can be removed.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >
> > I'm not sure why it took a couple weeks for Coverity to notice
> > ee1b30f12, but it saw it today, and it's not happy:
>
> Hmm, I've also done Coverity Scan in development but I wasn't able to
> see this one for some reason...

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread John Naylor
On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada  wrote:
>
> I've looked into this idea further. Overall, it looks clean and I
> don't see any problem so far in terms of integration with lazy vacuum.
> I've attached three patches for discussion and tests.

Seems okay in the big picture, it's the details we need to be careful of.

v77-0001

- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;

This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.

Note: This was not another pass over the whole vacuum patch, just
looking an the issue at hand.
Also for later: Dilip Kumar reviewed an earlier version.

v77-0002:

+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)

Since these macros are now referring to defaults, maybe their name
should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
(*_MAX_*)

+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)

That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
m_w_m to work correctly.

v77-0003:

+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+ return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+ return tidstore_create_internal(max_bytes, true, tranche_id);
+}

I don't think these operations have enough in common to justify
sharing even an internal implementation. Choosing aset block size is
done for both memory types, but it's pointless to do it for shared
memory, because the local context is then only used for small
metadata.

+ /*
+ * Choose the DSA initial and max segment sizes to be no longer than
+ * 1/16 and 1/8 of max_bytes, respectively.
+ */

I'm guessing the 1/8 here because the number of segments is limited? I
know these numbers are somewhat arbitrary, but readers will wonder why
one has 1/8 and the other has 1/16.

+ if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+ dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+ if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+ dsa_max_size = DSA_MAX_SEGMENT_SIZE;

The second clamp seems against the whole point of this patch -- it
seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
Did you try it with 1MB m_w_m?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread John Naylor
On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada  wrote:

> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation.

I didn't like this very much, because it's splitting an operation
across an API boundary. The caller already has all the information it
needs when it creates the DSA. Straw man proposal: it could do the
same for local memory, then they'd be more similar. But if we made
local contexts the responsibility of the caller, that would cause
duplication between creating and resetting.

> In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area

...plus apparently some overhead, I just found out today, but that's
beside the point.

On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada  wrote:
>
> Yet another idea is that TidStore creates its own DSA area in
> TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> dsa_handle of the DSA and stores it in the shared memory (e.g. in
> PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> the DSA area and dsa_pointer for the shared radix tree. This idea
> still requires controlling min/max segment sizes since dsa_create()
> uses the 1MB as the initial segment size. But the TidStoreCreate()
> would be more user friendly.

This seems like an overall simplification, aside from future size
configuration, so +1 to continue looking into this. If we go this
route, I'd like to avoid a boolean parameter and cleanly separate
TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
after that can introspect, but it's a bit awkward to force these cases
into the same function. It always was a little bit, but this change
makes it more so.




Re: add AVX2 support to simd.h

2024-03-20 Thread John Naylor
On Thu, Mar 21, 2024 at 2:55 AM Nathan Bossart  wrote:
>
> On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:

> > I don't mind removing the 2-register stuff if that's what you think we
> > should do.  I'm cautiously optimistic that it'd help more than the extra
> > branch prediction might hurt, and it'd at least help avoid regressing the
> > lower end for the larger AVX2 registers, but I probably won't be able to
> > prove that without constructing another benchmark.  And TBH I'm not sure
> > it'll significantly impact any real-world workload, anyway.
>
> Here's a new version of the patch set with the 2-register stuff removed,

I'm much happier about v5-0001. With a small tweak it would match what
I had in mind:

+ if (nelem < nelem_per_iteration)
+ goto one_by_one;

If this were "<=" then the for long arrays we could assume there is
always more than one block, and wouldn't need to check if any elements
remain -- first block, then a single loop and it's done.

The loop could also then be a "do while" since it doesn't have to
check the exit condition up front.

> plus a fresh run of the benchmark.  The weird spike for AVX2 is what led me
> down the 2-register path earlier.

Yes, that spike is weird, because it seems super-linear. However, the
more interesting question for me is: AVX2 isn't really buying much for
the numbers covered in this test. Between 32 and 48 elements, and
between 64 and 80, it's indistinguishable from SSE2. The jumps to the
next shelf are postponed, but the jumps are just as high. From earlier
system benchmarks, I recall it eventually wins out with hundreds of
elements, right? Is that still true?

Further, now that the algorithm is more SIMD-appropriate, I wonder
what doing 4 registers at a time is actually buying us for either SSE2
or AVX2. It might just be a matter of scale, but that would be good to
understand.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread John Naylor
On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 20, 2024 at 11:19 PM John Naylor  wrote:
> > Are they (the blocks to be precise) really out of order? The VALUES
> > statement is ordered, but after inserting it does not output that way.
> > I wondered if this is platform independent, but CI and our dev
> > machines haven't failed this test, and I haven't looked into what
> > determines the order. It's easy enough to hide the blocks if we ever
> > need to, as we do elsewhere...
>
> It seems not necessary as such a test is already covered by
> test_radixtree. I've changed the query to hide the output blocks.

Okay.

> The buildfarm has been all-green so far.

Great!

> I've attached the latest vacuum improvement patch.
>
> I just remembered that the tidstore cannot still be used for parallel
> vacuum with minimum maintenance_work_mem. Even when the shared
> tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> 1MB (1048576 bytes). We need something discussed on another thread[1]
> in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

The proposals in that thread are pretty simple. If those don't move
forward soon, a hackish workaround would be to round down the number
we get from dsa_get_total_size to the nearest megabyte. Then
controlling min/max segment size would be a nice-to-have for PG17, not
a prerequisite.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread John Naylor
On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada  wrote:
> I forgot to report the results. Yes, I did some tests where I inserted
> many TIDs to make the tidstore use several GB memory. I did two cases:
>
> 1. insert 100M blocks of TIDs with an offset of 100.
> 2. insert 10M blocks of TIDs with an offset of 2048.
>
> The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> and iteration results were expected.

Thanks for confirming!

> While reviewing the codes again, the following two things caught my eyes:
>
> in check_set_block_offset() function, we don't take a lock on the
> tidstore while checking all possible TIDs. I'll add
> TidStoreLockShare() and TidStoreUnlock() as follows:
>
> +   TidStoreLockShare(tidstore);
> if (TidStoreIsMember(tidstore, &tid))
> ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
> blkno, offset);
> +   TidStoreUnlock(tidstore);

In one sense, all locking in the test module is useless since there is
only a single process. On the other hand, it seems good to at least
run what we have written to run it trivially, and serve as an example
of usage. We should probably be consistent, and document at the top
that the locks are pro-forma only.

It's both a blessing and a curse that vacuum only has a single writer.
It makes development less of a hassle, but also means that tidstore
locking is done for API-completeness reasons, not (yet) as a practical
necessity. Even tidbitmap.c's hash table currently has a single
writer, and while using tidstore for that is still an engineering
challenge for other reasons, it wouldn't exercise locking
meaningfully, either, at least at first.

> Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> a lock on the shared tidstore since dsa_get_total_size() (called by
> RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> in the comment as follows:
>
> -/* Return the memory usage of TidStore */
> +/*
> + * Return the memory usage of TidStore.
> + *
> + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
> + * locking, the caller doesn't need to take a lock.
> + */
>
> What do you think?

That duplicates the underlying comment on the radix tree function that
this calls, so I'm inclined to leave it out. At this level it's
probably best to document when a caller _does_ need to take an action.

One thing I forgot to ask about earlier:

+-- Add tids in out of order.

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...




Re: Change GUC hashtable to use simplehash?

2024-03-20 Thread John Naylor
On Tue, Mar 5, 2024 at 5:30 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 5:04 PM John Naylor  wrote:
> >
> > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > > But given that we know the data length and we have it in a register
> > > already, it's easy enough to just mask out data past the end with a
> > > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > > small test harness that just hashes and finalizes an array of strings,
> > > with a data dependency between consecutive hashes (next address
> > > depends on the previous hash output).
> >
> > Interesting work! I've taken this idea and (I'm guessing, haven't
> > tested) improved it by re-using an intermediate step for the
> > conditional, simplifying the creation of the mask, and moving the
> > bitscan out of the longest dependency chain.
>
> This needed a rebase, and is now 0001. I plan to push this soon.

I held off on this because CI was failing, but it wasn't because of this.

> I also went and looked at the simplehash instances and found a few
> that would be easy to switch over. Rather than try to figure out which
> could benefit from shaving cycles, I changed all the string hashes,
> and one more, in 0002 so they can act as examples.

This was the culprit. The search path cache didn't trigger this when
it went in, but it seems for frontend a read past the end of malloc
fails -fsantize=address. By the same token, I'm guessing the only
reason this didn't fail for backend is because almost all strings
you'd want to use as a hash key won't use a malloc'd external block.

I found that adding __attribute__((no_sanitize_address)) to
fasthash_accum_cstring_aligned() passes CI. While this kind of
exception is warned against (for good reason), I think it's fine here
given that glibc and NetBSD, and probably others, do something similar
for optimized strlen(). Before I write the proper macro for that, are
there any objections? Better ideas?

> Commit 42a1de3013 added a new use for string_hash, but I can't tell
> from a quick glance whether it uses the truncation, so I'm going to
> take a closer look before re-attaching the proposed dynahash change
> again.

After looking, I think the thing to do here is create a
hashfn_unstable.c file for global functions:
- hash_string() to replace all those duplicate definitions of
hash_string_pointer() in all the frontend code
- hash_string_with_limit() for dynahash and dshash.




Re: add AVX2 support to simd.h

2024-03-19 Thread John Naylor
On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
 wrote:
> > Sounds similar in principle, but it looks really complicated. I don't
> > think the additional loops and branches are a good way to go, either
> > for readability or for branch prediction. My sketch has one branch for
> > which loop to do, and then performs only one loop. Let's do the
> > simplest thing that could work. (I think we might need a helper
> > function to do the block, but the rest should be easy)
>
> I tried to trim some of the branches, and came up with the attached patch.
> I don't think this is exactly what you were suggesting, but I think it's
> relatively close.  My testing showed decent benefits from using 2 vectors
> when there aren't enough elements for 4, so I've tried to keep that part
> intact.

I would caution against that if the benchmark is repeatedly running
against a static number of elements, because the branch predictor will
be right all the time (except maybe when it exits a loop, not sure).
We probably don't need to go to the trouble to construct a benchmark
with some added randomness, but we have be careful not to overfit what
the test is actually measuring.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread John Naylor
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > Locally (not CI), we should try big inputs to make sure we can
> > actually go up to many GB -- it's easier and faster this way than
> > having vacuum give us a large data set.
>
> I'll do these tests.

I just remembered this -- did any of this kind of testing happen? I
can do it as well.

> Thank you. I've incorporated all the comments above. I've attached the
> latest patches, and am going to push them (one by one) after
> self-review again.

One more cosmetic thing in 0001 that caught my eye:

diff --git a/src/backend/access/common/Makefile
b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
  syncscan.o \
  toast_compression.o \
  toast_internals.o \
+ tidstore.o \
  tupconvert.o \
  tupdesc.o

diff --git a/src/backend/access/common/meson.build
b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )

These aren't in alphabetical order.




Re: add AVX2 support to simd.h

2024-03-19 Thread John Naylor
On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
 wrote:
>
> On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote:
> > I took a brief look, and 0001 isn't quite what I had in mind. I can't
> > quite tell what it's doing with the additional branches and "goto
> > retry", but I meant something pretty simple:
>
> Do you mean 0002?  0001 just adds a 2-register loop for remaining elements
> once we've exhausted what can be processed with the 4-register loop.

Sorry, I was looking at v2 at the time.

> > - if short, do one element at a time and return
>
> 0002 does this.

That part looks fine.

> > - if long, do one block unconditionally, then round the start pointer
> > up so that "end - start" is an exact multiple of blocks, and loop over
> > them
>
> 0002 does the opposite of this.  That is, after we've completed as many
> blocks as possible, we move the iterator variable back to "end -
> block_size" and do one final iteration to cover all the remaining elements.

Sounds similar in principle, but it looks really complicated. I don't
think the additional loops and branches are a good way to go, either
for readability or for branch prediction. My sketch has one branch for
which loop to do, and then performs only one loop. Let's do the
simplest thing that could work. (I think we might need a helper
function to do the block, but the rest should be easy)




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread John Naylor
On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> >
> > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > wrote:

> > It might also be worth reducing the number of blocks in the random
> > test -- multiple runs will have different offsets anyway.
>
> Yes. If we reduce the number of blocks from 1000 to 100, the
> regression test took on my environment:
>
> 1000 blocks : 516 ms
> 100 blocks   : 228 ms

Sounds good.

> Removed some unnecessary variables in 0002 patch.

Looks good.

> So the MaxBlocktableEntrySize calculation would be as follows?
>
> #define MaxBlocktableEntrySize \
> offsetof(BlocktableEntry, words) + \
> (sizeof(bitmapword) * \
> WORDS_PER_PAGE(Min(MaxOffsetNumber, \
>BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
>
> I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/

+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)

Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

> While reviewing the vacuum patch, I realized that we always pass
> LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> to the tidstore is therefore always the same. I think it would be
> better to make the caller of TidStoreCreate() specify the tranch_id
> and pass it to RT_CREATE(). That way, the caller can specify their own
> wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.




Re: add AVX2 support to simd.h

2024-03-18 Thread John Naylor
On Tue, Mar 19, 2024 at 9:03 AM Nathan Bossart  wrote:
>
> On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> > I haven't looked at the patches, but the graphs look good.
>
> I spent some more time on these patches.  Specifically, I reordered them to
> demonstrate the effects on systems without AVX2 support.  I've also added a
> shortcut to jump to the one-by-one approach when there aren't many
> elements, as the overhead becomes quite noticeable otherwise.  Finally, I
> ran the same benchmarks again on x86 and Arm out to 128 elements.
>
> Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
> if it's possible to improve the style a bit.

I took a brief look, and 0001 isn't quite what I had in mind. I can't
quite tell what it's doing with the additional branches and "goto
retry", but I meant something pretty simple:

- if short, do one element at a time and return
- if long, do one block unconditionally, then round the start pointer
up so that "end - start" is an exact multiple of blocks, and loop over
them




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread John Naylor
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  wrote:
>
> On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:

> > Random offsets is what I was thinking of (if made distinct and
> > ordered), but even there the code is fairy trivial, so I don't have a
> > strong feeling about it.
>
> Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

> > I think we can stop including the debug-tid-store patch for CI now.
> > That would allow getting rid of some unnecessary variables.
>
> Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

> > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > TIDs
> > + * in one block. We return the block numbers in ascending order and the 
> > offset
> > + * numbers in each result is also sorted in ascending order.
> > + */
> > +TidStoreIterResult *
> > +TidStoreIterateNext(TidStoreIter *iter)
> >
> > The wording is a bit awkward.
>
> Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

> > +/* Extract TIDs from the given key-value pair */
> > +static void
> > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > BlocktableEntry *page)
> >
> > This is a leftover from the old encoding scheme. This should really
> > take a "BlockNumber blockno" not a "key", and the only call site
> > should probably cast the uint64 to BlockNumber.
>
> Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
  return NULL;

  /* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

> > +typedef struct BlocktableEntry
> > +{
> > + uint16 nwords;
> > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > +} BlocktableEntry;
> >
> > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

> > That doesn't have any real-world affect on the largest offset
> > encountered, and only in 32-bit builds with 32kB block size would the
> > theoretical max change at all. To be precise, we could use in the
> > MaxBlocktableEntrySize calculation:
> >
> > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
>
> I don't get this expression. Making the nwords one byte works well?
> With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> OS, respectively. One byte nwrods variable seems not to be suff

Re: add AVX2 support to simd.h

2024-03-16 Thread John Naylor
On Sat, Mar 16, 2024 at 2:40 AM Nathan Bossart  wrote:
>
> On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote:
> > I've also attached the results of running this benchmark on my machine at
> > HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
> > appears to work pretty well.  When there is a small "tail," it regresses a
> > small amount, but overall, it seems to improve more cases than it harms.
> > 0002 does regress searches on smaller arrays quite a bit, since it
> > postpones the SIMD optimizations until the arrays are longer.  It might be
> > possible to mitigate by using 2 registers when the "tail" is long enough,
> > but I have yet to try that.
>
> The attached 0003 is a sketch of what such mitigation might look like.  It
> appears to help with the regressions nicely.  I omitted the benchmarking
> patch in v3 to appease cfbot.

I haven't looked at the patches, but the graphs look good.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-16 Thread John Naylor
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > wrote:

> > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > overwrite) the value, I think we should not expect that found is
> > > always false.
> >
> > I find that a puzzling statement, since 1) it was designed for
> > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > now fail if the same block is set twice, since we just switched the
> > tests to use a remnant of vacuum's old array. Having said that, I
> > don't object to removing artificial barriers to using it for purposes
> > not yet imagined, as long as test_tidstore.sql warns against that.
>
> I think that if it supports only insert-only workload and expects the
> same block is set only once, it should raise an error rather than an
> assertion. It's odd to me that the function fails only with an
> assertion build assertions even though it actually works fine even in
> that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

> As for test_tidstore you're right that the test code doesn't handle
> the case where setting the same block twice. I think that there is no
> problem in the fixed-TIDs tests, but we would need something for
> random-TIDs tests so that we don't set the same block twice. I guess
> it could be trivial since we can use SQL queries to generate TIDs. I'm
> not sure how the random-TIDs tests would be like, but I think we can
> use SELECT DISTINCT to eliminate the duplicates of block numbers to
> use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

> > Given the above two things, I think this function's comment needs
> > stronger language about its limitations. Perhaps even mention that
> > it's intended for, and optimized for, vacuum. You and I have long
> > known that tidstore would need a separate, more complex, function to
> > add or remove individual tids from existing entries, but it might be
> > good to have that documented.
>
> Agreed.

How about this:

 /*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
  *
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
  */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables. More
comments:

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.

+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.

This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */

s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)

This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread John Naylor
On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.

I went further than this, see below.

> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

Using actual work_mem seems a bit more difficult to make this work.

> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.

Removed in v72-0007

On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada  wrote:
>
> I have two questions on tidstore.c:
>
> +/*
> + * Set the given TIDs on the blkno to TidStore.
> + *
> + * NB: the offset numbers in offsets must be sorted in ascending order.
> + */
>
> Do we need some assertions to check if the given offset numbers are
> sorted expectedly?

Done in v72-0008

> ---
> +   if (TidStoreIsShared(ts))
> +   found = shared_rt_set(ts->tree.shared, blkno, page);
> +   else
> +   found = local_rt_set(ts->tree.local, blkno, page);
> +
> +   Assert(!found);
>
> Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> overwrite) the value, I think we should not expect that found is
> always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. Yo

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread John Naylor
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > Okay, here's an another idea: Change test_lookup_tids() to be more
> > general and put the validation down into C as well. First we save the
> > blocks from do_set_block_offsets() into a table, then with all those
> > blocks lookup a sufficiently-large range of possible offsets and save
> > found values in another array. So the static items structure would
> > have 3 arrays: inserts, successful lookups, and iteration (currently
> > the iteration output is private to check_set_block_offsets(). Then
> > sort as needed and check they are all the same.
>
> That's a promising idea. We can use the same mechanism for randomized
> tests too. If you're going to work on this, I'll do other tests on my
> environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine. Some open questions on the test module:

- Since there are now three arrays we should reduce max bytes to
something smaller.
- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.
- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.
- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor
On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > BTW do we still want to test the tidstore by using a combination of
> > > SQL functions? We might no longer need to input TIDs via a SQL
> > > function.
> >
> > I'm not sure. I stopped short of doing that to get feedback on this
> > much. One advantage with SQL functions is we can use generate_series
> > to easily input lists of blocks with different numbers and strides,
> > and array literals for offsets are a bit easier. What do you think?
>
> While I'm not a fan of the following part, I agree that it makes sense
> to use SQL functions for test data generation:
>
> -- Constant values used in the tests.
> \set maxblkno 4294967295
> -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> 291.
> -- We use a higher number to test tidstore.
> \set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

> It would also be easier for developers to test the tidstore with their
> own data set. So I agreed with the current approach; use SQL functions
> for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

> Is it
> convenient for developers if we have functions like generate_tids()
> and generate_random_tids() to generate TIDs so that they can pass them
> to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor
On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > wrote:
> >
> > > As I mentioned above, if we implement the test cases in C, we can use
> > > the debug-build array in the test code. And we won't use it in AND/OR
> > > operations tests in the future.
> >
> > That's a really interesting idea, so I went ahead and tried that for
> > v71. This seems like a good basis for testing larger, randomized
> > inputs, once we decide how best to hide that from the expected output.
> > The tests use SQL functions do_set_block_offsets() and
> > check_set_block_offsets(). The latter does two checks against a tid
> > array, and replaces test_dump_tids().
>
> Great! I think that's a very good starter.
>
> The lookup_test() (and test_lookup_tids()) do also test that the
> IsMember() function returns false as expected if the TID doesn't exist
> in it, and probably we can do these tests in a C function too.
>
> BTW do we still want to test the tidstore by using a combination of
> SQL functions? We might no longer need to input TIDs via a SQL
> function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor
= page->nwords)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 
+done:
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		Assert(ret == ret_debug);
+#endif
+
 	return ret;
 }
 
@@ -360,6 +430,11 @@ TidStoreBeginIterate(TidStore *ts)
 	else
 		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		iter->tids_idx = 0;
+#endif
+
 	return iter;
 }
 
@@ -387,6 +462,11 @@ TidStoreIterateNext(TidStoreIter *iter)
 	/* Collect TIDs extracted from the key-value pair */
 	tidstore_iter_extract_tids(iter, key, page);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(iter->ts))
+		ts_debug_iter_check_tids(iter);
+#endif
+
 	return result;
 }
 
@@ -459,3 +539,122 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page
 		}
 	}
 }
+
+#ifdef TIDSTORE_DEBUG
+/* Comparator routines for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+/* Insert TIDs to the TID array for debugging */
+static void
+ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+		   int num_offsets)
+{
+	if (ts->num_tids > 0 &&
+		blkno < ItemPointerGetBlockNumber(&(ts->tids[ts->num_tids - 1])))
+	{
+		/* The array will be sorted at ts_debug_is_member() */
+		ts->tids_unordered = true;
+	}
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		ItemPointer tid;
+		int idx = ts->num_tids + i;
+
+		/* Enlarge the TID array if necessary */
+		if (idx >= ts->max_tids)
+		{
+			ts->max_tids *= 2;
+			ts->tids = repalloc(ts->tids, sizeof(ItemPointerData) * ts->max_tids);
+		}
+
+		tid = &(ts->tids[idx]);
+
+		ItemPointerSetBlockNumber(tid, blkno);
+		ItemPointerSetOffsetNumber(tid, offsets[i]);
+	}
+
+	ts->num_tids += num_offsets;
+}
+
+/* Return true if the given TID is present in the TID array */
+static bool
+ts_debug_is_member(TidStore *ts, ItemPointer tid)
+{
+	int64	litem,
+		ritem,
+		item;
+	ItemPointer res;
+
+	if (ts->num_tids == 0)
+		return false;
+
+	/* Make sure the TID array is sorted */
+	if (ts->tids_unordered)
+	{
+		qsort(ts->tids, ts->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		ts->tids_unordered = false;
+	}
+
+	litem = itemptr_encode(&ts->tids[0]);
+	ritem = itemptr_encode(&ts->tids[ts->num_tids - 1]);
+	item = itemptr_encode(tid);
+
+	/*
+	 * Doing a simple bound check before bsearch() is useful to avoid the
+	 * extra cost of bsearch(), especially if dead items on the heap are
+	 * concentrated in a certain range.	Since this function is called for
+	 * every index tuple, it pays to be really fast.
+	 */
+	if (item < litem || item > ritem)
+		return false;
+
+	res = bsearch(tid, ts->tids, ts->num_tids, sizeof(ItemPointerData),
+  itemptr_cmp);
+
+	return (res != NULL);
+}
+
+/* Verify if the iterator output matches the TIDs in the array for debugging */
+static void
+ts_debug_iter_check_tids(TidStoreIter *iter)
+{
+	BlockNumber blkno = iter->output.blkno;
+
+	for (int i = 0; i < iter->output.num_offsets; i++)
+	{
+		ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]);
+
+		Assert((iter->tids_idx + i) < iter->ts->max_tids);
+		Assert(ItemPointerGetBlockNumber(tid) == blkno);
+		Assert(ItemPointerGetOffsetNumber(tid) == iter->output.offsets[i]);
+	}
+
+	iter->tids_idx += iter->output.num_offsets;
+}
+#endif
-- 
2.44.0

From 4ed9f578c2f8d20f4477c80703a085b3dd3bef01 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 13 Mar 2024 14:56:56 +0700
Subject: [PATCH v71 3/6] DEV: Fix failure to sort debug array for iteration

---
 src/backend/access/common/tidstore.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 33753d8ed2..4f5882818b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -433,6 +433,13 @@ TidStoreBeginIterate(TidStore *ts)
 #ifdef TIDSTORE_DEBUG
 	if (!TidStoreIsShared(ts))
 		iter->tids_idx = 0;
+
+	/* Make sure the TID array is sorted */
+	if (ts->tids_unordered)
+	{
+		qsort(ts->tids, ts->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		ts->tids_unordered = false;
+	}
 #endif
 
 	return iter;
-- 
2.44.0

From 753c38866b5185ce399f214e05e1869abb8d2dc2 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 13 Ma

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-12 Thread John Naylor
On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 11, 2024 at 12:20 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  
> > wrote:

> > + ts->context = CurrentMemoryContext;
> >
> > As far as I can tell, this member is never accessed again -- am I
> > missing something?
>
> You're right. It was used to re-create the tidstore in the same
> context again while resetting it, but we no longer support the reset
> API. Considering it again, would it be better to allocate the iterator
> struct in the same context as we store the tidstore struct?

That makes sense.

> > + /* DSA for tidstore will be detached at the end of session */
> >
> > No other test module pins the mapping, but that doesn't necessarily
> > mean it's wrong. Is there some advantage over explicitly detaching?
>
> One small benefit of not explicitly detaching dsa_area in
> tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> need to remember the dsa_area using (for example) a static variable,
> and free it if it's non-NULL. I've implemented this idea in the
> attached patch.

Okay, I don't have a strong preference at this point.

> > +-- Add tids in random order.
> >
> > I don't see any randomization here. I do remember adding row_number to
> > remove whitespace in the output, but I don't remember a random order.
> > On that subject, the row_number was an easy trick to avoid extra
> > whitespace, but maybe we should just teach the setting function to
> > return blocknumber rather than null?
>
> Good idea, fixed.

+ test_set_block_offsets
+
+ 2147483647
+  0
+ 4294967294
+  1
+ 4294967295

Hmm, was the earlier comment about randomness referring to this? I'm
not sure what other regression tests do in these cases, or how
relibale this is. If this is a problem we could simply insert this
result into a temp table so it's not output.

> > +Datum
> > +tidstore_create(PG_FUNCTION_ARGS)
> > +{
> > ...
> > + tidstore = TidStoreCreate(max_bytes, dsa);
> >
> > +Datum
> > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > +{
> > 
> > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> >
> > These names are too similar. Maybe the test module should do
> > s/tidstore_/test_/ or similar.
>
> Agreed.

Mostly okay, although a couple look a bit generic now. I'll leave it
up to you if you want to tweak things.

> > In general, the .sql file is still very hard-coded. Functions are
> > created that contain a VALUES statement. Maybe it's okay for now, but
> > wanted to mention it. Ideally, we'd have some randomized tests,
> > without having to display it. That could be in addition to (not
> > replacing) the small tests we have that display input. (see below)
> >
>
> Agreed to add randomized tests in addition to the existing tests.

I'll try something tomorrow.

> Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> likely that even initdb would fail in practice. However, it's a very
> good idea that we can test the tidstore anyway with such a check
> without a debug-build array.
>
> Or as another idea, I wonder if we could keep the debug-build array in
> some form. For example, we use the array with the particular build
> flag and set a BF animal for that. That way, we can test the tidstore
> in more real cases.

I think the purpose of a debug flag is to help developers catch
mistakes. I don't think it's quite useful enough for that. For one, it
has the same 1GB limitation as vacuum's current array. For another,
it'd be a terrible way to debug moving tidbitmap.c from its hash table
to use TID store -- AND/OR operations and lossy pages are pretty much
undoable with a copy of vacuum's array. Last year, when I insisted on
trying a long term realistic load that compares the result with the
array, the encoding scheme was much harder to understand in code. I
think it's now easier, and there are better tests.

> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.
>
> As for renaming TidStore to TIDStore, I dropped the patch for now
> since it seems we're using "Tid" in some function names and variable
> names. If we want to update it, we can do that later.

I think we're not consistent across the codebase, and it's fine

Re: Add bump memory context type and use it for tuplesorts

2024-03-12 Thread John Naylor
On Tue, Mar 12, 2024 at 6:41 AM David Rowley  wrote:
> Thanks for trying this out.  I didn't check if the performance was
> susceptible to the memory size before the reset.  It certainly would
> be once the allocation crosses some critical threshold of CPU cache
> size, but probably it will also be to some extent regarding the number
> of actual mallocs that are required underneath.

I neglected to mention it, but the numbers I chose did have the L2/L3
cache in mind, but the reset frequency didn't seem to make much
difference.

> I see there's some discussion of bump in [1].  Do you still have a
> valid use case for bump for performance/memory usage reasons?

Yeah, that was part of my motivation for helping test, although my
interest is in saving memory in cases of lots of small allocations. It
might help if I make this a little more concrete, so I wrote a
quick-and-dirty function to measure the bytes used by the proposed TID
store and the vacuum's current array.

Using bitmaps really shines with a high number of offsets per block,
e.g. with about a million sequential blocks, and 49 offsets per block
(last parameter is a bound):

select * from tidstore_memory(0,1*1001*1000, 1,50);
 array_mem |  ts_mem
---+--
 294294000 | 42008576

The break-even point with this scenario is around 7 offsets per block:

select * from tidstore_memory(0,1*1001*1000, 1,8);
 array_mem |  ts_mem
---+--
  42042000 | 42008576

Below that, the array gets smaller, but the bitmap just has more empty
space. Here, 8 million bytes are used by the chunk header in bitmap
allocations, so the bump context would help there (I haven't actually
tried). Of course, the best allocation is no allocation at all, and I
have a draft patch to store up to 3 offsets in the last-level node's
pointer array, so for 2 or 3 offsets per block we're smaller than the
array again:

select * from tidstore_memory(0,1*1001*1000, 1,4);
 array_mem | ts_mem
---+-
  18018000 | 8462336

Sequential blocks are not the worst case scenario for memory use, but
this gives an idea of what's involved. So, with aset, on average I
still expect to use quite a bit less memory, with some corner cases
that use more. The bump context would be some extra insurance to
reduce those corner cases, where there are a large number of blocks in
play.




Re: Add bump memory context type and use it for tuplesorts

2024-03-11 Thread John Naylor
On Tue, Mar 5, 2024 at 9:42 AM David Rowley  wrote:
> performance against the recently optimised aset, generation and slab
> contexts.  The attached graph shows the time it took in seconds to
> allocate 1GB of memory performing a context reset after 1MB. The
> function I ran the test on is in the attached
> pg_allocate_memory_test.patch.txt file.
>
> The query I ran was:
>
> select chksz,mtype,pg_allocate_memory_test_reset(chksz,
> 1024*1024,1024*1024*1024, mtype) from (values(8),(16),(32),(64))
> sizes(chksz),(values('aset'),('generation'),('slab'),('bump'))
> cxt(mtype) order by mtype,chksz;

I ran the test function, but using 256kB and 3MB for the reset
frequency, and with 8,16,24,32 byte sizes (patched against a commit
after the recent hot/cold path separation). Images attached. I also
get a decent speedup with the bump context, but not quite as dramatic
as on your machine. It's worth noting that slab is the slowest for me.
This is an Intel i7-10750H.


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread John Naylor
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
>
> On Thu, Feb 15, 2024 at 8:26 PM John Naylor  wrote:

> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
[found a big speed-up where we don't expect one]

I tried to reproduce this (similar patch, but rebased on top of a bug
you recently fixed (possibly related?) -- attached, and also shows one
way to address some lack of coverage in the debug build, for as long
as we test that with CI).

Fortunately I cannot see a difference, so I believe it's not affecting
the case in this test all, as expected:

v68:

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 770, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s
system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s
system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s
system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s

v68 + emb values (that cannot be used because > 3 tids per block):

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 775, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s
system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s
system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s
system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s

I'll continue polishing the runtime-embeddable values patch as time
permits, for later consideration.


WIP-3-embeddable-TIDs.patch.nocfbot
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-10 Thread John Naylor
On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  wrote:
>
> I've attached the remaining patches for CI. I've made some minor
> changes in separate patches and drafted the commit message for
> tidstore patch.
>
> While reviewing the tidstore code, I thought that it would be more
> appropriate to place tidstore.c under src/backend/lib instead of
> src/backend/common/access since the tidstore is no longer implemented
> only for heap or other access methods, and it might also be used by
> executor nodes in the future. What do you think?

That's a heck of a good question. I don't think src/backend/lib is
right -- it seems that's for general-purpose data structures.
Something like backend/utils is also too general.
src/backend/access/common has things for tuple descriptors, toast,
sessions, and I don't think tidstore is out of place here. I'm not
sure there's a better place, but I could be convinced otherwise.

v68-0001:

I'm not sure if commit messages are much a subject of review, and it's
up to the committer, but I'll share a couple comments just as
something to think about, not something I would ask you to change: I
think it's a bit distracting that the commit message talks about the
justification to use it for vacuum. Let's save that for the commit
with actual vacuum changes. Also, I suspect saying there are a "wide
range" of uses is over-selling it a bit, and that paragraph is a bit
awkward aside from that.

+ /* Collect TIDs extracted from the key-value pair */
+ result->num_offsets = 0;
+

This comment has nothing at all to do with this line. If the comment
is for several lines following, some of which are separated by blank
lines, there should be a blank line after the comment. Also, why isn't
tidstore_iter_extract_tids() responsible for setting that to zero?

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);

+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{

+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);

These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.

+/* Sanity check if we've called tidstore_create() */
+static void
+check_tidstore_available(void)
+{
+ if (tidstore == NULL)
+ elog(ERROR, "tidstore is not initialized");
+}

I don't find this very helpful. If a developer wiped out the create
call, wouldn't the test crash and burn pretty obviously?

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)


v68-0002:

@@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)

  ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;

+#ifdef TIDSTORE_DEBUG
+ if (!TidStoreIsShared(ts))
+ {
+ bool ret_debug = ts_debug_is_member(ts, tid);;
+ Assert(ret == ret_debug);
+ }
+#endif

This only checking the case where we haven't returned already. In particular...

+ /* no entry for the blk */
+ if (page == NULL)
+ return false;
+
+ wordnum = WORDNUM(off);
+ bitnum = BITNUM(off);
+
+ /* no bitmap for the off */
+ if (wordnum >= page->nwords)
+ return false;

...these results are not checked.

More broadly, it seems like the test module should be able to test
everything that the debug-build array would complain about. Including
ordered iteration. This may require first saving our test input to a
table. We could create a cursor on a query that fetches the ordered
input from the table and verifies that the tid store iterate produces
the same ordered set, maybe with pl/pgSQL. Or something like that.
Seems like not a whole lot of work. I can try later in the week, if
you like.

v68-0005/6 look ready to squash

v68-0008 - I'm not a fan of captilizing short comment fragments. I use
the style of either: short lower-case phrases, or full sentences
including capitalization, correct grammar and period. I see these two
styles all over the code base, as appropriate.

+ /* Remain attached until end of backend */

We'll probably want this comment, if in fact we want this behavior.

+ /*
+ * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
+ * before return.
+ */

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Fri, Mar 8, 2024 at 9:53 AM John Naylor  wrote:
>
> On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> > Yesterday I've confirmed the something like the below fixes the
> > problem happened in Windows CI:
> >
> > --- a/src/test/modules/test_radixtree/meson.build
> > +++ b/src/test/modules/test_radixtree/meson.build
> > @@ -12,6 +12,7 @@ endif
> >
> >  test_radixtree = shared_module('test_radixtree',
> >test_radixtree_sources,
> > +  link_with: host_system == 'windows' ? pgport_srv : [],
> >kwargs: pg_test_mod_args,
> >  )
> >  test_install_libs += test_radixtree
>
> pgport_srv is for backend, shared libraries should be using pgport_shlib

> In any case, I'm trying it in CI branch with pgport_shlib now.

That seems to work, so I'll push that just to get things green again.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:
>
> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],
>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree

pgport_srv is for backend, shared libraries should be using pgport_shlib

Further, the top level meson.build has:

# all shared libraries not part of the backend should depend on this
frontend_shlib_code = declare_dependency(
  include_directories: [postgres_inc],
  link_with: [common_shlib, pgport_shlib],
  sources: generated_headers,
  dependencies: [shlib_code, os_deps, libintl],
)

...but the only things that declare needing frontend_shlib_code are in
src/interfaces/.

In any case, I'm trying it in CI branch with pgport_shlib now.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
>
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:

Glad you shared before I went and did it.

> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],

I don't see any similar coding elsewhere, so that leaves me wondering
if we're missing something. On the other hand, maybe no test modules
use files in src/port ...

>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree
>
> But I'm not sure it's the right fix especially because I guess it
> could raise "AddressSanitizer: odr-violation" error on Windows.

Well, it's now at zero definitions that it can see, so I imagine it's
possible that adding the above would not cause more than one. In any
case, we might not know since as far as I can tell the MSVC animals
don't have address sanitizer. I'll look around some more, and if I
don't get any revelations, I guess we should go with the above.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
>
> It looks like it requires a link with pgport_srv but I'm not sure. It
> seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> 2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:

> > $ git grep 'link_with: pgport_srv'
> > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> >
> > No other test module uses this directive, and indeed, removing this
> > still builds fine for me. Thoughts?
>
> Yeah, it could be the culprit. The test_radixtree/meson.build is the
> sole extension that explicitly specifies a link with pgport_srv. I
> think we can get rid of it as I've also confirmed the build still fine
> even without it.

olingo and grassquit have turned green, so that must have been it.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Thu, Mar 7, 2024 at 1:19 PM John Naylor  wrote:
>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor
On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada  wrote:
> odr-violation seems to refer to One Definition Rule (ODR). According
> to Wikipedia[1]:
>
> The One Definition Rule (ODR) is an important rule of the C++
> programming language that prescribes that classes/structs and
> non-inline functions cannot have more than one definition in the
> entire program and template and types cannot have more than one
> definition by translation unit. It is defined in the ISO C++ Standard
> (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
> have similar but differently defined rules towards the same objective.
>
> I don't fully understand this concept yet but are these two different
> build failures related?

I thought it may have something to do with the prerequisite commit
that moved some symbols from bitmapset.c to .h:

/* Select appropriate bit-twiddling functions for bitmap word size */
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
#define bmw_popcount(w) pg_popcount32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
#define bmw_popcount(w) pg_popcount64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif

...but olingo's error seems strange to me, because it is complaining
of pg_leftmost_one_pos, which refers to the lookup table in
pg_bitutils.c -- I thought all buildfarm members used the bitscan
instructions.

grassquit is complaining of pg_popcount64, which is a global function,
also in pg_bitutils.c. Not sure what to make of this, since we're just
pointing symbols at things which should have a single definition...




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  
> > > wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > > wrote:
> >
> > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > >
> > > > > I'll look in the other templates to see if what they do.
> > > >
> > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > out the typedef for the "define" section, but I first want to
> > > > reproduce the build failure.
> > >
> > > Right. I've reproduced this build failure on my machine by specifying
> > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > below change seems to fix the problem:
> >
> > Confirmed, will push shortly.
>
> mamba complained different build errors[1]:
>
>  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
>   |  ~~^ ~~~
>   ||  |
>   |long int   int64 {aka long long 
> int}
>   |  %lld
> ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> expects argument of type 'long int', but argument 4 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> tree->ctl->num_nodes[i]);
>   |~~^
> ~~~
>   |  |
>  |
>   |  long int
>  int64 {aka long long int}
>   |%lld
> ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> expects argument of type 'long int', but argument 3 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
>   |  ~~^   ~
>   |||
>   |long int int64 {aka long long int}
>   |  %lld
>
> Regards,
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 93e6a7d809..b8ad51c14d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2737,7 +2737,7 @@ RT_SCOPE void
 RT_STATS(RT_RADIX_TREE * tree)
 {
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
-	fprintf(stderr, "num_keys = %ld\n", tree->ctl->num_keys);
+	fprintf(stderr, "num_keys = %lld\n", (long long) tree->ctl->num_keys);
 
 #ifdef RT_SHMEM
 	fprintf(stderr, "handle = " DSA_POINTER_FORMAT "\n", tree->ctl->handle);
@@ -2749,10 +2749,10 @@ RT_STATS(RT_RADIX_TREE * tree)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
 
-		fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]);
+		fprintf(stderr, ", n%d = %lld", size_class.fanout, (long long) tree->ctl->num_nodes[i]);
 	}
 
-	fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
+	fprintf(stderr, ", leaves = %lld", (long long) tree->ctl->num_leaves);
 
 	fprintf(stderr, "\n");
 }


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:

> > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > feature [-Werror,-Wtypedef-redefinition]"
> > >
> > > I'll look in the other templates to see if what they do.
> >
> > Their "declare" sections have full typedefs. I found it works to leave
> > out the typedef for the "define" section, but I first want to
> > reproduce the build failure.
>
> Right. I've reproduced this build failure on my machine by specifying
> flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> below change seems to fix the problem:

Confirmed, will push shortly.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
> >
> > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  
> > wrote:
> > >
> > > > + /* Find the control object in shared memory */
> > > > + control = handle;
> > >
> > > I think it's mostly because of readability; it makes clear that the
> > > handle should be castable to dsa_pointer and it's a control object. I
> > > borrowed it from dshash_attach().
> >
> > I find that a bit strange, but I went ahead and kept it.
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  
> > wrote:
> >
> > > The 0001 patch looks good to me. I have some minor comments:
> >
> > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > > +
> > >
> > > "src/backend/lib/radixtree.c" should be updated to
> > > "src/include/lib/radixtree.h".
> >
> > Done.
> >
> > > --- /dev/null
> > > +++ b/src/test/modules/test_radixtree/README
> > > @@ -0,0 +1,7 @@
> > > +test_integerset contains unit tests for testing the integer set 
> > > implementation
> > > +in src/backend/lib/integerset.c.
> > > +
> > > +The tests verify the correctness of the implementation, but they can 
> > > also be
> > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > +test_integerset.c, the tests will print extra information about 
> > > execution time
> > > +and memory usage.
> > >
> > > This file is not updated for test_radixtree. I think we can remove it
> > > as the test cases in test_radixtree are clear.
> >
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!
> >
> > Already, I see sifaka doesn't like this, and I'm looking now...
>
> It's complaining that these forward declarations...
>
> /* generate forward declarations necessary to use the radix tree */
> #ifdef RT_DECLARE
>
> typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> typedef struct RT_ITER RT_ITER;
>
> ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> feature [-Werror,-Wtypedef-redefinition]"
>
> I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
  [1] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
  [2] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
  [1]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
#3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

  [2]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
#3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
>
> On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
> >
> > > + /* Find the control object in shared memory */
> > > + control = handle;
> >
> > I think it's mostly because of readability; it makes clear that the
> > handle should be castable to dsa_pointer and it's a control object. I
> > borrowed it from dshash_attach().
>
> I find that a bit strange, but I went ahead and kept it.
>
>
>
> On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:
>
> > The 0001 patch looks good to me. I have some minor comments:
>
> > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > +
> >
> > "src/backend/lib/radixtree.c" should be updated to
> > "src/include/lib/radixtree.h".
>
> Done.
>
> > --- /dev/null
> > +++ b/src/test/modules/test_radixtree/README
> > @@ -0,0 +1,7 @@
> > +test_integerset contains unit tests for testing the integer set 
> > implementation
> > +in src/backend/lib/integerset.c.
> > +
> > +The tests verify the correctness of the implementation, but they can also 
> > be
> > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > +test_integerset.c, the tests will print extra information about execution 
> > time
> > +and memory usage.
> >
> > This file is not updated for test_radixtree. I think we can remove it
> > as the test cases in test_radixtree are clear.
>
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!
>
> Already, I see sifaka doesn't like this, and I'm looking now...

It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
>
> > + /* Find the control object in shared memory */
> > + control = handle;
>
> I think it's mostly because of readability; it makes clear that the
> handle should be castable to dsa_pointer and it's a control object. I
> borrowed it from dshash_attach().

I find that a bit strange, but I went ahead and kept it.



On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:

> The 0001 patch looks good to me. I have some minor comments:

> +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> +
>
> "src/backend/lib/radixtree.c" should be updated to
> "src/include/lib/radixtree.h".

Done.

> --- /dev/null
> +++ b/src/test/modules/test_radixtree/README
> @@ -0,0 +1,7 @@
> +test_integerset contains unit tests for testing the integer set 
> implementation
> +in src/backend/lib/integerset.c.
> +
> +The tests verify the correctness of the implementation, but they can also be
> +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> +test_integerset.c, the tests will print extra information about execution 
> time
> +and memory usage.
>
> This file is not updated for test_radixtree. I think we can remove it
> as the test cases in test_radixtree are clear.

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
Actually, I forgot -- I had one more question: Masahiko, is there a
reason for this extra local variable, which uses the base type, rather
than the typedef'd parameter?

+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shared memory */
+ control = handle;




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
>
> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
attaching v67 which fixes that and has some small cosmetic adjustments
to the template. One functional change for debugging build is that
RT_STATS now prints out the number of leaves. I'll squash and push
0001 tomorrow morning unless there are further comments.


v67-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 6, 2024 at 5:33 PM John Naylor  wrote:

> > I've looked around and it seems clang is more lax on conversions.
> > Since it works fine for clang, I think we just need a cast here for
> > gcc. I've attached a blind attempt at a fix -- I'll apply shortly
> > unless someone happens to test and find it doesn't work.
>
> I've reproduced the same error on my raspberry pi, and confirmed the
> patch fixes the error.
>
> My previous idea was wrong. With my proposal, the regression test for
> radix tree failed on my raspberry pi. On the other hand, with your
> patch the tests passed.

Pushed, and at least parula's green now, thanks for testing! And
thanks, Andres, for the ping!




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 326b4faff5..597496f2fb 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -323,7 +323,7 @@ vector8_highbit_mask(const Vector8 v)
 		1 << 4, 1 << 5, 1 << 6, 1 << 7,
 	};
 
-	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8((int8x16_t) v, 7));
 	uint8x16_t	maskedhi = vextq_u8(masked, masked, 8);
 
 	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Wed, Mar 6, 2024 at 3:06 PM John Naylor  wrote:
>
> (Hmm, I thought we had run this code on Arm already...)

CI MacOS uses Clang on aarch64, which has been working fine. The
failing animals are on gcc 7.3...




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund  wrote:

> > A few ARM buildfarm animals are complaining:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
> >
>
> The error message we got is:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

That sounds plausible, and I'll look further.

(Hmm, I thought we had run this code on Arm already...)




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread John Naylor
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:

> > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > aspect that the random numbers come from a different seed every time
> > the test runs.
>
> The new tests look good. Here are some comments:
>
> ---
> +   expected = keys[i];
> +   iterval = rt_iterate_next(iter, &iterkey);
>
> -   ndeleted++;
> +   EXPECT_TRUE(iterval != NULL);
> +   EXPECT_EQ_U64(iterkey, expected);
> +   EXPECT_EQ_U64(*iterval, expected);
>
> Can we verify that the iteration returns keys in ascending order?

We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

> ---
> + /* reset random number generator for deletion */
> + pg_prng_seed(&state, seed);
>
> Why is resetting the seed required here?

Good catch - My intention was to delete in the same random order we
inserted with. We still have the keys in the array, but they're sorted
by now. I forgot to go the extra step and use the prng when generating
the keys for deletion -- will fix.

> ---
> The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

Will fix.

> ---
> radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
>   "test_radix_tree",
>   ALLOCSET_DEFAULT_SIZES);
>
> We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

Sounds good.




Re: Change GUC hashtable to use simplehash?

2024-03-05 Thread John Naylor
On Tue, Jan 30, 2024 at 5:04 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > But given that we know the data length and we have it in a register
> > already, it's easy enough to just mask out data past the end with a
> > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > small test harness that just hashes and finalizes an array of strings,
> > with a data dependency between consecutive hashes (next address
> > depends on the previous hash output).
>
> Interesting work! I've taken this idea and (I'm guessing, haven't
> tested) improved it by re-using an intermediate step for the
> conditional, simplifying the creation of the mask, and moving the
> bitscan out of the longest dependency chain.

This needed a rebase, and is now 0001. I plan to push this soon.

I also went and looked at the simplehash instances and found a few
that would be easy to switch over. Rather than try to figure out which
could benefit from shaving cycles, I changed all the string hashes,
and one more, in 0002 so they can act as examples.

0003 uses fasthash for resowner, as suggested by Heikki upthread.  Now
murmur64 has no callers, but it (or similar *) could be used in
pg_dump/common.c for hashing CatalogId (8 bytes).

Commit 42a1de3013 added a new use for string_hash, but I can't tell
from a quick glance whether it uses the truncation, so I'm going to
take a closer look before re-attaching the proposed dynahash change
again.

* some examples here:
https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp
From 63d8140f146b58ea044f3516ae5472febd6d1caf Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 6 Feb 2024 13:11:33 +0700
Subject: [PATCH v19 1/3] Speed up tail processing when hashing aligned C
 strings

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated the
terminator's position and re-loaded the remaining bytes from the input
string. We already have all the data we need in a register, so let's
just mask off the bytes we need and hash them immediately. The mask can
be cheaply computed without knowing the terminator's position. We still
need that position for the length calculation, but the CPU can now
do that in parallel with other work, shortening the dependency chain.

Ants Aasma and John Naylor

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
---
 src/include/common/hashfn_unstable.h | 44 +---
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 791750d136..bd7323fe05 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -219,8 +219,9 @@ static inline size_t
 fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 {
 	const char *const start = str;
-	size_t		remainder;
+	uint64		chunk;
 	uint64		zero_byte_low;
+	uint64		mask;
 
 	Assert(PointerIsAligned(start, uint64));
 
@@ -239,7 +240,7 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	 */
 	for (;;)
 	{
-		uint64		chunk = *(uint64 *) str;
+		chunk = *(uint64 *) str;
 
 #ifdef WORDS_BIGENDIAN
 		zero_byte_low = haszero64(pg_bswap64(chunk));
@@ -254,14 +255,37 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 		str += FH_SIZEOF_ACCUM;
 	}
 
-	/*
-	 * The byte corresponding to the NUL will be 0x80, so the rightmost bit
-	 * position will be in the range 7, 15, ..., 63. Turn this into byte
-	 * position by dividing by 8.
-	 */
-	remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
-	fasthash_accum(hs, str, remainder);
-	str += remainder;
+	if (zero_byte_low & 0xFF)
+	{
+		/*
+		 * The next byte in the input is the NUL terminator, so we have
+		 * nothing to do.
+		 */
+	}
+	else
+	{
+		/*
+		 * Create a mask for the remaining bytes so we can combine them into
+		 * the hash. The mask also covers the NUL terminator, but that's
+		 * harmless. The mask could contain 0x80 in bytes corresponding to the
+		 * input past the terminator, but only where the input byte is zero or
+		 * one, so also harmless.
+		 */
+		mask = zero_byte_low | (zero_byte_low - 1);
+#ifdef WORDS_BIGENDIAN
+		/* need to mask the upper bytes */
+		mask = pg_bswap64(mask);
+#endif
+		hs->accum = chunk & mask;
+		fasthash_combine(hs);
+
+		/*
+		 * The byte corresponding to the NUL will be 0x80, so the rightmost
+		 * bit position will be in the range 15, 23, ..., 63. Turn this into
+		 * byte position by dividing by 8.
+		 */
+		str += pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
+	}
 
 	return str - start;
 }
-- 
2.43.0

From 5dad8c783bc5d0d5f573ea136b13e79aa22d0371 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 5 Mar 2024 17:22:

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread John Naylor
On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 4, 2024 at 8:48 PM John Naylor  wrote:
> >
> > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  
> > wrote:

> > > Resetting the tree->context seems to work. But I think we should note
> > > for callers that the dsa_area passed to RT_CREATE should be created in
> > > a different context than the context passed to RT_CREATE because
> > > otherwise RT_FREE() will also free the dsa_area. For example, the
> > > following code in test_radixtree.c will no longer work:

I've added a comment in v66-0004, which contains a number of other
small corrections and edits.

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada  wrote:
> > Thirdly, cosmetic: With the introduction of single-value leaves, it
> > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?
>
> Agreed.

Done in v66-0005.

v66-0006 removes outdated tests for invalid root that somehow got left over.

> > I wrote:
> > > > Secondly, I thought about my recent work to skip checking if we first
> > > > need to create a root node, and that has a harmless (for vacuum at
> > > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > > the key is bigger than 255, new nodes will go on top of the root node.
> > > > These have chunk '0'. If all subsequent keys are big enough, the
> > > > orginal root node will stay empty. If all keys are deleted, there will
> > > > be a chain of empty nodes remaining. Again, I believe this is
> > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > > on this, but likely not today.

> > I put a little more work into this, and got it working, just needs a
> > small amount of finicky coding. I'll share tomorrow.

Done in v66-0007. I'm a bit disappointed in the extra messiness this
adds, although it's not a lot.

> > + check_stack_depth();
> > + CHECK_FOR_INTERRUPTS();
> >
> > I'm not sure why these are here: The first seems overly paranoid,
> > although harmless, but the second is probably a bad idea. Why should
> > the user be able to to interrupt the freeing of memory?
>
> Good catch. We should not check the interruption there.

Removed in v66-0008.

> > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> > tree, leading to coding like "iter->tree->ctl->root". I *think* it
> > would be easier to read if the tree was a parameter to these iteration
> > functions. That would require an API change, so the tests/tidstore
> > would have some churn. I can do that, but before trying I wanted to
> > see what you think -- is there some reason to keep the current way?
>
> I considered both usages, there are two reasons for the current style.
> I'm concerned that if we pass both the tree and RT_ITER to iteration
> functions, the caller could mistakenly pass a different tree than the
> one that was specified to create the RT_ITER. And the second reason is
> just to make it consistent with other data structures such as
> dynahash.c and dshash.c, but I now realized that in simplehash.h we
> pass both the hash table and the iterator.

Okay, then I don't think it's worth messing with at this point.

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:

> > It's pretty hard to see what test_pattern() is doing, or why it's
> > useful. I wonder if instead the test could use something like the
> > benchmark where random integers are masked off. That seems simpler. I
> > can work on that, but I'd like to hear your side about test_pattern().
>
> Yeah, test_pattern() is originally created for the integerset so it
> doesn't necessarily fit the radixtree. I agree to use some tests from
> benchmarks.

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

v66-0010/0011 run pgindent, the latter with one typedef added for the
test module. 0012 - 0017 are copied from v65, and I haven't done any
work on tidstore or vacuum, except for squashing most v65 follow-up
patches.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.


v66-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread John Naylor
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
>
> On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:

> > Right, I should have said "reset". Resetting a context will delete
> > it's children as well, and seems like it should work to reset the tree
> > context, and we don't have to know whether that context actually
> > contains leaves at all. That should allow copying "tree context" to
> > "leaf context" in the case where we have no special context for
> > leaves.
>
> Resetting the tree->context seems to work. But I think we should note
> for callers that the dsa_area passed to RT_CREATE should be created in
> a different context than the context passed to RT_CREATE because
> otherwise RT_FREE() will also free the dsa_area. For example, the
> following code in test_radixtree.c will no longer work:
>
> dsa = dsa_create(tranche_id);
> radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> :
> rt_free(radixtree);
> dsa_detach(dsa); // dsa is already freed.
>
> So I think that a practical usage of the radix tree will be that the
> caller  creates a memory context for a radix tree and passes it to
> RT_CREATE().

That sounds workable to me.

> I've attached an update patch set:
>
> - 0008 updates RT_FREE_RECURSE().

Thanks!

> - 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-02 Thread John Naylor
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada  wrote:
>
> On Thu, Feb 29, 2024 at 8:43 PM John Naylor  wrote:

> > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> > +"tidstore storage",
> >
> > "tidstore storage" sounds a bit strange -- maybe look at some other
> > context names for ideas.
>
> Agreed. How about "tidstore's radix tree"?

That might be okay. I'm now thinking "TID storage". On that note, one
improvement needed when we polish tidstore.c is to make sure it's
spelled "TID" in comments, like other files do already.

> > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> > +? tree->leaf_ctx
> > +: tree->context, allocsize);
> >
> > Instead of branching here, can we copy "context" to "leaf_ctx" when
> > necessary (those names should look more like eachother, btw)? I think
> > that means anything not covered by this case:
> >
> > +#ifndef RT_VARLEN_VALUE_SIZE
> > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> > + tree->leaf_ctx = SlabContextCreate(ctx,
> > +RT_STR(RT_PREFIX) "radix_tree leaf contex",
> > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> > +sizeof(RT_VALUE_TYPE));
> > +#endif /* !RT_VARLEN_VALUE_SIZE */
> >
> > ...also, we should document why we're using slab here. On that, I
> > don't recall why we are? We've never had a fixed-length type test case
> > on 64-bit, so it wasn't because it won through benchmarking. It seems
> > a hold-over from the days of "multi-value leaves". Is it to avoid the
> > possibility of space wastage with non-power-of-two size types?
>
> Yes, it matches my understanding.

There are two issues quoted here, so not sure if you mean both or only
the last one...

For the latter, I'm not sure it makes sense to have code and #ifdef's
to force slab for large-enough fixed-length values just because we
can. There may never be such a use-case anyway. I'm also not against
it, either, but it seems like a premature optimization.

> > For this stanza that remains unchanged:
> >
> > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> > {
> >   MemoryContextDelete(tree->node_slabs[i]);
> > }
> >
> > if (tree->leaf_ctx)
> > {
> >   MemoryContextDelete(tree->leaf_ctx);
> > }
> >
> > ...is there a reason we can't just delete tree->ctx, and let that
> > recursively delete child contexts?
>
> I thought that considering the RT_CREATE doesn't create its own memory
> context but just uses the passed context, it might be a bit unusable
> to delete the passed context in the radix tree code. For example, if a
> caller creates a radix tree (or tidstore) on a memory context and
> wants to recreate it again and again, he also needs to re-create the
> memory context together. It might be okay if we leave comments on
> RT_CREATE as a side effect, though. This is the same reason why we
> don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
I wrote:

> Secondly, I thought about my recent work to skip checking if we first
> need to create a root node, and that has a harmless (for vacuum at
> least) but slightly untidy behavior: When RT_SET is first called, and
> the key is bigger than 255, new nodes will go on top of the root node.
> These have chunk '0'. If all subsequent keys are big enough, the
> orginal root node will stay empty. If all keys are deleted, there will
> be a chain of empty nodes remaining. Again, I believe this is
> harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.


v64-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right? Is there currently anything preventing us
from destroying the whole DSA area at once?

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }

IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
  RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
  dsa_free(tree->dsa, child);

...or am I missing something?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada  wrote:

> - v63-0008 patch fixes a bug in tidstore.

- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));

Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

> - v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+"tidstore storage",

"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+? tree->leaf_ctx
+: tree->context, allocsize);

Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:

+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+RT_STR(RT_PREFIX) "radix_tree leaf contex",
+RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */

...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
  MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
  MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-19 Thread John Naylor
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada  wrote:
>
> I think that vacuum and tidbitmap (and future users) would end up
> having the same max block size calculation. And it seems slightly odd
> layering to me that max-block-size-specified context is created on
> vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> tidstore layer, and the passed context is used for leaves (if
> varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

> Another idea is to create a
> max-block-size-specified context on the tidstore layer. That is,
> vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> tidstore can use the bump context, and tidstore creates a (aset of
> bump) memory context with the calculated max block size and passes it
> to the radix tree.

That might be a better abstraction since both uses have some memory limit.

> As for using the bump memory context, I feel that we need to store
> iterator struct in aset context at least as it can be individually
> freed and re-created. Or it might not be necessary to allocate the
> iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Andres, did you want to take a look at the radix tree patch 0003?
Aside from the above possible cleanup, most of it should be stable.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
>
> monotonically ordered int column index:
>
> master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

> I'm happy to see a huge improvement. While it's really fascinating to
> me, I'm concerned about the time left until the feature freeze. We
> need to polish both tidstore and vacuum integration patches in 5
> weeks. Personally I'd like to have it as a separate patch for now, and
> focus on completing the main three patches since we might face some
> issues after pushing these patches. I think with 0007 patch it's a big
> win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

> > 2. Management of memory contexts. It's pretty verbose and messy. I
> > think the abstraction could be better:
> > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > can't destroy or reset it. That means we have to do a lot of manual
> > work.
> > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > it seems the wrong responsibility. Not all uses will have a
> > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > block size, and aset's default 8MB is already plenty small for
> > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > smaller, and there it makes sense to limit the max blocksize this way.
> > C: The context for values has complex #ifdefs based on the value
> > length/varlen, but it's both too much and not enough. If we get a bump
> > context, how would we shoehorn that in for values for vacuum but not
> > for tidbitmap?
> >
> > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > values (for local mem), and the node slabs will be children of the
> > value context. That way, measuring memory usage and free-ing can just
> > call with this parent context, and let recursion handle the rest.
> > Perhaps the passed context can also hold the radix-tree struct, but
> > I'm not sure since I haven't tried it. What do you think?
>
> If I understand your idea correctly, RT_CREATE() creates the context
> for values as a child of the passed context and the node slabs as
> children of the value context. That way, measuring memory usage can
> just call with the value context. It sounds like a good idea. But it
> was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
- node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

> Another variant of this idea would be that RT_CREATE() creates the
> parent context of the value context to store radix-tree struct. That
> is, the hierarchy would be like:
>
> A MemoryContext (passed by vacuum through tidstore)
> - radix tree memory context (store radx-tree struct, control
> struct, and iterator)
> - value context (aset, slab, or bump)
> - node slab contexts

The template handling the value context here is complex, and is what I
meant by 'C' above. Most fixed length allocations in all of the
backend are aset, so it seems fine to use it always.

> Freeing can just call with the radix tree memory context. And perhaps
> it works even if tidstore passes CurrentMemo

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor
v61 had a brown-paper-bag bug in the embedded tids patch that didn't
present in the tidstore test, but caused vacuum to fail, fixed in v62.


v62-ART.tar.gz
Description: application/gzip


  1   2   3   4   5   6   7   8   9   10   >