Re: postgres_fdw fails to see that array type belongs to extension

2024-01-15 Thread David Geier

Hi,

I realized that ALTER EXTENSION foo ADD TYPE _bar does pretty much the 
same via ExecAlterExtensionContentsStmt(). So the code in the patch 
seems fine.


On 1/8/24 12:21, David Geier wrote:
The attached patch just adds a 2nd dependency between the array type 
and the extension, using recordDependencyOnCurrentExtension(). It 
seems like that the other internal dependency on the element type must 
stay? If that seems reasonable I can add a test to 
modules/test_extensions. Can extensions in contrib use test extension 
in their own tests? It looks like postgres_fdw doesn't test any of the 
shippability logic.



--
David Geier
(ServiceNow)





Re: postgres_fdw fails to see that array type belongs to extension

2024-01-08 Thread David Geier

Hi,

On 12/27/23 18:38, Tom Lane wrote:

Hmm.  It seems odd that if an extension defines a type, the type is
listed as a member of the extension but the array type is not.
That makes it look like the array type is an externally-created
thing that happens to depend on the extension, when it's actually
part of the extension.  I'm surprised we've not run across other
misbehaviors traceable to that.

Agreed.

Of course, fixing it like that leads to needing to change the
contents of pg_depend, so it wouldn't be back-patchable.  But it
seems like the best way in the long run.


The attached patch just adds a 2nd dependency between the array type and 
the extension, using recordDependencyOnCurrentExtension(). It seems like 
that the other internal dependency on the element type must stay? If 
that seems reasonable I can add a test to modules/test_extensions. Can 
extensions in contrib use test extension in their own tests? It looks 
like postgres_fdw doesn't test any of the shippability logic.


--
David Geier
(ServiceNow)
From de23a4e9f1b0620a5204594139568cdcb3d57885 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Mon, 8 Jan 2024 10:58:21 +0100
Subject: [PATCH] Fix dependency of array of type owned by extension

---
 src/backend/catalog/pg_type.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c
index d7167108fb..3544a3af1a 100644
--- a/src/backend/catalog/pg_type.c
+++ b/src/backend/catalog/pg_type.c
@@ -724,8 +724,10 @@ GenerateTypeDependencies(HeapTuple typeTuple,
 	if (OidIsValid(typeForm->typelem))
 	{
 		ObjectAddressSet(referenced, TypeRelationId, typeForm->typelem);
-		recordDependencyOn(, ,
-		   isImplicitArray ? DEPENDENCY_INTERNAL : DEPENDENCY_NORMAL);
+		recordDependencyOn(, , isImplicitArray ? DEPENDENCY_INTERNAL : DEPENDENCY_NORMAL);
+
+		if (makeExtensionDep)
+	recordDependencyOnCurrentExtension(, rebuild);
 	}
 }
 
-- 
2.39.2



postgres_fdw fails to see that array type belongs to extension

2023-12-27 Thread David Geier

Hi hackers,

We use postgres_fdw to connect two databases. Both DBs have an extension 
installed which provides a custom string data type. Our extension is 
known to the FDW as we created the foreign server with our extension 
listed in the "extensions" option.


The filter clause of the query SELECT * FROM test WHERE col = 'foo' OR 
col = 'bar' is pushed down to the remote, while the filter clause of the 
semantically equivalent query SELECT * FROM test WHERE col IN ('foo', 
'bar') is not.


I traced this down to getExtensionOfObject() called from 
lookup_shippable(). getExtensionOfObject() doesn't recurse but only 
checks first level dependencies and only checks for extension 
dependencies. However, the IN operator takes an array of our custom data 
type as argument (type is typically prefixed with _ in pg_type). This 
array type is only dependent on our extension via the custom data type 
in two steps which postgres_fdw doesn't see. Therefore, postgres_fdw 
doesn't allow for push-down of the IN.


Thoughts?

--
David Geier
(ServiceNow)





Fix assertion in autovacuum worker

2023-11-28 Thread David Geier

Hi hackers,

PostgreSQL hit the following assertion during error cleanup, after being 
OOM in dsa_allocate0():


void dshash_detach(dshash_table *hash_table) { 
ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);


called from pgstat_shutdown_hook(), called from shmem_exit(), called 
from proc_exit(), called from the exception handler.


The partition locks got previously acquired by

AutoVacWorkerMain() pgstat_report_autovac() 
pgstat_get_entry_ref_locked() pgstat_get_entry_ref() 
dshash_find_or_insert() resize() resize() locks all partitions so the 
hash table can safely be resized. Then it calls dsa_allocate0(). If 
dsa_allocate0() fails to allocate, it errors out. The exception handler 
calls proc_exit() which normally calls LWLockReleaseAll() via 
AbortTransaction() but only if there's an active transaction. However, 
pgstat_report_autovac() runs before a transaction got started and hence 
LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.


See attached patch for an attempt to fix this issue.

--
David Geier
(ServiceNow)
From 5580e3680b2211235e4bc2b5dcbfe6b4f5b8eee5 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Tue, 28 Nov 2023 18:52:46 +0100
Subject: [PATCH] Fix autovacuum cleanup on error

---
 src/backend/postmaster/autovacuum.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f929b62e8a..5de55649d9 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1584,6 +1584,9 @@ AutoVacWorkerMain(int argc, char *argv[])
 		/* Report the error to the server log */
 		EmitErrorReport();
 
+		/* Make sure all locks are released so assertions don't hit in at-exit callbacks */
+		LWLockReleaseAll();
+
 		/*
 		 * We can now go away.  Note that because we called InitProcess, a
 		 * callback was registered to do ProcKill, which will clean up
-- 
2.39.2



Re: how to do profile for pg?

2023-09-21 Thread David Geier

Hi,

This talk from Andres seems to have some relevant information for you:

https://www.youtube.com/watch?v=HghP4D72Noc

--
David Geier
(ServiceNow)





Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2023-09-20 Thread David Geier

Hi Dmitry,

Thanks for looking at the patch and sorry for the long line.

On 3/17/23 21:14, Dmitry Dolgov wrote:

The idea sounds reasonable to me, but I have to admit snapshot_and_stats
implementation looks awkward. Maybe it would be better to have a
separate structure field for both stats and snapshot, which will be set
to point to a corresponding place in the shared FAM e.g. when the worker
is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle
possibility of some atomic operations needing it, so I guess that would
have to be an alignment in this case as well.

Probably another option would be to allocate two separate pieces of
shared memory, which resolves questions like proper alignment, but
annoyingly will require an extra lookup and a new key.


I considered the other options and it seems to me none of them is 
particularly superior. All of them have pros and cons with the cons 
mostly outweighing the pros. Let me quickly elaborate:


1. Use multiple shm_toc entries: Shared state is split into multiple 
pieces. Extra pointers in BitmapHeapScanState needed to point at the 
split out data. BitmapHeapScanState has already a shared_info member, 
which is not a pointer to the shared memory but a pointer to the leader 
local data allocated used to store the instrumentation data from the 
workers. This is confusing but at least consistent with how its done in 
other places (e.g. execSort.c, nodeHash.c, nodeIncrementalSort.c). 
Having another pointer there which points to the shared memory makes it 
even more confusing. If we go this way we would have e.g. 
shared_info_copy and shared_info members in BitmapHeapScanState.


2. Store two extra pointers to the shared FAM entries in 
BitmapHeapScanState: IMHO, that is the better alternative of (1) as it 
doesn't need an extra TOC entry but comes with the same confusion of 
multiple pointers to SharedBitmapHeapScanInfo in BitmapHeapScanState. 
But maybe that's not too bad?


3. Solution in initial patch (use two functions to obtain pointers 
where/when needed): Avoids the need for another pointer in 
BitmapHeapScanState at the cost / ugliness of having to call the helper 
functions.


Another, not yet discussed, option I can see work is:

4. Allocate a fixed amount of memory for the instrumentation stats based 
on MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used 
as the limit of the max_parallel_workers GUC. This way 
MAX_PARALLEL_WORKER_LIMIT * sizeof(BitmapHeapScanInstrumentation) = 1024 
* 8 = 8192 bytes would be allocated. To cut this down in half we could 
additionally change the type of lossy_pages and exact_pages from long to 
uint32. Only possibly needed memory would have to get initialized, the 
remaining unused memory would remain untouched to not waste cycles.


My first preference is the new option (4). My second preference is 
option (1). What's your take?


--
David Geier
(ServiceNow)





Re: Eliminate redundant tuple visibility check in vacuum

2023-09-01 Thread David Geier

Hi,

On 9/1/23 03:25, Peter Geoghegan wrote:

On Thu, Aug 31, 2023 at 3:35 PM Melanie Plageman
 wrote:

Any inserting transaction which aborts after heap_page_prune()'s
visibility check will now be of no concern to lazy_scan_prune(). Since
we don't do the visibility check again, we won't find the tuple
HEAPTUPLE_DEAD and thus won't have the problem of adding the tuple to
the array of tuples for vacuum to reap. This does mean that we won't
reap and remove tuples whose inserting transactions abort right after
heap_page_prune()'s visibility check. But, this doesn't seem like an
issue.

It's definitely not an issue.

The loop is essentially a hacky way of getting a consistent picture of
which tuples should be treated as HEAPTUPLE_DEAD, and which tuples
need to be left behind (consistent at the level of each page and each
HOT chain, at least). Making that explicit seems strictly better.


They may not be removed until the next vacuum, but ISTM it is
actually worse to pay the cost of re-pruning the whole page just to
clean up that one tuple. Maybe if that renders the page all visible
and we can mark it as such in the visibility map -- but that seems
like a relatively narrow use case.

The chances of actually hitting the retry are microscopic anyway. It
has nothing to do with making sure that dead tuples from aborted
tuples get removed for its own sake, or anything. Rather, the retry is
all about making sure that all TIDs that get removed from indexes can
only point to LP_DEAD stubs. Prior to Postgres 14, HEAPTUPLE_DEAD
tuples with storage would very occasionally be left behind, which made
life difficult in a bunch of other places -- for no good reason.

That makes sense and seems like a much better compromise. Thanks for the 
explanations. Please update the comment to document the corner case and 
how we handle it.


--
David Geier
(ServiceNow)





Re: Eliminate redundant tuple visibility check in vacuum

2023-08-31 Thread David Geier

Hi Melanie,

On 8/31/23 02:59, Melanie Plageman wrote:

I created a large table and then updated a tuple on every page in the
relation and vacuumed it. I saw a consistent slight improvement in
vacuum execution time. I profiled a bit with perf stat as well. The
difference is relatively small for this kind of example, but I can
work on a more compelling, realistic example. I think eliminating the
retry loop is also useful, as I have heard that users have had to
cancel vacuums which were in this retry loop for too long.

Just to provide a specific test case, if you create a small table like this

create table foo (a int, b int, c int) with(autovacuum_enabled=false);
insert into foo select i, i, i from generate_series(1, 1000);

And then vacuum it. I find that with my patch applied I see a
consistent ~9% speedup (averaged across multiple runs).

master: ~533ms
patch: ~487ms

And in the profile, with my patch applied, you notice less time spent
in HeapTupleSatisfiesVacuumHorizon()

master:
 11.83%  postgres  postgres   [.] heap_page_prune
 11.59%  postgres  postgres   [.] heap_prepare_freeze_tuple
  8.77%  postgres  postgres   [.] lazy_scan_prune
  8.01%  postgres  postgres   [.] HeapTupleSatisfiesVacuumHorizon
  7.79%  postgres  postgres   [.] heap_tuple_should_freeze

patch:
 13.41%  postgres  postgres   [.] heap_prepare_freeze_tuple
  9.88%  postgres  postgres   [.] heap_page_prune
  8.61%  postgres  postgres   [.] lazy_scan_prune
  7.00%  postgres  postgres   [.] heap_tuple_should_freeze
  6.43%  postgres  postgres   [.] HeapTupleSatisfiesVacuumHorizon


Thanks a lot for providing additional information and the test case.

I tried it on a release build and I also see a 10% speed-up. I reset the 
visibility map between VACUUM runs, see:


CREATE EXTENSION pg_visibility; CREATE TABLE foo (a INT, b INT, c INT) 
WITH(autovacuum_enabled=FALSE); INSERT INTO foo SELECT i, i, i from 
generate_series(1, 1000) i; VACUUM foo; SELECT 
pg_truncate_visibility_map('foo'); VACUUM foo; SELECT 
pg_truncate_visibility_map('foo'); VACUUM foo; ...


The first patch, which refactors the code so we can pass the result of 
the visibility checks to the caller, looks good to me.


Regarding the 2nd patch (disclaimer: I'm not too familiar with that area 
of the code): I don't completely understand why the retry loop is not 
needed anymore and how you now detect/handle the possible race 
condition? It can still happen that an aborting transaction changes the 
state of a row after heap_page_prune() looked at that row. Would that 
case now not be ignored?


--
David Geier
(ServiceNow)





Re: Eliminate redundant tuple visibility check in vacuum

2023-08-29 Thread David Geier

Hi Melanie,

On 8/29/23 01:49, Melanie Plageman wrote:

While working on a set of patches to combine the freeze and visibility
map WAL records into the prune record, I wrote the attached patches
reusing the tuple visibility information collected in heap_page_prune()
back in lazy_scan_prune().

heap_page_prune() collects the HTSV_Result for every tuple on a page
and saves it in an array used by heap_prune_chain(). If we make that
array available to lazy_scan_prune(), it can use it when collecting
stats for vacuum and determining whether or not to freeze tuples.
This avoids calling HeapTupleSatisfiesVacuum() again on every tuple in
the page.

It also gets rid of the retry loop in lazy_scan_prune().


How did you test this change?

Could you measure any performance difference?

If so could you provide your test case?

--
David Geier
(ServiceNow)





Re: Let's make PostgreSQL multi-threaded

2023-08-25 Thread David Geier

Hi,

On 8/11/23 14:05, Merlin Moncure wrote:

On Thu, Jul 27, 2023 at 8:28 AM David Geier  wrote:

Hi,

On 6/7/23 23:37, Andres Freund wrote:
> I think we're starting to hit quite a few limits related to the
process model,
> particularly on bigger machines. The overhead of cross-process
context
> switches is inherently higher than switching between threads in
the same
> process - and my suspicion is that that overhead will continue to
> increase. Once you have a significant number of connections we
end up spending
> a *lot* of time in TLB misses, and that's inherent to the
process model,
> because you can't share the TLB across processes.

Another problem I haven't seen mentioned yet is the excessive kernel
memory usage because every process has its own set of page table
entries
(PTEs). Without huge pages the amount of wasted memory can be huge if
shared buffers are big.


Hm, noted this upthread, but asking again, does this 
help/benefit interactions with the operating system make oom kill 
situations less likely?   These things are the bane of my existence, 
and I'm having a hard time finding a solution that prevents them other 
than running pgbouncer and lowering max_connections, which adds 
complexity.  I suspect I'm not the only one dealing with this.  
 What's really scary about these situations is they come without 
warning.  Here's a pretty typical example per sar -r.


The conjecture here is that lots of idle connections make the server 
appear to have less memory available than it looks, and sudden 
transient demands can cause it to destabilize.


It does in the sense that your server will have more memory available in 
case you have many long living connections around. Every connection has 
less kernel memory overhead if you will. Of course even then a runaway 
query will be able to invoke the OOM killer. The unfortunate thing with 
the OOM killer is that, in my experience, it often kills the 
checkpointer. That's because the checkpointer will touch all of shared 
buffers over time which makes it likely to get selected by the OOM 
killer. Have you tried disabling memory overcommit?


--
David Geier
(ServiceNow)





Re: Let's make PostgreSQL multi-threaded

2023-07-27 Thread David Geier

Hi,

On 6/7/23 23:37, Andres Freund wrote:

I think we're starting to hit quite a few limits related to the process model,
particularly on bigger machines. The overhead of cross-process context
switches is inherently higher than switching between threads in the same
process - and my suspicion is that that overhead will continue to
increase. Once you have a significant number of connections we end up spending
a *lot* of time in TLB misses, and that's inherent to the process model,
because you can't share the TLB across processes.


Another problem I haven't seen mentioned yet is the excessive kernel 
memory usage because every process has its own set of page table entries 
(PTEs). Without huge pages the amount of wasted memory can be huge if 
shared buffers are big.


For example with 256 GiB of used shared buffers a single process needs 
about 256 MiB for the PTEs (for simplicity I ignored the tree structure 
of the page tables and just took the number of 4k pages times 4 bytes 
per PTE). With 512 connections, which is not uncommon for machines with 
many cores, a total of 128 GiB of memory is just spent on page tables.


We used non-transparent huge pages to work around this limitation but 
they come with plenty of provisioning challenges, especially in cloud 
infrastructures where different services run next to each other on the 
same server. Transparent huge pages have unpredictable performance 
disadvantages. Also if some backends only use shared buffers sparsely, 
memory is wasted for the remaining, unused range inside the huge page.


--
David Geier
(ServiceNow)





Re: pg_stat_statements and "IN" conditions

2023-02-23 Thread David Geier

Hi,


Seems like supporting only constants is a good starting
point. The only thing that is likely confusing for users is that NUMERICs
(and potentially constants of other types) are unsupported. Wouldn't it be
fairly simple to support them via something like the following?

     is_const(element) || (is_coercion(element) && is_const(element->child))

It definitely makes sense to implement that, although I don't think it's
going to be acceptable to do that via directly listing conditions an
element has to satisfy. It probably has to be more flexible, sice we
would like to extend it in the future. My plan is to address this in a
follow-up patch, when the main mechanism is approved. Would you agree
with this approach?


I still think it's counterintuitive and I'm pretty sure people would 
even report this as a bug because not knowing about the difference in 
internal representation you would expect NUMERICs to work the same way 
other constants work. If anything we would have to document it.


Can't we do something pragmatic and have something like 
IsMergableInElement() which for now only supports constants and later 
can be extended? Or what exactly do you mean by "more flexible"?


--
David Geier
(ServiceNow)





Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2023-02-21 Thread David Geier

Hi,

On 1/20/23 09:34, David Geier wrote:
EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports 
the number of heap blocks processed by the leader. It's missing the 
per-worker stats. The attached patch adds that functionality in the 
spirit of e.g. Sort or Memoize. Here is a simple test case and the 
EXPLAIN ANALYZE output with and without the patch:


Attached is a rebased version of the patch. I would appreciate someone 
taking a look.


As background: the change doesn't come out of thin air. We repeatedly 
took wrong conclusions in our query analysis because we assumed that the 
reported block counts include the workers.


If no one objects I would also register the patch at the commit fest. 
The patch is passing cleanly on CI.


--
David Geier
(ServiceNow)
From de03dfb101810f051b883804947707ecddaffceb Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Tue, 8 Nov 2022 19:40:31 +0100
Subject: [PATCH v2] Parallel Bitmap Heap Scan reports per-worker stats

Similarly to other nodes (e.g. hash join, sort, memoize),
Bitmap Heap Scan now reports per-worker stats in the EXPLAIN
ANALYZE output. Previously only the heap blocks stats for the
leader were reported which was incomplete in parallel scans.

	Discussion: https://www.postgresql.org/message-id/flat/b3d80961-c2e5-38cc-6a32-61886cdf766d%40gmail.com
---
 src/backend/commands/explain.c| 45 ++--
 src/backend/executor/execParallel.c   |  3 +
 src/backend/executor/nodeBitmapHeapscan.c | 88 +++
 src/include/executor/nodeBitmapHeapscan.h |  1 +
 src/include/nodes/execnodes.h | 23 +-
 5 files changed, 137 insertions(+), 23 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e57bda7b62..b5b31a763c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3396,26 +3396,57 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 static void
 show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 {
+	Assert(es->analyze);
+
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-			   planstate->exact_pages, es);
+			   planstate->stats.exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-			   planstate->lossy_pages, es);
+			   planstate->stats.lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (planstate->stats.exact_pages > 0)
+appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages);
+			if (planstate->stats.lossy_pages > 0)
+appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
+
+	if (planstate->shared_info != NULL)
+	{
+		for (int n = 0; n < planstate->shared_info->num_workers; n++)
+		{
+			BitmapHeapScanInstrumentation *si = >shared_info->sinstrument[n];
+
+			if (si->exact_pages == 0 && si->lossy_pages == 0)
+continue;
+
+			if (es->workers_state)
+ExplainOpenWorker(n, es);
+
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+ExplainIndentText(es);
+appendStringInfo(es->str, "Heap Blocks: exact=%ld lossy=%ld\n",
+		 si->exact_pages, si->lossy_pages);
+			}
+			else
+			{
+ExplainPropertyInteger("Exact Heap Blocks", NULL, si->exact_pages, es);
+ExplainPropertyInteger("Lossy Heap Blocks", NULL, si->lossy_pages, es);
+			}
+
+			if (es->workers_state)
+ExplainCloseWorker(n, es);
+		}
+	}
 }
 
 /*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index aa3f283453..73ef4f0c0f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1072,6 +1072,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		case T_MemoizeState:
 			ExecMemoizeRetrieveInstrumentation((MemoizeState *) planstate);
 			break;
+		case T_BitmapHeapScanState:
+			ExecBitmapHeapRetrieveInstrumentation((BitmapHeapScanState *) planstate);
+			break;
 		default:
 			break;
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f35df0b8bf..6d36e7922b 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -239,9 +239,9 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			}

Re: Performance issues with parallelism and LIMIT

2023-02-20 Thread David Geier

Hi,

On 2/8/23 11:42, Tomas Vondra wrote:

On 2/1/23 14:41, David Geier wrote:

Yeah, this is a pretty annoying regression. We already can hit poor
behavior when matching rows are not distributed uniformly in the tables
(which is what LIMIT costing assumes), and this makes it more likely to
hit similar issues. A bit like when doing many HTTP requests makes it
more likely to hit at least one 99% outlier.
Are you talking about the use of ordering vs filtering indexes in 
queries where there's both an ORDER BY and a filter present (e.g. using 
an ordering index but then all rows passing the filter are at the end of 
the table)? If not, can you elaborate a bit more on that and maybe give 
an example.

No opinion on these options, but worth a try. Alternatively, we could
try the usual doubling approach - start with a low threshold (and set
the latch frequently), and then gradually increase it up to the 1/4.

That should work both for queries expecting only few rows and those
producing a lot of data.


I was thinking about this variant as well. One more alternative would be 
latching the leader once a worker has produced 1/Nth of the LIMIT where 
N is the number of workers. Both variants have the disadvantage that 
there are still corner cases where the latch is set too late; but it 
would for sure be much better than what we have today.


I also did some profiling and - at least on my development laptop with 8 
physical cores - the original example, motivating the batching change is 
slower than when it's disabled by commenting out:


    if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

SET parallel_tuple_cost TO 0;
CREATE TABLE b (a int);
INSERT INTO b SELECT generate_series(1, 2);
ANALYZE b;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM b;

 Gather  (cost=1000.00..1200284.61 rows=200375424 width=4) (actual 
rows=2 loops=1)

   Workers Planned: 7
   Workers Launched: 7
   ->  Parallel Seq Scan on b  (cost=0.00..1199284.61 rows=28625061 
width=4) (actual rows=2500 loops=8)


Always latch: 19055 ms
Batching:     19575 ms

If I find some time, I'll play around a bit more and maybe propose a patch.


...

We would need something similar to CHECK_FOR_INTERRUPTS() which returns
a NULL slot if a parallel worker is supposed to stop execution (we could
e.g. check if the queue got detached). Or could we amend
CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue
got detached?


That sounds reasonable, but I'm not very familiar the leader-worker
communication, so no opinion on how it should be done.


I think an extra macro that needs to be called from dozens of places to 
check if parallel execution is supposed to end is the least preferred 
approach. I'll read up more on how CHECK_FOR_INTERRUPTS() works and if 
we cannot actively signal the workers that they should stop.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-20 Thread David Geier

Hi!

On 2/14/23 13:48, David Geier wrote:


It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.


It took some time until Cirrus CI allowed me to run tests against my new 
GitHub account (there's a 3 days freeze to avoid people from getting 
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch 
which passes builds, rebased on latest master.


I also reviewed the first two patches a while ago in [1]. I hope we can 
progress with them to further reduce the size of this patch set.


Beyond that: I could work on support for more OSs (e.g. starting with 
Windows). Is there appetite for that or do we rather want to instead 
start with a smaller patch?


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com


--
David Geier
(ServiceNow)
From d03a9be2522b0ef22fd58cbcfc95eb19ca8b2bea Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v9 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)	((x).ticks < (y).ticks)
+
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-- 
2.34.1

From 639213b01102a7320a62ba5ed68a8e3d6a05514b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v9 2/3] wip: report nanoseconds in pg_test_timing

This commit also updates pg_test_timing's documentation:

- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
  of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
  numbers, and it's old enough nobody would care
---
 doc/src/sgml/ref/pgtesttiming.sgml  | 117 ++--
 src/bin/pg_test_timing/pg_test_timing.c |  74 +--
 2 files changed, 95 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
 
   
Good results will show most (>90%) individual timing calls take less than
-   one microsecond. Average per loop overhead will be even lower, below 100
-   nanoseconds. This example from an Intel i7-860 system using a TSC clock
-   source shows excellent performance:
+   one microsecond (1000 nanoseconds). Average per loop overhead will be even
+   lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+   using a TSC clock source shows excellent performance:
 
 
   
 
   
-   Note that different units are used for the per loop time than the
-   histogram. The loop can have resolution within a few nanoseconds (ns),
-   while the individual timing calls can only resolve down to one microsecond
-   (us).
+   Note that the accuracy of the histogram entries may be lower than the
+   per loop time.
   
 
  
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
EXPLAIN ANALYZE, individual operations are time

Re: pg_stat_statements and "IN" conditions

2023-02-14 Thread David Geier

Hi,

On 2/11/23 13:08, Dmitry Dolgov wrote:

On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:

The original version of the patch was doing all of this, i.e. handling
numerics, Param nodes, RTE_VALUES. The commentary about
find_const_walker in tests is referring to a part of that, that was
dealing with evaluation of expression to see if it could be reduced to a
constant.

Unfortunately there was a significant push back from reviewers because
of those features. That's why I've reduced the patch to it's minimally
useful version, having in mind re-implementing them as follow-up patches
in the future. This is the reason as well why I left tests covering all
this missing functionality -- as breadcrumbs to already discovered
cases, important for the future extensions.

I'd like to elaborate on this a bit and remind about the origins of the
patch, as it's lost somewhere in the beginning of the thread. The idea
is not pulled out of thin air, everything is coming from our attempts to
improve one particular monitoring infrastructure in a real commercial
setting. Every covered use case and test in the original proposal was a
result of field trials, when some application-side library or ORM was
responsible for gigabytes of data in pgss, chocking the monitoring agent.


Thanks for the clarifications. I didn't mean to contend the usefulness 
of the patch and I wasn't aware that you already jumped through the 
loops of handling Param, etc. Seems like supporting only constants is a 
good starting point. The only thing that is likely confusing for users 
is that NUMERICs (and potentially constants of other types) are 
unsupported. Wouldn't it be fairly simple to support them via something 
like the following?


    is_const(element) || (is_coercion(element) && is_const(element->child))

--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-14 Thread David Geier

Hi!

On 2/14/23 12:11, David Geier wrote:

Hi,

I think I fixed the compilation errors. It was due to a few variables 
being declared under


#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no 
faster than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.


Beyond that I reviewed the first two patches a while ago in [1]. I 
hope we can progress with them to further reduce the size of this 
patch set.


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 




It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-14 Thread David Geier

Hi,

On 2/7/23 19:12, Andres Freund wrote:

This fails to build on several platforms:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751


I think I fixed the compilation errors. It was due to a few variables 
being declared under


#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no faster 
than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.


Beyond that I reviewed the first two patches a while ago in [1]. I hope 
we can progress with them to further reduce the size of this patch set.


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 



--
David Geier
(ServiceNow)
From 36ff7f7ee14bf42ef0fb775cec428180251c3ff9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v8 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)	((x).ticks < (y).ticks)
+
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-- 
2.34.1

From 1200bfee5b8ebf7c68d8cefb3771a4c3523c2cea Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 2/3] wip: report nanoseconds in pg_test_timing

This commit also updates pg_test_timing's documentation:

- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
  of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
  numbers, and it's old enough nobody would care
---
 doc/src/sgml/ref/pgtesttiming.sgml  | 117 ++--
 src/bin/pg_test_timing/pg_test_timing.c |  74 +--
 2 files changed, 95 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
 
   
Good results will show most (>90%) individual timing calls take less than
-   one microsecond. Average per loop overhead will be even lower, below 100
-   nanoseconds. This example from an Intel i7-860 system using a TSC clock
-   source shows excellent performance:
+   one microsecond (1000 nanoseconds). Average per loop overhead will be even
+   lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+   using a TSC clock source shows excellent performance:
 
 
   
 
   
-   Note that different units are used for the per loop time than the
-   histogram. The loop can have resolution within a few nanoseconds (ns),
-   while the individual timing calls can only resolve down to one microsecond
-   (us).
+   Note that the accuracy of the histogram entries may be lower than the
+   per loop time.
   
 
  
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
EXPLAIN ANALYZE, indi

Re: pg_stat_statements and "IN" conditions

2023-02-11 Thread David Geier

Hi,

On 2/9/23 16:02, Dmitry Dolgov wrote:

Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs 
( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc )

I reviewed the last patch applied to some commit from Feb. 4th.

It seems a little strange to me that with const_merge_threshold = 1, such a 
test case gives the same result as with const_merge_threshold = 2

select pg_stat_statements_reset();
set const_merge_threshold to 1;
select * from test where i in (1,2,3);
select * from test where i in (1,2);
select * from test where i in (1);
select query, calls from pg_stat_statements order by query;

 query| calls
-+---
  select * from test where i in (...) | 2
  select * from test where i in ($1)  | 1

Probably const_merge_threshold = 1 should produce only "i in (...)"?

Well, it's not intentional, probably I need to be more careful with
off-by-one. Although I agree to a certain extent with Peter questioning


Please add tests for all the corner cases. At least for (1) IN only 
contains a single element and (2) const_merge_threshold = 1.


Beyond that:

- There's a comment about find_const_walker(). I cannot find that 
function anywhere. What am I missing?


- What about renaming IsConstList() to IsMergableConstList().

- Don't you intend to use the NUMERIC data column in SELECT * FROM 
test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? 
Otherwise, the test is identical to previous test cases and you're not 
checking for what happens with NUMERICs which are wrapped in FuncExpr 
because of the implicit coercion.


- Don't we want to extend IsConstList() to allow a list of all 
implicitly coerced constants? It's inconsistent that otherwise e.g. 
NUMERICs don't work.


- Typo in /* The firsts merged constant */ (first not firsts)

- Prepared statements are not supported as they contain INs with Param 
instead of Const nodes. While less likely, I've seen applications that 
use prepared statements in conjunction with queries generated through a 
UI which ended up with tons of prepared queries with different number of 
elements in the IN clause. Not necessarily something that must go into 
this patch but maybe worth thinking about.


- The setting name const_merge_threshold is not very telling without 
knowing the context. While being a little longer, what about 
jumble_const_merge_threshold or queryid_const_merge_threshold or similar?


- Why do we actually only want to merge constants? Why don't we ignore 
the type of element in the IN and merge whatever is there? Is this 
because the original jumbling logic as of now only has support for 
constants?


- Ideally we would even remove duplicates. That would even improve 
cardinality estimation but I guess we don't want to spend the cycles on 
doing so in the planner?


- Why did you change palloc() to palloc0() for clocations array? The 
length is initialized to 0 and FWICS RecordConstLocation() initializes 
all members. Seems to me like we don't have to spend these cycles.


- Can't the logic at the end of IsConstList() not be simplified to:

    foreach(temp, elements)
        if (!IsA(lfirst(temp), Const))
            return false;

    // All elements are of type Const
        *firstConst = linitial_node(Const, elements);
        *lastConst = llast_node(Const, elements);
        return true;

--
David Geier
(ServiceNow)





Performance issues with parallelism and LIMIT

2023-02-01 Thread David Geier
e on detached queues):


    /*
 * If we are not able to send the tuple, we assume the 
destination
 * has closed and no more tuples can be sent. If that's the 
case,

 * end the loop.
 */
    if (!dest->receiveSlot(slot, dest))
    break;

Reproduction steps for this problem are below. Here the worker getting 
the first table page will be done right away, but the query takes as 
long as it takes to scan all pages of the entire table.


CREATE TABLE bar (col INT);
INSERT INTO bar SELECT generate_series(1, 500);
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1;

 Limit  (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200 
rows=1 loops=1)

   Output: col
   ->  Gather  (cost=0.00..30939.03 rows=28208 width=4) (actual 
time=32.278..196.176 rows=1 loops=1)

 Output: col
 Workers Planned: 8
 Workers Launched: 7
 ->  Parallel Seq Scan on public.bar (cost=0.00..30939.03 
rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7)

   Output: col
   Filter: (bar.col = 1)
   Rows Removed by Filter: 713769
   Worker 0:  actual time=160.177..160.181 rows=0 loops=1
   Worker 1:  actual time=160.111..160.115 rows=0 loops=1
   Worker 2:  actual time=0.043..0.047 rows=1 loops=1
   Worker 3:  actual time=160.040..160.044 rows=0 loops=1
   Worker 4:  actual time=160.167..160.171 rows=0 loops=1
   Worker 5:  actual time=160.018..160.022 rows=0 loops=1
   Worker 6:  actual time=160.201..160.205 rows=0 loops=1
 Planning Time: 0.087 ms
 Execution Time: 196.247 ms

We would need something similar to CHECK_FOR_INTERRUPTS() which returns 
a NULL slot if a parallel worker is supposed to stop execution (we could 
e.g. check if the queue got detached). Or could we amend 
CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue 
got detached?


Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this.

[1] 
https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com


--
David Geier
(ServiceNow)




Re: Lazy JIT IR code generation to increase JIT speed with partitions

2023-01-30 Thread David Geier

Hi Dmitry,

On 1/27/23 16:18, Dmitry Dolgov wrote:

As I've noted off-list, there was noticeable difference in the dumped
bitcode, which I haven't noticed since we were talking mostly about
differences between executions of the same query.

Thanks for the clarification and also thanks for helping with this effort.

--
David Geier
(ServiceNow)





Re: Lazy JIT IR code generation to increase JIT speed with partitions

2023-01-27 Thread David Geier

Hi,

Thanks for taking a look!

On 12/1/22 22:12, Dmitry Dolgov wrote:


First to summarize things a bit: from what I understand there are two
suggestions on the table, one is about caching modules when doing
inlining, the second is about actual lazy jitting. Are those to tightly
coupled together, could they be split into two separate patches? It
would make it a bit easier to review and test.

Yes.


I was playing with the caching part of the patch (still have to think
about the lazy jitting), which generally seems to be a good idea.  From
what I see the main concern here is a chance that IRMover will rename
types, degrading performance of the generated code in long run. I have
to admit, I'm not fully sure mentioned LLVM 13 fix [1] completely
eliminates those concerns, somehow its description is formulated in not
very committing way ("don't know if this patch fixes ..., but it does
fix a few soundness issues that have crept in."). But I haven't found
any crashes or false asserts coming from the LLVM side (using v13),
running several rounds of regression tests with forced jitting, so a
point to the fix.

I would be curious to learn how Andres was troubleshooting type renaming
issues? Using LLVM 13 from packages, jitting the same query twice and
dumping the bitcode out showed some difference in types a-la
"%struct.ExprState.142" vs "%struct.ExprState.430" (from what I
understood, the whole thing is an identifier, including the number) with
the patch, but the very same changes are happening on the main branch as
well. Of course, I was inspecting bitcode only for certain queries, it
doesn't exclude that some other examples actually feature type renaming.

In general, it would be interesting to know how to do some sort of "smoke
tests" for the generated code, e.g. in case if LLVM has fixed this
particular issue, but they might reappear in the future?

+1, especially with my findings described below.

I did few performance tests and got numbers similar to posted in the
thread, inlining time being impressively reduced (~10 times) as well as
(suspiciously) optimization time (~1.5 times). The approach was a bit
different though, I've run the sequence of example queries from the
thread using pgbench and checked jit counters from pgss.

Few small notes:

If caching of modules is safe from LLVM >= 13, I guess it should be
wrapped into a corresponding condition on LLVM_VERSION_MAJOR, right?

Why the assert about hasExternalLinkage was removed from the
llvmjit_inline.cpp?

For so much discussion about such a small change there is definitely not
enough commentaries in the code about dangers of cloning modules.

Also, maybe a stupid question, but how big this cache should be? From
what I understand it will be cleared on llvm_shutdown, does it mean it
can grow unbounded if e.g. a single session is firing all the time
different queries at the db?
The cache doesn't contain the modules generated for a query but the 
bitcode files with the to-be-inlined functions stored on disk. So the 
cache would stop growing as soon as a connection process has loaded all 
of them. Nevertheless, if a workload uses many connections and truly all 
or at least most of the bitcode files that could be quite some memory.


Beyond that I made an unfortunate discovery: the inlining time goes down 
by so much because no inlining is happening anymore :-/. This is because 
I cloned the module only when passing it to the IRMover, not right after 
taking it from the cache. However, after the module is taken from the 
cache it's still modified. llvm_execute_inline_plan() executes


if (valueToImport->hasExternalLinkage()) { 
valueToImport->setLinkage(LinkageTypes::AvailableExternallyLinkage); }


But when checking if a function is inlinable in function_inlinable() we 
check


if (F.hasAvailableExternallyLinkage()) return false;

which makes the code bail out for any function to be inlined.

It's very curious as to why we didn't really see that when dumping the 
bitcode. It seems like the bitcode is always different enough to not 
spot that. So +1 on your point on having a smoke test in place to verify 
things still work.


If I change the code to clone the module right after taking it from the 
cache and before making the changes to it, the JIT time remains high and 
seems even higher than when re-reading the file from disk. Profiling 
showed that llvm::CloneModule() is just super slow, especially for big 
bitcode files like numeric.bc. I haven't investigated why that is and if 
we can do something about it. I also don't plan to do so for the moment 
being.


For reference, I attached the patch which only contains the caching 
code. It's the variant that clones early.


--
David Geier
(ServiceNow)
From 6a13803de83aaa46d34880e1b7c44bf93eb87f35 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Mon, 27 Jun 2022 12:28:29 +0200
Subject: [PATCH] Cache modules in JIT inlining

---
 src/backend/jit/llvm/llvm

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-26 Thread David Geier

Hi,

On 1/23/23 21:30, Andres Freund wrote:

That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.

I just realized it the other day :).

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

I'll include them in my new patch set and also have a careful look at them.


I reviewed the prerequisite patch which introduces 
INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my 
comments:


- The prerequisite patch looks good me.

- By default, the test query in the pg_test_timing doc runs serially. 
What about adding SET max_parallel_workers_per_gather = 0 to make sure 
it really always does (e.g. on a system with different settings for 
parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will 
be much more flaky.


- Why have you added a case distinction for diff == 0? Have you 
encountered this case? If so, how? Maybe add a comment.


- To further reduce overhead we could call INSTR_TIME_SET_CURRENT() 
multiple times. But then again: why do we actually care about the 
per-loop time? Why not instead sum up diff and divide by the number of 
iterations to exclude all the overhead in the first place?


- In the computation of the per-loop time in nanoseconds you can now use 
INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S.


The rest looks good to me. The rebased patches are part of the patch set 
I sent out yesterday in reply to another mail in this thread.


--
David Geier
(ServiceNow)





Re: pg_upgrade from PG-14.5 to PG-15.1 failing due to non-existing function

2023-01-26 Thread David Geier

Hi,

On 1/25/23 19:38, Dimos Stamatakis wrote:


Hi hackers,

I attempted to perform an upgrade from PG-14.5 to PG-15.1 with 
pg_upgrade and unfortunately it errors out because of a function that 
does not exist anymore in PG-15.1.


The function is ‘pg_catalog.close_lb’ and it exists in 14.5 but not in 
15.1.


In our scenario we changed the permissions of this function in PG14.5 
(via an automated tool) and then pg_upgrade tries to change the 
permissions in PG15.1 as well.



Here [1] is a very similar issue that has been reported in 2019.

The patch didn't make it in but it also seems to not fix the issue 
reported by Dimos. The patch in [1] seems to be concerned with changed 
function signatures rather than with dropped functions. Maybe [1] could 
be revived and extended to also ignore dropped functions?


[1] 
https://www.postgresql.org/message-id/flat/f85991ad-bbd4-ad57-fde4-e12f0661dbf0%40postgrespro.ru


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-24 Thread David Geier

Hi


I think at least some should be converted to just accumulate in an
instr_time...

I think that's for a later patch though?

Yep, at least quite similar.

OK. I coded it up in the latest version of the patch.

Depending on how low we want to keep the error, I don't think we can:

If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.

But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up

   math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
   ~= 5.87

So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.
Thanks for doing the math. Agreed. The latest patch detects overflow and 
correctly handles it.

But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.
Yeah. The overflow detection just compares two int64. The "overflow 
threshold" is pre-computed.

- the restriction just to linux, that'll make testing harder for some, and
   ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
   suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
   nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work
Makes sense. I carried that list over to my latest e-mail which also 
includes the patch to have some sort of summary of where we are in a 
single place.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-24 Thread David Geier

Hi,

On 1/23/23 18:41, Andres Freund wrote:

If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
   EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)

the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.
Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, 
in case we'll revisit exposing the clock source in the future.

Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.

It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.


That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and 
for now call that variant from InstrStartNode(), InstrEndNode() and 
pg_test_timing. To do so in InstrEndNode(), I removed 
INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro 
would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single 
place and the code is more readable that way. INSTR_TIME_SET_CURRENT() 
is called from a bunch of places. I still have to go through all of them 
and see which should be changed to call the _FAST() variant.


Attached is v7 of the patch:

- Rebased on latest master (most importantly on top of the int64 
instr_time commits). - Includes two commits from Andres which introduce 
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report 
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds 
only with integer math, while accounting for overflow. - Supports RDTSCP 
via INSTR_TIME_SET_CURRENT() and introduced 
INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC.


I haven't gotten to the following:

- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if 
they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing 
Andres commits. Potentially improving on pg_test_timing's output. - 
Looking at enabling RDTSC on more platforms. Is there a minimum set of 
platforms we would like support for? Windows should be easy. That would 
also allow to unify the code a little more. - Add more documentation and 
do more testing around the calls to CPUID. - Profiling and optimizing 
the code. A quick test showed about 10% improvement over master with 
TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail 
that started this thread.


I hope I'll find time to work on these points during the next days.

--
David Geier
(ServiceNow)
From 0b5ce706bed13c0c242e6ace809d3c37a8064029 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v7 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks 

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier

Hi,

On 1/21/23 06:31, Andres Freund wrote:

I pushed the int64-ification commits.


Great. I started rebasing.

One thing I was wondering about: why did you chose to use a signed 
instead of an unsigned 64-bit integer for the ticks?

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

I'll include them in my new patch set and also have a careful look at them.

--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier

Hi,

On 1/21/23 05:12, Andres Freund wrote:

We do currently do the conversion quite frequently.  Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)

But I think it's a real issue. Places where we do, but shouldn't, convert:

- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
InstrStopNode() doesn't convert in the general case but only for the 
first tuple or when async. So it goes somewhat hand in hand with 
ExecReScan().

- PendingWalStats.wal_write_time - this is particularly bad because it happens
   within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
   they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...

These all will get a bit slower when moving to a "variable" frequency.
I wonder if we will be able to measure any of them easily. But given 
that it's many more places than I had realized and given that the 
optimized code is not too involved, let's give it a try.

What was your approach for avoiding the costly operation?  I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.


It seems like we ended up with the same. I do:

sec = ticks / frequency_hz
ns  = ticks / frequency_hz * 1,000,000,000
ns  = ticks * (1,000,000,000 / frequency_hz)
ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz

Now, the constant scaling factor in parentheses is typically a floating 
point number. For example for a frequency of 2.5 GHz it would be 2.5. To 
work around that we can do something like:


ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler

Where scaler is a power-of-2, big enough to maintain enough precision 
while allowing for a shift to implement the division.


The additional multiplication with scaler makes that the maximum range 
go down, because we must ensure we never overflow. I'm wondering if we 
cannot pick scaler in such a way that remaining range of cycles is large 
enough for our use case and we can therefore live without bothering for 
the overflow case. What would be "enough"? 1 year? 10 years? ...


Otherwise, we indeed need code that cares for the potential overflow. My 
hunch is that it can be done branchless, but it for sure adds dependent 
instructions. Maybe in that case a branch is better that almost 
certainly will never be taken?


I'll include the code in the new patch set which I'll latest submit 
tomorrow.



I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.
Thanks to your commits, the diff of the new patch set will be already 
much smaller and easier to review. What's your biggest concern in terms 
of reliability?

I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.
I haven't yet looked into pg_test_timing. I'll do that while including 
your patches into the new patch set.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier

Hi,

On 1/21/23 05:14, Andres Freund wrote:

The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.

But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?


Yeah. I thought it would only be an issue if we showed it 
unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON, 
we're likely fine with pretty much all regression tests.


But given the different opinions, I'll leave it out in the new patch set 
for the moment being.


--
David Geier
(ServiceNow)





Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

2023-01-20 Thread David Geier

Hi hackers,

EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports 
the number of heap blocks processed by the leader. It's missing the 
per-worker stats. The attached patch adds that functionality in the 
spirit of e.g. Sort or Memoize. Here is a simple test case and the 
EXPLAIN ANALYZE output with and without the patch:


create table foo(col0 int, col1 int);
insert into foo select generate_series(1, 1000, 0.001), 
generate_series(1000, 2000, 0.001);

create index idx0 on foo(col0);
create index idx1 on foo(col1);
set parallel_tuple_cost = 0;
set parallel_setup_cost = 0;
explain (analyze, costs off, timing off) select * from foo where col0 > 
900 or col1 = 1;


With the patch:

 Gather (actual rows=99501 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
 Recheck Cond: ((col0 > 900) OR (col1 = 1))
 Heap Blocks: exact=98
 Worker 0:  Heap Blocks: exact=171 lossy=0
 Worker 1:  Heap Blocks: exact=172 lossy=0
 ->  BitmapOr (actual rows=0 loops=1)
   ->  Bitmap Index Scan on idx0 (actual rows=99501 loops=1)
 Index Cond: (col0 > 900)
   ->  Bitmap Index Scan on idx1 (actual rows=0 loops=1)
 Index Cond: (col1 = 1)

Without the patch:

 Gather (actual rows=99501 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3)
 Recheck Cond: ((col0 > 900) OR (col1 = 1))
 Heap Blocks: exact=91
 ->  BitmapOr (actual rows=0 loops=1)
   ->  Bitmap Index Scan on idx0 (actual rows=99501 loops=1)
 Index Cond: (col0 > 900)
   ->  Bitmap Index Scan on idx1 (actual rows=0 loops=1)
 Index Cond: (col1 = 1)

So in total the parallel Bitmap Heap Scan actually processed 441 heap 
blocks instead of just 91.


Now two variable length arrays (VLA) would be needed, one for the 
snapshot and one for the stats. As this obviously doesn't work, I now 
use a single, big VLA and added functions to retrieve pointers to the 
respective fields. I'm using MAXALIGN() to make sure the latter field is 
aligned properly. Am I doing this correctly? I'm not entirely sure 
around alignment conventions and requirements of other platforms.


I couldn't find existing tests that exercise the EXPLAIN ANALYZE output 
of specific nodes. I could only find a few basic smoke tests for EXPLAIN 
ANALYZE with parallel nodes in parallel_select.sql. Do we want tests for 
the changed functionality? If so I could right away also add tests for 
EXPLAIN ANALYZE including other parallel nodes.


Thank you for your feedback.

--
David Geier
(ServiceNow)
From b2c84fb16e9521d6cfadb0c069e27a213e8e8471 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Tue, 8 Nov 2022 19:40:31 +0100
Subject: [PATCH v1] Parallel Bitmap Heap Scan reports per-worker stats

Similarly to other nodes (e.g. hash join, sort, memoize),
Bitmap Heap Scan now reports per-worker stats in the EXPLAIN
ANALYZE output. Previously only the heap blocks stats for the
leader were reported which was incomplete in parallel scans.
---
 src/backend/commands/explain.c| 46 ++--
 src/backend/executor/execParallel.c   |  3 +
 src/backend/executor/nodeBitmapHeapscan.c | 88 +++
 src/include/executor/nodeBitmapHeapscan.h |  1 +
 src/include/nodes/execnodes.h | 23 +-
 5 files changed, 138 insertions(+), 23 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5212a64b1e..d532fc4a87 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3396,26 +3396,58 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 static void
 show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
 {
+	Assert(es->analyze);
+
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-			   planstate->exact_pages, es);
+			   planstate->stats.exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-			   planstate->lossy_pages, es);
+			   planstate->stats.lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (planstate->stats.exact_pages > 0)
+appendStringInfo(es->str, " exact=%ld", planstate

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-19 Thread David Geier

On 1/18/23 13:52, David Geier wrote:

On 1/16/23 21:39, Pavel Stehule wrote:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
 napsal:


    Hi,

    there's minor bitrot in the Mkvcbuild.pm change, making cfbot 
unhappy.


    As for the patch, I don't have much comments. I'm wondering if 
it'd be

    useful to indicate which timing source was actually used for EXPLAIN
    ANALYZE, say something like:

     Planning time: 0.197 ms
     Execution time: 0.225 ms
     Timing source: clock_gettime (or tsc)

+1


I like the idea of exposing the timing source in the EXPLAIN ANALYZE 
output.
It's a good tradeoff between inspectability and effort, given that 
RDTSC should always be better to use.

If there are no objections I go this way.
Thinking about this a little more made me realize that this will cause 
different pg_regress output depending on the platform. So if we go this 
route we would at least need an option for EXPLAIN ANALYZE to disable 
it. Or rather have it disabled by default and allow for enabling it. 
Thoughts?


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-19 Thread David Geier

Hi Andres,


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.
I also have code for that here. I decided against integrating it because 
we don't convert frequently enough to make it matter. Or am I missing 
something?

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?

That's alright.
Though, I would hope we attempt to bring your patch set as well as the 
RDTSC patch set in.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier

Hi,

@Andres: will you take care of these changes and provide me with an 
updated patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send 
out the complete patch set.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier

On 1/16/23 18:37, Andres Freund wrote:

Hi,

On 2023-01-02 14:28:20 +0100, David Geier wrote:

I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.
I don't feel strong about it and you have a point that we most likely 
only convert ones we've accumulated a fair amount of cycles.

Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.

I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.
Fully agreed. Why not replacing pg_time_usec_t with instr_time in a 
separate patch? I haven't looked carefully enough if all occurrences 
could sanely replaced but at least the ones that accumulate time seem 
good starting points.

Additionally, I initialized a few variables of type instr_time which
otherwise resulted in warnings due to use of potentially uninitialized
variables.

Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
I don't feel strong about it, but like Tom tend towards keeping the 
initialization macro.
Thanks that you have improved on the first patch and fixed these issues 
in a better way.

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function or
also have the function return a double and cast where necessary at the call
site?

I think those should be a separate discussion / patch.


OK. I'll propose follow-on patches ones we're done with the ones at hand.

I'll then rebase the RDTSC patches on your patch set.

--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier

On 1/16/23 21:39, Pavel Stehule wrote:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
 napsal:


Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as
explain
option), to allow users to pick what timing source to use. I
wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.


+1


Thanks for looking at the patch.

I'll fix the merge conflict.

I like the idea of exposing the timing source in the EXPLAIN ANALYZE output.
It's a good tradeoff between inspectability and effort, given that RDTSC 
should always be better to use.

If there are no objections I go this way.

--
David Geier
(ServiceNow)





Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-13 Thread David Geier

Nice idea.

On 1/6/23 10:19, Jelte Fennema wrote:
Do you have some performance comparison between TIMING ON and TIMING 
SAMPLING?


+1 to see some numbers compared to TIMING ON.

Mostly I'm wondering if the sampling based approach gains us enough to 
be worth it, once the patch to use RDTSC hopefully landed (see [1]). I 
believe that with the RDTSC patch the overhead of TIMING ON is lower 
than the overhead of using ANALYZE with TIMING OFF in the first place. 
Hence, to be really useful, it would be great if we could on top of 
TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g. 
by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON / 
SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton 
of branches and without all the typically deactivated code the 
implementation would be very small and could be placed in an inlinable 
function.


[1] 
https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-04 Thread David Geier

Hi,


CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh


I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

--
David Geier
(ServiceNow)
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else			/* WIN32 */
 
-- 
2.34.1

From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
 instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and u

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-03 Thread David Geier

Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for 
silence on the patch.


It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as 
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().


I rebased again on master because of [1]. Patches attached.



Its been hard to make time, and especially so because I typically 
develop on an ARM-based macOS system where I can't test this directly 
- hence my tests with virtualized EC2 instances, where I ran into the 
timing oddities.
That's good and bad. Bad to do the development and good to test the 
implementation on more virtualized setups; given that I also encountered 
"interesting" behavior on VMWare (see my previous mails).


On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?


Minor note, but in my understanding using a uint64 (where we can) is 
faster for any simple arithmetic we do with the values.


That's true. So the argument could be that for seconds and milliseconds 
we want the extra precision while microseconds are precise enough. 
Still, we could also make the seconds and milliseconds conversion code 
integer only and e.g. return two integers with the value before and 
after the comma. FWICS, the functions are nowhere used in performance 
critical code, so it doesn't really make a difference performance-wise.




+1, and feel free to carry this patch forward - I'll try to make an 
effort to review my earlier testing issues again, as well as your 
later improvements to the patch.

Moved to the current commit fest. Will you become reviewer?


Also, FYI, I just posted an alternate idea for speeding up EXPLAIN 
ANALYZE with timing over in [0], using a sampling-based approach to 
reduce the timing overhead.


Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1] 
https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com


--
David Geier
(ServiceNow)
From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-02 Thread David Geier

Hi,

I re-based again on master and applied the following changes:

I removed the fallback for obtaining the TSC frequency from /proc/cpu as 
suggested by Andres. Worst-case we fall back to clock_gettime().


I added code to obtain the TSC frequency via CPUID when under a 
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), 
because __get_cpuid() returns an error if the leaf is > 0x8000 
(probably the implementation pre-dates the hypervisor timing leafs). 
Unfortunately, while testing my implementation under VMWare, I found 
that RDTSC runs awfully slow there (like 30x slower). [1] indicates that 
we cannot generally rely on RDTSC being actually fast on VMs. However, 
the same applies to clock_gettime(). It runs as slow as RDTSC on my 
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not 
entirely sure if there aren't cases where e.g. clock_gettime() is 
actually faster than RDTSC and it would be advantageous to use 
clock_gettime(). We could add a GUC so that the user can decide which 
clock source to use. Any thoughts?


I also somewhat improved the accuracy of the cycles to milli- and 
microseconds conversion functions by having two more multipliers with 
higher precision. For microseconds we could also keep the computation 
integer-only. I'm wondering what to best do for seconds and 
milliseconds. I'm currently leaning towards just keeping it as is, 
because the durations measured and converted are usually long enough 
that precision shouldn't be a problem.


In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I 
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I 
initialized a few variables of type instr_time which otherwise resulted 
in warnings due to use of potentially uninitialized variables.


I also couldn't reproduce the reported timing discrepancy. For me the 
runtime reported by \timing is just slightly higher than the time 
reported by EXPLAIN ANALYZE, which is expected.


Beyond that:

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so 
that it's consistent with the _MILLISEC() and _MICROSEC() variants?


The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants 
return double. This seems error prone. What about renaming the function 
or also have the function return a double and cast where necessary at 
the call site?


If no one objects I would also re-register this patch in the commit fest.

[1] 
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf 
(page 11 "Virtual TSC")


--
David Geier
(ServiceNow)
From 321d00ae5dd1bcffc8fbdb39879b7f5c78e3930f Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-	

Re: Aggregate node doesn't include cost for sorting

2022-12-08 Thread David Geier

Hi David,

Thanks for the explanations and awesome that this was improved on already.
I didn't have this change on my radar.

On 12/8/22 11:40, David Rowley wrote:


This output surely must be from a version of PostgreSQL prior to
1349d279?  I can't quite figure out why you've added a "SET
enable_seqscan = FALSE;". That makes it look like you've used the same
version of PostgreSQL to produce both of these plans.  The 2nd plan
you've shown must be post 1349d279.



Both plans were captured on 14.5, which is indeed prior to 1349d279.

I disabled sequential scan to show that there's an alternative plan 
which is superior to the chosen plan: Index Only Scan is more expensive 
and takes longer than the Seq Scan, but the subsequent Aggregate runs 
much faster as it doesn't have to sort, making the plan overall superior.




So, with the assumption that you've used 2 different versions to show
this output, for post 1349d279, there does not seem to be much choice
on how the aggregate is executed. What's your concern about the
costings having to be accurate given there's no other plan choice?


There's another plan choice which is using the index to get pre-sorted 
input rows, see previous comment.




The new pre-sorted aggregate code will always request the sort order
that suits the largest number of ORDER BY / DISTINCT aggregates.  When
there are multiple ORDER BY / DISTINCT aggregates and they have
different sort requirements then there certainly are completing ways
that the aggregation portion of the plan can be executed.  I opted to
make the choice just based on the number of aggregates that could
become presorted.  nodeAgg.c is currently not very smart about sharing
sorts between multiple aggregates with the same sort requirements.  If
that was made better, there might be more motivation to have better
costing code in make_pathkeys_for_groupagg().  However, the reasons
for the reversion of db0d67db seemed to be mainly around the lack of
ability to accurately cost multiple competing sort orders. We'd need
to come up with some better way to do that if we were to want to give
make_pathkeys_for_groupagg() similar abilities.


Also, why does the Aggregate node sort itself? Why don't we instead emit
an explicit Sort node in the plan so that it's visible to the user what
is happening? As soon as there's also a GROUP BY in the query, a Sort
node occurs in the plan. This seems inconsistent.

Post 1349d279 we do that, but it can only do it for 1 sort order.
There can be any number of aggregate functions which require a sort
and they don't all have to have the same sort order requirements.  We
can't do the same as WindowAgg does by chaining nodes together either
because the aggregate node aggregates the results and we'd need all
the same input rows to be available at each step.

The only other way would be to have it so an Aggregate node could be
fed by multiple different input nodes and then it would only work on
the aggregates that suit the given input before reading the next input
and doing the other aggregates.  Making it work like that would cause
quite a bit of additional effort during planning (not to mention the
executor). We'd have to run the join search once per required order,
which is one of the slowest parts of planning.  Right now, you could
probably make that work by just writing the SQL to have a subquery per
sort requirement.


Thanks for the explanation!

--
David Geier
(ServiceNow)





Aggregate node doesn't include cost for sorting

2022-12-08 Thread David Geier

Hi hackers,

The cost of an Aggregate node seems to be the same regardless of the 
input being pre-sorted or not. If however the input is not sorted, the 
Aggregate node must additionally perform a sort which can impact runtime 
significantly. Here is an example:


CREATE TABLE foo(col0 INT, col1 TEXT);
INSERT INTO foo SELECT generate_series(1, 10), 
'aa' || md5(random()::TEXT);

CREATE INDEX foo_idx ON foo(col1);
SET max_parallel_workers_per_gather = 0;
SET enable_bitmapscan = FALSE;

-- Unsorted input. Aggregate node must additionally sort all rows.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

    QUERY PLAN
---
 Aggregate  (cost=2584.00..2584.01 rows=1 width=8) (actual 
time=1684.705..1684.809 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..2334.00 rows=10 width=71) 
(actual time=0.018..343.280 rows=10 loops=1)

 Planning Time: 0.317 ms
 Execution Time: 1685.543 ms


-- Pre-sorted input. Aggregate node doesn't have to sort all rows.
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo;

QUERY PLAN
---
 Aggregate  (cost=6756.42..6756.43 rows=1 width=8) (actual 
time=819.028..819.041 rows=1 loops=1)
   ->  Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 
rows=10 width=71) (actual time=0.046..404.260 rows=10 loops=1)

 Heap Fetches: 10
 Heap Prefetches: 1334
 Planning Time: 0.438 ms
 Execution Time: 819.515 ms

The cost of the Aggregate node is in both cases the same (250.0) while 
its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case.


Also, why does the Aggregate node sort itself? Why don't we instead emit 
an explicit Sort node in the plan so that it's visible to the user what 
is happening? As soon as there's also a GROUP BY in the query, a Sort 
node occurs in the plan. This seems inconsistent.


--
David Geier
(ServiceNow)





Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-11-19 Thread David Geier

I missed attaching the patches.

--
David Geier
(ServiceNow)
From f4e962729ca605498d0c8bfc97d0f42d68a0df06 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/2] WIP: Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else			/* WIN32 */
 
-- 
2.34.1

From 7a6317fdf5b1d82f120a4fd5535cfe40c8165153 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time
 for instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.

Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
 src/backend/utils/init/postinit.c   |   3 +
 src/bin/pg_test_timing/pg_test_timing.c |   1 +
 src/bin/pgbench/pgbench.c   |   3 +
 src/bin/psql/startup.c  |   4 +
 src/common/Makefile |   1 +
 src/common/instr_time.c | 103 
 src/include/portability/instr_time.h|  50 +---
 src/tools/msvc/Mkvcbuild.pm |   2 +-
 8 files changed, 157 insertions(+), 10 deletions(-)
 create mode 100644 src/common/instr_time.c

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..c684725af3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	/* Initialize portal manager */
 	EnablePortalManager();
 
+	/* initialize high-precision interval timing */
+	INSTR_TIME_INITIALIZE();
+
 	/* Initialize status reporting */
 	pgstat_beinit();
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timi

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-11-19 Thread David Geier
I think it would be great to get this patch committed. Beyond the 
reasons already mentioned, the significant overhead also tends to skew 
the reported runtimes in ways that makes it difficult to compare them. 
For example, if two nodes are executed equally often but one needs twice 
the time to process the rows: in such a case EXPLAIN ANALYZE should 
report timings that are 2x apart. However, currently, the high overhead 
of clock_gettime() tends to skew the relative runtimes.


On 10/12/22 10:33, Michael Paquier wrote:

No rebased version has been sent since this update, so this patch has
been marked as RwF.


I've rebased the patch set on latest master and fixed a few compiler 
warnings. Beyond that some findings and thoughts:


You're only using RDTSC if the clock source is 'tsc'. Great idea to not 
bother caring about a lot of hairy TSC details. Looking at the kernel 
code this seems to imply that the TSC is frequency invariant. I don't 
think though that this implies that Linux is not running under a 
hypervisor; which is good because I assume PostgreSQL is used a lot in 
VMs. However, when running under a hypervisor (at least with VMWare) 
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated 
success but the returned values were garbage. Instead of using leaf 
0x16, we should then use the hypervisor interface to obtain the TSC 
frequency. Checking if a hypervisor is active can be done via:


bool IsHypervisorActive()
{
    uint32 cpuinfo[4] = {0};
    int res = __get_cpuid(0x1, [0], [1], [2], 
[3]);

    return res > 0 && (cpuinfo[2] & (1 << 30));
}

Obtaining the TSC frequency via the hypervisor interface can be done 
with the following code. See https://lwn.net/Articles/301888/ for more 
details.


// Under hypervisors (tested with VMWare) leaf 0x16 is not available, 
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to 
obtain TSC frequency.

uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x4001, [0], 
[1], [2], [3]) > 0)

    cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);

Given that we anyways switch between RDTSC and clock_gettime() with a 
global variable, what about exposing the clock source as GUC? That way 
the user can switch back to a working clock source in case we miss a 
detail around activating or reading the TSC.


I'm happy to update the patches accordingly.

--
David Geier
(ServiceNow)





Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread David Geier

On 11/18/22 14:00, Tomas Vondra wrote:

Seems fine. I wonder if we could/could introduce a new constant for 0,
similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a
magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK.

Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached.

I don't think you can write a test for this, because there is no change
to behavior that can be observed by the user. If one side has no MCV,
the only difference is whether we try to load the other MCV or not.


Yeah. I thought along the lines of checking the number of pages read 
when the pg_stats entry is not in syscache yet. But that seems awfully 
implementation specific. So no test provided.


--
David Geier
(ServiceNow)
From 5c5c0fb9dd99e79daaa015984c9dda22e4ccda17 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 18 Nov 2022 09:35:08 +0100
Subject: [PATCH] Don't read MCV stats needlessly in join selectivity
 estimation

Join selectivity estimation only uses MCV stats if they're present
on both join attributes. Therefore, if MCV stats are not available
on one of the two columns, skip reading MCV stats altogether.

Frequently encountered situations in which MCV stats not available
is unique columns or columns for which all values have roughly
the same frequency and therefore ANALYZE decided to not store them.
---
 src/backend/utils/adt/selfuncs.c| 17 +++--
 src/backend/utils/cache/lsyscache.c |  4 
 src/include/utils/lsyscache.h   |  3 ++-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..85a29eaeb9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	bool		have_mcvs2 = false;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
+	bool		get_mcv_stats;
 
 	get_join_variables(root, args, sjinfo,
 	   , , _is_reversed);
@@ -2275,11 +2276,23 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+			 HeapTupleIsValid(vardata2.statsTuple) &&
+			 get_attstatsslot(, vardata1.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS) &&
+			 get_attstatsslot(, vardata2.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS));
+
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2302,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f495..191f68f7b4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
  * reqkind: STAKIND code for desired statistics slot kind.
  * reqop: STAOP value wanted, or InvalidOid if don't care.
  * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS.
+ *Passing 0 will only check if the requested slot type exists but not
+ *extract any content. This is useful in cases where MCV stats are only
+ *useful if they exists on all involved columns (e.g. during join
+ *selectivity computation).
  *
  * If a matching slot is found, true is returned, and *sslot is filled thus:
  * staop: receives the actual STAOP value.
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 50f0288305..d9a3916997 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -39,7 +39,8 @@ typedef enum IOFuncSelector
 } IOFuncSelector;
 
 /* Flag bits for get_attstatsslot */
-#define ATTSTATSSLOT_VALUES		0x01
+#define ATTSTATSSLOT_EXISTS	0x00
+#define ATTSTATSSLOT_VALUES	0x01
 #define ATTSTATSSLOT_NUMBERS	0x02
 
 /* Result struct for get_attstatsslot */
-- 
2.34.1



Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-18 Thread David Geier

Hi Tom,

Back-patching but keeping RelationOpenSgmr() for extensions sounds 
reasonable.


On a different note: are we frequently running our tests suites with 
debug_discard_caches=1 enabled?

It doesn't seem like. I just ran make check with debug_discard_caches=1 on

- latest master: everything passes.
- version 14.5: fails in create_index, create_index_spgist, create_view.

So the buggy code path is at least covered by the tests. But it seems 
like we could have found it earlier by regularly running with 
debug_discard_caches=1.


--
David Geier
(ServiceNow)

On 11/17/22 18:51, Tom Lane wrote:

I wrote:

I wonder whether we ought to back-patch f10f0ae42.  We could
leave the RelationOpenSmgr macro in existence to avoid unnecessary
breakage of extension code, but stop using it within our own code.

Concretely, about like this for v14 (didn't look at the older
branches yet).

I'm not sure whether to recommend that outside extensions switch to using
RelationGetSmgr in pre-v15 branches.  If they do, they run a risk
of compile failure should they be built against old back-branch
headers.  Once compiled, though, they'd work against any minor release
(since RelationGetSmgr is static inline, not something in the core
backend).  So maybe that'd be good enough, and keeping their code in
sync with what they need for v15 would be worth something.

regards, tom lane






Re: Assertion failure with barriers in parallel hash join

2022-11-18 Thread David Geier

Thanks!

Please let me know if I can help out, e.g. with re-testing.

--
David Geier
(ServiceNow)

On 11/17/22 08:28, Thomas Munro wrote:

On Thu, Nov 17, 2022 at 8:01 PM David Geier  wrote:

Can we make progress with this patch in the current commit fest, or discuss 
what is still missing to bring this in?

Hi David,
Sorry for the delay.  I'll aim to get this done in the next few days.





Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-18 Thread David Geier

Thanks everyone for the great feedback and suggestions.




Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
return whether the slot type exists without loading it.  Do you think we
need to emphasize this use case in the comments for 'flags'?

Perhaps, it's not really obvious now.


Comment added.



I wonder whether we need to also check statistic_proc_security_check()

when determining if MCVs exists in both sides.

Yeah, I thought about hoisting the statistic_proc_security_check
tests up into get_mcv_stats.  I don't think it's a great idea
though.  Again, it'd complicate untangling this if we ever
generalize the use of MCVs in this function.  Also, I don't
think we should be micro-optimizing the case where the security
check doesn't pass --- if it doesn't, you're going to be hurting
from bad plans a lot more than you are from some wasted cycles
here.


Sounds reasonable.

Attached is v2 of the patch.
This is basically Tom's version plus a comment for the flags of 
get_attstatslot() as suggested by Richard.


I couldn't come up with any reasonable way of writing an automated test 
for that.

Any ideas?

--
David Geier
(ServiceNow)
From 66b16792e8e16dad33adb5614a0936cc41002f4c Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 18 Nov 2022 09:35:08 +0100
Subject: [PATCH] Don't read MCV stats needlessly in join selectivity
 estimation

Join selectivity estimation only uses MCV stats if they're present
on both join attributes. Therefore, if MCV stats are not available
on one of the two columns, skip reading MCV stats altogether.

Frequently encountered situations in which MCV stats not available
is unique columns or columns for which all values have roughly
the same frequency and therefore ANALYZE decided to not store them.
---
 src/backend/utils/adt/selfuncs.c| 15 +--
 src/backend/utils/cache/lsyscache.c |  4 
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..5baab4e3ac 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	bool		have_mcvs2 = false;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
+	bool		get_mcv_stats;
 
 	get_join_variables(root, args, sjinfo,
 	   , , _is_reversed);
@@ -2275,11 +2276,21 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+			 HeapTupleIsValid(vardata2.statsTuple) &&
+			 get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0) &&
+			 get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0));
+
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2300,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f495..191f68f7b4 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
  * reqkind: STAKIND code for desired statistics slot kind.
  * reqop: STAOP value wanted, or InvalidOid if don't care.
  * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS.
+ *Passing 0 will only check if the requested slot type exists but not
+ *extract any content. This is useful in cases where MCV stats are only
+ *useful if they exists on all involved columns (e.g. during join
+ *selectivity computation).
  *
  * If a matching slot is found, true is returned, and *sslot is filled thus:
  * staop: receives the actual STAOP value.
-- 
2.34.1



Re: Assertion failure with barriers in parallel hash join

2022-11-16 Thread David Geier

Hi Thomas,

Can we make progress with this patch in the current commit fest, or 
discuss what is still missing to bring this in?


Thanks!

--
David Geier
(ServiceNow)

On 6/6/22 17:01, David Geier wrote:

Hi Thomas,

Correct. We're running with disabled parallel leader participation and 
we have to do so, because another custom plan node we built relies on 
that.

That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro  wrote:

On Thu, Jun 2, 2022 at 9:31 PM David Geier 
wrote:
> We recently encountered the same bug in the field. Oleksii
Kozlov managed to come up with reproduction steps, which reliably
trigger it. Interestingly, the bug does not only manifest as
failing assertion, but also as segmentation fault; in builds with
disabled and with enabled (!) assertions. So it can crash
production environments. We applied the proposed patch v3 from
Melanie to the REL_14_3 branch and can confirm that with the patch
neither the assertion nor the segmentation fault still occur.

Thanks for the report, testing, and for creating the CF entry.

I assume you are using parallel_leader_participation=off, and the
reason we haven't heard more about this is because few people do that.
By coincidence I was just about to restart a bunch of hash join
projects and have been paging the topic area back into my brain, so
I'll start with another round of testing/analysis of this bug/patch
next week.


Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-14 Thread David Geier

Hi Tom,

There won't *be* any MCV stats for a column that ANALYZE perceives to
be unique, so I'm not quite sure where the claimed savings comes from.
We save if one join attribute is unique while the other isn't. In that 
case stored MCV stats are read for the non-unique attribute but then 
never used. This is because MCV stats in join selectivity estimation are 
only used if they're present on both columns

Please provide a concrete example.


A super simple case already showing a significant speedup is the 
following. The more ways to join two tables and the more joins overall, 
the higher the expected gain.


CREATE TABLE bar(col INT UNIQUE);
CREATE TABLE foo (col INT);
INSERT INTO foo SELECT generate_series(1, 100, 0.5);
SET default_statistics_target = 1;
ANALYZE foo, bar;
\timing on
EXPLAIN SELECT * FROM foo, bar WHERE foo.col = bar.col;

Running the above query five times gave me average runtimes of:

- 0.62 ms without the patch and
- 0.48 ms with the patch.

--
David Geier
(ServiceNow)





Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-11 Thread David Geier

Hi hackers,

eqjoinsel() can be optimized by not reading MCV stats if at least one of 
the two join attributes is unique. As primary keys are implicitly unique 
this situation can occur frequently. For unique columns no MCV stats are 
stored and eqjoinsel_inner() and eqjoinsel_semi(), called from 
eqjoinsel(), only consider MCV stats in the join selectivity computation 
if they're present on both columns. Attached is a small patch that 
implements the skipping.


With this change we saw some queries improve planning time by more than 
2x, especially with larger values for default_statistics_target. That's 
because get_attstatsslot() deconstructs the array holding the MCV. The 
size of that array depends on default_statistics_target.


Thanks for your consideration!

--
David Geier
(ServiceNow)
From 7a8176b9eb9dd9982662d83b86996c5402378674 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 11 Nov 2022 12:33:11 +0100
Subject: [PATCH] Skip reading MCV stats for unique join attributes

MCV stats are not stored for unique columns. If one of the two join
attributes is unique, reading the MCV stats can be skipped for both
columns because they are only ever used during join selectivity
estimation if they're present on both attributes.
---
 src/backend/utils/adt/selfuncs.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..3c9007c8d1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	bool		have_mcvs2 = false;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
+	bool		get_mcv_stats;
 
 	get_join_variables(root, args, sjinfo,
 	   , , _is_reversed);
@@ -2270,6 +2271,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	nd1 = get_variable_numdistinct(, );
 	nd2 = get_variable_numdistinct(, );
 
+	get_mcv_stats = !vardata1.isunique && !vardata2.isunique;
 	opfuncoid = get_opcode(operator);
 
 	memset(, 0, sizeof(sslot1));
@@ -2279,7 +2281,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2291,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats && statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
-- 
2.34.1



Re: Add explicit casts in four places to simplehash.h

2022-11-04 Thread David Geier

On 11/3/22 15:50, Tom Lane wrote:

Seems reasonable, so done (I fixed one additional spot you missed).

Thanks!

The bigger picture here is that we do actually endeavor to keep
(most of) our headers C++ clean, but our tool cpluspluscheck misses
these problems because it doesn't try to use these macros.
I wonder whether there is a way to do better.


What about having a custom header alongside cpluspluscheck which 
references all macros we care about?
We could start with the really big macros like the ones from 
simplehash.h and add as we go.

I could give this a try if deemed useful.

--
David Geier
(ServiceNow)





Add explicit casts in four places to simplehash.h

2022-11-03 Thread David Geier

Hi hackers,

While working on an extension, I found that simplehash.h is missing 
explicit casts in four places. Without these casts, compiling code 
including simplehash.h yields warnings if the code is compiled with 
-Wc++-compat.


PostgreSQL seems to mostly prefer omitting the explicit casts, however 
there are many places where an explicit cast is actually used. Among 
many others, see e.g.


bool.c:
    state = (BoolAggState *) MemoryContextAlloc(agg_context, 
sizeof(BoolAggState));


domains.c:
   my_extra = (DomainIOData *) MemoryContextAlloc(mcxt, 
sizeof(DomainIOData));


What about, while not being strictly necessary for PostgreSQL itself, 
also adding such casts to simplehash.h so that it can be used in code 
where -Wc++-compat is enabled?


Attached is a small patch that adds the aforementioned casts.
Thanks for your consideration!

--
David Geier
(ServiceNow)
From d2665b4065227ffa6efd37ee912e4add082869cb Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 3 Nov 2022 12:28:08 +0100
Subject: [PATCH] Add explicit casts to simplehash.h

Without these casts compilation yields warnings when compiling the
code with -Wc++-compat.
---
 src/include/lib/simplehash.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 329687c1a5..cdfae537a1 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -438,7 +438,7 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
 #ifdef SH_RAW_ALLOCATOR
 	tb = SH_RAW_ALLOCATOR(sizeof(SH_TYPE));
 #else
-	tb = MemoryContextAllocZero(ctx, sizeof(SH_TYPE));
+	tb = (SH_TYPE *) MemoryContextAllocZero(ctx, sizeof(SH_TYPE));
 	tb->ctx = ctx;
 #endif
 	tb->private_data = private_data;
@@ -448,7 +448,7 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
 
 	SH_COMPUTE_PARAMETERS(tb, size);
 
-	tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
+	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
 
 	return tb;
 }
@@ -493,7 +493,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	/* compute parameters for new table */
 	SH_COMPUTE_PARAMETERS(tb, newsize);
 
-	tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
+	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
 
 	newdata = tb->data;
 
@@ -1059,7 +1059,7 @@ SH_STAT(SH_TYPE * tb)
 	double		fillfactor;
 	uint32		i;
 
-	uint32	   *collisions = palloc0(tb->size * sizeof(uint32));
+	uint32		*collisions = (uint32 *) palloc0(tb->size * sizeof(uint32));
 	uint32		total_collisions = 0;
 	uint32		max_collisions = 0;
 	double		avg_collisions;
-- 
2.34.1



Re: Reducing planning time on tables with many indexes

2022-08-19 Thread David Geier

On 8/1/22 15:33, David Geier wrote:

Hi Tom,

On Wed, Jul 27, 2022 at 7:15 PM Tom Lane  wrote:

I wrote:
> Unfortunately, as things stand today, the planner needs more
than the
> right to look at the indexes' schemas, because it makes physical
accesses
> to btree indexes to find out their tree height (and I think
there are some
> comparable behaviors in other AMs).  I've never particularly
cared for
> that implementation, and would be glad to rip out that behavior
if we can
> find another way.  Maybe we could persuade VACUUM or ANALYZE to
store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?

It seems like _bt_getrootheight() first checks if the height is cached 
and only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight() 
accessing index meta pages in case they are not cached, maybe the 
index locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even 
better.



A first step here could just be to postpone fetching
_bt_getrootheight()
until we actually need it during cost estimation.  That would
avoid the
need to do it at all for indexes that indxpath.c discards as
irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.


Hi Tom,

I gave the idea of moving _bt_getrootheight() into costsize.c and 
filling IndexOptInfo in get_relation_info() via syscache instead of 
relcache a try, but didn't get very far.
Moving out _bt_getrootheight() was straightforward, and we should do 
nevertheless. However, it seems like get_relation_info() strongly 
depends on the index's Relation for quite some stuff. A fair amount of 
fields I could actually fill from syscache, but there are some that 
either need data not stored in syscache (e.g. estimate_rel_size(), 
Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need 
fields that are cached in the index's Relation and would have to be 
recomputed otherwise (e.g. Relation::rd_indexprs filled by 
RelationGetIndexExpressions(), Relation::rd_indpred filled by 
RelationGetIndexPredicate()). Even if we could somehow obtain the 
missing info from somewhere, recomputing the otherwise cached fields 
from Relation would likely cause a significant slowdown in the serial case.


Beyond that I did some off-CPU profiling to precisely track down which 
lock serializes execution. It turned out to be the MyProc::fpInfoLock 
lightweight lock. This lock is used in the fast path of the heavyweight 
lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE 
mode to (1) check if there is no other process holding a stronger lock, 
and if not, to reserve a process local fast path lock slot and (2) to 
return the fast path lock slots all in one go. To do so, the current 
implementation always linearly iterates over all lock slots. The 
corresponding call stacks are:


get_relation_info()   CommitTransaction()
  index_open()  ResourceOwnerRelease()
    relation_open() ResourceOwnerReleaseInternal()
  LockRelationOid() ProcReleaseLocks()
    LockAcquireExtended() LockReleaseAll() <-- called 
twice from ProcReleaseLocks()

  LWLockAcquire()

On top of that there are only 16 fast path lock slots. One slot is 
always taken up by the parent relation, leaving only 15 slots for the 
indexes. As soon as a session process runs out of slots, it falls back 
to the normal lock path which has to mess around with the lock table. To 
do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This 
lightweight lock however is partitioned and therefore does not content. 
Hence, normal lock acquisition is slower but contents less.


To prove above findings I increased the number of fast path lock slots 
per connection and optimized FastPathGrantRelationLock() and 
FastPathUnGrantRelationLock(). With these changes the lock contention 
disappeared and the workload scales linearly (the code I tested with 
also included moving out _bt_getrootheight()):


| Parallel streams | TPS  | TPS / stream  |
|--|--|---|
| 1    |   5,253  | 5,253 |
| 10   |  51,406  | 5,140 |
| 20   | 101,401  | 5,070 |
| 30   | 152,023  | 5,067 |
| 40   | 200,607  | 5,015 |
| 50   | 245,359  | 4,907 |
| 60   | 302,994  | 5,049 |

However, with the very same setup, the index filtering approach yields 
486k TPS with 60 streams and 9,827 TPS with a single stream. The single 
stream number shows that this is not because it scales even better, but 
just because less work is spent during planning. A quick perf session 
showed that a fair amo

Re: Reducing planning time on tables with many indexes

2022-08-04 Thread David Geier
Hi Tom,

On Wed, Jul 27, 2022 at 6:39 PM Tom Lane  wrote:

> David Geier  writes:
> > We tracked down the root cause of this slowdown to lock contention in
> > 'get_relation_info()'. The index lock of every single index of every
> single
> > table used in that query is acquired. We attempted a fix by pre-filtering
> > out all indexes that anyways cannot be used with a certain query, without
> > taking the index locks (credits to Luc Vlaming for idea and
> > implementation). The patch does so by caching the columns present in
> every
> > index, inside 'struct Relation', similarly to 'rd_indexlist'.
>
> I wonder how much thought you gave to the costs imposed by that extra
> cache space.  We have a lot of users who moan about relcache bloat
> already.


The current representation could be compacted (e.g. by storing the column
indexes as 16-bit integers, instead of using a Bitmapset). That should make
the additional space needed neglectable compared to the current size of
RelationData.  On top of that we could maybe reorder the members of
RelationData to save padding bytes. Currently, there is lots of
interleaving of members of different sizes. Even when taking cache locality
into consideration it looks like a fair amount could be saved, probably
accounting for the additional space needed to store the index column data.

  But more to the point, I do not buy the assumption that
> an index's set of columns is a good filter for which indexes are of
> interest.  A trivial counterexample from the regression database is
>
> regression=# explain select count(*) from tenk1;
>  QUERY PLAN
>
>
>
> 
> 
>  Aggregate  (cost=219.28..219.29 rows=1 width=8)
>->  Index Only Scan using tenk1_hundred on tenk1  (cost=0.29..194.28
> rows=100
> 00 width=0)
> (2 rows)
>
> Only for queries without index conditions, the current version of the
patch simply discards all indexes but the one with the least columns. This
is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification
with the goal of picking the index which yields least I/O. For single
column indexes that works, but it can fall short for multi-column indexes
(e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the
latter would be better suited when there's no other index and we want to
read the first integer column). What we should do here instead is to
discard indexes based on storage size.


> It looks to me like the patch also makes unwarranted assumptions about
> being able to discard all but the smallest index having a given set
> of columns.  This would, for example, possibly lead to dropping the
> index that has the most useful sort order, or that has the operator
> class needed to support a specific WHERE clause.t
>
Why would that be? If we keep all indexes that contain columns that are
used by the query, we also keep the indexes which provide a certain sort
order or operator class.

>
> In short, I'm not sure I buy this concept at all.  I think it might
> be more useful to attack the locking overhead more directly.  I kind
> of wonder why we need per-index locks at all during planning ---
> I think that we already interpret AccessShareLock on the parent table
> as being sufficient to block schema changes on existing indexes.
>
> As I said in the reply to your other mail, there's huge savings also in
the serial case where lock contention is not an issue. It seems like
pre-filtering saves work down the road. I'll do some perf runs to track
down where exactly the savings come from. One source I can think of is only
having to consider a subset of all indexes during path creation.


> Unfortunately, as things stand today, the planner needs more than the
> right to look at the indexes' schemas, because it makes physical accesses
> to btree indexes to find out their tree height (and I think there are some
> comparable behaviors in other AMs).  I've never particularly cared for
> that implementation, and would be glad to rip out that behavior if we can
> find another way.  Maybe we could persuade VACUUM or ANALYZE to store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?
>
> That's another interesting approach, but I would love to save the planner
cycles for the sequential case.

--
David Geier
(ServiceNow)


Re: Reducing planning time on tables with many indexes

2022-08-01 Thread David Geier
Hi Tom,

On Wed, Jul 27, 2022 at 7:15 PM Tom Lane  wrote:

> I wrote:
> > Unfortunately, as things stand today, the planner needs more than the
> > right to look at the indexes' schemas, because it makes physical accesses
> > to btree indexes to find out their tree height (and I think there are
> some
> > comparable behaviors in other AMs).  I've never particularly cared for
> > that implementation, and would be glad to rip out that behavior if we can
> > find another way.  Maybe we could persuade VACUUM or ANALYZE to store
> that
> > info in the index's pg_index row, or some such, and then the planner
> > could use it with no lock?
>
It seems like _bt_getrootheight() first checks if the height is cached and
only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight()
accessing index meta pages in case they are not cached, maybe the index
locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even
better.

>
> A first step here could just be to postpone fetching _bt_getrootheight()
> until we actually need it during cost estimation.  That would avoid the
> need to do it at all for indexes that indxpath.c discards as irrelevant,
> which is a decision made on considerably more information than the
> proposed patch uses.
>
> Having done that, you could look into revising plancat.c to fill the
> IndexOptInfo structs from catcache entries instead of opening the
> index per se.  (You'd have to also make sure that the appropriate
> index locks are acquired eventually, for indexes the query does use
> at runtime.  I think that's the case, but I'm not sure if anything
> downstream has been optimized on the assumption the planner did it.)
>
> I can give this a try.
That way we would get rid of the scalability issues.
However, what about the runtime savings observed with a single query stream?
In that case there is no contention, so it seems like having less indexes
to look at further down the road, also yields substantial savings.
Any clue where exactly these savings might come from? Or is it actually
the calls to _bt_getrootheight()? I can also do a few perf runs to track
that down.


> This'd probably get us a large part of the way there.  Further
> optimization of acquisition of tree height etc could be an
> optional follow-up.
>
> regards, tom lane
>


Re: [PoC] Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier
Sorry, by accident I sent this one out twice.

--
David Geier
(ServiceNow)

On Wed, Jul 27, 2022 at 2:42 PM David Geier  wrote:

> Hi hackers,
>
> We came across a slowdown in planning, where queries use tables with many
> indexes. In setups with wide tables it is not uncommon to have easily
> 10-100 indexes on a single table. The slowdown is already visible in serial
> workloads with just a handful of indexes, but gets drastically amplified
> when running queries with more indexes in parallel at high throughput.
>
> We measured the TPS and planning time of running parallel streams of
> simple point look-up queries on a single empty table with 60 columns and 60
> indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
> rows are returned because the table is empty. We used a machine with 64
> physical CPU cores. The schema and sysbench script to reproduce these
> numbers are attached. We used the TPS as reported by sysbench and obtained
> planning time by running 'EXPLAIN ANALYZE' on the same query in a
> separately opened connection. We averaged the planning time of 3 successive
> 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
> numbers of threads using the following command line:
>
> sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
> --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
> --report-interval=1 --threads=64 run
>
> The following table shows the results. It is clearly visible that the TPS
> flatten out already at 8 parallel streams, while the planning time is
> increasing drastically.
>
> Parallel streams | TPS (before) | Planning time (before)
> -|--|---
> 1|  5,486   | 0.13 ms
> 2|  8,098   | 0.22 ms
> 4| 15,013   | 0.19 ms
> 8| 27,107   | 0.29 ms
> 16   | 30,938   | 0.43 ms
> 32   | 26,330   | 1.68 ms
> 64   | 24,314   | 2.48 ms
>
> We tracked down the root cause of this slowdown to lock contention in
> 'get_relation_info()'. The index lock of every single index of every single
> table used in that query is acquired. We attempted a fix by pre-filtering
> out all indexes that anyways cannot be used with a certain query, without
> taking the index locks (credits to Luc Vlaming for idea and
> implementation). The patch does so by caching the columns present in every
> index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
> opening (= locking) the indexes in 'get_relation_info()', we check if the
> index can actually contribute to the query and if not it is discarded right
> away. Caching the index info saves considerable work for every query run
> subsequently, because less indexes must be inspected and thereby locked.
> This way we also save cycles in any code that later on goes over all
> relation indexes.
>
> The work-in-progress version of the patch is attached. It is still fairly
> rough (e.g. uses a global variable, selects the best index in scans without
> restrictions by column count instead of physical column size, is missing
> some renaming, etc.), but shows the principle.
>
> The following table shows the TPS, planning time and speed-ups after
> applying the patch and rerunning above described benchmark. Now, the
> planning time remains roughly constant and TPS roughly doubles each time
> the number of parallel streams is doubled. The higher the stream count the
> more severe the lock contention is and the more pronounced the gained
> speed-up gets. Interestingly, even for a single query stream the speed-up
> in planning time is already very significant. This applies also for lower
> index counts. For example just with 10 indexes the TPS for a single query
> stream goes from 9,159 to 12,558. We can do more measurements if there is
> interest in details for a lower number of indexes.
>
> Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
> Speed-up planning
>
> -|-|---|--|--
> 1|  10,344 | 0.046 |  1.9x|
>  2.8x
> 2|  20,140 | 0.045 ms  |  2.5x|
>  4.9x
> 4|  40,349 | 0.047 ms  |  2.7x|
>  4.0x
> 8|  80,121 | 0.046 ms  |  3.0x|
>  6.3x
> 16   | 152,632 | 0.051 ms  |  4.9x|
>  8.4x
> 32   | 301,359 | 0.052 ms  | 11.4x|
> 32.3x
> 64   | 525,115 | 0.062 ms  | 21.6x|
> 40.0x
>
> We are happy to receive your feedback and polish up the patch.
>
> --
> David Geier
> (ServiceNow)
>


[PoC] Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier
Hi hackers,

We came across a slowdown in planning, where queries use tables with many
indexes. In setups with wide tables it is not uncommon to have easily
10-100 indexes on a single table. The slowdown is already visible in serial
workloads with just a handful of indexes, but gets drastically amplified
when running queries with more indexes in parallel at high throughput.

We measured the TPS and planning time of running parallel streams of simple
point look-up queries on a single empty table with 60 columns and 60
indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
rows are returned because the table is empty. We used a machine with 64
physical CPU cores. The schema and sysbench script to reproduce these
numbers are attached. We used the TPS as reported by sysbench and obtained
planning time by running 'EXPLAIN ANALYZE' on the same query in a
separately opened connection. We averaged the planning time of 3 successive
'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
numbers of threads using the following command line:

sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
--pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
--report-interval=1 --threads=64 run

The following table shows the results. It is clearly visible that the TPS
flatten out already at 8 parallel streams, while the planning time is
increasing drastically.

Parallel streams | TPS (before) | Planning time (before)
-|--|---
1|  5,486   | 0.13 ms
2|  8,098   | 0.22 ms
4| 15,013   | 0.19 ms
8| 27,107   | 0.29 ms
16   | 30,938   | 0.43 ms
32   | 26,330   | 1.68 ms
64   | 24,314   | 2.48 ms

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
opening (= locking) the indexes in 'get_relation_info()', we check if the
index can actually contribute to the query and if not it is discarded right
away. Caching the index info saves considerable work for every query run
subsequently, because less indexes must be inspected and thereby locked.
This way we also save cycles in any code that later on goes over all
relation indexes.

The work-in-progress version of the patch is attached. It is still fairly
rough (e.g. uses a global variable, selects the best index in scans without
restrictions by column count instead of physical column size, is missing
some renaming, etc.), but shows the principle.

The following table shows the TPS, planning time and speed-ups after
applying the patch and rerunning above described benchmark. Now, the
planning time remains roughly constant and TPS roughly doubles each time
the number of parallel streams is doubled. The higher the stream count the
more severe the lock contention is and the more pronounced the gained
speed-up gets. Interestingly, even for a single query stream the speed-up
in planning time is already very significant. This applies also for lower
index counts. For example just with 10 indexes the TPS for a single query
stream goes from 9,159 to 12,558. We can do more measurements if there is
interest in details for a lower number of indexes.

Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
Speed-up planning
-|-|---|--|--
1|  10,344 | 0.046 |  1.9x|
 2.8x
2|  20,140 | 0.045 ms  |  2.5x|
 4.9x
4|  40,349 | 0.047 ms  |  2.7x|
 4.0x
8|  80,121 | 0.046 ms  |  3.0x|
 6.3x
16   | 152,632 | 0.051 ms  |  4.9x|
 8.4x
32   | 301,359 | 0.052 ms  | 11.4x|
32.3x
64   | 525,115 | 0.062 ms  | 21.6x|
40.0x

We are happy to receive your feedback and polish up the patch.

--
David Geier
(ServiceNow)
From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 15 Jul 2022 11:53:27 +0200
Subject: [PATCH] Index filtering

---
 src/backend/optimizer/path/Makefile   |   3 +-
 src/backend/optimizer/path/index_filtering.c  | 220 ++
 src/backend/optimizer/plan/planmain.c |   7 +
 src/backend/optimizer/util/plancat.c  | 124 ++
 src/backend/utils/cache/relcache.c|  17 +-
 src/include

Reducing planning time on tables with many indexes

2022-07-27 Thread David Geier
Hi hackers,

We came across a slowdown in planning, where queries use tables with many
indexes. In setups with wide tables it is not uncommon to have easily
10-100 indexes on a single table. The slowdown is already visible in serial
workloads with just a handful of indexes, but gets drastically amplified
when running queries with more indexes in parallel at high throughput.

We measured the TPS and planning time of running parallel streams of simple
point look-up queries on a single empty table with 60 columns and 60
indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
rows are returned because the table is empty. We used a machine with 64
physical CPU cores. The schema and sysbench script to reproduce these
numbers are attached. We used the TPS as reported by sysbench and obtained
planning time by running 'EXPLAIN ANALYZE' on the same query in a
separately opened connection. We averaged the planning time of 3 successive
'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
numbers of threads using the following command line:

sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
--pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
--report-interval=1 --threads=64 run

The following table shows the results. It is clearly visible that the TPS
flatten out already at 8 parallel streams, while the planning time is
increasing drastically.

Parallel streams | TPS (before) | Planning time (before)
-|--|---
1|  5,486   | 0.13 ms
2|  8,098   | 0.22 ms
4| 15,013   | 0.19 ms
8| 27,107   | 0.29 ms
16   | 30,938   | 0.43 ms
32   | 26,330   | 1.68 ms
64   | 24,314   | 2.48 ms

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
opening (= locking) the indexes in 'get_relation_info()', we check if the
index can actually contribute to the query and if not it is discarded right
away. Caching the index info saves considerable work for every query run
subsequently, because less indexes must be inspected and thereby locked.
This way we also save cycles in any code that later on goes over all
relation indexes.

The work-in-progress version of the patch is attached. It is still fairly
rough (e.g. uses a global variable, selects the best index in scans without
restrictions by column count instead of physical column size, is missing
some renaming, etc.), but shows the principle.

The following table shows the TPS, planning time and speed-ups after
applying the patch and rerunning above described benchmark. Now, the
planning time remains roughly constant and TPS roughly doubles each time
the number of parallel streams is doubled. The higher the stream count the
more severe the lock contention is and the more pronounced the gained
speed-up gets. Interestingly, even for a single query stream the speed-up
in planning time is already very significant. This applies also for lower
index counts. For example just with 10 indexes the TPS for a single query
stream goes from 9,159 to 12,558. We can do more measurements if there is
interest in details for a lower number of indexes.

Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
Speed-up planning
-|-|---|--|--
1|  10,344 | 0.046 |  1.9x|
 2.8x
2|  20,140 | 0.045 ms  |  2.5x|
 4.9x
4|  40,349 | 0.047 ms  |  2.7x|
 4.0x
8|  80,121 | 0.046 ms  |  3.0x|
 6.3x
16   | 152,632 | 0.051 ms  |  4.9x|
 8.4x
32   | 301,359 | 0.052 ms  | 11.4x|
32.3x
64   | 525,115 | 0.062 ms  | 21.6x|
40.0x

We are happy to receive your feedback and polish up the patch.

--
David Geier
(ServiceNow)
function thread_init()
  drv = sysbench.sql.driver()
  con = drv:connect()
end

function thread_done()
  con:disconnect()
end

function event()
  con:query('SELECT * FROM synth_table WHERE col05 = 52')
end

schema.sql
Description: application/sql
From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Fri, 15 Jul 2022 11:53:27 +0200
Subject: [PATCH] Index filtering

---
 src/backend/optimizer/path/Makefile   |   3 +-
 src/backend

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-18 Thread David Geier

Can you elaborate a bit more on how you conclude that?

Looking at the numbers I measured in one of my previous e-mails, it 
looks to me like the overhead of using multiple modules is fairly low 
and only measurable in queries with dozens of modules. Given that JIT is 
most useful in queries that process a fair amount of rows, having to 
spend marginally more time on creating the JIT program while being able 
to use JIT much more fine grained seems desirable. For example, the time 
you lose for handling more modules, you save right away because not the 
whole plan gets JIT compiled.


It is a trade-off between optimizing for the best case where everything 
in the plan can truly benefit from jitting and hence a single module 
that has it all is best, vs the worst-case where almost nothing truly 
profits from jitting and hence only a small fraction of the plan should 
actually be jitted. The penalty for the best case seems low though, 
because (1) the overhead is low in absolute terms, and (2) also if the 
entire plan truly benefits from jitting, spending sub-ms more per node 
seems neglectable because there is anyways going to be significant time 
spent.


--
David Geier
(ServiceNow)

On 7/4/22 22:23, Andres Freund wrote:

Hi,

On 2022-07-04 06:43:00 +, Luc Vlaming Hummel wrote:

Thanks for reviewing this and the interesting examples!

Wanted to give a bit of extra insight as to why I'd love to have a system that 
can lazily emit JIT code and hence creates roughly a module per function:
In the end I'm hoping that we can migrate to a system where we only JIT after a 
configurable cost has been exceeded for this node, as well as a configurable 
amount of rows has actually been processed.
Reason is that this would safeguard against some problematic planning issues
wrt JIT (node not being executed, row count being massively off).

I still don't see how it's viable to move to always doing function-by-function
emission overhead wise.

I also want to go to do JIT in the background and triggered by acutal
usage. But to me it seems a dead end to require moving to
one-function-per-module model for that.



If this means we have to invest more in making it cheap(er) to emit modules,
I'm all for that.

I think that's just inherently more expensive and thus a no-go.



@Andres if there's any other things we ought to fix to make this cheap
(enough) compared to the previous code I'd love to know your thoughts.

I'm not seeing it.

Greetings,

Andres Freund

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-07-14 Thread David Geier
On Mon, Jul 4, 2022 at 10:32 PM Andres Freund  wrote:

> Hi,
>
> On 2022-06-27 16:55:55 +0200, David Geier wrote:
> > Indeed, the total JIT time increases the more modules are used. The
> reason
> > for this to happen is that the inlining pass loads and deserializes all
> to
> > be inlined modules (.bc files) from disk prior to inlining them via
> > llvm::IRMover. There's already a cache for such modules in the code, but
> it
> > is currently unused. This is because llvm::IRMover takes the module to be
> > inlined as std::unique_ptr. The by-value argument requires
> > the source module to be moved, which means it cannot be reused
> afterwards.
> > The code is accounting for that by erasing the module from the cache
> after
> > inlining it, which in turns requires reloading the module next time a
> > reference to it is encountered.
> >
> > Instead of each time loading and deserializing all to be inlined modules
> > from disk, they can reside in the cache and instead be cloned via
> > llvm::CloneModule() before they get inlined. Key to calling
> > llvm::CloneModule() is fully deserializing the module upfront, instead of
> > loading the module lazily. That is why I changed the call from
> > LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
> > llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2()
> (which
> > fully loads the module via llvm::parseBitcodeFile()). Beyond that it
> seems
> > like that prior to LLVM 13, cloning modules could fail with an assertion
> > (not sure though if that would cause problems in a release build without
> > assertions). Andres reported this problem back in the days here [1]. In
> the
> > meanwhile the issue got discussed in [2] and finally fixed for LLVM 13,
> see
> > [3].
>
> Unfortunately that doesn't work right now - that's where I had started. The
> problem is that IRMover renames types. Which, in the case of cloned modules
> unfortunately means that types used cloned modules are also renamed in the
> "origin" module. Which then causes problems down the line, because parts of
> the LLVM code match types by type names.
>
> That can then have the effect of drastically decreasing code generation
> quality over time, because e.g. inlining never manages to find signatures
> compatible.
>
> Looking at the LLVM patches these issues seem to be gone.
The same suggests dumping the bitcode and running all tests with JIT
force-enabled.
See below.

>
> > However, curiously the time spent on optimizing is also reduced (95ms
> > instead of 164ms). Could this be because some of the applied
> optimizations
> > are ending up in the cached module?
>
> I suspect it's more that optimization stops being able to do a lot, due to
> the
> type renamign issue.
>
>
> > @Andres: could you provide me with the queries that caused the assertion
> > failure in LLVM?
>
> I don't think I have the concrete query. What I tend to do is to run the
> whole
> regression tests with forced JITing. I'm fairly certain this triggered the
> bug
> at the time.
>
>
> > Have you ever observed a segfault with a non-assert-enabled build?
>
> I think I observed bogus code generation that then could lead to segfaults
> or
> such.
>
OK, then using a non-assert-enabled LLVM build seems good enough.
Especially, given the fixes they merged in [3].

>
>
> > I just want to make sure this is truly fixed in LLVM 13. Running 'make
> > check-world' all tests passed.
>
> With jit-ing forced for everything?
>
> One more thing to try is to jit-compile twice and ensure the code is the
> same. It certainly wasn't in the past due to the above issue.
>
> Yes, with jitting force-enabled everything passes (I removed all checks
for the various 'jit_above' costs from planner.c).

I also dumped the bitcode via 'SET jit_dump_bitcode = ON' and compared the
bitcode output after inlining and after optimization for two successive
executions of the very same query. The .bc files after inlining only
differ in two bytes. The .bc files after optimization differ in a lot more
bytes. However, the same is true for the master branch. How do you exactly
conclude the identity of compilation output?

--
David Geier
(ServiceNow)


Improving scalability of Parallel Bitmap Heap/Index Scan

2022-07-14 Thread David Geier
Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:

   1. The BIS always only runs in a single process, also when the parent
   BHS is parallel. The first process arriving in the BHS serves as leader and
   executes the BIS.
   2. As long as execution is "exact" (TIDs are stored instead of page
   bits), the parallel BHS sorts all TIDs to ensure pages are accessed
   sequentially. The sort is also performed just by a single worker. Already
   with a few tens of thousands of pages to scan, the sort time can make up a
   significant portion of the total runtime. Large page counts and the need
   for parallelism are not uncommon for BHS, as one use case is closing the
   gap between index and sequential scans. The BHS costing seems to not
   account for that.
   3. The BHS does not scale well with an increasing number of parallel
   workers, even when accounting for the sequential parts of execution. A perf
   profile shows that the TID list / bitmap iteration code heavily contents on
   a mutex taken for every single TID / page bit (see
   LWLockAcquire(>lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
   4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
   statistics of the parallel workers. For example the number of heap pages
   processed is what just the leader did. Similarly to other parallel plan
   nodes we should aggregate statistics across workers.

The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
---|--|--
2   | 15322 ms  | 3107 ms  | 5912 ms  | 9269 ms
4   | 13277 ms  | 3094 ms  | 5869 ms  | 7260 ms
8   | 14628 ms  | 3106 ms  | 5882 ms  | 8598 ms

None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:

   - What stops us from extending the BIS to run in parallel? Parallel
   Bitmap Index Scans are also supported.
   - What about reducing the sort time by, e.g.
  - dividing TIDs across workers, ending up with N parallely sorted
  streams,
  - cooperatively sorting the TIDs with multiple workers using barriers
  for synchronization,
  - optimizing the PagetableEntry data structure for size and using a
  faster sorting algorithm like e.g. radix sort
  - a combination of the first three options
   - With separate TID lists per worker process the iteration problem would
   be solved. Otherwise, we could
  - optimize the iteration code and thereby minimize the duration of
  the critical section,
  - have worker processes acquire chunks of TIDs / page bits to reduce
  locking.

Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?

--
David Geier
(ServiceNow)



-- 2 workers

 Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=15187.942..15321.345 rows=2 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
   Output: PARTIAL count(*)
   Worker 0:  actual time=15181.364..15181.366 rows=1 loops=1
   Worker 1:  actual time=15181.608..15181.610 rows=1 loops=1
   ->  Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
 Filter: ...
 Rows Removed by Filter: 4020149
 Worker 0:  actual time=5912.498..15166.936 rows=269305
loops=1
 Worker 1:  actual time=5912.963..15167.048 rows=270121
loops=1
 ->  Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
   Index Cond: -
   Worker 1:  actual time=3107.947..3107.948
rows=8579724 loops=1
 Planning Time: 0.167 ms
 Execution Time: 15322.081 ms


-- 4 workers

 Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=13137.981..13276.403 rows=4 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 4
 Workers Launched: 4
 ->  Partial Aggregate (a

Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-06-29 Thread David Geier
Hi Alvaro,

That's a very interesting case and might indeed be fixed or at least
improved by this patch. I tried to reproduce this, but at least when
running a simple, serial query with increasing numbers of functions, the
time spent per function is linear or even slightly sub-linear (same as Tom
observed in [1]).

I also couldn't reproduce the JIT runtimes you shared, when running the
attached catalog query. The catalog query ran serially for me with the
following JIT stats:

 JIT:
   Functions: 169
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
   Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491
ms, Emission 283.464 ms, Total 701.501 ms

Is it possible that the query ran in parallel for you? For parallel
queries, every worker JITs all of the functions it uses. Even though the
workers might JIT the functions in parallel, the time reported in the
EXPLAIN ANALYZE output is the sum of the time spent by all workers. With
this patch applied, the JIT time drops significantly, as many of the
generated functions remain unused.

 JIT:
   Modules: 15
   Functions: 26
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
   Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms,
Emission 70.347 ms, Total 140.195 ms

Of course, this does not prove that the nonlinearity that you observed went
away. Could you share with me how you ran the query so that I can reproduce
your numbers on master to then compare them with the patched version? Also,
which LLVM version did you run with? I'm currently running with LLVM 13.

Thanks!

--
David Geier
(ServiceNow)

On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera 
wrote:

> On 2021-Jan-18, Luc Vlaming wrote:
>
> > I would like this topic to somehow progress and was wondering what other
> > benchmarks / tests would be needed to have some progress? I've so far
> > provided benchmarks for small(ish) queries and some tpch numbers,
> assuming
> > those would be enough.
>
> Hi, some time ago I reported a case[1] where our JIT implementation does
> a very poor job and perhaps the changes that you're making could explain
> what is going on, and maybe even fix it:
>
> [1] https://postgr.es/m/20241706.wqq7xoyigwa2@alvherre.pgsql
>
> The query for which I investigated the problem involved some pg_logical
> metadata tables, so I didn't post it anywhere public; but the blog post
> I found later contains a link to a query that shows the same symptoms,
> and which is luckily still available online:
> https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580
> I attach it here in case it goes missing sometime.
>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
>


Re: Lazy JIT IR code generation to increase JIT speed with partitions

2022-06-27 Thread David Geier
les are used.
However, curiously the time spent on optimizing is also reduced (95ms
instead of 164ms). Could this be because some of the applied optimizations
are ending up in the cached module? With more modules the total time spent
on jitting seems to increase slightly (175ms vs 195ms). Almost all of that
time is spent on optimization and emission. Time for inlining only
increases from ~0.2ms to ~1.1ms. With improved optimization settings the
final time spent is pretty much on par with the single module variant
(175ms vs 181ms). So overall it looks like using many modules would this
way have competitive performance for the worst-case (all functions used),
but much better performance for cases where a significant amount of the
functions remain unused.

@Andres: could you provide me with the queries that caused the assertion
failure in LLVM? Have you ever observed a segfault with a
non-assert-enabled build? I just want to make sure this is truly fixed in
LLVM 13. Running 'make check-world' all tests passed.

The attached patch currently does not yet have a case distinction for LLVM
13. But that would be straightforward to add.

Thanks for your consideration.

--
David Geier
(ServiceNow)

[1] https://lists.llvm.org/pipermail/llvm-dev/2018-March/122111.html
[2] https://reviews.llvm.org/D96531
[3] https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a

On Fri, Jun 24, 2022 at 2:13 PM Luc Vlaming  wrote:

> On 18-01-2021 08:47, Luc Vlaming wrote:
> > Hi everyone, Andres,
> >
> > On 03-01-2021 11:05, Luc Vlaming wrote:
> >> On 30-12-2020 14:23, Luc Vlaming wrote:
> >>> On 30-12-2020 02:57, Andres Freund wrote:
> >>>> Hi,
> >>>>
> >>>> Great to see work in this area!
> >
> > I would like this topic to somehow progress and was wondering what other
> > benchmarks / tests would be needed to have some progress? I've so far
> > provided benchmarks for small(ish) queries and some tpch numbers,
> > assuming those would be enough.
> >
> >>>>
> >>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
> >>>>> I would like to propose a small patch to the JIT machinery which
> >>>>> makes the
> >>>>> IR code generation lazy. The reason for postponing the generation
> >>>>> of the IR
> >>>>> code is that with partitions we get an explosion in the number of JIT
> >>>>> functions generated as many child tables are involved, each with
> >>>>> their own
> >>>>> JITted functions, especially when e.g. partition-aware
> >>>>> joins/aggregates are
> >>>>> enabled. However, only a fraction of those functions is actually
> >>>>> executed
> >>>>> because the Parallel Append node distributes the workers among the
> >>>>> nodes.
> >>>>> With the attached patch we get a lazy generation which makes that
> >>>>> this is no
> >>>>> longer a problem.
> >>>>
> >>>> I unfortunately don't think this is quite good enough, because it'll
> >>>> lead to emitting all functions separately, which can also lead to very
> >>>> substantial increases of the required time (as emitting code is an
> >>>> expensive step). Obviously that is only relevant in the cases where
> the
> >>>> generated functions actually end up being used - which isn't the
> >>>> case in
> >>>> your example.
> >>>>
> >>>> If you e.g. look at a query like
> >>>>SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
> >>>> blub;
> >>>> on a table without indexes, you would end up with functions for
> >>>>
> >>>> - WHERE clause (including deforming)
> >>>> - projection (including deforming)
> >>>> - grouping key
> >>>> - aggregate transition
> >>>> - aggregate result projection
> >>>>
> >>>> with your patch each of these would be emitted separately, instead of
> >>>> one go. Which IIRC increases the required time by a significant
> amount,
> >>>> especially if inlining is done (where each separate code generation
> >>>> ends
> >>>> up with copies of the inlined code).
> >>>>
> >>>>
> >>>> As far as I can see you've basically falsified the second part of this
> >>>> comment (which you moved):
> >>>>
> >>>>> +
> >>>>> +/*
> >>>>&g

Re: Assertion failure with barriers in parallel hash join

2022-06-06 Thread David Geier
Hi Thomas,

Correct. We're running with disabled parallel leader participation and we
have to do so, because another custom plan node we built relies on that.
That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro  wrote:

> On Thu, Jun 2, 2022 at 9:31 PM David Geier  wrote:
> > We recently encountered the same bug in the field. Oleksii Kozlov
> managed to come up with reproduction steps, which reliably trigger it.
> Interestingly, the bug does not only manifest as failing assertion, but
> also as segmentation fault; in builds with disabled and with enabled (!)
> assertions. So it can crash production environments. We applied the
> proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that
> with the patch neither the assertion nor the segmentation fault still occur.
>
> Thanks for the report, testing, and for creating the CF entry.
>
> I assume you are using parallel_leader_participation=off, and the
> reason we haven't heard more about this is because few people do that.
> By coincidence I was just about to restart a bunch of hash join
> projects and have been paging the topic area back into my brain, so
> I'll start with another round of testing/analysis of this bug/patch
> next week.
>


Re: Assertion failure with barriers in parallel hash join

2022-06-02 Thread David Geier
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hi all,

We recently encountered the same bug in the field. Oleksii Kozlov managed to 
come up with reproduction steps, which reliably trigger it. Interestingly, the 
bug does not only manifest as failing assertion, but also as segmentation 
fault; in builds with disabled and with enabled (!) assertions. So it can crash 
production environments. We applied the proposed patch v3 from Melanie to the 
REL_14_3 branch and can confirm that with the patch neither the assertion nor 
the segmentation fault still occur.

I have also glanced at the code and the implementation looks fine. However, I'm 
not an expert for the fairly involved hash join state machine.
There seems to be no need for additional documentation.

For completeness here is the stack trace of the segmentation fault.
Like the stack trace from the assertion failure initially shared by Michael and 
also encountered by us, the stack trace of the segmentation fault also contains 
ExecParallelHashJoinNewBatch().

#9  | Source "/opt/src/backend/executor/execMain.c", line 361, in 
standard_ExecutorRun
| Source "/opt/src/backend/executor/execMain.c", line 1551, in ExecutePlan
  Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode 
[0x657e4d]
#8  | Source "/opt/src/backend/executor/nodeAgg.c", line 2179, in ExecAgg
  Source "/opt/src/backend/executor/nodeAgg.c", line 2364, in 
agg_retrieve_direct [0x66ba60]
#7  | Source "/opt/src/backend/executor/nodeAgg.c", line 581, in 
fetch_input_tuple
  Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode 
[0x66d585]
#6  | Source "/opt/src/backend/executor/nodeHashjoin.c", line 607, in 
ExecParallelHashJoin
| Source "/opt/src/backend/executor/nodeHashjoin.c", line 560, in 
ExecHashJoinImpl
  Source "/opt/src/backend/executor/nodeHashjoin.c", line 1132, in 
ExecParallelHashJoinNewBatch [0x67a89b]
#5  | Source "/opt/src/backend/storage/ipc/barrier.c", line 242, in 
BarrierAttach
  Source "/opt/src/include/storage/s_lock.h", line 228, in tas [0x7c2a1b]
#4Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7f4db364841f, in 
__funlockfile

--
David Geier
(SericeNow)

The new status of this patch is: Ready for Committer


Re: WIP Patch: Precalculate stable functions, infrastructure v1

2022-05-23 Thread David Geier
Hi hackers!

I would like to revive this thread. At ServiceNow we recurringly encounter
queries that are much slower than they would have to be, because of
frequent calls to uncached stable functions with constant arguments (mostly
to_date()). We've seen e.g. queries that get more than 8x faster by
temporarily changing to_date() from stable to immutable.

I would be glad to help bringing this effort forward. Was there more work
on the patch left than rebasing on latest master?
@Marina: do you have any plans to continue with this?

For reference here are all existing mailing list discussions I could find
on this topic:

- [WIP] Caching constant stable expressions per execution (Marti, 2011),
https://www.postgresql.org/message-id/flat/CABRT9RC-1wGxZC_Z5mwkdk70fgY2DRX3sLXzdP4voBKuKPZDow%40mail.gmail.com
- Caching for stable expressions with constant arguments v6 (Marti, 2012),
https://www.postgresql.org/message-id/flat/CABRT9RA-RomVS-yzQ2wUtZ=m-ev61lcbrl1p1j3jydpsttf...@mail.gmail.com
- WIP Patch: Precalculate stable functions (Marina, 2017),
https://www.postgresql.org/message-id/flat/ba261b9fc25dea4069d8ba9a8fcad...@postgrespro.ru
- WIP Patch: Precalculate stable functions, infrastructure v1 (Marina,
2017),
https://www.postgresql.org/message-id/flat/da87bb6a014e029176a04f6e50033cfb%40postgrespro.ru

--
David Geier
(ServiceNow)

On Mon, 23 May 2022 at 17:06, Andres Freund  wrote:

> On 2018-11-29 18:00:15 +0100, Dmitry Dolgov wrote:
> > > On Tue, Oct 2, 2018 at 4:22 AM Michael Paquier 
> wrote:
> > >
> > > On Thu, May 24, 2018 at 04:00:33PM +0300, Marina Polyakova wrote:
> > > > Here there's a 9-th version of the patches for the precalculation of
> stable
> > > > or immutable functions, stable or immutable operators and other
> nonvolatile
> > > > expressions. This is a try to execute cached expressions as
> PARAM_EXEC,
> > > > thanks to the comments of Tom Lane and Andres Freund [1].
> > >
> > > Please note that v9-0004 fails to apply, so a rebase is needed.  This
> > > patch is moved to next CF, waiting on author.
> >
> > Unfortunately, patch still has some conflicts, could you please post an
> updated
> > version?
>
> As nothing has happened since, I'm marking this as returned with
> feedback.
>
> Greetings,
>
> Andres Freund
>
>
>
>


Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier

Hi,

On 18.01.21 23:42, Tom Lane wrote:

David Geier  writes:

On 18.01.21 19:46, Tom Lane wrote:

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

That makes sense. Thanks for the explanation.

OK, cool.  I was afraid you'd argue that you really needed your CustomScan
node to be transparent in such cases.  We could imagine inventing an
additional custom-scan-provider callback to embed the necessary knowledge,
but I'd rather not add the complexity until someone has a use-case.


I was thinking about that. Generally, having such possibility would be 
very useful. Unfortunately, I believe that in my specific case even 
having such functionality wouldn't allow for the query to work with 
CURRENT OF, because my CSP behaves similarly to a materialize node.


My understanding is only such plans are supported which consist of nodes 
that guarantee that the tuple returned by the plan is the unmodified 
tuple a scan leaf node is currently positioned on?


Still, if there's interest I would be happy to draft a patch. Instead of 
a separate CSP callback, we could also provide an additional flag like 
CUSTOMPATH_SUPPORT_CURRENT_OF. The advantage of the callback would be 
that we could delay the decision until execution time where potentially 
more information is available.

I updated the patch to match your proposal.

WFM, will push in a bit.

regards, tom lane

Best regards,
David
Swarm64




Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier

Hi,

On 18.01.21 19:46, Tom Lane wrote:

David Geier  writes:

search_plan_tree() assumes that
CustomScanState::ScanState::ss_currentRelation is never NULL. In my
understanding that only holds for CustomScanState nodes which are at the
bottom of the plan and actually read from a relation. CustomScanState
nodes which are not at the bottom don't have ss_currentRelation set. I
believe for such nodes, instead search_plan_tree() should recurse into
CustomScanState::custom_ps.

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

That makes sense. Thanks for the explanation.


So I'm inclined to think a suitable fix is just

-   if (RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   RelationGetRelid(sstate->ss_currentRelation) == table_oid)
 result = sstate;

regards, tom lane



I updated the patch to match your proposal.

Best regards,
David
Swarm64
diff --git a/src/backend/executor/execCurrent.c 
b/src/backend/executor/execCurrent.c
index c7f909241b..fb3cf9537c 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -330,7 +330,8 @@ search_plan_tree(PlanState *node, Oid table_oid,
{
ScanState  *sstate = (ScanState *) node;
 
-   if 
(RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   
RelationGetRelid(sstate->ss_currentRelation) == table_oid)
result = sstate;
break;
}


search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier

Hi hackers,

While working with cursors that reference plans with CustomScanStates 
nodes, I encountered a segfault which originates from 
search_plan_tree(). The query plan is the result of a simple SELECT 
statement into which I inject a Custom Scan node at the root to do some 
post-processing before returning rows. This plan is referenced by a 
second plan with a Tid Scan which originates from a query of the form 
DELETE FROM foo WHERE CURRENT OF my_cursor;


search_plan_tree() assumes that 
CustomScanState::ScanState::ss_currentRelation is never NULL. In my 
understanding that only holds for CustomScanState nodes which are at the 
bottom of the plan and actually read from a relation. CustomScanState 
nodes which are not at the bottom don't have ss_currentRelation set. I 
believe for such nodes, instead search_plan_tree() should recurse into 
CustomScanState::custom_ps.


I attached a patch. Any thoughts?

Best regards,
David
Swarm64

diff --git a/src/backend/executor/execCurrent.c 
b/src/backend/executor/execCurrent.c
index f89319fcd8..0d5f09402b 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -326,7 +326,6 @@ search_plan_tree(PlanState *node, Oid table_oid,
case T_BitmapHeapScanState:
case T_TidScanState:
case T_ForeignScanState:
-   case T_CustomScanState:
{
ScanState  *sstate = (ScanState *) node;
 
@@ -335,6 +334,39 @@ search_plan_tree(PlanState *node, Oid table_oid,
break;
}
 
+
+   /*
+* Custom scan nodes can be leaf nodes or inner nodes 
and therfore need special treatment.
+*/
+   case T_CustomScanState:
+   {
+   CustomScanState *css = 
castNode(CustomScanState, node);
+   ScanState   *sstate = (ScanState *) node;
+
+   if (sstate->ss_currentRelation == NULL) /* 
inner node */
+   {
+   ListCell *lc;
+
+   foreach (lc, sstate->custom_ps)
+   {
+   ScanState *elem = 
search_plan_tree((PlanState *)lfirst(lc), table_oid, pending_rescan);
+
+   if (!elem)
+   continue;
+   if (result)
+   return NULL;/* 
multiple matches */
+   result = elem;
+   }
+   }
+   else /* leaf node */
+   {
+   if 
(RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   result = sstate;
+   }
+
+   break;
+   }
+
/*
 * For Append, we must look through the members; watch 
out for
 * multiple matches (possible if it was from UNION ALL)