Re: Table AM Interface Enhancements

2024-05-24 Thread Mats Kindahl
On Thu, Nov 23, 2023 at 1:43 PM Alexander Korotkov 
wrote:

> Hello PostgreSQL Hackers,
>
> I am pleased to submit a series of patches related to the Table Access
> Method (AM) interface, which I initially announced during my talk at
> PGCon 2023 [1]. These patches are primarily designed to support the
> OrioleDB engine, but I believe they could be beneficial for other
> table AM implementations as well.
>
> The focus of these patches is to introduce more flexibility and
> capabilities into the Table AM interface. This is particularly
> relevant for advanced use cases like index-organized tables,
> alternative MVCC implementations, etc.
>

Hi Alexander and great to see some action around in the table access method
interface.

Sorry for being late to the game, but wondering a few things about the
patches, but I'll start with the first one that caught my eye.

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patch
>
> This allows table AM to return a native tuple slot, which is aware of
> table AM-specific system attributes.
>

This patch seems straightforward enough, but from reading the surrounding
code and trying to understand the context I am wondering a few things.
Reading the thread, I am unsure if this will go in or not, but just wanted
to point out a concern I had. My apologies if I am raising an issue that is
already resolved.

AFAICT, the general contract for working with table tuple slots is creating
them for a particular purpose, filling it in, and then passing around a
pointer to it. Since the slot is created using a "source" implementation,
the "source" is responsible for memory allocation and also other updates to
the state. Please correct me if I have misunderstood how this is intended
to work, but this seems like a good API since it avoids
unnecessary allocation and, in particular, unbounded creation of new slots
affecting memory usage while a query is executing. For a plan you want to
execute, you just make sure that you have slots of the right kind in each
plan node and there is no need to dynamically allocate more slots. If you
want one for the table access method, just make sure to fetch the slot
callbacks from the table access method use those correctly. As a result,
the number of slots used during execution is bounded

Assuming that I've understood it correct, if a TTS is then created and
passed to tuple_insert, and it needs to return a different slot, this
raises two questions:

   - As Andres pointed out: who is responsible for taking care of and
   dealing with the cleanup of the returned slot here? Note that this is not
   just a matter of releasing memory, there are other stateful things that
   they might need to deal with that the TAM have created for in the slot. For
   this, some sort of callback is needed and the tuple_insert implementation
   needs to call that correctly.
   - The dual is the cleanup of the "original" slot passed in: a slot of a
   particular kind is passed in and you need to deal with this correctly to
   release the resources allocated by the original slot, using some sort of
   callback.

For both these cases, the question is what cleanup function to call.

In most cases, the slot comes from a subplan and is not dynamically
allocated, i.e., it cannot just use release() since it is reused later. For
example, for ExecScanFetch the slot ss_ScanTupleSlot is returned, which is
then used with tuple_insert (unless I've misread the code), which is
typically cleared, not released.

If clear() is used instead, and you clear this slot as part of inserting a
tuple, you can instead clear a premature intermediate result
(ss_ScanTupleSlot, in the example above), which can cause strange issues if
this result is needed later.

So, given that the dynamic allocation of new slots is unbounded within a
query and it is complicated to make sure that slots are
cleared/reset/released correctly depending on context, this seems to be
hard to get to work correctly and not risk introducing bugs. IMHO, it would
be preferable to have a very simple contract where you init, set, clear,
and release the slot to avoid bugs creeping into the code, which is what
the PostgreSQL code mostly has now.

So, the question here is why changing the slot implementation is needed. I
do not know the details of OrioleDB, but this slot is immediately used
with ExecInsertIndexTuples() after the call in nodeModifyTable. If the need
is to pass information from the TAM to the IAM then it might be better to
store this information in the execution state. Is there a case where the
correct slot is not created, then fixing that location might be better.
(I've noticed that the copyFrom code has a somewhat naïve assumption of
what slot implementation should be used, but that is a separate discussion.)

Best wishes,
Mats Kindahl


Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-11 Thread Mats Kindahl
On Mon, Mar 11, 2024 at 12:42 AM Michael Paquier 
wrote:

> On Thu, Mar 07, 2024 at 07:45:01AM +0900, Michael Paquier wrote:
> > That's nice.  You would be able to shave quite a bit of code.  If
> > there are no objections, I propose to apply the change of this thread
> > to have this standard explain wrapper at the beginning of next week.
> > If others have any comments, feel free.
>
> Well, done as of a04ddd077e61.
>

Thanks Michael!
-- 
Best wishes,
Mats Kindahl, Timescale


Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-05 Thread Mats Kindahl
On Wed, Mar 6, 2024 at 3:27 AM Andrei Lepikhov 
wrote:

> On 6/3/2024 06:25, Michael Paquier wrote:
> >> Just to elaborate: the intention was to allow a section to be added to
> >> every node in the plan containing information from further down and also
> >> allow this information to propagate upwards. We happen to have buffer
> >> information right now, but allowing something similar to be added
> >> dynamically by extending ExplainNode and passing down a callback to
> >> standard_ExplainOneQuery.
> >
> > Or an extra hook at the end of ExplainNode() to be able to append more
> > information at node level?  Not sure if others would agree with that,
> > though.
>

That is what I had in mind, yes.


> We already discussed EXPLAIN hooks, at least in [1]. IMO, extensions
> should have a chance to add something to the node explain and the
> summary, if only because they can significantly influence the planner
> and executor's behaviour.
>
> [1]
>
> https://www.postgresql.org/message-id/flat/6cd5caa7-06e1-4460-bf35-00a59da3f677%40garret.ru


This is an excellent example of where such a hook would be useful.
-- 
Best wishes,
Mats Kindahl, Timescale


Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Mats Kindahl
On Tue, Mar 5, 2024 at 7:31 AM Michael Paquier  wrote:

> On Mon, Mar 04, 2024 at 03:41:16PM +0300, Aleksander Alekseev wrote:
> >> I wanted to hook into the EXPLAIN output for queries and add some
> >> extra information, but since there is no standard_ExplainOneQuery() I
> >> had to copy the code and create my own version.
> >>
> >> Since the pattern with other hooks for a function
> >> WhateverFunction() seems to be that there is a
> >> standard_WhateverFunction() for each WhateverFunction_hook, I
> >> created a patch to follow this pattern for your consideration.
>
> So you've wanted to be able to add some custom information at the end
> or the beginning of ExplainState's output buffer, before falling back
> to the in-core path.  What was the use case, if I may ask?
>

Yes, that was the use-case. We have some caches added by extending
ExecutorStart and other executor-related functions using the hooks there
and want to show cache hits and misses in the plan.

I realize that a more advanced system is possible to create where you can
customize the output even more, but in this case I just wanted to add a
section with some additional information related to plan execution. Also,
the code in explain.c seems to not be written with extensibility in mind,
so I did not want to make too big a change here before thinking through how
this would work.


> >> I was also considering adding a callback so that you can annotate
> >> any node with explanatory information that is not a custom scan
> >> node. This could be used to propagate and summarize information
> >> from custom scan nodes, but I had no immediate use for that so did
> >> not add it here. I would still be interested in hearing if you
> >> think this is something that would be useful to the community.
>
> That depends.
>

Just to elaborate: the intention was to allow a section to be added to
every node in the plan containing information from further down and also
allow this information to propagate upwards. We happen to have buffer
information right now, but allowing something similar to be added
dynamically by extending ExplainNode and passing down a callback to
standard_ExplainOneQuery.

Best wishes,
Mats Kindahl


Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

2024-03-04 Thread Mats Kindahl
Hi hackers,

I wanted to hook into the EXPLAIN output for queries and add some extra
information, but since there is no standard_ExplainOneQuery() I had to copy
the code and create my own version.

Since the pattern with other hooks for a function WhateverFunction() seems
to be that there is a standard_WhateverFunction() for each
WhateverFunction_hook, I created a patch to follow this pattern for your
consideration.

I was also considering adding a callback so that you can annotate any node
with explanatory information that is not a custom scan node. This could be
used to propagate and summarize information from custom scan nodes, but I
had no immediate use for that so did not add it here. I would still be
interested in hearing if you think this is something that would be useful
to the community.

Best wishes,
Mats Kindahl, Timescale
From eaab4d7c088ff3ee9b0e6ec3de96677bafd184c0 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Mon, 4 Mar 2024 12:38:05 +0100
Subject: Improve support for ExplainOneQuery hook

There is a hook ExplainOneQuery_hook, but there is no corresponding
standard_ExplainOneQuery and the code that belongs there is written
in-line in ExplainOneQuery(). As a result, anybody adding a hook for
ExplainOneQuery_hook has to copy the code from ExplainOneQuery() to run
the standard ExplainOneQuery before adding their own messages.

This commit fixes this by refactoring ExplainOneQuery() and factor out
the standard code into standard_ExplainOneQuery() and call it from
ExplainOneQuery() in a similar manner to how it is done for other
hooks.
---
 src/backend/commands/explain.c | 106 ++---
 src/include/commands/explain.h |   4 ++
 2 files changed, 61 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 78754bc6ba..3b7bed3ca2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -391,63 +391,71 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 	/* if an advisor plugin is present, let it manage things */
 	if (ExplainOneQuery_hook)
-		(*ExplainOneQuery_hook) (query, cursorOptions, into, es,
- queryString, params, queryEnv);
+		(*ExplainOneQuery_hook)(query, cursorOptions, into, es,
+			 queryString, params, queryEnv);
 	else
-	{
-		PlannedStmt *plan;
-		instr_time	planstart,
-	planduration;
-		BufferUsage bufusage_start,
-	bufusage;
-		MemoryContextCounters mem_counters;
-		MemoryContext planner_ctx = NULL;
-		MemoryContext saved_ctx = NULL;
-
-		if (es->memory)
-		{
-			/*
-			 * Create a new memory context to measure planner's memory
-			 * consumption accurately.  Note that if the planner were to be
-			 * modified to use a different memory context type, here we would
-			 * be changing that to AllocSet, which might be undesirable.
-			 * However, we don't have a way to create a context of the same
-			 * type as another, so we pray and hope that this is OK.
-			 */
-			planner_ctx = AllocSetContextCreate(CurrentMemoryContext,
-"explain analyze planner context",
-ALLOCSET_DEFAULT_SIZES);
-			saved_ctx = MemoryContextSwitchTo(planner_ctx);
-		}
+		standard_ExplainOneQuery(query, cursorOptions, into, es,
+ queryString, params, queryEnv);
+}
 
-		if (es->buffers)
-			bufusage_start = pgBufferUsage;
-		INSTR_TIME_SET_CURRENT(planstart);
+void
+standard_ExplainOneQuery(Query *query, int cursorOptions,
+		 IntoClause *into, ExplainState *es,
+		 const char *queryString, ParamListInfo params,
+		 QueryEnvironment *queryEnv)
+{
+	PlannedStmt *plan;
+	instr_time	planstart,
+planduration;
+	BufferUsage bufusage_start,
+bufusage;
+	MemoryContextCounters mem_counters;
+	MemoryContext planner_ctx = NULL;
+	MemoryContext saved_ctx = NULL;
+
+	if (es->memory)
+	{
+		/*
+		 * Create a new memory context to measure planner's memory consumption
+		 * accurately.  Note that if the planner were to be modified to use a
+		 * different memory context type, here we would be changing that to
+		 * AllocSet, which might be undesirable.  However, we don't have a way
+		 * to create a context of the same type as another, so we pray and
+		 * hope that this is OK.
+		 */
+		planner_ctx = AllocSetContextCreate(CurrentMemoryContext,
+			"explain analyze planner context",
+			ALLOCSET_DEFAULT_SIZES);
+		saved_ctx = MemoryContextSwitchTo(planner_ctx);
+	}
 
-		/* plan the query */
-		plan = pg_plan_query(query, queryString, cursorOptions, params);
+	if (es->buffers)
+		bufusage_start = pgBufferUsage;
+	INSTR_TIME_SET_CURRENT(planstart);
 
-		INSTR_TIME_SET_CURRENT(planduration);
-		INSTR_TIME_SUBTRACT(planduration, planstart);
+	/* plan the query */
+	plan = pg_plan_query(query, queryString, cursorOptions, params);
 
-		if (es->memory)
-		{
-			MemoryContextSwitchTo(saved_ctx);
-			MemoryContextMemConsumed(planner_ctx, _counters);
-		}
+	INSTR_TIME_SET_CURRENT(planduration);

Re: glibc qsort() vulnerability

2024-02-17 Thread Mats Kindahl
On Fri, Feb 16, 2024 at 9:09 PM Nathan Bossart 
wrote:

> On Fri, Feb 16, 2024 at 01:45:52PM +0100, Mats Kindahl wrote:
> > Looks good to me.
>
> Committed.
>

Thanks Nathan!


>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-16 Thread Mats Kindahl
On Fri, Feb 16, 2024 at 12:32 AM Nathan Bossart 
wrote:

> Here is what I have staged for commit.
>

Looks good to me.

Checked that all of the comparisons are in the expected order, except
inside compDESC, cmp_lsn, and resource_priority_cmp, where the order is
reversed.

Best wishes,
Mats Kindahl

>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-13 Thread Mats Kindahl
On Tue, Feb 13, 2024 at 12:41 AM Andres Freund  wrote:

> Hi,
>
> On 2024-02-12 17:04:23 -0600, Nathan Bossart wrote:
> > On Mon, Feb 12, 2024 at 01:31:30PM -0800, Andres Freund wrote:
> > > One thing that's worth checking is if this ends up with *worse* code
> when the
> > > comparators are inlined. I think none of the changed comparators will
> end up
> > > getting used with an inlined sort, but ...
> >
> > Yeah, AFAICT the only inlined sorts are in tuplesort.c and bufmgr.c, and
> > the patches don't touch those files.
> >
> > > The reason we could end up with worse code is that when inlining the
> > > comparisons would make less sense for the compiler. Consider e.g.
> > > return DO_COMPARE(a, b) < 0 ?
> > > (DO_COMPARE(b, c) < 0 ? b : (DO_COMPARE(a, c) < 0 ? c : a))
> > > : (DO_COMPARE(b, c) > 0 ? b : (DO_COMPARE(a, c) < 0 ? a :
> c));
> > >
> > > With a naive implementation the compiler will understand it only cares
> about
> > > a < b, not about the other possibilities. I'm not sure that's still
> true with
> > > the more complicated optimized version.
> >
> > You aren't kidding [0].  Besides perhaps adding a comment in
> > sort_template.h, is there anything else you think we should do about this
> > now?
>
> I'd add also a comment to the new functions. I think it's fine otherwise. I
> wish there were formulation that'd be optimal for both cases, but this way
> we
> can at least adapt all places at once if either find a better formulation
> or
> change all our sorts to happen via an inline implementation of qsort or
> such.
>

I suspect that this has to do with the fact that we "abuse" the type system
by performing arithmetics on booleans converted to integers and the
compilers do not have rules for dealing with these.

For example, with the inline function "static inline cmp(a,b) { return a <
b ? -1 : a > b ? 1 : 0; }"

Which trivially can be rewritten by the compiler using very basic rules, as
follows:

DO_COMPARE(a,b)
  (a < b ? -1 : a > b ? 1 : 0) < 0
  a < b ? (-1 < 0) : a > b ? (1 < 0) : (0 < 0)
  a < b ? true : a > b ? false : false
  a < b ? true : a > b ? false : false
  a < b ? true : false
  a < b

When it comes to (a < b) - (a > b) there are no such rules added since it
is not a very common case.

Clang fares better for this case and at least shows the two alternatives as
equal.

Maybe we should change to use the original version equivalent to the inline
function above since that works better with surrounding code?

Best wishes,
Mats Kindahl


>
> Greetings,
>
> Andres
>


Re: glibc qsort() vulnerability

2024-02-12 Thread Mats Kindahl
On Mon, Feb 12, 2024 at 4:57 PM Nathan Bossart 
wrote:

> On Sun, Feb 11, 2024 at 03:44:42PM +0100, Mats Kindahl wrote:
> > On Sat, Feb 10, 2024 at 9:53 PM Nathan Bossart  >
> > wrote:
> >> and I think we should expand on some of the commentary in int.h.
> >> For example, the comment at the top of int.h seems very tailored to the
> >> existing functions and should probably be adjusted.
> >
> >
> > I rewrote the beginning to the following, does that look good?
> >
> >  * int.h
> >  *  Routines to perform signed and unsigned integer arithmetics,
> including
> >  *  comparisons, in an overflow-safe way.
> >
> >
> >
> >> And the "comparison
> >> routines for integers" comment might benefit from some additional
> details
> >> about the purpose and guarantees of the new functions.
> >>
> >
> > I expanded that into the following. WDYT?
> >
> >
> /*
> >  * Comparison routines for integers.
> >  *
> >  * These routines are used to implement comparison functions for, e.g.,
> >  * qsort(). They are designed to be efficient and not risk overflows in
> >  * internal computations that could cause strange results, such as
> INT_MIN >
> >  * INT_MAX if you just return "lhs - rhs".
> >
> *----
>
> LGTM.  I might editorialize a bit before committing, but I think your
> proposed wording illustrates the thrust of the change.
>

Thanks Nathan,

Here are the two fixed patches.

Best wishes,
Mats Kindahl


>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>
From b68f08b6afd62b75755e88f11a3ca31cfdce2645 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Fri, 9 Feb 2024 21:48:27 +0100
Subject: Use integer comparison functions

Use overflow-safe integer functions when implementing
qsort() comparison functions.
---
 contrib/hstore/hstore_gist.c|  3 ++-
 contrib/intarray/_intbig_gist.c |  3 ++-
 contrib/pg_stat_statements/pg_stat_statements.c |  8 ++--
 contrib/pg_trgm/trgm_op.c   |  8 ++--
 src/backend/access/nbtree/nbtinsert.c   |  8 ++--
 src/backend/access/nbtree/nbtpage.c | 10 +++---
 src/backend/access/nbtree/nbtsplitloc.c |  8 ++--
 src/backend/access/spgist/spgdoinsert.c |  5 ++---
 src/backend/access/spgist/spgtextproc.c |  3 ++-
 src/backend/backup/basebackup_incremental.c |  8 ++--
 src/backend/backup/walsummary.c |  7 ++-
 src/backend/catalog/heap.c  |  8 ++--
 src/backend/nodes/list.c| 13 +++--
 src/backend/nodes/tidbitmap.c   |  7 ++-
 src/backend/parser/parse_agg.c  |  7 ---
 src/backend/postmaster/autovacuum.c |  6 ++
 src/backend/replication/logical/reorderbuffer.c |  7 ++-
 src/backend/replication/syncrep.c   |  8 ++--
 src/backend/utils/adt/oid.c |  7 ++-
 src/backend/utils/adt/tsgistidx.c   | 10 +++---
 src/backend/utils/adt/tsquery_gist.c|  6 ++
 src/backend/utils/adt/tsvector.c|  5 ++---
 src/backend/utils/adt/tsvector_op.c |  5 ++---
 src/backend/utils/adt/xid.c |  7 ++-
 src/backend/utils/cache/relcache.c  |  3 ++-
 src/backend/utils/cache/syscache.c  |  5 ++---
 src/backend/utils/cache/typcache.c  |  8 ++--
 src/backend/utils/resowner/resowner.c   | 10 ++
 src/bin/pg_dump/pg_dump_sort.c  |  7 ++-
 src/bin/pg_upgrade/function.c   | 13 +++--
 src/bin/pg_walsummary/pg_walsummary.c   |  8 ++--
 src/bin/psql/crosstabview.c |  3 ++-
 src/include/access/gin_private.h|  8 ++--
 33 files changed, 76 insertions(+), 156 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index fe343739eb..1079f25bf7 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -7,6 +7,7 @@
 #include "access/reloptions.h"
 #include "access/stratnum.h"
 #include "catalog/pg_type.h"
+#include "common/int.h"
 #include "hstore.h"
 #include "utils/pg_crc.h"
 
@@ -356,7 +357,7 @@ typedef struct
 static int
 comparecost(const void *a, const void *b)
 {
-	return ((const SPLITCOST *) a)->cost - ((const SPLITCOST *) b)->cost;
+	return pg_cmp_s32(((const SPLITCOST *) a)->cost, ((const SPLITCOST *) b)->cost);
 }
 
 
diff --git a/contrib/intarray/_intbig_gist.c b

Re: glibc qsort() vulnerability

2024-02-11 Thread Mats Kindahl
On Sat, Feb 10, 2024 at 9:53 PM Nathan Bossart 
wrote:

> On Sat, Feb 10, 2024 at 08:59:06AM +0100, Mats Kindahl wrote:
> > Split the code into two patches: one that just adds the functions
> > (including the new pg_cmp_size()) to common/int.h and one that starts
> using
> > them. I picked the name "pg_cmp_size" rather than "pg_cmp_size_t" since
> > "_t" is usually used as a suffix for types.
> >
> > I added a comment to the (a > b) - (a < b) return and have also added
> casts
> > to (int32) for the int16 and uint16 functions (we need a signed int for
> > uin16 since we need to be able to get a negative number).
> >
> > Changed the type of two instances that had an implicit cast from size_t
> to
> > int and used the new pg_,cmp_size() function.
> >
> > Also fixed the missed replacements in the "contrib" directory.
>
> Thanks for the new patches.  I think the comparison in resowner.c is
> backwards,


Thanks for catching that.



> and I think we should expand on some of the commentary in int.h.
> For example, the comment at the top of int.h seems very tailored to the
> existing functions and should probably be adjusted.


I rewrote the beginning to the following, does that look good?

 * int.h
 *  Routines to perform signed and unsigned integer arithmetics, including
 *  comparisons, in an overflow-safe way.



> And the "comparison
> routines for integers" comment might benefit from some additional details
> about the purpose and guarantees of the new functions.
>

I expanded that into the following. WDYT?

/*
 * Comparison routines for integers.
 *
 * These routines are used to implement comparison functions for, e.g.,
 * qsort(). They are designed to be efficient and not risk overflows in
 * internal computations that could cause strange results, such as INT_MIN >
 * INT_MAX if you just return "lhs - rhs".
 *


Best wishes,
Mats Kindahl


> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-09 Thread Mats Kindahl
On Fri, Feb 9, 2024 at 9:08 PM Nathan Bossart 
wrote:

> On Fri, Feb 09, 2024 at 08:43:21PM +0100, Mats Kindahl wrote:
> > QQ: right now it looks like this:
> >
> > static inline int
> > pg_cmp_u16(uint16 a, uint16 b)
> > {
> >
> > return (int32)a - (int32)b;
> >
> > }
> >
> >
> > and
> >
> > static inline int
> > pg_cmp_u32(uint32 a, uint32 b)
> > {
> >
> > return (a > b) - (a < b);
> >
> > }
> >
> >
> > I think that is clear enough, but do you want more casts added for the
> > return value as well?
>
> I think that is reasonably clear.  The latter does require you to know that
> < and > return (int) 0 or (int) 1, which might be worth a short comment.
> But that's just nitpicking...
>
>
Hi all,

Split the code into two patches: one that just adds the functions
(including the new pg_cmp_size()) to common/int.h and one that starts using
them. I picked the name "pg_cmp_size" rather than "pg_cmp_size_t" since
"_t" is usually used as a suffix for types.

I added a comment to the (a > b) - (a < b) return and have also added casts
to (int32) for the int16 and uint16 functions (we need a signed int for
uin16 since we need to be able to get a negative number).

Changed the type of two instances that had an implicit cast from size_t to
int and used the new pg_,cmp_size() function.

Also fixed the missed replacements in the "contrib" directory.

Best wishes,
Mats Kindahl


> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>
From 84ed91e96f950cdb92952c8248b2c754d3c4a034 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Fri, 9 Feb 2024 21:48:27 +0100
Subject: Use integer comparison functions

Use overflow-safe integer functions when implementing
qsort() comparison functions.
---
 contrib/hstore/hstore_gist.c|  3 ++-
 contrib/intarray/_intbig_gist.c |  3 ++-
 contrib/pg_stat_statements/pg_stat_statements.c |  8 ++--
 contrib/pg_trgm/trgm_op.c   |  8 ++--
 src/backend/access/nbtree/nbtinsert.c   |  8 ++--
 src/backend/access/nbtree/nbtpage.c | 10 +++---
 src/backend/access/nbtree/nbtsplitloc.c |  8 ++--
 src/backend/access/spgist/spgdoinsert.c |  5 ++---
 src/backend/access/spgist/spgtextproc.c |  3 ++-
 src/backend/backup/basebackup_incremental.c |  8 ++--
 src/backend/backup/walsummary.c |  7 ++-
 src/backend/catalog/heap.c  |  8 ++--
 src/backend/nodes/list.c| 13 +++--
 src/backend/nodes/tidbitmap.c   |  7 ++-
 src/backend/parser/parse_agg.c  |  7 ---
 src/backend/postmaster/autovacuum.c |  6 ++
 src/backend/replication/logical/reorderbuffer.c |  7 ++-
 src/backend/replication/syncrep.c   |  8 ++--
 src/backend/utils/adt/oid.c |  7 ++-
 src/backend/utils/adt/tsgistidx.c   | 10 +++---
 src/backend/utils/adt/tsquery_gist.c|  6 ++
 src/backend/utils/adt/tsvector.c|  5 ++---
 src/backend/utils/adt/tsvector_op.c |  5 ++---
 src/backend/utils/adt/xid.c |  7 ++-
 src/backend/utils/cache/relcache.c  |  3 ++-
 src/backend/utils/cache/syscache.c  |  5 ++---
 src/backend/utils/cache/typcache.c  |  8 ++--
 src/backend/utils/resowner/resowner.c   | 10 ++
 src/bin/pg_dump/pg_dump_sort.c  |  7 ++-
 src/bin/pg_upgrade/function.c   | 13 +++--
 src/bin/pg_walsummary/pg_walsummary.c   |  8 ++--
 src/bin/psql/crosstabview.c |  3 ++-
 src/include/access/gin_private.h|  8 ++--
 33 files changed, 76 insertions(+), 156 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index fe343739eb..1079f25bf7 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -7,6 +7,7 @@
 #include "access/reloptions.h"
 #include "access/stratnum.h"
 #include "catalog/pg_type.h"
+#include "common/int.h"
 #include "hstore.h"
 #include "utils/pg_crc.h"
 
@@ -356,7 +357,7 @@ typedef struct
 static int
 comparecost(const void *a, const void *b)
 {
-	return ((const SPLITCOST *) a)->cost - ((const SPLITCOST *) b)->cost;
+	return pg_cmp_s32(((const SPLITCOST *) a)->cost, ((const SPLITCOST *) b)->cost);
 }
 
 
diff --git a/contrib/intarray/_intbig_gist.c b/contrib/intarray/_intbig_gist.c
index 8c6c4b5d05..aef9e13dcc 100644
--- a/contrib/intarray/_intbig_gist.c
+++ b/contrib/intarray/_intbig_gist.c
@@ -9,6 +9,7 @@
 #include "access/gist.h"
 #incl

Re: glibc qsort() vulnerability

2024-02-09 Thread Mats Kindahl
On Fri, Feb 9, 2024 at 8:26 PM Mats Kindahl  wrote:

> On Fri, Feb 9, 2024 at 5:24 PM Nathan Bossart 
> wrote:
>
>> On Fri, Feb 09, 2024 at 08:52:26AM +0100, Mats Kindahl wrote:
>> > Here is a new version introducing pg_cmp_s32 and friends and use them
>> > instead of the INT_CMP macro introduced before. It also moves the
>> > definitions to common/int.h and adds that as an include to all locations
>> > using these functions.
>>
>> Thanks for the new version of the patch.
>>
>> > Note that for integers with sizes less than sizeof(int), C standard
>> > conversions will convert the values to "int" before doing the
>> arithmetic,
>> > so no casting is *necessary*. I did not force the 16-bit functions to
>> > return -1 or 1 and have updated the comment accordingly.
>>
>> It might not be necessary, but this is one of those places where I would
>> add casting anyway to make it abundantly clear what we are expecting to
>> happen and why it is safe.
>>
>
> I'll add it.
>
>
>> > The types "int" and "size_t" are treated as s32 and u32 respectively
>> since
>> > that seems to be the case for most of the code, even if strictly not
>> > correct (size_t can be an unsigned long int for some architecture).
>>
>> Why is it safe to do this?
>>
>> > - return ((const SPLITCOST *) a)->cost - ((const SPLITCOST *)
>> b)->cost;
>> > + return INT_CMP(((const SPLITCOST *) a)->cost, ((const SPLITCOST
>> *) b)->cost);
>>
>> The patch still contains several calls to INT_CMP.
>>
>
> I'll fix it.
>
>
>>
>> >
>> +/*
>> > + * Comparison routines for integers
>> > +
>> *
>> > + */
>>
>> I'd suggest separating this part out to a 0001 patch to make it easier to
>> review.  The 0002 patch could take care of converting the existing qsort
>> comparators.
>>
>
> Ok. Will split it out into two patches.
>
>
>>
>> > +static inline int
>> > +pg_cmp_s16(int16 a, int16 b)
>> > +{
>> > + return a - b;
>> > +}
>> > +
>> > +static inline int
>> > +pg_cmp_u16(uint16 a, uint16 b)
>> > +{
>> > + return a - b;
>> > +}
>> > +
>> > +static inline int
>> > +pg_cmp_s32(int32 a, int32 b)
>> > +{
>> > + return (a > b) - (a < b);
>> > +}
>> > +
>> > +static inline int
>> > +pg_cmp_u32(uint32 a, uint32 b)
>> > +{
>> > + return (a > b) - (a < b);
>> > +}
>> > +
>> > +static inline int
>> > +pg_cmp_s64(int64 a, int64 b)
>> > +{
>> > + return (a > b) - (a < b);
>> > +}
>> > +
>> > +static inline int
>> > +pg_cmp_u64(uint64 a, uint64 b)
>> > +{
>> > + return (a > b) - (a < b);
>> > +}
>>
>> As suggested above, IMHO we should be rather liberal with the casting to
>> ensure it is abundantly clear what is happening here.
>>
>
> Ok.
>

QQ: right now it looks like this:

static inline int
pg_cmp_u16(uint16 a, uint16 b)
{

return (int32)a - (int32)b;

}


and

static inline int
pg_cmp_u32(uint32 a, uint32 b)
{

return (a > b) - (a < b);

}


I think that is clear enough, but do you want more casts added for the
return value as well?

Best wishes,
Mats Kindahl


>
>
>>
>> --
>> Nathan Bossart
>> Amazon Web Services: https://aws.amazon.com
>>
>


Re: glibc qsort() vulnerability

2024-02-09 Thread Mats Kindahl
On Fri, Feb 9, 2024 at 5:27 PM Tom Lane  wrote:

> Nathan Bossart  writes:
> > On Fri, Feb 09, 2024 at 08:52:26AM +0100, Mats Kindahl wrote:
> >> The types "int" and "size_t" are treated as s32 and u32 respectively
> since
> >> that seems to be the case for most of the code, even if strictly not
> >> correct (size_t can be an unsigned long int for some architecture).
>
> > Why is it safe to do this?
>
> We do pretty much assume that "int" is "int32".  But I agree that
> assuming anything about the width of size_t is bad.  I think we need
> a separate pg_cmp_size() or pg_cmp_size_t().
>

Do we want to have something similar for "int" as well? It seems to be
quite common and even though it usually is an int32, it does not have to be.

Best wishes,
Mats Kindahl

>
> regards, tom lane
>


Re: glibc qsort() vulnerability

2024-02-09 Thread Mats Kindahl
On Fri, Feb 9, 2024 at 5:27 PM Tom Lane  wrote:

> Nathan Bossart  writes:
> > On Fri, Feb 09, 2024 at 08:52:26AM +0100, Mats Kindahl wrote:
> >> The types "int" and "size_t" are treated as s32 and u32 respectively
> since
> >> that seems to be the case for most of the code, even if strictly not
> >> correct (size_t can be an unsigned long int for some architecture).
>
> > Why is it safe to do this?
>
> We do pretty much assume that "int" is "int32".  But I agree that
> assuming anything about the width of size_t is bad.  I think we need
> a separate pg_cmp_size() or pg_cmp_size_t().
>

I added precisely one first, but removed it when I saw that all uses
assumed that it was an int. :)

I'll add it back.

Best wishes,
Mats Kindahl

>
> regards, tom lane
>


Re: glibc qsort() vulnerability

2024-02-09 Thread Mats Kindahl
On Fri, Feb 9, 2024 at 5:24 PM Nathan Bossart 
wrote:

> On Fri, Feb 09, 2024 at 08:52:26AM +0100, Mats Kindahl wrote:
> > Here is a new version introducing pg_cmp_s32 and friends and use them
> > instead of the INT_CMP macro introduced before. It also moves the
> > definitions to common/int.h and adds that as an include to all locations
> > using these functions.
>
> Thanks for the new version of the patch.
>
> > Note that for integers with sizes less than sizeof(int), C standard
> > conversions will convert the values to "int" before doing the arithmetic,
> > so no casting is *necessary*. I did not force the 16-bit functions to
> > return -1 or 1 and have updated the comment accordingly.
>
> It might not be necessary, but this is one of those places where I would
> add casting anyway to make it abundantly clear what we are expecting to
> happen and why it is safe.
>

I'll add it.


> > The types "int" and "size_t" are treated as s32 and u32 respectively
> since
> > that seems to be the case for most of the code, even if strictly not
> > correct (size_t can be an unsigned long int for some architecture).
>
> Why is it safe to do this?
>
> > - return ((const SPLITCOST *) a)->cost - ((const SPLITCOST *)
> b)->cost;
> > + return INT_CMP(((const SPLITCOST *) a)->cost, ((const SPLITCOST *)
> b)->cost);
>
> The patch still contains several calls to INT_CMP.
>

I'll fix it.


>
> >
> +/*
> > + * Comparison routines for integers
> > +
> *
> > + */
>
> I'd suggest separating this part out to a 0001 patch to make it easier to
> review.  The 0002 patch could take care of converting the existing qsort
> comparators.
>

Ok. Will split it out into two patches.


>
> > +static inline int
> > +pg_cmp_s16(int16 a, int16 b)
> > +{
> > + return a - b;
> > +}
> > +
> > +static inline int
> > +pg_cmp_u16(uint16 a, uint16 b)
> > +{
> > + return a - b;
> > +}
> > +
> > +static inline int
> > +pg_cmp_s32(int32 a, int32 b)
> > +{
> > + return (a > b) - (a < b);
> > +}
> > +
> > +static inline int
> > +pg_cmp_u32(uint32 a, uint32 b)
> > +{
> > + return (a > b) - (a < b);
> > +}
> > +
> > +static inline int
> > +pg_cmp_s64(int64 a, int64 b)
> > +{
> > + return (a > b) - (a < b);
> > +}
> > +
> > +static inline int
> > +pg_cmp_u64(uint64 a, uint64 b)
> > +{
> > + return (a > b) - (a < b);
> > +}
>
> As suggested above, IMHO we should be rather liberal with the casting to
> ensure it is abundantly clear what is happening here.
>

Ok.


>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 9:39 PM Tom Lane  wrote:

> Nathan Bossart  writes:
> > On Thu, Feb 08, 2024 at 11:59:54AM -0800, Andres Freund wrote:
> >> I'd put these static inlines into common/int.h. I don't think this is
> common
> >> enough to warrant being in c.h. Probably also doesn't hurt to have a
> not quite
> >> as generic name as INT_CMP, I'd not be too surprised if that's defined
> in some
> >> library.
> >>
> >> I think it's worth following int.h's pattern of including
> [s]igned/[u]nsigned
> >> in the name, an efficient implementation for signed might not be the
> same as
> >> for unsigned. And if we use static inlines, we need to do so for correct
> >> semantics anyway.
>
> > Seems reasonable to me.
>
> +1 here also.
>

Here is a new version introducing pg_cmp_s32 and friends and use them
instead of the INT_CMP macro introduced before. It also moves the
definitions to common/int.h and adds that as an include to all locations
using these functions.

Note that for integers with sizes less than sizeof(int), C standard
conversions will convert the values to "int" before doing the arithmetic,
so no casting is *necessary*. I did not force the 16-bit functions to
return -1 or 1 and have updated the comment accordingly.

The types "int" and "size_t" are treated as s32 and u32 respectively since
that seems to be the case for most of the code, even if strictly not
correct (size_t can be an unsigned long int for some architecture).

I also noted that in many situations size_t values are treated as "int" so
there is an overflow risk here, even if small. For example, the result of
"list_length" is assigned to an integer. I do not think this is an
immediate concern, but just wanted to mention it.

Best wishes,
Mats Kindahl

>
> regards, tom lane
>
From 6980199cefde28b4ccf5f0fc3d89e001e8b0c4ef Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Tue, 6 Feb 2024 14:53:53 +0100
Subject: Add integer comparison functions for qsort()

Introduce functions pg_cmp_xyz() that will standardize integer comparison
for implementing comparison function for qsort(). The functions will
returns negative, 0, or positive integer without risking overflow as a
result of subtracting the two integers, which is otherwise a common
pattern for implementing this.

For integer sizes less than sizeof(int) we can use normal subtraction,
which is faster, but for integers with size greater than or equal to
sizeof(int) we need to handle this differently.
---
 contrib/hstore/hstore_gist.c  |  2 +-
 contrib/intarray/_intbig_gist.c   |  2 +-
 .../pg_stat_statements/pg_stat_statements.c   |  7 +---
 contrib/pg_trgm/trgm_op.c |  7 +---
 contrib/seg/seg.c |  7 +---
 src/backend/access/nbtree/nbtinsert.c |  8 +---
 src/backend/access/nbtree/nbtpage.c   | 10 ++---
 src/backend/access/nbtree/nbtsplitloc.c   |  8 +---
 src/backend/access/spgist/spgdoinsert.c   |  5 +--
 src/backend/access/spgist/spgtextproc.c   |  3 +-
 src/backend/backup/basebackup_incremental.c   |  8 +---
 src/backend/backup/walsummary.c   |  7 +---
 src/backend/catalog/heap.c|  8 +---
 src/backend/nodes/list.c  | 13 ++
 src/backend/nodes/tidbitmap.c |  7 +---
 src/backend/parser/parse_agg.c|  3 +-
 src/backend/postmaster/autovacuum.c   |  6 +--
 .../replication/logical/reorderbuffer.c   |  7 +---
 src/backend/replication/syncrep.c |  8 +---
 src/backend/utils/adt/oid.c   |  7 +---
 src/backend/utils/adt/tsgistidx.c | 10 ++---
 src/backend/utils/adt/tsquery_gist.c  |  6 +--
 src/backend/utils/adt/tsvector.c  |  5 +--
 src/backend/utils/adt/tsvector_op.c   |  5 +--
 src/backend/utils/adt/xid.c   |  7 +---
 src/backend/utils/cache/relcache.c|  3 +-
 src/backend/utils/cache/syscache.c|  5 +--
 src/backend/utils/cache/typcache.c|  8 +---
 src/backend/utils/resowner/resowner.c | 10 +
 src/bin/pg_dump/pg_dump_sort.c|  7 +---
 src/bin/pg_upgrade/function.c | 13 +++---
 src/bin/pg_walsummary/pg_walsummary.c |  8 +---
 src/bin/psql/crosstabview.c   |  3 +-
 src/include/access/gin_private.h  |  8 +---
 src/include/common/int.h  | 41 +++
 35 files changed, 112 insertions(+), 160 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index fe343739eb..e125b76290 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -356,7 +356,7 @@ typedef struct
 static int
 comparecost(const void *a, const void *b)

Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 9:07 PM Nathan Bossart 
wrote:

> On Thu, Feb 08, 2024 at 11:59:54AM -0800, Andres Freund wrote:
> > On 2024-02-08 13:44:02 -0500, Tom Lane wrote:
> >> Are we okay with using macros that (a) have double evaluation hazards
> >> and (b) don't enforce the data types being compared are the same?
> >> I think static inlines might be a safer technology.
> >
> > +1
>
> Agreed on static inlines.
>

Seems to be a general consensus on static inlines. I'll get a new patch.


> > I'd put these static inlines into common/int.h. I don't think this is
> common
> > enough to warrant being in c.h. Probably also doesn't hurt to have a not
> quite
> > as generic name as INT_CMP, I'd not be too surprised if that's defined
> in some
> > library.
> >
> >
> > I think it's worth following int.h's pattern of including
> [s]igned/[u]nsigned
> > in the name, an efficient implementation for signed might not be the
> same as
> > for unsigned. And if we use static inlines, we need to do so for correct
> > semantics anyway.
>
> Seems reasonable to me.
>

Agree.

Best wishes,
Mats Kindahl


>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 7:44 PM Tom Lane  wrote:

> Nathan Bossart  writes:
> > On Thu, Feb 08, 2024 at 02:16:11PM +0100, Mats Kindahl wrote:
> >> +/*
> >> + * Compare two integers and return -1, 0, or 1 without risking
> overflow.
> >> + *
> >> + * This macro is used to avoid running into overflow issues because a
> simple
> >> + * subtraction of the two values when implementing a cmp function for
> qsort().
> >> +*/
> >> +#define INT_CMP(lhs,rhs) (((lhs) > (rhs)) - ((lhs) < (rhs)))
>
> > I think we should offer a few different macros, i.e., separate macros for
> > int8, uint8, int16, uint16, int32, etc.  For int16, we can do something
> > faster like
>
> >   (int32) (lhs) - (int32) (rhs)
>
> > but for int32, we need to do someting more like what's in the patch.
>
> Are we okay with using macros that (a) have double evaluation hazards
> and (b) don't enforce the data types being compared are the same?
> I think static inlines might be a safer technology.  Perhaps it's
> okay given the only expected use is in qsort comparators, but ...
>

I picked a macro simply because it can deal with all kinds of integers, but
if we want to have separate implementations for each then inline functions
work just as well and will be safer.

 Best wishes,
Mats Kindahl


> regards, tom lane
>


Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 3:56 AM Nathan Bossart 
wrote:

> On Thu, Feb 08, 2024 at 03:49:03PM +1300, Thomas Munro wrote:
> > On Thu, Feb 8, 2024 at 3:38 PM Thomas Munro 
> wrote:
> >> Perhaps you could wrap it in a branch-free sign() function so you get
> >> a narrow answer?
> >>
> >> https://stackoverflow.com/questions/14579920/fast-sign-of-integer-in-c
> >
> > Ah, strike that, it is much like the suggested (a > b) - (a < b) but
> > with extra steps...
>
> Yeah, https://godbolt.org/ indicates that the sign approach compiles to
>
> movsx   rsi, esi
> movsx   rdi, edi
> xor eax, eax
> sub rdi, rsi
> testrdi, rdi
> setgal
> shr rdi, 63
> sub eax, edi
> ret
>
> while the approach Andres suggested compiles to
>
> xor eax, eax
> cmp edi, esi
> setldl
> setgal
> movzx   edx, dl
> sub eax, edx
> ret
>

Here is a patch that fixes existing cases and introduces a macro for this
comparison (it uses the (a > b) - (a < b) approach). Not sure where to
place the macro nor what a suitable name should be, so feel free to suggest
anything.

I also noted that some functions are duplicated and it might be an idea to
introduce a few standard functions like pg_qsort_strcmp for, e.g., integers
and other common types.

Also noted it is quite common to have this pattern in various places to do
lexicographic sort of multiple values and continue the comparison if they
are equal. Not sure if that is something we should look at.

Best wishes,
Mats Kindahl

>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>
From 8c183806a6b0f95ab53db4b029bc82823785c363 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Tue, 6 Feb 2024 14:53:53 +0100
Subject: Standardize integer comparison for qsort()

Introduce a macro INT_CMP() that will standardize integer comparison
for implementing comparison function for qsort(). The macro will ensure
that the function returns -1, 0, or 1 without risking overflow as a
result of subtracting the two integers, which is otherwise a common
pattern for implementing this.
---
 contrib/hstore/hstore_gist.c|  2 +-
 contrib/intarray/_intbig_gist.c |  2 +-
 contrib/pg_stat_statements/pg_stat_statements.c |  7 +--
 contrib/pg_trgm/trgm_op.c   |  7 +--
 contrib/seg/seg.c   |  7 +--
 src/backend/access/nbtree/nbtinsert.c   |  7 +--
 src/backend/access/nbtree/nbtpage.c |  9 ++---
 src/backend/access/nbtree/nbtsplitloc.c |  7 +--
 src/backend/access/spgist/spgdoinsert.c |  4 +---
 src/backend/access/spgist/spgkdtreeproc.c   |  8 ++--
 src/backend/access/spgist/spgtextproc.c |  2 +-
 src/backend/backup/basebackup_incremental.c |  7 +--
 src/backend/backup/walsummary.c |  6 +-
 src/backend/catalog/heap.c  |  7 +--
 src/backend/nodes/list.c| 12 ++--
 src/backend/nodes/tidbitmap.c   |  6 +-
 src/backend/optimizer/geqo/geqo_pool.c  |  7 +--
 src/backend/parser/parse_agg.c  |  2 +-
 src/backend/postmaster/autovacuum.c |  5 +
 src/backend/replication/logical/reorderbuffer.c |  6 +-
 src/backend/replication/syncrep.c   |  7 +--
 src/backend/utils/adt/oid.c |  6 +-
 src/backend/utils/adt/tsgistidx.c   |  9 ++---
 src/backend/utils/adt/tsquery_gist.c|  5 +
 src/backend/utils/adt/tsvector.c|  4 +---
 src/backend/utils/adt/tsvector_op.c |  4 +---
 src/backend/utils/adt/xid.c |  6 +-
 src/backend/utils/cache/relcache.c  |  2 +-
 src/backend/utils/cache/syscache.c  |  4 +---
 src/backend/utils/cache/typcache.c  |  7 +--
 src/backend/utils/resowner/resowner.c   |  9 +
 src/bin/pg_dump/pg_dump_sort.c  |  6 +-
 src/bin/pg_upgrade/function.c   |  8 
 src/bin/pg_walsummary/pg_walsummary.c   |  7 +--
 src/bin/psql/crosstabview.c |  2 +-
 src/include/access/gin_private.h|  7 +--
 src/include/c.h |  8 
 37 files changed, 51 insertions(+), 170 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index fe343739eb..e125b76290 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -356,7 +356,7 @@ typedef struct
 static int
 comparecost(const void *a, const void *b)
 {
-	return ((const SPLITCOST *) a)->cost - ((const SPLITCOS

Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 12:01 PM Mats Kindahl  wrote:

> On Wed, Feb 7, 2024 at 9:56 PM Nathan Bossart 
> wrote:
>
>> On Wed, Feb 07, 2024 at 08:46:56PM +0200, Heikki Linnakangas wrote:
>> > Doesn't hurt to fix the comparison functions, and +1 on using the same
>> > pattern everywhere.
>>
>> I attached a new version of the patch with some small adjustments.  I
>> haven't looked through all in-tree qsort() comparators to see if any
>> others
>> need to be adjusted, but we should definitely do so as part of this
>> thread.
>> Mats, are you able to do this?
>>
>
> Sure, I checked them and the only ones remaining are those using int16.
> Shall I modify those as well?
>

Seems your additional patch dealt with at least one of the cases.


>
>
>> > However, we use our qsort() with user-defined comparison functions, and
>> we
>> > cannot make any guarantees about what they might do. So we must ensure
>> that
>> > our qsort() doesn't overflow, no matter what the comparison function
>> does.
>> >
>> > Looking at our ST_SORT(), it seems safe to me.
>>
>> Cool.
>>
>> --
>> Nathan Bossart
>> Amazon Web Services: https://aws.amazon.com
>>
>


Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Thu, Feb 8, 2024 at 4:08 AM Nathan Bossart 
wrote:

> Mats, I apologize for steamrolling a bit here.  I'll take a step back into
> a reviewer role.
>

No worries. :)


>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-08 Thread Mats Kindahl
On Wed, Feb 7, 2024 at 9:56 PM Nathan Bossart 
wrote:

> On Wed, Feb 07, 2024 at 08:46:56PM +0200, Heikki Linnakangas wrote:
> > Doesn't hurt to fix the comparison functions, and +1 on using the same
> > pattern everywhere.
>
> I attached a new version of the patch with some small adjustments.  I
> haven't looked through all in-tree qsort() comparators to see if any others
> need to be adjusted, but we should definitely do so as part of this thread.
> Mats, are you able to do this?
>

Sure, I checked them and the only ones remaining are those using int16.
Shall I modify those as well?


> > However, we use our qsort() with user-defined comparison functions, and
> we
> > cannot make any guarantees about what they might do. So we must ensure
> that
> > our qsort() doesn't overflow, no matter what the comparison function
> does.
> >
> > Looking at our ST_SORT(), it seems safe to me.
>
> Cool.
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: glibc qsort() vulnerability

2024-02-07 Thread Mats Kindahl
On Tue, Feb 6, 2024 at 9:56 PM Tom Lane  wrote:

> Nathan Bossart  writes:
> > Even if the glibc issue doesn't apply to Postgres, I'm tempted to suggest
> > that we make it project policy that comparison functions must be
> > transitive.  There might be no real issues today, but if we write all
> > comparison functions the way Mats is suggesting, it should be easier to
> > reason about overflow risks.
>
> A comparison routine that is not is probably broken, agreed.
> I didn't look through the details of the patch --- I was more
> curious whether we had a version of the qsort bug, because
> if we do, we should fix that too.
>

The patch basically removes the risk of overflow in three routines and just
returns -1, 0, or 1, and adds a comment in one.

The routines modified do a subtraction of int:s and return that, which can
cause an overflow. This method is used for some int16 as well but since
standard conversions in C will perform the arithmetics in "int" precision,
this cannot overflow, so added a comment there. It might still be a good
idea to follow the same pattern for the int16 routines, but since there is
no bug there, I did not add them to the patch.

Best wishes,
Mats Kindahl



>
> regards, tom lane
>


Re: glibc qsort() vulnerability

2024-02-07 Thread Mats Kindahl
On Tue, Feb 6, 2024 at 4:11 PM Tom Lane  wrote:

> Mats Kindahl  writes:
> > There is a bug in glibc's qsort() algorithm that runs the risk of
> creating
> > an out-of-bounds error if the comparison function is not transitive, for
> > example, if subtraction is used so that it can create an overflow.
>
> We don't use glibc's qsort.  Have you checked whether there's a
> problem with the code we do use?
>

Interesting. No, haven't checked. Will do that.

Best wishes,
Mats Kindahl

>
> regards, tom lane
>


glibc qsort() vulnerability

2024-02-06 Thread Mats Kindahl
Hi hackers,

There is a bug in glibc's qsort() algorithm that runs the risk of creating
an out-of-bounds error if the comparison function is not transitive, for
example, if subtraction is used so that it can create an overflow.

See
https://packetstormsecurity.com/files/176931/glibc-qsort-Out-Of-Bounds-Read-Write.html

I checked the existing functions in the latest version of Postgres source
code and most are safe, but there were a few ones that could lead to
overflow. I do not know if these can actually lead to problems, but better
safe than sorry, so I created a patch to fix those few cases and add a
comment to one case that was not clear that it could not overflow.

Best wishes,
Mats Kindahl, Timescale
From 83e2f14ab259f568034b07c2f99e34c6dff2c7b5 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Tue, 6 Feb 2024 14:53:53 +0100
Subject: Ensure comparison functions are transitive

There are a few comparison functions to qsort() that are non-transitive
because they can cause an overflow. Fix these functions to instead use
normal comparisons and return -1, 0, or 1 explicitly.
---
 src/backend/access/spgist/spgtextproc.c |  6 +-
 src/backend/utils/cache/relcache.c  |  2 ++
 src/bin/pg_upgrade/function.c   | 10 +++---
 src/bin/psql/crosstabview.c |  8 +++-
 4 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index b8fd0c2ad8..d0a2b4e6e1 100644
--- a/src/backend/access/spgist/spgtextproc.c
+++ b/src/backend/access/spgist/spgtextproc.c
@@ -325,7 +325,11 @@ cmpNodePtr(const void *a, const void *b)
 	const spgNodePtr *aa = (const spgNodePtr *) a;
 	const spgNodePtr *bb = (const spgNodePtr *) b;
 
-	return aa->c - bb->c;
+	if (aa->c > bb->c)
+		return 1;
+	if (aa->c < bb->c)
+		return -1;
+	return 0;
 }
 
 Datum
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ac106b40e3..42e4359c7b 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4520,6 +4520,8 @@ AttrDefaultCmp(const void *a, const void *b)
 	const AttrDefault *ada = (const AttrDefault *) a;
 	const AttrDefault *adb = (const AttrDefault *) b;
 
+	/* Cannot overflow because standard conversions will convert to int and
+	 * sizeof(int16) < sizeof(int) */
 	return ada->adnum - adb->adnum;
 }
 
diff --git a/src/bin/pg_upgrade/function.c b/src/bin/pg_upgrade/function.c
index af998f74d3..c82ae11016 100644
--- a/src/bin/pg_upgrade/function.c
+++ b/src/bin/pg_upgrade/function.c
@@ -32,14 +32,18 @@ library_name_compare(const void *p1, const void *p2)
 	int			slen1 = strlen(str1);
 	int			slen2 = strlen(str2);
 	int			cmp = strcmp(str1, str2);
+	int lhsdb = ((const LibraryInfo *) p1)->dbnum;
+	int rhsdb = ((const LibraryInfo *) p2)->dbnum;
 
 	if (slen1 != slen2)
 		return slen1 - slen2;
 	if (cmp != 0)
 		return cmp;
-	else
-		return ((const LibraryInfo *) p1)->dbnum -
-			((const LibraryInfo *) p2)->dbnum;
+	if (lhsdb < rhsdb)
+		return -1;
+	if (lhsdb > rhsdb)
+		return 1;
+	return 0;
 }
 
 
diff --git a/src/bin/psql/crosstabview.c b/src/bin/psql/crosstabview.c
index c6116b7238..948ed9b1fe 100644
--- a/src/bin/psql/crosstabview.c
+++ b/src/bin/psql/crosstabview.c
@@ -709,5 +709,11 @@ pivotFieldCompare(const void *a, const void *b)
 static int
 rankCompare(const void *a, const void *b)
 {
-	return *((const int *) a) - *((const int *) b);
+	const int lhs = *(const int *) a;
+	const int rhs = *(const int *) b;
+	if (lhs < rhs)
+		return -1;
+	if (lhs > rhs)
+		return 1;
+	return 0;
 }
-- 
2.34.1



Re: Add test module for Table Access Method

2024-01-16 Thread Mats Kindahl
On Tue, Jan 16, 2024 at 6:15 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Tue, Jan 16, 2024 at 10:28 AM Michael Paquier 
> wrote:
> >
> > Hmm.  I'd rather have it do something useful in terms of test coverage
> > rather than being just an empty skull.
> >
> > How about adding the same kind of coverage as dummy_index_am with a
> > couple of reloptions then?  That can serve as a point of reference
> > when a table AM needs a few custom options.  A second idea would be to
> > show how to use toast relations when implementing your new AM, where a
> > toast table could be created even in cases where we did not want one
> > with heap, when it comes to size limitations with char and/or varchar,
> > and that makes for a simpler needs_toast_table callback.
>
> I think a test module for a table AM will really help developers. Just
> to add to the above list - how about the table AM implementing a
> simple in-memory (columnar if possible) database storing tables
> in-memory and subsequently providing readers with the access to the
> tables?
>

Hi,

One idea I wanted to implement is a table access method that you can use to
test the interface, something like a "mock TAM" where you can
programmatically decide on the responses to unit-test the API. I was
thinking that you could implement a framework that allows you to implement
the TAM in some scripting language like Perl, Python, or (horrors) Tcl for
easy prototyping.

Best wishes,
Mats Kindahl


> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>


Re: Add test module for Table Access Method

2024-01-16 Thread Mats Kindahl
Hi all,

On Tue, Jan 16, 2024 at 10:40 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi,
>
> > > I think a test module for a table AM will really help developers. Just
> > > to add to the above list - how about the table AM implementing a
> > > simple in-memory (columnar if possible) database storing tables
> > > in-memory and subsequently providing readers with the access to the
> > > tables?
> >
> > That's a good idea.
>
> Personally I would be careful with this idea.
>
> Practice shows that when you show the first incomplete, limited and
> buggy PoC it ends up being in the production environment the next day
> :) In other words sooner or later there will be users demanding a full
> in-memory columnar storage support from Postgres. I believe it would
> be a problem. Last time I checked TAM was not extremely good for
> implementing proper columnar storages, and there are lots of open
> questions when it comes to in-memory tables (e.g. what to do with
> foreign keys, inherited tables, etc).
>
> All in all I don't think we should provide something that can look /
> be interpreted as first-class alternative storage but in fact is not.
>

I tossed together a table access method for in-memory storage in column
format for experimental purposes over the holidays (I actually have a
row-based one as well, but that is in no shape to share at this point).
It's available under https://github.com/mkindahl/pg_arrow. The intention
was mostly to have something simple to play and experiment with. It is
loosely based on the Apache Arrow Columnar format, but the normal data
structures are not suitable for storing in shared memory so I have tweaked
it a little.


> > How about adding the same kind of coverage as dummy_index_am with a
> > couple of reloptions then?  That can serve as a point of reference
> > when a table AM needs a few custom options.  A second idea would be to
> > show how to use toast relations when implementing your new AM, where a
> > toast table could be created even in cases where we did not want one
> > with heap, when it comes to size limitations with char and/or varchar,
> > and that makes for a simpler needs_toast_table callback.
>
> Good ideas. Additionally we could provide a proxy TAM for a heap TAM
> which does nothing but logging used TAM methods, its arguments and
> return values. This would be a good example and also potentially can
> be used as a debugging tool.
>

We wrote a table access method for experimenting with and to be able to
trace what happens while executing various statements. It is available
under https://github.com/timescale/pg_traceam for anybody who is interested.

Best wishes,
Mats Kindahl


>
> --
> Best regards,
> Aleksander Alekseev
>
>
>


Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

2022-11-16 Thread Mats Kindahl
Hello all,

I think the discussion went a little sideways, so let me recap what I'm
suggesting:

   1. I mentioned that there is a missing callback when the filenode is
   unlinked and this is particularly evident when dropping a table.
   2. It was correctly pointed out to me that an implementor need to ensure
   that dropping a table is transactional.
   3. I argued that the callback is still correct and outlined how this can
   be handled by a table access method using xact callbacks, if necessary.

I see huge potential in the table access method and would like to do my
part in helping it in succeeding. I noted that the API is biased in how the
existing heap implementation works and is also very focused on
implementations of "local" storage engines. For more novel architectures
(for example, various sorts of distributed architectures) and to be easier
to work with, I think that the API can be improved in a few places. This is
a first step in the direction of making the API both easier to use as well
as enabling more novel use-cases.

Writing implementations with ease is more about having the right callbacks
in the right places and providing the right information, but in some cases
it is just not possible to implement efficient functionality with the
current interface. I think it can be useful to separate these two kinds of
enhancements when discussing the API, but I think both are important for
the table access methods to be practically usable and to leverage the full
power of this concept.

On Tue, Aug 2, 2022 at 1:44 AM Andres Freund  wrote:

> Hi,
>
> On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
> >2. In the storage layer, the function RelationDropStorage is called,
> >which will record the table to be dropped in the pendingDeletes
> >
> > When committing (or aborting) the transaction, there are two calls that
> are
> > interesting, in this order:
> >
> >1. CallXactCallbacks which calls registered callbacks
> >2. smgrDoPendingDeletes, which calls the storage layer directly to
> >perform the actual deletion, if necessary.
> >
> > Now, suppose that we want to replace the storage layer with a different
> > one. It is straightforward to replace it by implementing the Table AM
> > methods above, but we are missing a callback on dropping the table. If we
> > have that, we can record the table-to-be-dropped in a similar manner to
> how
> > the heap AM does it and register a transaction callback using
> > RegisterXactCallback.
>
>   don't think implementing dropping relation data at-commit/rollback using
> xact callbacks can be correct. The dropping needs to be integrated with the
> commit / abort records, so it is redone during crash recovery - that's not
> possible with xact callbacks.
>

Yes, but this patch is about making the extension aware that a file node is
being unlinked.


> To me it still seems fundamentally the wrong direction to implement a "drop
> relation callback" tableam callback.
>

This is not really a "drop table" callback, it is just the most obvious
case where this is missing. So, just to recap the situation as it looks
right now.

Here is (transactional) truncate table:

   1. Allocate a new file node in the same tablespace as the table
   2. Add the file node to the list of pending node to delete
   3. Overwrite the existing file node in the relation with the new one
   4. Call table_relation_set_new_filenode to tell extension that there is
   a new filenode for the relation

Here is drop table:

   1. Add the existing file node to the list of pending deletes
   2. Remove the table from the catalogs

For an extension writer, the disappearance of the old file node is
"invisible" since there is no callback about this, but it is very clear
when a new file node is allocated. In addition to being inconsistent, it
adds an extra burden on the extension writer. To notice that a file node
has been unlinked you can register a transaction handler and investigate
the pending list at commit or abort time. Even though possible, there are
two problems with this: 1) the table access method is notified "late" in
the transaction that the file node is going away, and 2) it is
unnecessarily complicated to register a transaction handler only for
inspecting this.

Telling the access method that the filenode is unlinked by adding a
callback is by far the best solution since it does not affect existing
extensions and will give the table access methods opportunities to act on
it immediately.

I have attached an updated patch that changes the names of the callbacks
since there was a name change. I had also missed the case of unlinking a
file node when tables were truncated, so I have added a callback for this
as well.

Best wishes,
Mats Kindahl


> Greetings,
>
> Andres Freund
>
From 5

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

2022-07-29 Thread Mats Kindahl
on all remote nodes
   (this is already what PostgreSQL does, locking the table on a drop). I do
   realize that distributed transactions are not that simple and there are
   other problems associated with this, but this would still introduce an
   unnecessary restriction on what you can do.
   3. A problem with optimistic protocols in general is that they drop in
   performance when you have a lot of writes. It is simply the case that other
   smaller transactions will constantly force a long-running transaction to be
   aborted. This also means that there is a risk that a transaction that drops
   a table will have to be aborted out of necessity since other transactions
   are updating the table. In a distributed system there will be more of
   those, so the odds of aborting "more important" transactions (in the sense
   of needing stronger locks) is higher.
   4. A smaller issue is that right now the storage manager (smgr) and the
   transaction system are quite tightly coupled, which makes it more difficult
   to make the storage system "pluggable". I think that not requiring the use
   of pendingDeletes would move one step in the direction of removing this
   coupling, but I am not entirely sure here.

It is likely that many of these problems can be worked around by placing
restrictions on how DDLs can be used in transactions, but that would create
unnecessary restrictions for the end-user. It might also be possible to
find implementation workarounds by placing code at strategic points in the
implementation, but this is error prone and the risk of making an error is
higher.

Best wishes,
Mats Kindahl


>
> - Heikki
>


Re: RFC: Table access methods and scans

2021-06-04 Thread Mats Kindahl
Hi Jeff,

On Fri, Jun 4, 2021 at 2:52 AM Jeff Davis  wrote:

> Hi,
>
> On Wed, 2021-03-31 at 22:10 +0200, Mats Kindahl wrote:
> > As an example of how this is useful, I noticed the work by Heikki and
> > Ashwin [1], where they return a `TableScanDesc` that contains
> > information about what columns to scan, which looks very useful.
> > Since
> > the function `table_beginscan` in `src/include/access/tableam.h`
> > accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin
> > was exploiting to create a specialized scan for a columnar store.
>
> I don't think ScanKeys are the right place to store information about
> what columns would be useful. See another thread[2] about that topic.
>

Yeah, it is not a good example.  The examples below are better examples.
The scan keys are not sufficient to get all the columns, but AFAICT, it is
this callback that is exploited in the patch.


>
> > Another example of where this can be useful is to optimize access
> > during a sequential scan when you can handle some specific scans very
> > efficiently and can "skip ahead" many tuples if you know what is
> > being
> > looked for instead of filtering "late". Two examples of where this
> > could be useful are:
> >
> > - An access method that reads data from a remote system and doesn't
> > want
> >   to transfer all tuples unless necessary.
> > - Some sort of log-structured storage with Bloom filters that allows
> >   you to quickly skip suites that do not have a key.
>
> I agree that would be very conventient for non-heap AMs. There's a very
> old commit[3] that says:
>
> +   /*
> +* Note that unlike IndexScan, SeqScan never use keys
> +* in heap_beginscan (and this is very bad) - so, here
> +* we have not check are keys ok or not.
> +*/
>
> and that language has just been carried forward for decades. I wonder
> if there's any major reason this hasn't been done yet. Does it just not
> improve performance for a heap, or is there some other reason?
>

That is basically the question. I'm prepared to take a shot at it unless
there is a good reason not to.

Best wishes,
Mats Kindahl



>
> Regards,
> Jeff Davis
>
> [2]
>
> https://www.postgresql.org/message-id/cae-ml+9rmtnzkcntzpqf8o3b-ujhwgfbsoxpqa3wvuc8ybb...@mail.gmail.com
>
> [3]
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3a1ab764ef2
>
>
>


Table AM and DROP TABLE [ Was: Table AM and DDLs]

2021-04-05 Thread Mats Kindahl
On Mon, Mar 22, 2021 at 12:16 AM Andres Freund  wrote:

> Hi,
>
> On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> > On Tue, Feb 23, 2021 at 2:11 AM Andres Freund 
> wrote:
> > Thanks for the answer and sorry about the late reply.
>
> Mine is even later ;)
>

:)

Seems I keep the tradition. :)

Thanks a lot for the pointers, I have some comments below both about DROP
TABLE and ALTER TABLE.


> > > I don't think that's quite right. It's not exactly obvious from the
> > > name, but RelationDropStorage() does not actually drop storage. Instead
> > > it *schedules* the storage to be dropped upon commit.
> > >
> > > The reason for deferring the dropping of table storage is that DDL in
> > > postgres is transactional. Therefore we cannot remove the storage at
> the
> > > moment the DROP TABLE is executed - only when the transaction that
> > > performed the DDL commits. Therefore just providing you with a callback
> > > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > > you'd not have a way to execute the "actual" dropping of the relation
> at
> > > the later stage.
> > >
> >
> > Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> > -> heap_drop_with_catalog) where the delete was just scheduled for
> deletion
> > but it appeared like this was the place to actually perform the "actual"
> > delete. Looking closer, I see this was the wrong location. However, the
> > intention was to get a callback when the "actual" delete should happen.
> > Before that, the blocks are still potentially alive and could be read, so
> > shouldn't be recycled.
> >
> > It seems the right location seems to be in the storage manager
> (smgr_unlink
> > in smgr.c), but that does not seem to be extensible, or are there any
> plans
> > to make it available so that you can implement something other than just
> > "magnetic disk"?
>
> There've been patches to add new types of storage below smgr.c, but not
> in way that can be done outside of build time. As far as I recall.
>

I have done some more research and I do not think it is necessary to extend
the storage layer. As a matter of fact, I think the patch I suggested is
the right approach: let me elaborate on why.

Let's look at how the implementation works with the heap access method (the
file heapam_handler.c) and for this case let's use CREATE TABLE, DROP
TABLE, and TRUNCATE TABLE (last one since that is supported in the Table AM
and hence is a good reference for the comparison).

Disregarding surrounding layers, we have three layers that are important
here:

   1. Heap catalog layer (not sure what to call it, but it's the
   src/backend/catalog/heap.c file)
   2. AM layer (the src/backend/access/heap/heapam_handler.c file)
   3. Storage layer (the src/backend/catalog/storage.c file) "code to
   create and destroy physical storage for relations".

Looking at CREATE TRUNCATE, we have the following calls through these
layers:

   1. In the heap catalog layer we have a call of heap_truncate_one_rel
   which calls the table AM layer.
   2. In the Table AM layer heapam_relation_nontransactional_truncate will
   just call the storage layer to truncate the storage.
   3. The storage layer gets called through RelationTruncate, which will
   truncate the actual files.

Looking at CREATE TABLE, we have a similar pattern:

   1. In the heap catalog layer heap_create_with_catalog is called, which
   in turn calls heap_create, which will create the actual relcache and also
   call the table AM layer if it is a relation, materialized view, or
   toastvalue.
   2. In the Table AM layer, heapam_relation_set_new_filenode is called
   which will record the transaction identifiers and call the storage layer to
   create the underlying storage.
   3. In the storage layer, RelationCreateStorage will create the necessary
   storage, but also register the table for deletion if the transaction is
   aborted.

Note here that the storage layer remembers the table for deletion by saving
it in pendingDeletes, which is local to the storage layer.

Looking at DROP TABLE, we have a similar pattern, but am missing one step:

   1. In the heap catalog layer the function heap_drop_with_catalog is
   called, which releases the system cache and calls the storage layer to drop
   the relation
   2. In the storage layer, the function RelationDropStorage is called,
   which will record the table to be dropped in the pendingDeletes

When committing (or aborting) the transaction, there are two calls that are
interesting, in this order:

   1. CallXactCallbacks which calls registered callbacks
   2. smgrDoPendingDeletes, which calls the storage layer directly to
   perform the actu

RFC: Table access methods and scans

2021-03-31 Thread Mats Kindahl
Hi all,

I started looking into how table scans are handled for table access
methods and have discovered a few things that I find odd. I cannot
find any material regarding why this particular choice was made (if
anybody has pointers, I would be very grateful).

I am quite new to PostgreSQL so forgive me if my understanding of the
code below is wrong and please clarify what I have misunderstood.

I noted that `scan_begin` accepts a `ScanKey` and my *guess* was that
the intention for adding this to the interface was to support primary
indexes for table access methods (the comment is a little vague, but
it seems to point to that). However, looking at where `scan_begin` is
called from, I see that it is called from the following methods in
`tableam.h`:

- `table_beginscan` is always called using zero scan keys and NULL.
- `table_beginscan_strat` is mostly called with zero keys and NULL,
  with the exception of `systable_beginscan`, which is only for system
  tables. It does use this feature.
- `table_beginscan_bm` is only called with zero keys and NULL.
- `table_beginscan_sampling` is only called with zero keys and NULL.
- `table_beginscan_tid` calls `scan_begin` with zero keys and NULL.
- `table_beginscan_analyze` calls `scan_begin` with zero keys and NULL.
- `table_beginscan_catalog` is called with more than one key, but
  AFACT this is only for catalog tables.
- `table_beginscan_parallel` calls `scan_begin` with zero keys and NULL.

I draw the conclusion that the scan keys only make sense for a table
access method for the odd case where it is used for a system tables or
catalog tables, so for all practical purposes the scan key cannot be
used to implement a primary index for general tables.

As an example of how this is useful, I noticed the work by Heikki and
Ashwin [1], where they return a `TableScanDesc` that contains
information about what columns to scan, which looks very useful. Since
the function `table_beginscan` in `src/include/access/tableam.h`
accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin
was exploiting to create a specialized scan for a columnar store.

Another example of where this can be useful is to optimize access
during a sequential scan when you can handle some specific scans very
efficiently and can "skip ahead" many tuples if you know what is being
looked for instead of filtering "late". Two examples of where this
could be useful are:

- An access method that reads data from a remote system and doesn't want
  to transfer all tuples unless necessary.
- Some sort of log-structured storage with Bloom filters that allows
  you to quickly skip suites that do not have a key.

Interestingly enough, `ScanKey` is generated for `IndexScan` and I
think that the same approach could be used for sequential scans: pick
out the quals that can be used for filtering and offer them to the
table access method through the `scan_begin` callback.

Thoughts around this?

Best wishes,
Mats Kindahl

[1] 
https://www.postgresql-archive.org/Zedstore-compressed-in-core-columnar-storage-tp6081536.html




Re: Table AM and DDLs

2021-03-03 Thread Mats Kindahl
On Tue, Feb 23, 2021 at 2:11 AM Andres Freund  wrote:

> Hi,
>

Hi Andres,

Thanks for the answer and sorry about the late reply.


> On 2021-02-22 08:33:21 +0100, Mats Kindahl wrote:
> > I started to experiment with the table access method interface to see if
> it
> > can be used for some ideas I have.
>
> Cool.
>
>
> > The `relation_set_new_filenode` is indirectly called from
> > `heap_create_with_catalog`, but there is no corresponding callback from
> > `heap_drop_with_catalog`. It also seems like the intention is that the
> > callback should call `RelationCreateStorage` itself (makes sense, since
> the
> > access method knows about how to use the storage), so it seems natural to
> > add a `relation_reset_filenode` to the table AM that is called from
> > `heap_drop_with_catalog` for tables and add that to the heap
> implementation
> > (see the attached patch).
>
> I don't think that's quite right. It's not exactly obvious from the
> name, but RelationDropStorage() does not actually drop storage. Instead
> it *schedules* the storage to be dropped upon commit.
>
> The reason for deferring the dropping of table storage is that DDL in
> postgres is transactional. Therefore we cannot remove the storage at the
> moment the DROP TABLE is executed - only when the transaction that
> performed the DDL commits. Therefore just providing you with a callback
> that runs in heap_drop_with_catalog() doesn't really achieve much -
> you'd not have a way to execute the "actual" dropping of the relation at
> the later stage.
>

Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
-> heap_drop_with_catalog) where the delete was just scheduled for deletion
but it appeared like this was the place to actually perform the "actual"
delete. Looking closer, I see this was the wrong location. However, the
intention was to get a callback when the "actual" delete should happen.
Before that, the blocks are still potentially alive and could be read, so
shouldn't be recycled.

It seems the right location seems to be in the storage manager (smgr_unlink
in smgr.c), but that does not seem to be extensible, or are there any plans
to make it available so that you can implement something other than just
"magnetic disk"?


> > Creating new blocks for a table is straightforward to implement by using
> > the `relation_set_new_filenode` callback where you can create new memory
> > blocks for a relation, but I cannot find a way to clean up those blocks
> > when the table is dropped nor a way to handle a change of the schema for
> a
> > table.
>
> What precisely do you mean with the "handle a change of the schema" bit?
> I.e. what would you like to do, and what do you think is preventing you
> from it? But before you answer see my next point below.
>
>
> > Altering the schema does not seem to be covered at all, but this is
> > something that table access methods need to know about since it might
> want
> > to optimize the internal storage when the schema changes. I have not been
> > able to find any discussions around this, but it seems like a natural
> thing
> > to do with a table. Have I misunderstood how this works?
>
> Due to postgres' transactional DDL you cannot really change the storage
> layout of *existing data* when that DDL command is executed - the data
> still needs to be interpretable in case the DDL is rolled back
> (including when crashing).
>

No, didn't expect this, but some means to see that a schema change is about
to happen.


> Before I explain some more: Could you describe in a bit more detail what
> kind of optimization you'd like to make?
>

This is not really about any optimizations, it more about a good API for
tables and managing storage. If a memory table can be implemented entirely
in the extension and storage managed fully, there is a lot of interesting
potential for various implementations of table backends. For this to work I
think it is necessary to be able to handle schema changes for the backend
storage in addition to scans, inserts, updates, and deletes, but I am not
sure if it is already possible in some way that I haven't discovered or if
I should just try to propose something (making the storage manager API
extensible seems like a good first attempt).


> Back to schema change handling:
>
> For some schema changes postgres assumes that they can be done
> "in-place", e.g. adding a column to a table.
>
> Other changes, e.g. changing the type of a column "sufficiently", will
> cause a so called table rewrite. Which means that a new relation will be
> created (including a call to relation_set_new_filenode()), then that new
> relation will get all t

Table AM and DDLs

2021-02-21 Thread Mats Kindahl
Hi all,

I am quite new to PostgreSQL so forgive me if my understanding of the code
below is wrong and please clarify what I have misunderstood.

I started to experiment with the table access method interface to see if it
can be used for some ideas I have.

For the experiment, I am using a simple in-memory table access method that
stores all the data as shared memory pages instead of disk pages. I know
there are other ways to get good performance, but this implementation is
good enough for my experiments since it tests a few things with the Table
AM interface that I am wondering about.

Now, the first question I was looking at is if it is possible to
handle DDL properly if you have a non-normal storage. Both create new
storage blocks on table creation, clean up on dropping a table as well as
handling schema changes on alter table?

Creating new blocks for a table is straightforward to implement by using
the `relation_set_new_filenode` callback where you can create new memory
blocks for a relation, but I cannot find a way to clean up those blocks
when the table is dropped nor a way to handle a change of the schema for a
table.

The `relation_set_new_filenode` is indirectly called from
`heap_create_with_catalog`, but there is no corresponding callback from
`heap_drop_with_catalog`. It also seems like the intention is that the
callback should call `RelationCreateStorage` itself (makes sense, since the
access method knows about how to use the storage), so it seems natural to
add a `relation_reset_filenode` to the table AM that is called from
`heap_drop_with_catalog` for tables and add that to the heap implementation
(see the attached patch).

Altering the schema does not seem to be covered at all, but this is
something that table access methods need to know about since it might want
to optimize the internal storage when the schema changes. I have not been
able to find any discussions around this, but it seems like a natural thing
to do with a table. Have I misunderstood how this works?

Best wishes,
Mats Kindahl
Timescale
From 354ec1b5e86c673c27d1e578c8cddfa483fac659 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Sun, 14 Feb 2021 18:58:43 +0100
Subject: Add callback to reset filenode to table access method

This commit adds a callback to `TableAccessMethod` that is called when
the table should be scheduled for unlinking and implement the method
for the heap access method.
---
 src/backend/access/heap/heapam_handler.c |  7 +++
 src/backend/access/table/tableamapi.c|  1 +
 src/backend/catalog/heap.c   | 23 ++-
 src/include/access/tableam.h | 20 
 4 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..287f8a47cf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -613,6 +613,12 @@ heapam_relation_set_new_filenode(Relation rel,
 	smgrclose(srel);
 }
 
+static void
+heapam_relation_reset_filenode(Relation rel)
+{
+	RelationDropStorage(rel);
+}
+
 static void
 heapam_relation_nontransactional_truncate(Relation rel)
 {
@@ -2566,6 +2572,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_delete_tuples = heap_index_delete_tuples,
 
 	.relation_set_new_filenode = heapam_relation_set_new_filenode,
+	.relation_reset_filenode = heapam_relation_reset_filenode,
 	.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 325ecdc122..30e070383c 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -83,6 +83,7 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->tuple_lock != NULL);
 
 	Assert(routine->relation_set_new_filenode != NULL);
+	Assert(routine->relation_reset_filenode != NULL);
 	Assert(routine->relation_nontransactional_truncate != NULL);
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 9abc4a1f55..e1e8b93285 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1990,7 +1990,28 @@ heap_drop_with_catalog(Oid relid)
 	 * Schedule unlinking of the relation's physical files at commit.
 	 */
 	if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
-		RelationDropStorage(rel);
+		switch (rel->rd_rel->relkind)
+		{
+			case RELKIND_VIEW:
+			case RELKIND_COMPOSITE_TYPE:
+			case RELKIND_FOREIGN_TABLE:
+			case RELKIND_PARTITIONED_TABLE:
+			case RELKIND_PARTITIONED_INDEX:
+Assert(false);
+break;
+
+			case RELKIND_INDEX:
+			case RELKIND_SEQUENCE:
+RelationDropStorage(rel);
+break;
+
+			case RELKIND_RELATION:
+			cas