Re: Synchronizing slots from primary to standby

2023-06-26 Thread Amit Kapila
On Mon, Jun 26, 2023 at 11:15 AM Drouvot, Bertrand
 wrote:
>
> On 6/20/23 12:22 PM, Amit Kapila wrote:
> > On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand
> >  wrote:
>
> >> In such a case (slot valid on the primary but invalidated on the standby) 
> >> then I think we
> >> could drop and recreate the invalidated slot on the standby.
> >>
> >
> > Will it be safe? Because after recreating the slot, it will reserve
> > the new WAL location and build the snapshot based on that which might
> > miss some important information in the snapshot. For example, to
> > update the slot's position with new information from the primary, the
> > patch uses pg_logical_replication_slot_advance() which means it will
> > process all records and update the snapshot via
> > DecodeCommit->SnapBuildCommitTxn().
>
> Your concern is that the slot could have been consumed on the standby?
>
> I mean, if we suppose the "synchronized" slot can't be consumed on the 
> standby then
> drop/recreate such an invalidated slot would be ok?
>

That also may not be sufficient because as soon as the slot is
invalidated/dropped, the required WAL could be removed on standby.

-- 
With Regards,
Amit Kapila.




Fwd: Castable Domains for different JSON representations

2023-06-26 Thread Steve Chavez
> The bigger picture here, though, is what are you really buying
compared to just invoking the special conversion function explicitly?
> If you have to write "sometsrangecolumn::mytsrange::json", that's
not shorter and certainly not clearer than writing a function call.

The main benefit is to be able to call `json_agg` on tables with these
custom json representations. Then the defined json casts work
transparently when doing:

select json_agg(x) from mytbl x;
   json_agg
---
 [{"id":1,"val":{"lower" : "2022-12-31T11:00:00", "upper" :
"2023-01-01T06:00:00", "lower_inc" : false, "upper_inc" : false}}]

-- example table
create table mytbl(id int, val mytsrange);
insert into mytbl values (1, '(2022-12-31 11:00, 2023-01-01 06:00)');

This output is directly consumable on web applications and as
you can see the expression is pretty short, with no need to use
the explicit casts as `json_agg` already does them internally.

Best regards,
Steve


Clean up JumbleQuery() from query text

2023-06-26 Thread Michael Paquier
Hi all,

Joe has reported me offlist that JumbleQuery() includes a dependency
to the query text, but we don't use that anymore as the query ID is
generated from the Query structure instead.

Any thoughts about the cleanup attached?  But at the same time, this
is simple and a new thing, so I'd rather clean up that now rather than
later. 

It is not urgent, so I am fine to postpone that after beta2 is
released on 17~ if there are any objections to that, of course.

Thoughts?
--
Michael
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 3499bfd4a0..7649e095aa 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -64,7 +64,7 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern JumbleState *JumbleQuery(Query *query, const char *querytext);
+extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
 extern PGDLLIMPORT bool query_id_enabled;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf..8570b14f62 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -255,7 +255,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 
 	query = castNode(Query, stmt->query);
 	if (IsQueryIdEnabled())
-		jstate = JumbleQuery(query, pstate->p_sourcetext);
+		jstate = JumbleQuery(query);
 
 	if (post_parse_analyze_hook)
 		(*post_parse_analyze_hook) (pstate, query, jstate);
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index d7fd72d70f..281907a4d8 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -97,7 +97,7 @@ CleanQuerytext(const char *query, int *location, int *len)
 }
 
 JumbleState *
-JumbleQuery(Query *query, const char *querytext)
+JumbleQuery(Query *query)
 {
 	JumbleState *jstate = NULL;
 
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 70932dba61..4006632092 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -124,7 +124,7 @@ parse_analyze_fixedparams(RawStmt *parseTree, const char *sourceText,
 	query = transformTopLevelStmt(pstate, parseTree);
 
 	if (IsQueryIdEnabled())
-		jstate = JumbleQuery(query, sourceText);
+		jstate = JumbleQuery(query);
 
 	if (post_parse_analyze_hook)
 		(*post_parse_analyze_hook) (pstate, query, jstate);
@@ -166,7 +166,7 @@ parse_analyze_varparams(RawStmt *parseTree, const char *sourceText,
 	check_variable_parameters(pstate, query);
 
 	if (IsQueryIdEnabled())
-		jstate = JumbleQuery(query, sourceText);
+		jstate = JumbleQuery(query);
 
 	if (post_parse_analyze_hook)
 		(*post_parse_analyze_hook) (pstate, query, jstate);
@@ -203,7 +203,7 @@ parse_analyze_withcb(RawStmt *parseTree, const char *sourceText,
 	query = transformTopLevelStmt(pstate, parseTree);
 
 	if (IsQueryIdEnabled())
-		jstate = JumbleQuery(query, sourceText);
+		jstate = JumbleQuery(query);
 
 	if (post_parse_analyze_hook)
 		(*post_parse_analyze_hook) (pstate, query, jstate);


signature.asc
Description: PGP signature


pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-06-26 Thread Ashutosh Bapat
Hi All,
Every pg_decode routine except pg_decode_message  that decodes a
transactional change, has following block
/* output BEGIN if we haven't yet */
if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
{
pg_output_begin(ctx, data, txn, false);
}
txndata->xact_wrote_changes = true;

But pg_decode_message() doesn't call pg_output_begin(). If a WAL
message is the first change in the transaction, it won't have a BEGIN
before it. That looks like a bug. Why is pg_decode_message()
exception?

-- 
Best Wishes,
Ashutosh Bapat




Re: Do we want a hashset type?

2023-06-26 Thread jian he
On Mon, Jun 26, 2023 at 4:36 AM Joel Jacobson  wrote:
>
> On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
> > SELECT hashset_contains('{}'::int4hashset, NULL::int);
> >
> > would be False, according to the General Rules.
> >
> ...
> > Applying the same rules, we'd have to return Unknown (which we
represent as
> > null) for:
> >
> > SELECT hashset_contains('{null}'::int4hashset, NULL::int);
> >
>
> Aha! I just discovered to my surprise that the corresponding array
> queries gives the same result:
>
> SELECT NULL = ANY(ARRAY[]::int[]);
>  ?column?
> --
>  f
> (1 row)
>
> SELECT NULL = ANY(ARRAY[NULL]::int[]);
>  ?column?
> --
>
> (1 row)
>
> I have no more objections; let's stick to the same null semantics as
arrays and multisets.
>
> /Joel

Can you try to glue the attached to the hashset data type input function.
the attached will parse cstring with double quote and not. so '{1,2,3}' ==
'{"1","2","3"}'. obviously quote will preserve the inner string as is.
currently int4hashset input is delimited by comma, if you want deal with
range then you need escape the comma.
/*

gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/Desktop/regress_pgsql/input_validate.c
gcc -shared  -o /home/jian/Desktop/regress_pgsql/input_validate.so /home/jian/Desktop/regress_pgsql/input_validate.o

CREATE OR REPLACE FUNCTION str_delim_count_validate(cstring) RETURNS BOOL SET search_path from current
AS '/home/jian/Desktop/regress_pgsql/input_validate', 'str_delim_count_validate'
LANGUAGE C IMMUTABLE;

select str_delim_count_validate('{"23890","2","3",  "a",1,2,3,4,NULL,2022-01-01,"[1,2]"}');
select str_delim_count_validate('{"3 ", }'); --fail
select str_delim_count_validate('{"3 " }'); --ok
select str_delim_count_validate('{"""23890"}'); --fail.
select str_delim_count_validate('{}'); --ok
select str_delim_count_validate('}');   --fail.
select str_delim_count_validate('{');  --fail.
select str_delim_count_validate('{{}}');  --fail.
select str_delim_count_validate('{{}}');  --fail.
select str_delim_count_validate('{"22022-01-01,[1,2]}'); --fail.
select str_delim_count_validate('{" 2022-01-01 "}'); --ok
select str_delim_count_validate('{ 2022-01-01}'); --ok
select str_delim_count_validate('{  2022-01-01  ,"[1,2]"}  '); --ok
select str_delim_count_validate('{ 2023-06-26 16:45:02.454293+08   ,"2","3"}'); --ok.
select str_delim_count_validate('{"\\t"}');--ok
*/
#include "postgres.h"
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"
#include "utils/numeric.h"
#include "funcapi.h"
#include "utils/lsyscache.h"
#include "utils/fmgrprotos.h"
#include "common/hashfn.h"
PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(str_delim_count_validate);
static	int SetCount(const char *str, char typdelim, Node *escontext);
static bool ReadSetStr(char *arrayStr,const char *origStr, char typdelim, Node *escontext); 
static bool set_isspace(char ch);

Datum
str_delim_count_validate(PG_FUNCTION_ARGS)
{   
char*string= PG_GETARG_CSTRING(0);
char*string_save;
char*p;
	/* Make a modifiable copy of the input */
	string_save = pstrdup(string);
chartypdelim   = ',';
	p = string_save;
int nitems;
nitems  = SetCount(p,typdelim, fcinfo->context);

if (!ReadSetStr(p, string,typdelim,fcinfo->context))
elog(INFO,"delimuite str failed");

elog(INFO,"line %d nitems=%d",__LINE__,nitems);
PG_RETURN_BOOL(true);
}


/*
 * array_isspace() --- a non-locale-dependent isspace()
 *
 * We used to use isspace() for parsing array values, but that has
 * undesirable results: an array value might be silently interpreted
 * differently depending on the locale setting.  Now we just hard-wire
 * the traditional ASCII definition of isspace().
 */
static bool
set_isspace(char ch)
{
	if (ch == ' ' ||
		ch == '\t' ||
		ch == '\n' ||
		ch == '\r' ||
		ch == '\v' ||
		ch == '\f')
		return true;
	return false;
}

static bool
ReadSetStr(char *arrayStr,const char *origStr, char typdelim, Node *escontext)
{
	int		i;
char*srcptr;
boolin_quotes   = false;
booleoArray = false;
boolhasnull;
int32   totbytes;
int indx	= 0;

	/*
	 * We have to remove " and \ characters to create a clean item value to
	 * pass to the datatype input routine.  We overwrite each item value
	 * in-place within arrayStr to do this.  srcptr is the current scan point,
	 * and dstptr is where we are copying to.
	 *
	 * We also want to suppress leading and trailing unquoted whitespace. We
	 * use the leadingspace flag to suppress leading space.  Trailing space is
	 * tracked by using dstendptr to point to the last significant output
	 * character.
	 *
	 * The error checking in this routine is mostly pro-forma, since we expect
	 * that SetCount() already validated the string.  So we don't bother
	 * with errdetail messages.
	 */
srcptr  = arrayStr;

Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-26 Thread Önder Kalacı
Hi Hayato, all

>
>
> This is a follow-up thread of [1]. The commit allowed subscribers to use
> indexes
> other than PK and REPLICA IDENTITY when REPLICA IDENTITY is FULL on
> publisher,
> but the index must be a B-tree. In this proposal, I aim to extend this
> functionality to allow
> for hash indexes and potentially other types.
>
>
Cool, thanks for taking the time to work on this.


> # Current problem
>
> The current limitation comes from the function build_replindex_scan_key(),
> specifically these lines:
>

When I last dealt with the same issue, I was examining it from a slightly
broader perspective. I think
my conclusion was that RelationFindReplTupleByIndex() is designed for the
constraints of UNIQUE INDEX
and Primary Key. Hence, btree limitation was given.

So, my main point is that it might be useful to check
RelationFindReplTupleByIndex() once more in detail
to see if there is anything else that is specific to btree indexes.

build_replindex_scan_key() is definitely one of the major culprits but see
below as well.

I think we should also be mindful about tuples_equal() function. When an
index returns more than
one tuple, we rely on tuples_equal() function to make sure non-relevant
tuples are skipped.

For btree indexes, it was safe to rely on that function as the columns that
are indexed using btree
always have equality operator. I think we can safely assume the same for
hash indexes.

However, say we indexed "point" type using "gist" index. Then, if we let
this logic to kick in,
I think tuples_equal() would fail saying that there is no equality operator
exists.

One might argue that it is already the case for RelationFindReplTupleSeq()
or when you
have index but the index on a different column. But still, it seems useful
to make sure
you are aware of this limitation as well.


>
> ## Current difficulties
>
> The challenge with supporting other indexes is that they lack a fixed set
> of strategies,
> making it difficult to choose the correct strategy number based on the
> index
> access method. Even within the same index type, different operator classes
> can
> use different strategy numbers for the same operation.
> E.g. [2] shows that number 6 can be used for the purpose, but other
> operator classes
> added by btree_gist [3] seem to use number 3 for the euqlaity comparison.
>
>
Also, build_replindex_scan_key() seems like a too late place to check this?
I mean, what
if there is no equality operator, how should code react to that? It
effectively becomes
RelationFindReplTupleSeq(), so maybe better follow that route upfront?

In other words, that decision should maybe
happen IsIndexUsableForReplicaIdentityFull()?

For the specific notes you raised about strategy numbers / operator
classes, I need to
study a bit :) Though, I'll be available to do that early next week.

Thanks,
Onder


Re: Optimize walsender handling invalid messages of 'drop publication'

2023-06-26 Thread Bowen Shi
Dears,

This issue has been pending for several months without any response.
And this problem still exists in the latest minor versions of PG 12
and PG 13.

I believe that the fix in this patch is helpful.

The patch has been submitted
https://commitfest.postgresql.org/43/4393/ .  Anyone who is interested
in this issue can help with the review.

Regards!
BowenShi

On Mon, 27 Feb 2023 at 16:12, Bowen Shi  wrote:
>
> Hi all,
>
> Before PG 14, walsender process has to handle invalid message in one
> XLOG (PG 14 provide a particular XLOG type: XLOG_XACT_INVALIDATIONS).
> This may bring some problems which has been discussed in previous
> mail: 
> https://www.postgresql.org/message-id/flat/CAM_vCufO3eeRZ_O04z9reiE%2BB644%2BRgczbAVo9C5%2BoHV9S7%2B-g%40mail.gmail.com#981e65567784e0aefa4474cc3fd840f6
>
> This patch can solve the problem. It has three parts:
> 1. pgoutput do not do useless invalid cache anymore;
> 2. Add a relid->relfilenode hash map to invoid hash seq search;
> 3. test case: It needs two or three minutes to finish.
>
> The patch is based on the HEAD of branch REL_13_STABLE. It also works
> for PG 10~12.
>
> Thanks.
> Bowenshi




Re: Row pattern recognition

2023-06-26 Thread Tatsuo Ishii
>> In this case, we should require the user to specify AFTER MATCH SKIP
>> TO NEXT ROW so that behavior doesn't change when we implement the
>> standard default.  (Your patch might do this already.)
> 
> Agreed. I will implement AFTER MATCH SKIP PAST LAST ROW in the next
> patch and I will change the default to AFTER MATCH SKIP PAST LAST ROW.

Attached is the v2 patch to add support for AFTER MATCH SKIP PAST LAST
ROW and AFTER MATCH SKIP PAST LAST ROW. The default is AFTER MATCH
SKIP PAST LAST ROW as the standard default. Here are some examples to
demonstrate how those clauses affect the query result.

SELECT i, rpr(i) OVER w
  FROM (VALUES (1), (2), (3), (4)) AS v (i)
  WINDOW w AS (
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   AFTER MATCH SKIP PAST LAST ROW
   PATTERN (A B)
   DEFINE
A AS i <= 2,
B AS i <= 3
);
 i | rpr 
---+-
 1 |   1
 2 |
 3 |
 4 |
(4 rows)

In this example rpr starts from i = 1 and find that row i = 1
satisfies A, and row i = 2 satisfies B. Then rpr moves to row i = 3
and find that it does not satisfy A, thus the result is NULL. Same
thing can be said to row i = 4.

SELECT i, rpr(i) OVER w
  FROM (VALUES (1), (2), (3), (4)) AS v (i)
  WINDOW w AS (
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   AFTER MATCH SKIP TO NEXT ROW
   PATTERN (A B)
   DEFINE
A AS i <= 2,
B AS i <= 3
);
 i | rpr 
---+-
 1 |   1
 2 |   2
 3 |
 4 |
(4 rows)

In this example rpr starts from i = 1 and find that row i = 1
satisfies A, and row i = 2 satisfies B (same as above). Then rpr moves
to row i = 2, rather than 3 because AFTER MATCH SKIP TO NEXT ROW is
specified.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
>From 92aebdb8a8c6d3739ee49b755f281e77c34f5d36 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii 
Date: Mon, 26 Jun 2023 17:05:35 +0900
Subject: [PATCH v2 1/7] Row pattern recognition patch for raw parser.

---
 src/backend/parser/gram.y   | 218 +---
 src/include/nodes/parsenodes.h  |  54 
 src/include/parser/kwlist.h |   8 ++
 src/include/parser/parse_node.h |   1 +
 4 files changed, 266 insertions(+), 15 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 39ab7eac0d..5cd3ebaa98 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	DefElem	   *defelt;
 	SortBy	   *sortby;
 	WindowDef  *windef;
+	RPCommonSyntax	*rpcom;
+	RPSubsetItem	*rpsubset;
 	JoinExpr   *jexpr;
 	IndexElem  *ielem;
 	StatsElem  *selem;
@@ -453,8 +455,12 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 TriggerTransitions TriggerReferencing
 vacuum_relation_list opt_vacuum_relation_list
 drop_option_list pub_obj_list
-
-%type 	opt_routine_body
+row_pattern_measure_list row_pattern_definition_list
+opt_row_pattern_subset_clause
+row_pattern_subset_list row_pattern_subset_rhs
+row_pattern
+%type 	 row_pattern_subset_item
+%type 	opt_routine_body row_pattern_term
 %type  group_clause
 %type 	group_by_list
 %type 	group_by_item empty_grouping_set rollup_clause cube_clause
@@ -551,6 +557,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	relation_expr_opt_alias
 %type 	tablesample_clause opt_repeatable_clause
 %type 	target_el set_target insert_column_item
+row_pattern_measure_item row_pattern_definition
 
 %type 		generic_option_name
 %type 	generic_option_arg
@@ -633,6 +640,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	window_clause window_definition_list opt_partition_clause
 %type 	window_definition over_clause window_specification
 opt_frame_clause frame_extent frame_bound
+%type 	opt_row_pattern_common_syntax opt_row_pattern_skip_to
+%type 	opt_row_pattern_initial_or_seek
+%type 	opt_row_pattern_measures
 %type 	opt_window_exclusion_clause
 %type 		opt_existing_window_name
 %type  opt_if_not_exists
@@ -645,7 +655,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 		hash_partbound
 %type 		hash_partbound_elem
 
-
 %type 		json_format_clause_opt
 	json_value_expr
 	json_output_clause_opt
@@ -705,7 +714,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
 
 	DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
-	DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
+	DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH DESC
 	DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P
 	DOUBLE_P DROP
 
@@ -721,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	HANDLER HAVING HEADER_P HOLD HOUR_P
 
 

Re: Parallelize correlated subqueries that execute within each worker

2023-06-26 Thread Richard Guo
On Mon, Jun 12, 2023 at 10:23 AM James Coleman  wrote:

> BTW are you by any chance testing on ARM macOS? I reproduced the issue
> there, but for some reason I did not reproduce the error (and the plan
> wasn't parallelized) when I tested this on linux. Perhaps I missed
> setting something up; it seems odd.


Hmm, that's weird.  I was also testing that query on linux.  But please
note that several GUC settings are needed to generate parallel plan for
that query.

set min_parallel_table_scan_size to 0;
set parallel_setup_cost to 0;
set parallel_tuple_cost to 0;

Thanks
Richard


Re: Add GUC to tune glibc's malloc implementation.

2023-06-26 Thread Ronan Dunklau
Le vendredi 23 juin 2023, 22:55:51 CEST Peter Eisentraut a écrit :
> On 22.06.23 15:35, Ronan Dunklau wrote:
> > The thing is, by default, those parameters are adjusted dynamically by the
> > glibc itself. It starts with quite small thresholds, and raises them when
> > the program frees some memory, up to a certain limit. This patch proposes
> > a new GUC allowing the user to adjust those settings according to their
> > workload.
> > 
> > This can cause problems. Let's take for example a table with 10k rows, and
> > 32 columns (as defined by a bench script David Rowley shared last year
> > when discussing the GenerationContext for tuplesort), and execute the
> > following
> > query, with 32MB of work_mem:

> I don't follow what you are trying to achieve with this.  The examples
> you show appear to work sensibly in my mind.  Using this setting, you
> can save some of the adjustments that glibc does after the first query.
> But that seems only useful if your session only does one query.  Is that
> what you are doing?

No, not at all: glibc does not do the right thing, we don't "save"  it. 
I will try to rephrase that.

In the first test case I showed, we see that glibc adjusts its threshold, but 
to a suboptimal value since repeated executions of a query needing the same 
amount of memory will release it back to the kernel, and move the brk pointer 
again, and will not adjust it again. On the other hand, by manually adjusting 
the thresholds, we can set them to a higher value which means that the memory 
will be kept in malloc's freelist for reuse for the next queries. As shown in 
the benchmark results I posted, this can have quite a dramatic effect, going 
from 396 tps to 894.  For ease of benchmarking, it is a single query being 
executed over and over again, but the same thing would be true if different 
queries allocating memories were executed by a single backend. 

The worst part of this means it is unpredictable: depending on past memory 
allocation patterns, glibc will end up in different states, and exhibit 
completely different performance for all subsequent queries. In fact, this is 
what Tomas noticed last year, (see [0]),  which led to investigation into 
this. 

I also tried to show that for certain cases glibcs behaviour can be on the 
contrary to greedy, and hold on too much memory if we just need the memory 
once and never allocate it again. 

I hope what I'm trying to achieve is clearer that way. Maybe this patch is not 
the best way to go about this, but since the memory allocator behaviour can 
have such an impact it's a bit sad we have to leave half the performance on 
the table because of it when there are easily accessible knobs to avoid it.

[0] 
https://www.postgresql.org/message-id/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com






Clean up command argument assembly

2023-06-26 Thread Peter Eisentraut

This is a small code cleanup patch.

Several commands internally assemble command lines to call other 
commands.  This includes initdb, pg_dumpall, and pg_regress.  (Also 
pg_ctl, but that is different enough that I didn't consider it here.) 
This has all evolved a bit organically, with fixed-size buffers, and 
various optional command-line arguments being injected with 
confusing-looking code, and the spacing between options handled in 
inconsistent ways.  This patch cleans all this up a bit to look clearer 
and be more easily extensible with new arguments and options.  We start 
each command with printfPQExpBuffer(), and then append arguments as 
necessary with appendPQExpBuffer().  Also standardize on using 
initPQExpBuffer() over createPQExpBuffer() where possible.  pg_regress 
uses StringInfo instead of PQExpBuffer, but many of the same ideas apply.From dbd9d7b7ff2c069a131c9efa8bff2597c0d9e1c8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 26 Jun 2023 11:30:24 +0200
Subject: [PATCH] Clean up command argument assembly

Several commands internally assemble command lines to call other
commands.  This includes initdb, pg_dumpall, and pg_regress.  (Also
pg_ctl, but that is different enough that I didn't consider it here.)
This has all evolved a bit organically, with fixed-size buffers, and
various optional command-line arguments being injected with
confusing-looking code, and the spacing between options handled in
inconsistent ways.  Clean all this up a bit to look clearer and be
more easily extensible with new arguments and options.  We start each
command with printfPQExpBuffer(), and then append arguments as
necessary with appendPQExpBuffer().  Also standardize on using
initPQExpBuffer() over createPQExpBuffer() where possible.  pg_regress
uses StringInfo instead of PQExpBuffer, but many of the same ideas
apply.
---
 src/bin/initdb/initdb.c   | 56 +++
 src/bin/pg_dump/pg_dumpall.c  | 27 +
 src/test/regress/pg_regress.c | 26 +---
 3 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index fc1fb363e7..3f4167682a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -309,16 +309,16 @@ void  initialize_data_directory(void);
 /*
  * macros for running pipes to postgres
  */
-#define PG_CMD_DECLchar cmd[MAXPGPATH]; FILE *cmdfd
+#define PG_CMD_DECLFILE *cmdfd
 
-#define PG_CMD_OPEN \
+#define PG_CMD_OPEN(cmd) \
 do { \
cmdfd = popen_check(cmd, "w"); \
if (cmdfd == NULL) \
exit(1); /* message already printed by popen_check */ \
 } while (0)
 
-#define PG_CMD_CLOSE \
+#define PG_CMD_CLOSE() \
 do { \
if (pclose_check(cmdfd)) \
exit(1); /* message already printed by pclose_check */ \
@@ -1138,13 +1138,15 @@ test_config_settings(void)
 static bool
 test_specific_config_settings(int test_conns, int test_buffs)
 {
-   PQExpBuffer cmd = createPQExpBuffer();
+   PQExpBufferData cmd;
_stringlist *gnames,
   *gvalues;
int status;
 
+   initPQExpBuffer();
+
/* Set up the test postmaster invocation */
-   printfPQExpBuffer(cmd,
+   printfPQExpBuffer(,
  "\"%s\" --check %s %s "
  "-c max_connections=%d "
  "-c shared_buffers=%d "
@@ -1158,18 +1160,18 @@ test_specific_config_settings(int test_conns, int 
test_buffs)
 gnames != NULL;/* assume lists have the same 
length */
 gnames = gnames->next, gvalues = gvalues->next)
{
-   appendPQExpBuffer(cmd, " -c %s=", gnames->str);
-   appendShellString(cmd, gvalues->str);
+   appendPQExpBuffer(, " -c %s=", gnames->str);
+   appendShellString(, gvalues->str);
}
 
-   appendPQExpBuffer(cmd,
+   appendPQExpBuffer(,
  " < \"%s\" > \"%s\" 2>&1",
  DEVNULL, DEVNULL);
 
fflush(NULL);
-   status = system(cmd->data);
+   status = system(cmd.data);
 
-   destroyPQExpBuffer(cmd);
+   termPQExpBuffer();
 
return (status == 0);
 }
@@ -1469,6 +1471,7 @@ static void
 bootstrap_template1(void)
 {
PG_CMD_DECL;
+   PQExpBufferData cmd;
char  **line;
char  **bki_lines;
charheaderline[MAXPGPATH];
@@ -1530,16 +1533,17 @@ bootstrap_template1(void)
/* Also ensure backend isn't confused by this environment var: */
unsetenv("PGCLIENTENCODING");
 
-   snprintf(cmd, sizeof(cmd),
-"\"%s\" --boot -X %d %s %s %s %s",
-backend_exec,
-wal_segment_size_mb * (1024 * 1024),
-  

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-06-26 Thread Andrey Lepikhov

On 24/6/2023 17:23, Tomas Vondra wrote:

I really hope what I just wrote makes at least a little bit of sense.

Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,10) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=9 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case 
where nd1

--
regards,
Andrey Lepikhov
Postgres Professional





Re: pg_decode_message vs skip_empty_xacts and xact_wrote_changes

2023-06-26 Thread Amit Kapila
On Mon, Jun 26, 2023 at 3:07 PM Ashutosh Bapat
 wrote:
>
> Hi All,
> Every pg_decode routine except pg_decode_message  that decodes a
> transactional change, has following block
> /* output BEGIN if we haven't yet */
> if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
> {
> pg_output_begin(ctx, data, txn, false);
> }
> txndata->xact_wrote_changes = true;
>
> But pg_decode_message() doesn't call pg_output_begin(). If a WAL
> message is the first change in the transaction, it won't have a BEGIN
> before it. That looks like a bug. Why is pg_decode_message()
> exception?
>

I can't see a reason why we shouldn't have a similar check for
transactional messages. So, agreed this is a bug.

-- 
With Regards,
Amit Kapila.




Improve comment on cid mapping (was Re: Adding CommandID to heap xlog records)

2023-06-26 Thread Heikki Linnakangas

On 28/02/2023 15:52, Heikki Linnakangas wrote:

So unfortunately I don't see much opportunity to simplify logical
decoding with this. However, please take a look at the first two patches
attached. They're tiny cleanups that make sense on their own.


Rebased these small patches. I'll add this to the commitfest.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 66289440ac65ea386cd138aeeed27b0032c2bb80 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 26 Jun 2023 09:48:26 +0300
Subject: [PATCH v2 1/2] Improve comment on why we need ctid->(cmin,cmax)
 mapping.

Combocids are only part of the problem. Explain the problem in more detail.

Discussion: https://www.postgresql.org/message-id/1ba2899e-77f8-7866-79e5-f3b7d1251...@iki.fi
---
 src/backend/replication/logical/snapbuild.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0786bb0ab7..e403feeccd 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -41,10 +41,15 @@
  * transactions we need Snapshots that see intermediate versions of the
  * catalog in a transaction. During normal operation this is achieved by using
  * CommandIds/cmin/cmax. The problem with that however is that for space
- * efficiency reasons only one value of that is stored
- * (cf. combocid.c). Since combo CIDs are only available in memory we log
- * additional information which allows us to get the original (cmin, cmax)
- * pair during visibility checks. Check the reorderbuffer.c's comment above
+ * efficiency reasons, the cmin and cmax are not included in WAL records. We
+ * cannot read the cmin/cmax from the tuple itself, either, because it is
+ * reset on crash recovery. Even if we could, we could not decode combocids
+ * which are only tracked in the original backend's memory. To work around
+ * that, heapam writes an extra WAL record (XLOG_HEAP2_NEW_CID) every time a
+ * catalog row is modified, which includes the cmin and cmax of the
+ * tuple. During decoding, we insert the ctid->(cmin,cmax) mappings into the
+ * reorder buffer, and use them at visibility checks instead of the cmin/cmax
+ * on the tuple itself. Check the reorderbuffer.c's comment above
  * ResolveCminCmaxDuringDecoding() for details.
  *
  * To facilitate all this we need our own visibility routine, as the normal
-- 
2.30.2

From 9140a0d98fd21b595eac6d75521a6b1a9f1b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 26 Jun 2023 09:56:02 +0300
Subject: [PATCH v2 2/2] Remove redundant check for fast_forward.

We already checked for it earlier in the function.

Discussion: https://www.postgresql.org/message-id/1ba2899e-77f8-7866-79e5-f3b7d1251...@iki.fi
---
 src/backend/replication/logical/decode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d91055a440..7039d425e2 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -422,8 +422,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	switch (info)
 	{
 		case XLOG_HEAP2_MULTI_INSERT:
-			if (!ctx->fast_forward &&
-SnapBuildProcessChange(builder, xid, buf->origptr))
+			if (SnapBuildProcessChange(builder, xid, buf->origptr))
 DecodeMultiInsert(ctx, buf);
 			break;
 		case XLOG_HEAP2_NEW_CID:
-- 
2.30.2



Re: 'converts internal representation to "..."' comment is confusing

2023-06-26 Thread Heikki Linnakangas

On 24/06/2023 23:52, Steve Chavez wrote:
On Tue, 16 May 2023 at 07:49, Robert Haas > wrote:


On Sun, May 14, 2023 at 9:37 PM Tom Lane mailto:t...@sss.pgh.pa.us>> wrote:
 > Steve Chavez mailto:st...@supabase.io>> writes:
 > > I found "..." confusing in some comments, so this patch changes
it to
 > > "cstring". Which seems to be the intention after all.
 >
 > Those comments are Berkeley-era, making them probably a decade older
 > than the "cstring" pseudotype (invented in b663f3443).  Perhaps what
 > you suggest is an improvement, but I'm not sure that appealing to
 > original intent can make the case.

FWIW, it does seem like an improvement to me.

Tom, could we apply this patch since Robert agrees it's an improvement?


Looking around at other input/output functions, we're not very 
consistent, there are many variants of "converts string to [datatype]", 
"converts C string to [datatype]", and "input routine for [datatype]". 
They are all fine, even though they're inconsistent. Doesn't seem worth 
the code churn to change them.


Anyway, I agree this patch is an improvement, so applied. Thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: ​function arguments are not PG_FUNCTION_ARGS, how to pass Node *escontext

2023-06-26 Thread jian he
On Mon, Jun 26, 2023 at 7:50 PM Andrew Dunstan  wrote:
>
>
> On 2023-06-26 Mo 07:20, jian he wrote:
>
> static
> Datum return_numeric_datum(char *token)
> {
> Datum   numd;
> Node*escontext;
>
> if (!DirectInputFunctionCallSafe(numeric_in, token,
> InvalidOid, -1,
> escontext,
> ));
> elog(INFO,"something is wrong");
> return numd;
> }
>
>
> To start with, the semicolon at the end of that if appears bogus. The elog is 
> indented to look like it's conditioned by the if but the semicolon makes it 
> not be.
>
> There are compiler switches in modern gcc at least that help you detect 
> things like this.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com


sorry. It was my mistake.

> Node*escontext;
> if (!DirectInputFunctionCallSafe(numeric_in, token,
> InvalidOid, -1,
> escontext,
> ))
> elog(INFO,"something is wrong");

I wonder about the implication of just declaring Node *escontext in here.
In this DirectInputFunctionCallSafe, what does escontext point to.




Re: ​function arguments are not PG_FUNCTION_ARGS, how to pass Node *escontext

2023-06-26 Thread Andrew Dunstan


On 2023-06-26 Mo 09:45, jian he wrote:



On Mon, Jun 26, 2023 at 9:32 PM jian he  
wrote:

>
> On Mon, Jun 26, 2023 at 7:50 PM Andrew Dunstan  
wrote:

> >
> >
> > On 2023-06-26 Mo 07:20, jian he wrote:
> >
> > static
> > Datum return_numeric_datum(char *token)
> > {
> >     Datum   numd;
> >     Node    *escontext;
> >
> >     if (!DirectInputFunctionCallSafe(numeric_in, token,
> >                                     InvalidOid, -1,
> >                                     escontext,
> >                                     ));
> >         elog(INFO,"something is wrong");
> >     return numd;
> > }
> >
> >
> > To start with, the semicolon at the end of that if appears bogus. 
The elog is indented to look like it's conditioned by the if but the 
semicolon makes it not be.

> >
> > There are compiler switches in modern gcc at least that help you 
detect things like this.

> >
> >
> > cheers
> >
> >
> > andrew
> >
> > --
> > Andrew Dunstan
> > EDB: https://www.enterprisedb.com
>
>
> sorry. It was my mistake.
>
> >     Node    *escontext;
> >     if (!DirectInputFunctionCallSafe(numeric_in, token,
> >                                     InvalidOid, -1,
> >                                     escontext,
> >                                     ))
> >         elog(INFO,"something is wrong");
>
> I wonder about the implication of just declaring Node *escontext in 
here.

> In this DirectInputFunctionCallSafe, what does escontext point to.

gcc -Wempty-body will detect my error.

However, gcc -Wextra includes  -Wempty-body and  -Wuninitializedand 
others.

i guess in here, the real question in here is how to initializeescontext.



See code for pg_input_error_info and friends for examples.


cheers


andrew



--
Andrew Dunstan
EDB:https://www.enterprisedb.com


​function arguments are not PG_FUNCTION_ARGS, how to pass Node *escontext

2023-06-26 Thread jian he
hi. simple question

The following one works.

Datum
test_direct_inputcall(PG_FUNCTION_ARGS)
{
char*token  = PG_GETARG_CSTRING(0);
Datum   numd;
if (!DirectInputFunctionCallSafe(numeric_in, token,
InvalidOid, -1,
fcinfo->context,
))
{
elog(INFO,"convert to cstring failed");
PG_RETURN_BOOL(false);
}
elog(INFO,"%s",DatumGetCString(DirectFunctionCall1(numeric_out,numd)));
PG_RETURN_BOOL(true);
}

--the following one does not work. will print out something is wrong

Datum
test_direct_inputcall(PG_FUNCTION_ARGS)
{
char*token  = PG_GETARG_CSTRING(0);
Datum   numd;
numd= return_numeric_datum(token);
elog(INFO,"%s",DatumGetCString(DirectFunctionCall1(numeric_out,numd)));
PG_RETURN_BOOL(true);
}

static
Datum return_numeric_datum(char *token)
{
Datum   numd;
Node*escontext;

if (!DirectInputFunctionCallSafe(numeric_in, token,
InvalidOid, -1,
escontext,
));
elog(INFO,"something is wrong");
return numd;
}

I wonder how to make it return_numeric_datum works in functions that
arguments are not PG_FUNCTION_ARGS.

To make it work, I need to understand the Node *context, which is kind
of a vague idea for me.
In the top level function (input as PG_FUNCTION_ARGS) the Node
*context can be used to print out errors and back to normal state, I
kind of get it.

I really appreciate someone showing a minimal, reproducible example
based on return_numeric_datum




Re: logical decoding and replication of sequences, take 2

2023-06-26 Thread Ashutosh Bapat
This is review of 0003 patch. Overall the patch looks good and helps
understand the decoding logic better.

+  data
+
+ BEGIN
+ sequence public.test_sequence: transactional:1 last_value: 1
log_cnt: 0 is_called:0
+ COMMIT

Looking at this output, I am wondering how would this patch work with DDL
replication. I should have noticed this earlier, sorry. A sequence DDL has two
parts, changes to the catalogs and changes to the data file. Support for
replicating the data file changes is added by these patches. The catalog
changes will need to be supported by DDL replication patch. When applying the
DDL changes, there are two ways 1. just apply the catalog changes and let the
support added here apply the data changes. 2. Apply both the changes. If the
second route is chosen, all the "transactional" decoding and application
support added by this patch will need to be ripped out. That will make the
"transactional" field in the protocol will become useless. It has potential to
be waste bandwidth in future.

OTOH, I feel that waiting for the DDL repliation patch set to be commtted will
cause this patchset to be delayed for an unknown duration. That's undesirable
too.

One solution I see is to use Storage RMID WAL again. While decoding it we send
a message to the subscriber telling it that a new relfilenode is being
allocated to a sequence. The subscriber too then allocates new relfilenode to
the sequence. The sequence data changes are decoded without "transactional"
flag; but they are decoded as transactional or non-transactional using the same
logic as the current patch-set. The subscriber will always apply these changes
to the reflilenode associated with the sequence at that point in time. This
would have the same effect as the current patch-set. But then there is
potential that the DDL replication patchset will render the Storage decoding
useless. So not an option. But anyway, I will leave this as a comment as an
alternative thought and discarded. Also this might trigger a better idea.

What do you think?

+-- savepoint test on table with serial column
+BEGIN;
+CREATE TABLE test_table (a SERIAL, b INT);
+INSERT INTO test_table (b) VALUES (100);
+INSERT INTO test_table (b) VALUES (200);
+SAVEPOINT a;
+INSERT INTO test_table (b) VALUES (300);
+ROLLBACK TO SAVEPOINT a;

The third implicit nextval won't be logged so whether subtransaction is rolled
back or committed, it won't have much effect on the decoding. Adding
subtransaction around the first INSERT itself might be useful to test that the
subtransaction rollback does not rollback the sequence changes.

After adding {'include_sequences', false} to the calls to
pg_logical_slot_get_changes() in other tests, the SQL statement has grown
beyond 80 characters. Need to split it into multiple lines.

 }
+else if (strcmp(elem->defname, "include-sequences") == 0)
+{
+
+if (elem->arg == NULL)
+data->include_sequences = false;

By default inlclude_sequences = true. Shouldn't then it be set to true here?

After looking at the option processing code in
pg_logical_slot_get_changes_guts(), it looks like an argument can never be
NULL. But I see we have checks for NULL values of other arguments so it's ok to
keep a NULL check here.

I will look at 0004 next.

-- 
Best Wishes,
Ashutosh Bapat




Re: ​function arguments are not PG_FUNCTION_ARGS, how to pass Node *escontext

2023-06-26 Thread Andrew Dunstan


On 2023-06-26 Mo 07:20, jian he wrote:

static
Datum return_numeric_datum(char *token)
{
 Datum   numd;
 Node*escontext;

 if (!DirectInputFunctionCallSafe(numeric_in, token,
 InvalidOid, -1,
 escontext,
 ));
 elog(INFO,"something is wrong");
 return numd;
}



To start with, the semicolon at the end of that if appears bogus. The 
elog is indented to look like it's conditioned by the if but the 
semicolon makes it not be.


There are compiler switches in modern gcc at least that help you detect 
things like this.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


pgbnech: allow to cancel queries during benchmark

2023-06-26 Thread Yugo NAGATA
Hello,

This attached patch enables pgbench to cancel queries during benchmark.

Formerly, Ctrl+C during benchmark killed pgbench immediately, but backend
processes executing long queries remained for a while. You can simply
reproduce this problem by cancelling the pgbench running a custom script
executing "SELECT pg_sleep(10)".

The patch fixes this so that cancel requests are sent for all connections on
Ctrl+C, and all running queries are cancelled before pgbench exits.

In thread #0, setup_cancel_handler is called before the loop, the
CancelRequested flag is set when Ctrl+C is issued. In the loop, cancel
requests are sent when the flag is set only in thread #0. SIGINT is
blocked in other threads, but the threads will exit after their query
are cancelled. If thread safety is disabled or OS is Windows, the signal
is not blocked because pthread_sigmask cannot be used. 
(I didn't test the patch on WIndows yet, though.)

I choose the design that the signal handler and the query cancel are
performed only in thread #0 because I wanted to make the behavior as
predicable as possible. However, another design that all thread can
received SIGINT and that the first thread that catches the signal is
responsible to sent cancel requests for all connections may also work.

Also, the array of CState that contains all clients state is changed to
a global variable so that all connections can be accessed within a thread.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1d1670d4c2..4e80afdf9b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -606,6 +606,7 @@ typedef enum
 typedef struct
 {
 	PGconn	   *con;			/* connection handle to DB */
+	PGcancel   *cancel;			/* query cancel */
 	int			id;/* client No. */
 	ConnectionStateEnum state;	/* state machine's current state. */
 	ConditionalStack cstack;	/* enclosing conditionals state */
@@ -648,6 +649,8 @@ typedef struct
  * here */
 } CState;
 
+CState	*all_state;		/* status of all clients */
+
 /*
  * Thread state
  */
@@ -3639,6 +3642,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 		st->state = CSTATE_ABORTED;
 		break;
 	}
+	st->cancel = PQgetCancel(st->con);
 
 	/* reset now after connection */
 	now = pg_time_now();
@@ -4670,6 +4674,18 @@ disconnect_all(CState *state, int length)
 		finishCon([i]);
 }
 
+/* send cancel requests to all connections */
+static void
+cancel_all()
+{
+	for (int i = 0; i < nclients; i++)
+	{
+		char errbuf[1];
+		if (all_state[i].cancel != NULL)
+			(void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf));
+	}
+}
+
 /*
  * Remove old pgbench tables, if any exist
  */
@@ -6607,7 +6623,6 @@ main(int argc, char **argv)
 	bool		initialization_option_set = false;
 	bool		internal_script_used = false;
 
-	CState	   *state;			/* status of clients */
 	TState	   *threads;		/* array of thread */
 
 	pg_time_usec_t
@@ -6656,7 +6671,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	state = (CState *) pg_malloc0(sizeof(CState));
+	all_state = (CState *) pg_malloc0(sizeof(CState));
 
 	/* set random seed early, because it may be used while parsing scripts. */
 	if (!set_random_seed(getenv("PGBENCH_RANDOM_SEED")))
@@ -6715,7 +6730,7 @@ main(int argc, char **argv)
 		pg_fatal("invalid variable definition: \"%s\"", optarg);
 
 	*p++ = '\0';
-	if (!putVariable([0].variables, "option", optarg, p))
+	if (!putVariable(_state[0].variables, "option", optarg, p))
 		exit(1);
 }
 break;
@@ -7087,28 +7102,28 @@ main(int argc, char **argv)
 
 	if (nclients > 1)
 	{
-		state = (CState *) pg_realloc(state, sizeof(CState) * nclients);
-		memset(state + 1, 0, sizeof(CState) * (nclients - 1));
+		all_state = (CState *) pg_realloc(all_state, sizeof(CState) * nclients);
+		memset(all_state + 1, 0, sizeof(CState) * (nclients - 1));
 
 		/* copy any -D switch values to all clients */
 		for (i = 1; i < nclients; i++)
 		{
 			int			j;
 
-			state[i].id = i;
-			for (j = 0; j < state[0].variables.nvars; j++)
+			all_state[i].id = i;
+			for (j = 0; j < all_state[0].variables.nvars; j++)
 			{
-Variable   *var = [0].variables.vars[j];
+Variable   *var = _state[0].variables.vars[j];
 
 if (var->value.type != PGBT_NO_VALUE)
 {
-	if (!putVariableValue([i].variables, "startup",
+	if (!putVariableValue(_state[i].variables, "startup",
 		  var->name, >value))
 		exit(1);
 }
 else
 {
-	if (!putVariable([i].variables, "startup",
+	if (!putVariable(_state[i].variables, "startup",
 	 var->name, var->svalue))
 		exit(1);
 }
@@ -7119,8 +7134,8 @@ main(int argc, char **argv)
 	/* other CState initializations */
 	for (i = 0; i < nclients; i++)
 	{
-		state[i].cstack = conditional_stack_create();
-		initRandomState([i].cs_func_rs);
+		all_state[i].cstack = conditional_stack_create();
+		initRandomState(_state[i].cs_func_rs);
 	}
 
 	/* opening 

RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-26 Thread Hayato Kuroda (Fujitsu)
Dear Önder,

Thank you for giving comments! The author's comment is quite helpful for us.

>
When I last dealt with the same issue, I was examining it from a slightly 
broader perspective. I think
my conclusion was that RelationFindReplTupleByIndex() is designed for the 
constraints of UNIQUE INDEX 
and Primary Key.
>

I see. IIUC that's why you have added tuples_equal() in the 
RelationFindReplTupleByIndex().

>
I think we should also be mindful about tuples_equal() function. When an index 
returns more than
one tuple, we rely on tuples_equal() function to make sure non-relevant tuples 
are skipped.

For btree indexes, it was safe to rely on that function as the columns that are 
indexed using btree
always have equality operator. I think we can safely assume the same for hash 
indexes.

However, say we indexed "point" type using "gist" index. Then, if we let this 
logic to kick in,
I think tuples_equal() would fail saying that there is no equality operator 
exists.
>

Good point. Actually I have tested for "point" datatype but it have not worked 
well.
Now I understand the reason.
It seemed that when TYPECACHE_EQ_OPR_FINFO is reuqesed to lookup_type_cache(),
it could return valid value only if the datatype has operator class for Btree 
or Hash.
So tuples_equal() might not be able to   use if tuples have point box circle 
types.


BTW, I have doubt that the restriction is not related with your commit.
In other words, if the table has attributes which the datatype is not for 
operator
class of Btree, we could not use REPLICA IDENTITY FULL. IIUC it is not 
documented.
Please see attched script to reproduce that. The final DELETE statement cannot 
be
replicated at the subscriber on my env.

>
For the specific notes you raised about strategy numbers / operator classes, I 
need to
study a bit :) Though, I'll be available to do that early next week.
>

Thanks! I'm looking forward to see your opinions...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Changing types of block and chunk sizes in memory contexts

2023-06-26 Thread Melih Mutlu
Hi hackers,

In memory contexts, block and chunk sizes are likely to be limited by
some upper bounds. Some examples of those bounds can be
MEMORYCHUNK_MAX_BLOCKOFFSET and MEMORYCHUNK_MAX_VALUE. Both values are
only 1 less than 1GB.
This makes memory contexts to have blocks/chunks with sizes less than
1GB. Such sizes can be stored in 32-bits. Currently, "Size" type,
which is 64-bit, is used, but 32-bit integers should be enough to
store any value less than 1GB.

Attached patch is an attempt to change the types of some fields to
uint32 from Size in aset, slab and generation memory contexts.
I tried to find most of the places that needed to be changed to
uint32, but I probably missed some. I can add more places if you feel
like it.

I would appreciate any feedback.

Thanks,
-- 
Melih Mutlu
Microsoft


0001-Change-memory-context-fields-to-uint32.patch
Description: Binary data


Re: Clean up JumbleQuery() from query text

2023-06-26 Thread Nathan Bossart
On Mon, Jun 26, 2023 at 05:44:49PM +0900, Michael Paquier wrote:
> Joe has reported me offlist that JumbleQuery() includes a dependency
> to the query text, but we don't use that anymore as the query ID is
> generated from the Query structure instead.
> 
> Any thoughts about the cleanup attached?  But at the same time, this
> is simple and a new thing, so I'd rather clean up that now rather than
> later. 

LGTM

> It is not urgent, so I am fine to postpone that after beta2 is
> released on 17~ if there are any objections to that, of course.

Even if extensions are using it, GA for v16 is still a few months away, so
IMO it's okay to apply it to v16.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: ​function arguments are not PG_FUNCTION_ARGS, how to pass Node *escontext

2023-06-26 Thread jian he
On Mon, Jun 26, 2023 at 9:32 PM jian he  wrote:
>
> On Mon, Jun 26, 2023 at 7:50 PM Andrew Dunstan 
wrote:
> >
> >
> > On 2023-06-26 Mo 07:20, jian he wrote:
> >
> > static
> > Datum return_numeric_datum(char *token)
> > {
> > Datum   numd;
> > Node*escontext;
> >
> > if (!DirectInputFunctionCallSafe(numeric_in, token,
> > InvalidOid, -1,
> > escontext,
> > ));
> > elog(INFO,"something is wrong");
> > return numd;
> > }
> >
> >
> > To start with, the semicolon at the end of that if appears bogus. The
elog is indented to look like it's conditioned by the if but the semicolon
makes it not be.
> >
> > There are compiler switches in modern gcc at least that help you detect
things like this.
> >
> >
> > cheers
> >
> >
> > andrew
> >
> > --
> > Andrew Dunstan
> > EDB: https://www.enterprisedb.com
>
>
> sorry. It was my mistake.
>
> > Node*escontext;
> > if (!DirectInputFunctionCallSafe(numeric_in, token,
> > InvalidOid, -1,
> > escontext,
> > ))
> > elog(INFO,"something is wrong");
>
> I wonder about the implication of just declaring Node *escontext in here.
> In this DirectInputFunctionCallSafe, what does escontext point to.

gcc -Wempty-body will detect my error.

However, gcc -Wextra includes  -Wempty-body and  -Wuninitialized and others.
i guess in here, the real question in here is how to initialize escontext.


Re: Stampede of the JIT compilers

2023-06-26 Thread Laurenz Albe
On Sun, 2023-06-25 at 11:10 +0200, Michael Banck wrote:
> On Sat, Jun 24, 2023 at 01:54:53PM -0400, Tom Lane wrote:
> > I don't know whether raising the default would be enough to fix that
> > in a nice way, and I certainly don't pretend to have a specific value
> > to offer.  But it's undeniable that we have a serious problem here,
> > to the point where JIT is a net negative for quite a few people.
> 
> Some further data: to my knowledge, most major managed postgres
> providers disable jit for their users.

I have also started recommending jit=off for all but analytic workloads.

Yours,
Laurenz Albe




Re: logical decoding and replication of sequences, take 2

2023-06-26 Thread Tomas Vondra



On 6/26/23 15:18, Ashutosh Bapat wrote:
> This is review of 0003 patch. Overall the patch looks good and helps
> understand the decoding logic better.
> 
> +  data
> +
> + BEGIN
> + sequence public.test_sequence: transactional:1 last_value: 1
> log_cnt: 0 is_called:0
> + COMMIT
> 
> Looking at this output, I am wondering how would this patch work with DDL
> replication. I should have noticed this earlier, sorry. A sequence DDL has two
> parts, changes to the catalogs and changes to the data file. Support for
> replicating the data file changes is added by these patches. The catalog
> changes will need to be supported by DDL replication patch. When applying the
> DDL changes, there are two ways 1. just apply the catalog changes and let the
> support added here apply the data changes. 2. Apply both the changes. If the
> second route is chosen, all the "transactional" decoding and application
> support added by this patch will need to be ripped out. That will make the
> "transactional" field in the protocol will become useless. It has potential to
> be waste bandwidth in future.
> 

I don't understand why would it need to be ripped out. Why would it make
the transactional behavior useless? Can you explain?

IMHO we replicate either changes (and then DDL replication does not
interfere with that), or DDL (and then this patch should not interfere).

> OTOH, I feel that waiting for the DDL repliation patch set to be commtted will
> cause this patchset to be delayed for an unknown duration. That's undesirable
> too.
> 
> One solution I see is to use Storage RMID WAL again. While decoding it we send
> a message to the subscriber telling it that a new relfilenode is being
> allocated to a sequence. The subscriber too then allocates new relfilenode to
> the sequence. The sequence data changes are decoded without "transactional"
> flag; but they are decoded as transactional or non-transactional using the 
> same
> logic as the current patch-set. The subscriber will always apply these changes
> to the reflilenode associated with the sequence at that point in time. This
> would have the same effect as the current patch-set. But then there is
> potential that the DDL replication patchset will render the Storage decoding
> useless. So not an option. But anyway, I will leave this as a comment as an
> alternative thought and discarded. Also this might trigger a better idea.
> 
> What do you think?
> 


I don't understand what the problem with DDL is, so I can't judge how
this is supposed to solve it.

> +-- savepoint test on table with serial column
> +BEGIN;
> +CREATE TABLE test_table (a SERIAL, b INT);
> +INSERT INTO test_table (b) VALUES (100);
> +INSERT INTO test_table (b) VALUES (200);
> +SAVEPOINT a;
> +INSERT INTO test_table (b) VALUES (300);
> +ROLLBACK TO SAVEPOINT a;
> 
> The third implicit nextval won't be logged so whether subtransaction is rolled
> back or committed, it won't have much effect on the decoding. Adding
> subtransaction around the first INSERT itself might be useful to test that the
> subtransaction rollback does not rollback the sequence changes.
> 
> After adding {'include_sequences', false} to the calls to
> pg_logical_slot_get_changes() in other tests, the SQL statement has grown
> beyond 80 characters. Need to split it into multiple lines.
> 
>  }
> +else if (strcmp(elem->defname, "include-sequences") == 0)
> +{
> +
> +if (elem->arg == NULL)
> +data->include_sequences = false;
> 
> By default inlclude_sequences = true. Shouldn't then it be set to true here?
> 

I don't follow. Is this still related to the DDL replication, or are you
describing some new issue with savepoints?

> After looking at the option processing code in
> pg_logical_slot_get_changes_guts(), it looks like an argument can never be
> NULL. But I see we have checks for NULL values of other arguments so it's ok 
> to
> keep a NULL check here.
> 
> I will look at 0004 next.
> 

OK

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: psql: Add role's membership options to the \du+ command

2023-06-26 Thread Pavel Luzanov

Please find attached new patch version.
It implements \drg command and hides duplicates in \du & \dg commands.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
From a117f13fd497bf6ff8a504bcda6cb10d34dd22a7 Mon Sep 17 00:00:00 2001
From: Pavel Luzanov 
Date: Mon, 26 Jun 2023 22:10:31 +0300
Subject: [PATCH v8] psql: add \drg command to show role memberships and grants
 information.

New command shows assigned privileges for each membership (ADMIN, INHERIT
or SET) and who is grantor. This is important to know which privileges can
be used and how to revoke membership. Without this command you need to query
pg_auth_members directly.

Since v16 it is possible to include one role to another several times with
different grantors. Therefore, the 'Member of' column of the \du
and \dg commands now shows a sorted array of distinct roles to hide duplicates.
---
 doc/src/sgml/ref/psql-ref.sgml | 27 +-
 src/bin/psql/command.c |  2 +
 src/bin/psql/describe.c| 80 +-
 src/bin/psql/describe.h|  3 ++
 src/bin/psql/help.c|  1 +
 src/bin/psql/tab-complete.c|  6 ++-
 src/test/regress/expected/psql.out | 41 +++
 src/test/regress/sql/psql.sql  | 26 ++
 8 files changed, 180 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 35aec6d3ce..2d2e394a84 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1722,7 +1722,8 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
-\du.)
+\du.) For each role a sorted list of distinct roles
+of which it is a member is shown in array format.
 By default, only user-created roles are shown; supply the
 S modifier to include system roles.
 If pattern is specified,
@@ -1883,6 +1884,27 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g
 
   
 
+
+  
+\drg[S] [ pattern ]
+
+
+Lists detailed information about each role membership, including
+assigned options (ADMIN,
+INHERIT or SET) and grantor.
+See the GRANT
+command for their meaning.
+
+
+By default, only user-created roles are shown; supply the
+S modifier to include system roles.
+If pattern is specified,
+only those roles whose names match the pattern are listed.
+
+
+  
+
+
   
 \drds [ role-pattern [ database-pattern ] ]
 
@@ -1957,7 +1979,8 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
-\dg.)
+\dg.) For each role a sorted list of distinct roles
+of which it is a member is shown in array format.
 By default, only user-created roles are shown; supply the
 S modifier to include system roles.
 If pattern is specified,
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 511debbe81..88a653a709 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -918,6 +918,8 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 
 	free(pattern2);
 }
+else if (strncmp(cmd, "drg", 3) == 0)
+	success = describeRoleGrants(pattern, show_system);
 else
 	status = PSQL_CMD_UNKNOWN;
 break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 9325a46b8f..e4101fe36b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3632,10 +3632,11 @@ describeRoles(const char *pattern, bool verbose, bool showSystem)
 	  "SELECT r.rolname, r.rolsuper, r.rolinherit,\n"
 	  "  r.rolcreaterole, r.rolcreatedb, r.rolcanlogin,\n"
 	  "  r.rolconnlimit, r.rolvaliduntil,\n"
-	  "  ARRAY(SELECT b.rolname\n"
+	  "  ARRAY(SELECT DISTINCT b.rolname\n"
 	  "FROM pg_catalog.pg_auth_members m\n"
 	  "JOIN pg_catalog.pg_roles b ON (m.roleid = b.oid)\n"
-	  "WHERE m.member = r.oid) as memberof");
+	  "WHERE m.member = r.oid\n"
+	  "ORDER BY 1) as memberof");
 
 	if (verbose)
 	{
@@ -3753,6 +3754,81 @@ describeRoles(const char *pattern, bool verbose, bool showSystem)
 	return true;
 }
 
+/*
+ * \drg
+ * Describes role grants.
+ */
+bool
+describeRoleGrants(const char *pattern, bool showSystem)
+{
+	PQExpBufferData buf;
+	PGresult   *res;
+	printQueryOpt myopt = pset.popt;
+
+	initPQExpBuffer();
+	printfPQExpBuffer(,
+	  "SELECT m.rolname AS \"%s\", r.rolname AS \"%s\",\n"
+	  "  pg_catalog.concat_ws(', ',\n",
+	  gettext_noop("Role name"),

Re: Analyze on table creation?

2023-06-26 Thread James Coleman
cc'ing Tom because I'm curious if he's willing to provide some greater
context on the commit in question.

On Mon, Jun 26, 2023 at 2:16 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:48 odesílatel James Coleman  napsal:
>>
>> On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  
>> wrote:
>> >
>> >
>> >
>> > po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
>> > napsal:
>> >>
>> >> Hi
>> >>
>> >> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Have we ever discussed running an analyze immediately after creating a 
>> >>> table?
>> >>>
>> >>> Consider the following:
>> >>>
>> >>> create table stats(i int, t text not null);
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>> >>> analyze stats;
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>> >>>
>> >>> Combined with rapidly increasing error margin on row estimates when
>> >>> adding joins means that a query joining to a bunch of empty tables
>> >>> when a database first starts up can result in some pretty wild plan
>> >>> costs.
>> >>>
>> >>> This feels like a simple idea to me, and so I assume people have
>> >>> considered it before. If so, I'd like to understand why the conclusion
>> >>> was not to do it, or, alternatively if it's a lack of tuits.
>> >>
>> >>
>> >> I like this. On the second hand, described behaviour is designed for 
>> >> ensuring of back compatibility.
>> >
>> >
>> > if you break this back compatibility, then the immediate ANALYZE is not 
>> > necessary
>>
>> I don't follow what backwards compatibility you're referencing. Could
>> you expand on that?
>
>
> Originally, until the table had minimally one row, the PostgreSQL calculated 
> with 10 pages. It was fixed (changed) in PostgreSQL 14.
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4
>

>From that commit message:
> Historically, we've considered the state with relpages and reltuples
> both zero as indicating that we do not know the table's tuple density.
> This is problematic because it's impossible to distinguish "never yet
> vacuumed" from "vacuumed and seen to be empty".  In particular, a user
> cannot use VACUUM or ANALYZE to override the planner's normal heuristic
> that an empty table should not be believed to be empty because it is
> probably about to get populated.  That heuristic is a good safety
> measure, so I don't care to abandon it, but there should be a way to
> override it if the table is indeed intended to stay empty.

So that implicitly provides our reasoning for not analyzing up-front
on table creation.

I haven't thought about this too deeply yet, but it seems plausible to
me that the dangers of overestimating row count here (at minimum in
queries like I described with lots of joins) are higher than the
dangers of underestimating, which we would do if we believed the table
was empty. One critical question would be how fast we can assume the
table will be auto-analyzed (i.e., how fast would the underestimate be
corrected.

Regards,
James Coleman




Re: Stampede of the JIT compilers

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-24 13:54:53 -0400, Tom Lane wrote:
> I think there is *plenty* of evidence that it is too low, or at least
> that for some reason we are too willing to invoke JIT when the result
> is to make the overall cost of a query far higher than it is without.
> Just see all the complaints on the mailing lists that have been
> resolved by advice to turn off JIT.  You do not even have to look
> further than our own regression tests: on my machine with current
> HEAD, "time make installcheck-parallel" reports
> 
> real0m8.544s
> user0m0.906s
> sys 0m0.863s
> 
> for a build without --with-llvm, and
> 
> real0m13.211s
> user0m0.917s
> sys 0m0.811s

IIRC those are all, or nearly all, cases where we have no stats and the plans
have ridiculous costs (and other reasons like enable_seqscans = false and
using seqscans nonetheless). In those cases no cost based approach will work
:(.


> I don't know whether raising the default would be enough to fix that
> in a nice way, and I certainly don't pretend to have a specific value
> to offer.  But it's undeniable that we have a serious problem here,
> to the point where JIT is a net negative for quite a few people.

Yea, I think at the moment it's not working well enough to be worth having on
by default. Some of that is due to partitioning having become much more
common, leading to much bigger plan trees, some of it is just old stuff that I
had hoped could be addressed more easily.

FWIW, Daniel Gustafsson is hacking on an old patch of mine that was working
towards making the JIT result cacheable (and providing noticeably bigger
performance gains).


> > In that context capping the number of backends compiling, particularly
> > where plans (and JIT?) might be cached, could well save us (depending
> > on workload).
> 
> TBH I do not find this proposal attractive in the least.

Yea, me neither. It doesn't address any of the actual problems and will add
new contention.

Greetings,

Andres Freund




Analyze on table creation?

2023-06-26 Thread James Coleman
Hello,

Have we ever discussed running an analyze immediately after creating a table?

Consider the following:

create table stats(i int, t text not null);
explain select * from stats;
   Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
analyze stats;
explain select * from stats;
   Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)

Combined with rapidly increasing error margin on row estimates when
adding joins means that a query joining to a bunch of empty tables
when a database first starts up can result in some pretty wild plan
costs.

This feels like a simple idea to me, and so I assume people have
considered it before. If so, I'd like to understand why the conclusion
was not to do it, or, alternatively if it's a lack of tuits.

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 4:00 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> > Have we ever discussed running an analyze immediately after creating a 
> > table?
>
> That doesn't make a whole lot of sense to me - we could just insert the
> constants stats we wanted in that case.
>

I thought that was implicit in that, but fair enough :)

> > Consider the following:
> >
> > create table stats(i int, t text not null);
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> > analyze stats;
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> >
> > Combined with rapidly increasing error margin on row estimates when
> > adding joins means that a query joining to a bunch of empty tables
> > when a database first starts up can result in some pretty wild plan
> > costs.
>
> The issue is that the table stats are likely going to quickly out of date in
> that case, even a hand full of inserts (which wouldn't trigger
> autovacuum analyzing) would lead to the "0 rows" stats causing very bad plans.
>

It's not obvious to me (as noted elsewhere in the thread) which is
worse: a bunch of JOINs on empty tables can result in (specific
example) plans with cost=15353020, and then trigger JIT, and...here we
collide with my other thread about JIT [1].

Regards,
James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe-g-Q0Mm5H9QLcu8cHeMwok%2BHaxS4-UC9Oj3bK3a5jPvg%40mail.gmail.com




Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
Hi

po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:

> Hello,
>
> Have we ever discussed running an analyze immediately after creating a
> table?
>
> Consider the following:
>
> create table stats(i int, t text not null);
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> analyze stats;
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>
> Combined with rapidly increasing error margin on row estimates when
> adding joins means that a query joining to a bunch of empty tables
> when a database first starts up can result in some pretty wild plan
> costs.
>
> This feels like a simple idea to me, and so I assume people have
> considered it before. If so, I'd like to understand why the conclusion
> was not to do it, or, alternatively if it's a lack of tuits.
>

I like this. On the second hand, described behaviour is designed for
ensuring of back compatibility.

Regards

Pavel



> Regards,
> James Coleman
>
>
>


Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-06-26 Thread Alena Rybakina

Hi, all!

On 24.06.2023 14:23, Tomas Vondra wrote:

On 6/24/23 02:08, Tom Lane wrote:

Tomas Vondra  writes:

The problem is that the selectivity for "IS NULL" is estimated using the
table-level statistics. But the LEFT JOIN entirely breaks the idea that
the null_frac has anything to do with NULLs in the join result.

Right.


I wonder how to improve this, say by adjusting the IS NULL selectivity
when we know to operate on the outer side of the join. We're able to
do this for antijoins, so maybe we could do that here, somehow?

This mess is part of the long-term plan around the work I've been doing
on outer-join-aware Vars.  We now have infrastructure that can let
the estimator routines see "oh, this Var isn't directly from a scan
of its table, it's been passed through a potentially-nulling outer
join --- and I can see which one".  I don't have more than vague ideas
about what happens next, but that is clearly an essential step on the
road to doing better.


I was wondering if that work on outer-join-aware Vars could help with
this, but I wasn't following it very closely. I agree the ability to
check if the Var could be NULL due to an outer join seems useful, as it
says whether applying raw attribute statistics makes sense or not.

I was thinking about what to do for the case when that's not possible,
i.e. when the Var refers to nullable side of the join. Knowing that this
is happening is clearly not enough - we need to know how many new NULLs
are "injected" into the join result, and "communicate" that to the
estimation routines.

Attached is a very ugly experimental patch doing that, and with it the
estimate changes to this:

  QUERY PLAN
   --
Hash Left Join  (cost=3.25..18179.88 rows=00 width=16)
(actual time=0.528..596.151 rows=00 loops=1)
  Hash Cond: (large.id = small.id)
  Filter: ((small.id IS NULL) OR
   (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
  Rows Removed by Filter: 100
  ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 width=8)
  (actual time=0.069..176.138 rows=100 loops=1)
  ->  Hash  (cost=2.00..2.00 rows=100 width=8)
(actual time=0.371..0.373 rows=100 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 12kB
->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
  (actual time=0.032..0.146 rows=100 loops=1)
Planning Time: 3.845 ms
Execution Time: 712.405 ms
   (10 rows)

Seems nice, but. The patch is pretty ugly, I don't claim it works for
other queries or that this is exactly what we should do. It calculates
"unmatched frequency" next to eqjoinsel_inner, stashes that info into
sjinfo and the estimator (nulltestsel) then uses that to adjust the
nullfrac it gets from the statistics.

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

So maybe we should split the join estimate into two parts, one for each
subset of the join result. One for the rows with a match (and then we
can just do what we do now, with the attribute stats we already have).
And one for the "unmatched part" where we know the values on the outer
side are NULL (and then we can easily "fake" stats with null_frac=1.0).


I really hope what I just wrote makes at least a little bit of sense.


regards


I am also interested in this problem.

I did some refactoring of the source code in the patch, moved the 
calculation of unmatched_fraction to 

Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
>
> >
> > Originally, until the table had minimally one row, the PostgreSQL
> calculated with 10 pages. It was fixed (changed) in PostgreSQL 14.
> >
> >
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4
> >
>
> From that commit message:
> > Historically, we've considered the state with relpages and reltuples
> > both zero as indicating that we do not know the table's tuple density.
> > This is problematic because it's impossible to distinguish "never yet
> > vacuumed" from "vacuumed and seen to be empty".  In particular, a user
> > cannot use VACUUM or ANALYZE to override the planner's normal heuristic
> > that an empty table should not be believed to be empty because it is
> > probably about to get populated.  That heuristic is a good safety
> > measure, so I don't care to abandon it, but there should be a way to
> > override it if the table is indeed intended to stay empty.
>
> So that implicitly provides our reasoning for not analyzing up-front
> on table creation.
>
> I haven't thought about this too deeply yet, but it seems plausible to
> me that the dangers of overestimating row count here (at minimum in
> queries like I described with lots of joins) are higher than the
> dangers of underestimating, which we would do if we believed the table
> was empty. One critical question would be how fast we can assume the
> table will be auto-analyzed (i.e., how fast would the underestimate be
> corrected.
>

I found this issue a few years ago. This application had 40% of tables with
one or zero row, 30% was usual size, and 30% was sometimes really big. It
can be "relative" common in OLAP applications.

The estimation was terrible. I don't think there can be some better
heuristic.  Maybe we can introduce some table option like expected size,
that can be used when real statistics are not available.

Some like

CREATE TABLE foo(...) WITH (default_relpages = x)

It is not a perfect solution, but it allows fix this issue by one command.


> Regards,
> James Coleman
>


Re: Fixing tab-complete for dollar-names

2023-06-26 Thread Mikhail Gribkov
Hi hackers,

As not much preliminary interest seem to be here, I'm sending the patch to
the upcoming commitfest

--
 best regards,
Mikhail A. Gribkov


On Sat, Jun 17, 2023 at 12:51 AM Mikhail Gribkov  wrote:

> Hi hackers,
>
> In modern versions of Postgres the dollar sign is a totally legal
> character for identifiers (except for the first character), but
> tab-complete do not treat such identifiers well.
> For example if one try to create an Oracle-style view like this:
>
> create view v$activity as select * from pg_stat_activity;
>
> , he will get a normally functioning view, but psql tab-complete will not
> help him. Type "v", "v$" or "v$act" and press  - nothing will be
> suggested.
>
> Attached is a small patch fixing this problem.
> Honestly I'm a little surprised that this was not done before. Maybe,
> there are some special considerations I am not aware of, and the patch will
> break something?
> What would you say?
> --
>  best regards,
> Mikhail A. Gribkov
>


Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule 
napsal:

> Hi
>
> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>
>> Hello,
>>
>> Have we ever discussed running an analyze immediately after creating a
>> table?
>>
>> Consider the following:
>>
>> create table stats(i int, t text not null);
>> explain select * from stats;
>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>> analyze stats;
>> explain select * from stats;
>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>>
>> Combined with rapidly increasing error margin on row estimates when
>> adding joins means that a query joining to a bunch of empty tables
>> when a database first starts up can result in some pretty wild plan
>> costs.
>>
>> This feels like a simple idea to me, and so I assume people have
>> considered it before. If so, I'd like to understand why the conclusion
>> was not to do it, or, alternatively if it's a lack of tuits.
>>
>
> I like this. On the second hand, described behaviour is designed for
> ensuring of back compatibility.
>

if you break this back compatibility, then the immediate ANALYZE is not
necessary



>
> Regards
>
> Pavel
>
>
>
>> Regards,
>> James Coleman
>>
>>
>>


Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
> napsal:
>>
>> Hi
>>
>> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>>>
>>> Hello,
>>>
>>> Have we ever discussed running an analyze immediately after creating a 
>>> table?
>>>
>>> Consider the following:
>>>
>>> create table stats(i int, t text not null);
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>>> analyze stats;
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>>>
>>> Combined with rapidly increasing error margin on row estimates when
>>> adding joins means that a query joining to a bunch of empty tables
>>> when a database first starts up can result in some pretty wild plan
>>> costs.
>>>
>>> This feels like a simple idea to me, and so I assume people have
>>> considered it before. If so, I'd like to understand why the conclusion
>>> was not to do it, or, alternatively if it's a lack of tuits.
>>
>>
>> I like this. On the second hand, described behaviour is designed for 
>> ensuring of back compatibility.
>
>
> if you break this back compatibility, then the immediate ANALYZE is not 
> necessary

I don't follow what backwards compatibility you're referencing. Could
you expand on that?

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> Have we ever discussed running an analyze immediately after creating a table?

That doesn't make a whole lot of sense to me - we could just insert the
constants stats we wanted in that case.


> Consider the following:
> 
> create table stats(i int, t text not null);
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> analyze stats;
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> 
> Combined with rapidly increasing error margin on row estimates when
> adding joins means that a query joining to a bunch of empty tables
> when a database first starts up can result in some pretty wild plan
> costs.

The issue is that the table stats are likely going to quickly out of date in
that case, even a hand full of inserts (which wouldn't trigger
autovacuum analyzing) would lead to the "0 rows" stats causing very bad plans.

Greetings,

Andres Freund




Re: Do we want a hashset type?

2023-06-26 Thread Joel Jacobson
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> Can you try to glue the attached to the hashset data type input 
> function.
> the attached will parse cstring with double quote and not. so '{1,2,3}' 
> == '{"1","2","3"}'. obviously quote will preserve the inner string as 
> is.
> currently int4hashset input is delimited by comma, if you want deal 
> with range then you need escape the comma.

Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?

I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.

Help wanted to reason about what the following queries should return:

SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);

Should they return NULL, the empty set or something else?

I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.

/Joel




Re: pgbnech: allow to cancel queries during benchmark

2023-06-26 Thread Kirk Wolak
On Mon, Jun 26, 2023 at 9:46 AM Yugo NAGATA  wrote:

> Hello,
>
> This attached patch enables pgbench to cancel queries during benchmark.
>
> Formerly, Ctrl+C during benchmark killed pgbench immediately, but backend
> processes executing long queries remained for a while. You can simply
> reproduce this problem by cancelling the pgbench running a custom script
> executing "SELECT pg_sleep(10)".
>
> The patch fixes this so that cancel requests are sent for all connections
> on
> Ctrl+C, and all running queries are cancelled before pgbench exits.
>
> In thread #0, setup_cancel_handler is called before the loop, the
> CancelRequested flag is set when Ctrl+C is issued. In the loop, cancel
> requests are sent when the flag is set only in thread #0. SIGINT is
> blocked in other threads, but the threads will exit after their query
> are cancelled. If thread safety is disabled or OS is Windows, the signal
> is not blocked because pthread_sigmask cannot be used.
> (I didn't test the patch on WIndows yet, though.)
>
> I choose the design that the signal handler and the query cancel are
> performed only in thread #0 because I wanted to make the behavior as
> predicable as possible. However, another design that all thread can
> received SIGINT and that the first thread that catches the signal is
> responsible to sent cancel requests for all connections may also work.
>
> Also, the array of CState that contains all clients state is changed to
> a global variable so that all connections can be accessed within a thread.
>
>
> +1
  I also like the thread #0 handling design.  I have NOT reviewed/tested
this yet.


Detecting use-after-free bugs using gcc's malloc() attribute

2023-06-26 Thread Andres Freund
Hi,

I played around with adding
  __attribute__((malloc(free_func), malloc(another_free_func)))
annotations to a few functions in pg. See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html


Adding them to pg_list.h seems to have found two valid issues when compiling
without optimization:

[1001/2331 22  42%] Compiling C object 
src/backend/postgres_lib.a.p/commands_tablecmds.c.o
../../../../home/andres/src/postgresql/src/backend/commands/tablecmds.c: In 
function ‘ATExecAttachPartition’:
../../../../home/andres/src/postgresql/src/backend/commands/tablecmds.c:17966:25:
 warning: pointer ‘partBoundConstraint’ may be used after ‘list_concat’ 
[-Wuse-after-free]
17966 | 
get_proposed_default_constraint(partBoundConstraint);
  | 
^~~~
../../../../home/andres/src/postgresql/src/backend/commands/tablecmds.c:17919:26:
 note: call to ‘list_concat’ here
17919 | partConstraint = list_concat(partBoundConstraint,
  |  ^~~~
17920 |  
RelationGetPartitionQual(rel));
  |  
~~
[1233/2331 22  52%] Compiling C object 
src/backend/postgres_lib.a.p/rewrite_rewriteHandler.c.o
../../../../home/andres/src/postgresql/src/backend/rewrite/rewriteHandler.c: In 
function ‘rewriteRuleAction’:
../../../../home/andres/src/postgresql/src/backend/rewrite/rewriteHandler.c:550:41:
 warning: pointer ‘newjointree’ may be used after ‘list_concat’ 
[-Wuse-after-free]
  550 | checkExprHasSubLink((Node *) 
newjointree);
  | 
^
../../../../home/andres/src/postgresql/src/backend/rewrite/rewriteHandler.c:542:33:
 note: call to ‘list_concat’ here
  542 | list_concat(newjointree, 
sub_action->jointree->fromlist);
  | 
^~~~


Both of these bugs seem somewhat older, the latter going back to 19ff959bff0,
in 2005.  I'm a bit surprised that we haven't hit them before, via
DEBUG_LIST_MEMORY_USAGE?


When compiling with optimization, another issue is reported:

[508/1814 22  28%] Compiling C object 
src/backend/postgres_lib.a.p/commands_explain.c.o
../../../../home/andres/src/postgresql/src/backend/commands/explain.c: In 
function 'ExplainNode':
../../../../home/andres/src/postgresql/src/backend/commands/explain.c:2037:25: 
warning: pointer 'ancestors' may be used after 'lcons' [-Wuse-after-free]
 2037 | show_upper_qual(plan->qual, "Filter", 
planstate, ancestors, es);
  | 
^~~
In function 'show_group_keys',
inlined from 'ExplainNode' at 
../../../../home/andres/src/postgresql/src/backend/commands/explain.c:2036:4:
../../../../home/andres/src/postgresql/src/backend/commands/explain.c:2564:21: 
note: call to 'lcons' here
 2564 | ancestors = lcons(plan, ancestors);
  | ^~

which looks like it might be valid - the caller's "ancestors" variable could
now be freed?  There do appear to be further instances of the issue, e.g. in
show_agg_keys(), that aren't flagged for some reason.



For something like pg_list.h the malloc(free) attribute is a bit awkward to
use, because one a) needs to list ~30 functions that can free a list and b)
the referenced functions need to be declared.

In my quick hack I just duplicated the relevant part of pg_list.h and added
the appropriate attributes to the copy of the original declaration.


I also added such attributes to bitmapset.h and palloc() et al, but that
didn't find existing bugs.  It does find use-after-free instances if I add
some, similarly it does find cases of mismatching palloc with free etc.


The attached prototype is quite rough and will likely fail on anything but a
recent gcc (likely >= 12).


Do others think this would be useful enough to be worth polishing? And do you
agree the warnings above are bugs?

Greetings,

Andres Freund
>From e05c1260ee70457e9a899d5c32e5b85702193739 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 26 Jun 2023 12:17:30 -0700
Subject: [PATCH v1 1/2] Add allocator attributes.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/c.h   |  9 
 src/include/nodes/bitmapset.h | 40 ++-
 src/include/nodes/pg_list.h   | 97 +++
 src/include/utils/palloc.h| 55 
 4 files changed, 167 insertions(+), 34 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index f69d739be57..920fdf983a1 100644
--- a/src/include/c.h
+++ b/src/include/c.h

Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
po 26. 6. 2023 v 19:48 odesílatel James Coleman  napsal:

> On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule 
> wrote:
> >
> >
> >
> > po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule 
> napsal:
> >>
> >> Hi
> >>
> >> po 26. 6. 2023 v 19:41 odesílatel James Coleman 
> napsal:
> >>>
> >>> Hello,
> >>>
> >>> Have we ever discussed running an analyze immediately after creating a
> table?
> >>>
> >>> Consider the following:
> >>>
> >>> create table stats(i int, t text not null);
> >>> explain select * from stats;
> >>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> >>> analyze stats;
> >>> explain select * from stats;
> >>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> >>>
> >>> Combined with rapidly increasing error margin on row estimates when
> >>> adding joins means that a query joining to a bunch of empty tables
> >>> when a database first starts up can result in some pretty wild plan
> >>> costs.
> >>>
> >>> This feels like a simple idea to me, and so I assume people have
> >>> considered it before. If so, I'd like to understand why the conclusion
> >>> was not to do it, or, alternatively if it's a lack of tuits.
> >>
> >>
> >> I like this. On the second hand, described behaviour is designed for
> ensuring of back compatibility.
> >
> >
> > if you break this back compatibility, then the immediate ANALYZE is not
> necessary
>
> I don't follow what backwards compatibility you're referencing. Could
> you expand on that?
>

Originally, until the table had minimally one row, the PostgreSQL
calculated with 10 pages. It was fixed (changed) in PostgreSQL 14.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4

Regards

Pavel


> Regards,
> James Coleman
>


Re: Do we want a hashset type?

2023-06-26 Thread Kirk Wolak
On Mon, Jun 26, 2023 at 4:55 PM Joel Jacobson  wrote:

> On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> > Can you try to glue the attached to the hashset data type input
> > function.
> > the attached will parse cstring with double quote and not. so '{1,2,3}'
> > == '{"1","2","3"}'. obviously quote will preserve the inner string as
> > is.
> > currently int4hashset input is delimited by comma, if you want deal
> > with range then you need escape the comma.
>
> Not sure what you're trying to do here; what's the problem with
> the current int4hashset_in()?
>
> I think it might be best to focus on null semantics / three-valued logic
> before moving on and trying to implement support for more types,
> otherwise we would need to rewrite more code if we find general
> thinkos that are problems in all types.
>
> Help wanted to reason about what the following queries should return:
>
> SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
>
> Should they return NULL, the empty set or something else?
>
> I've renamed hashset_merge() -> hashset_union() to better match
> SQL's MULTISET feature which has a MULTISET UNION.
>

Shouldn't they return the same thing that left(NULL::text,1) returns?
(NULL)...
Typically any operation on NULL is NULL.

Kirk...


Re: Row pattern recognition

2023-06-26 Thread Vik Fearing

On 6/26/23 03:05, Tatsuo Ishii wrote:

I don't understand this.  RPR in a window specification limits the
window to the matched rows, so this looks like your rpr() function is
just the regular first_value() window function that we already have?


No, rpr() is different from first_value(). rpr() returns the argument
value at the first row in a frame only when matched rows found. On the
other hand first_value() returns the argument value at the first row
in a frame unconditionally.

company  |   tdate| price | rpr  | first_value
--++---+--+-
  company1 | 2023-07-01 |   100 |  | 100
  company1 | 2023-07-02 |   200 |  200 | 200
  company1 | 2023-07-03 |   150 |  150 | 150
  company1 | 2023-07-04 |   140 |  | 140
  company1 | 2023-07-05 |   150 |  150 | 150
  company1 | 2023-07-06 |90 |  |  90
  company1 | 2023-07-07 |   110 |  | 110
  company1 | 2023-07-08 |   130 |  | 130
  company1 | 2023-07-09 |   120 |  | 120
  company1 | 2023-07-10 |   130 |  | 130

For example, a frame starting with (tdate = 2023-07-02, price = 200)
consists of rows (price = 200, 150, 140, 150) satisfying the pattern,
thus rpr returns 200. Since in this example frame option "ROWS BETWEEN
CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts
with (tdate = 2023-07-03, price = 150). This frame satisfies the
pattern too (price = 150, 140, 150), and rpr retus 150... and so on.



Okay, I see the problem now, and why you need the rpr() function.

You are doing this as something that happens over a window frame, but it 
is actually something that *reduces* the window frame.  The pattern 
matching needs to be done when the frame is calculated and not when any 
particular function is applied over it.


This query (with all the defaults made explicit):

SELECT s.company, s.tdate, s.price,
   FIRST_VALUE(s.tdate) OVER w,
   LAST_VALUE(s.tdate) OVER w,
   lowest OVER w
FROM stock AS s
WINDOW w AS (
  PARTITION BY s.company
  ORDER BY s.tdate
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  EXCLUDE NO OTHERS
  MEASURES
LAST(DOWN) AS lowest
  AFTER MATCH SKIP PAST LAST ROW
  INITIAL PATTERN (START DOWN+ UP+)
  DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);

Should produce this result:

 company  |   tdate| price | first_value | last_value | lowest
--++---+-++
 company1 | 07-01-2023 |   100 | ||
 company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |140
 company1 | 07-03-2023 |   150 | ||
 company1 | 07-04-2023 |   140 | ||
 company1 | 07-05-2023 |   150 | ||
 company1 | 07-06-2023 |90 | ||
 company1 | 07-07-2023 |   110 | ||
 company1 | 07-08-2023 |   130 | 07-05-2023  | 07-05-2023 |120
 company1 | 07-09-2023 |   120 | ||
 company1 | 07-10-2023 |   130 | ||
(10 rows)

Or if we switch to AFTER MATCH SKIP TO NEXT ROW, then we get:

 company  |   tdate| price | first_value | last_value | lowest
--++---+-++
 company1 | 07-01-2023 |   100 | ||
 company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |140
 company1 | 07-03-2023 |   150 | 07-03-2023  | 07-05-2023 |140
 company1 | 07-04-2023 |   140 | ||
 company1 | 07-05-2023 |   150 | 07-05-2023  | 07-08-2023 | 90
 company1 | 07-06-2023 |90 | ||
 company1 | 07-07-2023 |   110 | ||
 company1 | 07-08-2023 |   130 | 07-08-2023  | 07-10-2023 |120
 company1 | 07-09-2023 |   120 | ||
 company1 | 07-10-2023 |   130 | ||
(10 rows)

And then if we change INITIAL to SEEK:

 company  |   tdate| price | first_value | last_value | lowest
--++---+-++
 company1 | 07-01-2023 |   100 | 07-02-2023  | 07-05-2023 |140
 company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |140
 company1 | 07-03-2023 |   150 | 07-03-2023  | 07-05-2023 |140
 company1 | 07-04-2023 |   140 | 07-05-2023  | 07-08-2023 | 90
 company1 | 07-05-2023 |   150 | 07-05-2023  | 07-08-2023 | 90
 company1 | 07-06-2023 |90 | 07-08-2023  | 07-10-2023 |120
 company1 | 07-07-2023 |   110 | 07-08-2023  | 07-10-2023 |120
 company1 | 07-08-2023 |   130 | 07-08-2023  | 07-10-2023 |120
 company1 | 07-09-2023 |   120 | ||
 company1 | 07-10-2023 |   130 | ||
(10 rows)

Since the pattern recognition is part of the frame, the window 
aggregates should Just Work.




o SUBSET is not supported


Is this because 

ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Andres Freund
Hi,

As mentioned nearby [1], Thomas brought up [2] the idea of using
ReadRecentBuffer() _bt_getroot().  I couldn't resist and prototyped it.

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer().  Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock.  When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

ReadRecentBuffer() scales worse even if LockBufHdr() is immediately followed
by PinBuffer_Locked(), so it's really just holding the lock that is the issue.


The fairly obvious solution to this is to just use PinBuffer() and just unpin
the buffer if its identity was changed concurrently. There could be an
unlocked pre-check as well.  However, there's the following comment in
ReadRecentBuffer():
 * It's now safe to pin the buffer.  We can't pin first 
and ask
 * questions later, because it might confuse code paths 
like
 * InvalidateBuffer() if we pinned a random 
non-matching buffer.
 */

But I'm not sure I buy that - there's plenty other things that can briefly
acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
contents, etc).



Another difference between using PinBuffer() and PinBuffer_locked() is that
the latter does not adjust a buffer's usagecount.

Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
increasing usagecount anymore?


FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
the root page's buffer id in RelationData, seems a noticeable win. About 7% in
a concurrent, read-only pgbench that utilizes batches of 10. And it should be
easy to get much bigger wins, e.g. with a index nested loop with a relatively
small index on the inner side.


Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20230627013458.axge7iylw7llyvww%40awork3.anarazel.de
[2] https://twitter.com/MengTangmu/status/1673439083518115840




Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Thomas Munro
On Tue, Jun 27, 2023 at 2:05 PM Andres Freund  wrote:
> As mentioned nearby [1], Thomas brought up [2] the idea of using
> ReadRecentBuffer() _bt_getroot().  I couldn't resist and prototyped it.

Thanks!

> Unfortunately it scaled way worse at first. This is not an inherent issue, but
> due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> LockBufHdr(), checks if the buffer ID is the same and then uses
> PinBuffer_Locked().
>
> The problem with that is that PinBuffer() takes care to not hold the buffer
> header spinlock, it uses compare_exchange to atomically acquire the pin, while
> guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> there obviously is the risk of being scheduled out (or even just not have
> exclusive access to the cacheline).

Yeah.  Aside from inherent nastiness of user-space spinlocks, this new
use case is also enormously more likely to contend and then get into
trouble by being preempted due to btree root pages being about the
hottest pages in the universe than the use case I was focusing on at
the time.

> The fairly obvious solution to this is to just use PinBuffer() and just unpin
> the buffer if its identity was changed concurrently. There could be an
> unlocked pre-check as well.  However, there's the following comment in
> ReadRecentBuffer():
>  * It's now safe to pin the buffer.  We can't pin 
> first and ask
>  * questions later, because it might confuse code 
> paths like
>  * InvalidateBuffer() if we pinned a random 
> non-matching buffer.
>  */
>
> But I'm not sure I buy that - there's plenty other things that can briefly
> acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> contents, etc).

I may well have been too cautious with that.  The worst thing I can
think of right now is that InvalidateBuffer() would busy loop (as it
already does in other rare cases) when it sees a pin.

> Another difference between using PinBuffer() and PinBuffer_locked() is that
> the latter does not adjust a buffer's usagecount.
>
> Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> increasing usagecount anymore?

Yeah, that is not great.  The simplification you suggest would fix
that too, though I guess it would also bump the usage count of buffers
that don't have the tag we expected; that's obviously rare and erring
on a better side though.

> FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> easy to get much bigger wins, e.g. with a index nested loop with a relatively
> small index on the inner side.

Wooo, that's better than I was hoping.  Thanks for trying it out!  I
think, for the complexity involved (ie very little), it's a nice
result, and worth considering even though it's also a solid clue that
we could do much better than this with a (yet to be designed)
longer-lived pin scheme.  smgr_targblock could be another
easy-to-cache candidate, ie a place where there is a single
interesting hot page that we're already keeping track of with no
requirement for new backend-local mapping machinery.




Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Peter Geoghegan
On Mon, Jun 26, 2023 at 8:34 PM Thomas Munro  wrote:
> Yeah.  Aside from inherent nastiness of user-space spinlocks, this new
> use case is also enormously more likely to contend and then get into
> trouble by being preempted due to btree root pages being about the
> hottest pages in the universe than the use case I was focusing on at
> the time.

They're not just the hottest. They're also among the least likely to
change from one moment to the next. (If that ever failed to hold then
it wouldn't take long for the index to become grotesquely tall.)

-- 
Peter Geoghegan




Re: Add GUC to tune glibc's malloc implementation.

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-22 15:35:12 +0200, Ronan Dunklau wrote:
> This can cause problems. Let's take for example a table with 10k rows, and 32 
> columns (as defined by a bench script David Rowley shared last year when 
> discussing the GenerationContext for tuplesort), and execute the following 
> query, with 32MB of work_mem:
> 
> select * from t order by a offset 10;
>
> I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark 
> with different values for glibc_malloc_max_trim_threshold.

Could you provide instructions for the benchmark that don't require digging
into the archive to find an email by David?

Greetings,

Andres Freund




Re: Add GUC to tune glibc's malloc implementation.

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-26 08:38:35 +0200, Ronan Dunklau wrote:
> I hope what I'm trying to achieve is clearer that way. Maybe this patch is not
> the best way to go about this, but since the memory allocator behaviour can
> have such an impact it's a bit sad we have to leave half the performance on
> the table because of it when there are easily accessible knobs to avoid it.

I'm *quite* doubtful this patch is the way to go.  If we want to more tightly
control memory allocation patterns, because we have more information than
glibc, we should do that, rather than try to nudge glibc's malloc in random
direction.  In contrast a generic malloc() implementation we can have much
more information about memory lifetimes etc due to memory contexts.

We e.g. could keep a larger number of memory blocks reserved
ourselves. Possibly by delaying the release of additionally held blocks until
we have been idle for a few seconds or such.


WRT to the difference in TPS in the benchmark you mention - I suspect that we
are doing something bad that needs to be improved regardless of the underlying
memory allocator implementation.  Due to the lack of detailed instructions I
couldn't reproduce the results immediately.

Greetings,

Andres Freund




Re: Improving btree performance through specializing by key shape, take 2

2023-06-26 Thread Dilip Kumar
On Fri, Jun 23, 2023 at 8:16 PM Matthias van de Meent
 wrote:
>
> On Fri, 23 Jun 2023 at 11:26, Dilip Kumar  wrote:

>
> > and I have one confusion in the
> > below hunk in the _bt_moveright function, basically, if the parent
> > page's right key is exactly matching the HIGH key of the child key
> > then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
> > use the optimization right, i.e. column number from where we need to
> > start the comparison should be used what is passed by the caller.  But
> > in the below hunk, you are always passing that as 'cmpcol' which is 1.
> > I think this should be '*comparecol' because '*comparecol' will either
> > hold the value passed by the parent if high key data exactly match
> > with the parent's right tuple or it will hold 1 in case it doesn't
> > match.  Am I missing something?
>
> We can't carry _bt_compare prefix results across pages, because the
> key range of a page may shift while we're not holding a lock on that
> page. That's also why the code resets the prefix to 1 every time it
> accesses a new page ^1: it cannot guarantee correct results otherwise.
> See also [0] and [1] for why that is important.

Yeah that makes sense

> ^1: When following downlinks, the code above your quoted code tries to
> reuse the _bt_compare result of the parent page in the common case of
> a child page's high key that is bytewise equal to the right separator
> tuple of the parent page's downlink to this page.  However, if it
> detects that this child high key has changed (i.e. not 100% bytewise
> equal), we can't reuse that result, and we'll have to re-establish all
> prefix info on that page from scratch.
> In any case, this only establishes the prefix for the right half of
> the page's keyspace, the prefix of the left half of the data still
> needs to be established separetely.
>
> I hope this explains the reasons for why we can't reuse comparecol as
> _bt_compare argument.

Yeah got it, thanks for explaining this.  Now I see you have explained
this in comments as well above the memcmp() statement.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Fixing tab-complete for dollar-names

2023-06-26 Thread Vik Fearing

On 6/26/23 22:10, Mikhail Gribkov wrote:

Hi hackers,

As not much preliminary interest seem to be here, I'm sending the patch to
the upcoming commitfest


I have added myself as reviewer.  I already had taken a look at it, and 
it seemed okay, but I have not yet searched for corner cases.

--
Vik Fearing





[PATCH] Honor PG_TEST_NOCLEAN for tempdirs

2023-06-26 Thread Jacob Champion
Hello,

I was running the test_pg_dump extension suite, and I got annoyed that
I couldn't keep it from deleting its dump artifacts after a successful
run. Here's a patch to make use of PG_TEST_NOCLEAN (which currently
covers the test cluster's base directory) with the Test::Utils
tempdirs too.

(Looks like this idea was also discussed last year [1]; let me know if
I missed any more recent suggestions.)

Thanks,
--Jacob

[1] https://www.postgresql.org/message-id/yypd9unv14sx2...@paquier.xyz
From 4e00947b66edc83ab66a06aa8a9f4c1591a2734c Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Fri, 23 Jun 2023 13:30:20 -0700
Subject: [PATCH] Test::Utils: honor PG_TEST_NOCLEAN for tempdirs

---
 src/test/perl/PostgreSQL/Test/Utils.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index a27fac83d2..e5a08d7b1a 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -272,7 +272,7 @@ sub all_tests_passing
 
 Securely create a temporary directory inside C<$tmp_check>, like C,
 and return its name.  The directory will be removed automatically at the
-end of the tests.
+end of the tests, unless the PG_TEST_NOCLEAN envvar is provided.
 
 If C is given, the new directory is templated as C<${prefix}_>.
 Otherwise the template is C.
@@ -286,7 +286,7 @@ sub tempdir
 	return File::Temp::tempdir(
 		$prefix . '_',
 		DIR => $tmp_check,
-		CLEANUP => 1);
+		CLEANUP => not defined $ENV{'PG_TEST_NOCLEAN'});
 }
 
 =pod
@@ -301,7 +301,7 @@ name, to avoid path length issues.
 sub tempdir_short
 {
 
-	return File::Temp::tempdir(CLEANUP => 1);
+	return File::Temp::tempdir(CLEANUP => not defined $ENV{'PG_TEST_NOCLEAN'});
 }
 
 =pod
-- 
2.25.1



Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-26 Thread Peter Smith
On Mon, Jun 26, 2023 at 11:44 PM Hayato Kuroda (Fujitsu)
 wrote:
...
> BTW, I have doubt that the restriction is not related with your commit.
> In other words, if the table has attributes which the datatype is not for 
> operator
> class of Btree, we could not use REPLICA IDENTITY FULL. IIUC it is not 
> documented.
> Please see attched script to reproduce that. The final DELETE statement 
> cannot be
> replicated at the subscriber on my env.
>

Missing attachment?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund  wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, 
> > but
> > due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, 
> > while
> > guaranteing nobody holds the lock.  When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
> 
> Yeah.  Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.


> > The fairly obvious solution to this is to just use PinBuffer() and just 
> > unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well.  However, there's the following comment in
> > ReadRecentBuffer():
> >  * It's now safe to pin the buffer.  We can't pin 
> > first and ask
> >  * questions later, because it might confuse code 
> > paths like
> >  * InvalidateBuffer() if we pinned a random 
> > non-matching buffer.
> >  */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
> 
> I may well have been too cautious with that.  The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.


> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
> 
> Yeah, that is not great.  The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.


> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% 
> > in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should 
> > be
> > easy to get much bigger wins, e.g. with a index nested loop with a 
> > relatively
> > small index on the inner side.
> 
> Wooo, that's better than I was hoping.  Thanks for trying it out!  I
> think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.


> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...


> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.


Greetings,

Andres Freund




RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-26 Thread Hayato Kuroda (Fujitsu)
> Please see attched script to reproduce that. The final DELETE statement cannot
> be
> replicated at the subscriber on my env.

Sorry, I forgot to attach...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



test_point.sh
Description: test_point.sh


False sharing for PgBackendStatus, made worse by in-core query_id handling

2023-06-26 Thread Andres Freund
Hi,

On Twitter Thomas thoroughly nerdsniped me [1]. As part of that I ran a
concurrent readonly pgbench workload and analyzed cacheline "contention" using
perf c2c.

One of the cacheline conflicts, by far not the most common, but significant,
is one I hadn't noticed in the past. The cacheline accesses are by
pgstat_report_query_id() and pgstat_report_activity().

The reason for the conflict is simple - we don't ensure any aligment for
PgBackendStatus. Thus the end of one backend's PgBackendStatus will be in the
same cacheline as another backend's start of PgBackendStatus.

Historically that didn't show up too much, because the end of PgBackendStatus
happened to contain less-frequently changing data since at least 9.6, namely
int64   st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];

which effectively avoided any relevant false sharing.


But in 4f0b0966c866 a new trailing element was added to PgBackendStatus:

/* query identifier, optionally computed using post_parse_analyze_hook 
*/
uint64  st_query_id;

which is very frequently set, due to the following in ExecutorStart:
/*
 * In some cases (e.g. an EXECUTE statement) a query execution will skip
 * parse analysis, which means that the query_id won't be reported.  
Note
 * that it's harmless to report the query_id multiple times, as the call
 * will be ignored if the top level query_id has already been reported.
 */
pgstat_report_query_id(queryDesc->plannedstmt->queryId, false);



The benchmarks I ran used -c 48 -j 48 clients on my two socket workstation, 2x
10/20 cores/threads.

With a default pgbench -S workload, the regression is barely visible - the
context switches between pgbench and backend use too many resources. But a
pipelined pgbench -S shows a 1-2% regression and server-side query execution
of a simple statement [2] regresses by ~5.5%.

Note that this is with compute_query_id = auto, without any extensions
loaded.

The fix for this is quite simple, something like:
#ifdef pg_attribute_aligned
pg_attribute_aligned(PG_CACHE_LINE_SIZE)
#endif

at the start of PgBackendStatus.


Unfortunately we can't fix that in the backbranches, as it obviously is an ABI
violation.


Leaving the performance issue aside for a moment, I'm somewhat confused by the
maintenance of PgBackendStatus->st_query_id:

1) Why are there pgstat_report_query_id() calls in parse_analyze_*()? We aren't
   executing the statements at that point?

2) pgstat_report_query_id() doesn't overwrite a non-zero query_id unless force
   is passed in. Force is only passed in exec_simple_query(). query_id is also
   reset when pgstat_report_activity(STATE_RUNNING) is called.

   I think this means that e.g. bgworkers issuing queries will often get stuck
   on the first query_id used, unless they call pgstat_report_activity()?

Greetings,

Andres Freund

[1] https://twitter.com/MengTangmu/status/1673439083518115840
[2] DO $$ BEGIN FOR i IN 1..1 LOOP EXECUTE 'SELECT'; END LOOP;END;$$;




Incorrect comment for memset() on pgssHashKey?

2023-06-26 Thread Japin Li

Hi,

Commit 6b4d23feef introduces a toplevel field in pgssHashKey, which leads
padding.  In pgss_store(), it comments that memset() is required when
pgssHashKey is without padding only.

@@ -1224,9 +1227,14 @@ pgss_store(const char *query, uint64 queryId,
 query = CleanQuerytext(query, _location, _len);

 /* Set up key for hashtable search */
+
+/* memset() is required when pgssHashKey is without padding only */
+memset(, 0, sizeof(pgssHashKey));
+
 key.userid = GetUserId();
 key.dbid = MyDatabaseId;
 key.queryid = queryId;
+key.toplevel = (exec_nested_level == 0);

 /* Lookup the hash table entry with shared lock. */
 LWLockAcquire(pgss->lock, LW_SHARED);

However, we need memset() only when pgssHashKey has padding, right?

-- 
Regrads,
Japin Li.

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 21bede29fe..f6ed87f1e6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1257,7 +1257,7 @@ pgss_store(const char *query, uint64 queryId,
 
 	/* Set up key for hashtable search */
 
-	/* memset() is required when pgssHashKey is without padding only */
+	/* memset() is required when pgssHashKey is with padding only */
 	memset(, 0, sizeof(pgssHashKey));
 
 	key.userid = GetUserId();


Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Peter Geoghegan
On Mon, Jun 26, 2023 at 9:09 PM Andres Freund  wrote:
> I think we should be able to have a post-check that can figure out
> if the copied root page is out of date after searching it, without needing the
> content lock.

I'm guessing that you're thinking of doing something with the page LSN?

-- 
Peter Geoghegan




Re: Improve comment on cid mapping (was Re: Adding CommandID to heap xlog records)

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-26 09:57:56 +0300, Heikki Linnakangas wrote:
> diff --git a/src/backend/replication/logical/snapbuild.c 
> b/src/backend/replication/logical/snapbuild.c
> index 0786bb0ab7..e403feeccd 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -41,10 +41,15 @@
>   * transactions we need Snapshots that see intermediate versions of the
>   * catalog in a transaction. During normal operation this is achieved by 
> using
>   * CommandIds/cmin/cmax. The problem with that however is that for space
> - * efficiency reasons only one value of that is stored
> - * (cf. combocid.c). Since combo CIDs are only available in memory we log
> - * additional information which allows us to get the original (cmin, cmax)
> - * pair during visibility checks. Check the reorderbuffer.c's comment above
> + * efficiency reasons, the cmin and cmax are not included in WAL records. We
> + * cannot read the cmin/cmax from the tuple itself, either, because it is
> + * reset on crash recovery. Even if we could, we could not decode combocids
> + * which are only tracked in the original backend's memory. To work around
> + * that, heapam writes an extra WAL record (XLOG_HEAP2_NEW_CID) every time a
> + * catalog row is modified, which includes the cmin and cmax of the
> + * tuple. During decoding, we insert the ctid->(cmin,cmax) mappings into the
> + * reorder buffer, and use them at visibility checks instead of the cmin/cmax
> + * on the tuple itself. Check the reorderbuffer.c's comment above
>   * ResolveCminCmaxDuringDecoding() for details.
>   *
>   * To facilitate all this we need our own visibility routine, as the normal
> -- 
> 2.30.2

LGTM


> From 9140a0d98fd21b595eac6d75521a6b1a9f1b Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 26 Jun 2023 09:56:02 +0300
> Subject: [PATCH v2 2/2] Remove redundant check for fast_forward.
> 
> We already checked for it earlier in the function.
> 
> Discussion: 
> https://www.postgresql.org/message-id/1ba2899e-77f8-7866-79e5-f3b7d1251...@iki.fi
> ---
>  src/backend/replication/logical/decode.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/src/backend/replication/logical/decode.c 
> b/src/backend/replication/logical/decode.c
> index d91055a440..7039d425e2 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -422,8 +422,7 @@ heap2_decode(LogicalDecodingContext *ctx, 
> XLogRecordBuffer *buf)
>   switch (info)
>   {
>   case XLOG_HEAP2_MULTI_INSERT:
> - if (!ctx->fast_forward &&
> - SnapBuildProcessChange(builder, xid, 
> buf->origptr))
> + if (SnapBuildProcessChange(builder, xid, buf->origptr))
>   DecodeMultiInsert(ctx, buf);
>   break;
>   case XLOG_HEAP2_NEW_CID:
> -- 
> 2.30.2

LGTM^2

Greetings,

Andres Freund




Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-06-26 Thread Peter Smith
On Thu, Jun 22, 2023 at 11:37 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
> (CC: Önder because he owned the related thread)
>
...
> The current patch only includes tests for hash indexes. These are separated 
> into
> the file 032_subscribe_use_index.pl for convenience, but will be integrated 
> in a
> later version.
>

Hi Kuroda-san.

I am still studying the docs references given in your initial post.

Meanwhile, FWIW, here are some minor review comments for the patch you gave.

==
src/backend/executor/execReplication.c

1. get_equal_strategy_number

+/*
+ * Return the appropriate strategy number which corresponds to the equality
+ * comparisons.
+ *
+ * TODO: support other indexes: GiST, GIN, SP-GiST, BRIN
+ */
+static int
+get_equal_strategy_number(Oid opclass)
+{
+ Oid am_method = get_opclass_method(opclass);
+ int ret;
+
+ switch (am_method)
+ {
+ case BTREE_AM_OID:
+ ret = BTEqualStrategyNumber;
+ break;
+ case HASH_AM_OID:
+ ret = HTEqualStrategyNumber;
+ break;
+ case GIST_AM_OID:
+ case GIN_AM_OID:
+ case SPGIST_AM_OID:
+ case BRIN_AM_OID:
+ /* TODO */
+ default:
+ /* XXX: Do we have to support extended indexes? */
+ ret = InvalidStrategy;
+ break;
+ }
+
+ return ret;
+}

1a.
In the file syscache.c there are already some other functions like
get_op_opfamily_strategy so I am wondering if this new function also
really belongs in that file.

~

1b.
Should that say "operator" instead of "comparisons"?

~

1c.
"am" stands for "access method" so "am_method" is like "access method
method" – is it correct?

~~~

2. build_replindex_scan_key

  int table_attno = indkey->values[index_attoff];
+ int strategy_number;


Ought to say this is a strategy for "equality", so a better varname
might be "equality_strategy_number" or "eq_strategy" or similar.

==
src/backend/replication/logical/relation.c

3. IsIndexUsableForReplicaIdentityFull

It seems there is some overlap between this code which hardwired 2
valid AMs and the switch statement in your other
get_equal_strategy_number function which returns a strategy number for
those 2 AMs.

Would it be better to make another common function like
get_equality_strategy_for_am(), and then you don’t have to hardwire
anything? Instead, you can say:

is_usable_type = get_equality_strategy_for_am(indexInfo->ii_Am) !=
InvalidStrategy;

==
src/backend/utils/cache/lsyscache.c

4. get_opclass_method

+/*
+ * get_opclass_method
+ *
+ * Returns the OID of the index access method operator family is for.
+ */
+Oid
+get_opclass_method(Oid opclass)
+{
+ HeapTuple tp;
+ Form_pg_opclass cla_tup;
+ Oid result;
+
+ tp = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclass));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for opclass %u", opclass);
+ cla_tup = (Form_pg_opclass) GETSTRUCT(tp);
+
+ result = cla_tup->opcmethod;
+ ReleaseSysCache(tp);
+ return result;
+}

Is the comment correct? IIUC, this function is not doing anything for
"families".

==
src/test/subscription/t/034_hash.pl

5.
+# insert some initial data within the range 0-9 for y
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO test_replica_id_full SELECT i, (i%10)::text FROM
generate_series(0,10) i"
+);

Why does the comment only say "for y"?

~~~

6.
+# wait until the index is used on the subscriber
+# XXX: the test will be suspended here
+$node_publisher->wait_for_catchup($appname);
+$node_subscriber->poll_query_until('postgres',
+ q{select (idx_scan = 4) from pg_stat_all_indexes where indexrelname
= 'test_replica_id_full_idx';}
+  )
+  or die
+  "Timed out while waiting for check subscriber tap_sub_rep_full
updates 4 rows via index";
+

AFAIK this is OK but it was slightly misleading because it says
"updates 4 rows" whereas the previous UPDATE was only for 2 rows. So
here I think the 4 also counts the 2 DELETED rows. The comment can be
expanded slightly to clarify this.

~~~

7.
FYI, when I ran the TAP test the result was this:

t/034_hash.pl .. 1/? # Tests were run but no plan
was declared and done_testing() was not seen.
t/034_hash.pl .. All 2 subtests passed
t/100_bugs.pl .. ok

Test Summary Report
---
t/034_hash.pl(Wstat: 0 Tests: 2 Failed: 0)
  Parse errors: No plan found in TAP output
Files=36, Tests=457, 338 wallclock secs ( 0.29 usr  0.07 sys + 206.73
cusr 47.94 csys = 255.03 CPU)
Result: FAIL

~

Just adding the missing done_testing() seemed to fix this.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Do we want a hashset type?

2023-06-26 Thread jian he
On Tue, Jun 27, 2023 at 4:55 AM Joel Jacobson  wrote:
>
> On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> > Can you try to glue the attached to the hashset data type input
> > function.
> > the attached will parse cstring with double quote and not. so '{1,2,3}'
> > == '{"1","2","3"}'. obviously quote will preserve the inner string as
> > is.
> > currently int4hashset input is delimited by comma, if you want deal
> > with range then you need escape the comma.
>
> Not sure what you're trying to do here; what's the problem with
> the current int4hashset_in()?
>
> I think it might be best to focus on null semantics / three-valued logic
> before moving on and trying to implement support for more types,
> otherwise we would need to rewrite more code if we find general
> thinkos that are problems in all types.
>
> Help wanted to reason about what the following queries should return:
>
> SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
>
> Should they return NULL, the empty set or something else?
>
> I've renamed hashset_merge() -> hashset_union() to better match
> SQL's MULTISET feature which has a MULTISET UNION.
>
> /Joel

in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

In page11,
>
> Unlike the corresponding table operators UNION, INTERSECT and EXCEPT, we have 
> chosen ALL as the default, since this is the most natural interpretation of 
> MULTISET UNION, etc

also in page 11 aggregate name FUSION. (I like the name.)




Assert !bms_overlap(joinrel->relids, required_outer)

2023-06-26 Thread Jaime Casanova
Hi,

The attached query makes beta2 crash with attached backtrace.
Interestingly the index on ref_6 is needed to make it crash, without
it the query works fine.

-- 
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {4194304, 140734892134240, 2, 6, 7058365, 
94467046612912, 4611686018427388799, 140602878769846, 0, 281470681751456, 0, 0, 
0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7fe0a7c60535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 0, 0, 0, 0, 140602865618952, 2, 7018074096076324867, 
  7017846515177828964, 94467046612912, 7003721282549509696, 0, 
6603711216602630912, 140734892134480, 0, 140734892135344}}, 
  sa_flags = -759062608, sa_restorer = 0x0}
sigs = {__val = {32, 0 }}
#2  0x55ead332796a in ExceptionalCondition (conditionName=0x55ead34de1f0 
"!bms_overlap(joinrel->relids, required_outer)", 
fileName=0x55ead34ddcef "relnode.c", lineNumber=1652) at assert.c:66
No locals.
#3  0x55ead306f7b8 in get_joinrel_parampathinfo (root=0x7fe09c5bd0a0, 
joinrel=0x7fe09c399d98, outer_path=0x7fe09c391e68, inner_path=0x7fe09c38fd28, 
sjinfo=0x7fe09c4766a0, required_outer=0x7fe09c39a218, 
restrict_clauses=0x7fff6540d548) at relnode.c:1652
ppi = 0x10
join_and_req = 0x7fff6540d4a0
outer_and_req = 0x0
inner_and_req = 0x7fe09c391a78
pclauses = 0x55ead4a896e0
eclauses = 0x386540d4a0
dropped_ecs = 0x68
rows = 4.6672926152656088e-310
lc = 0x55ead3366497 
#4  0x55ead30605b6 in create_nestloop_path (root=0x7fe09c5bd0a0, 
joinrel=0x7fe09c399d98, jointype=JOIN_LEFT, workspace=0x7fff6540d590, 
extra=0x7fff6540d740, outer_path=0x7fe09c391e68, inner_path=0x7fe09c38fd28, 
restrict_clauses=0x0, pathkeys=0x0, required_outer=0x7fe09c39a218)
at pathnode.c:2462
pathnode = 0x7fe09c399a08
inner_req_outer = 0x0
#5  0x55ead3007a7d in try_nestloop_path (root=0x7fe09c5bd0a0, 
joinrel=0x7fe09c399d98, outer_path=0x7fe09c391e68, inner_path=0x7fe09c38fd28, 
pathkeys=0x0, jointype=JOIN_LEFT, extra=0x7fff6540d740) at joinpath.c:775
required_outer = 0x7fe09c39a218
workspace = {startup_cost = 2.1988, total_cost = 
10.754533615595889, run_cost = 8.336155958894, 
  inner_run_cost = 6.9466950277108173e-310, inner_rescan_run_cost = 0, 
outer_rows = 0, inner_rows = 6.9532275375539389e-310, 
  outer_skip_rows = 4.667292444873237e-310, inner_skip_rows = 
3.9525251667299724e-322, numbuckets = 0, numbatches = 0, 
  inner_rows_total = 6.946694916976477e-310}
innerrel = 0x7fe09c4743d0
outerrel = 0x7fe09c391868
innerrelids = 0x7fe09c474670
outerrelids = 0x7fe09c391a78
inner_paramrels = 0x0
outer_paramrels = 0x7fe09c391d18
#6  0x55ead300904d in match_unsorted_outer (root=0x7fe09c5bd0a0, 
joinrel=0x7fe09c399d98, outerrel=0x7fe09c391868, innerrel=0x7fe09c4743d0, 
jointype=JOIN_LEFT, extra=0x7fff6540d740) at joinpath.c:1802
innerpath = 0x7fe09c38fd28
mpath = 0x0
lc2__state = {l = 0x7fe09c38ff68, i = 0}
lc2 = 0x7fe09c38ff80
outerpath = 0x7fe09c391e68
merge_pathkeys = 0x0
lc1__state = {l = 0x7fe09c391dd8, i = 0}
save_jointype = JOIN_LEFT
nestjoinOK = true
useallclauses = false
inner_cheapest_total = 0x7fe09c38fd28
matpath = 0x7fe09c393458
lc1 = 0x7fe09c391df0
__func__ = "match_unsorted_outer"
#7  0x55ead3006ed3 in add_paths_to_joinrel (root=0x7fe09c5bd0a0, 
joinrel=0x7fe09c399d98, outerrel=0x7fe09c391868, innerrel=0x7fe09c4743d0, 
jointype=JOIN_LEFT, sjinfo=0x7fe09c4766a0, restrictlist=0x0) at 
joinpath.c:294
extra = {restrictlist = 0x0, mergeclause_list = 0x0, inner_unique = 
false, sjinfo = 0x7fe09c4766a0, semifactors = {
outer_match_frac = 6.946694916976477e-310, match_count = 0}, 
param_source_rels = 0x7fe09c39a1f8}
mergejoin_allowed = true
lc = 0x0
joinrelids = 0x7fe09c399d78
#8  0x55ead300b9e9 in populate_joinrel_with_paths (root=0x7fe09c5bd0a0, 
rel1=0x7fe09c391868, rel2=0x7fe09c4743d0, joinrel=0x7fe09c399d98, 
sjinfo=0x7fe09c4766a0, restrictlist=0x0) at joinrels.c:942
__func__ = "populate_joinrel_with_paths"
#9  0x55ead300b5b7 in make_join_rel (root=0x7fe09c5bd0a0, 
rel1=0x7fe09c391868, rel2=0x7fe09c4743d0) at joinrels.c:776
joinrelids = 0x7fe09c38fe58
sjinfo = 0x7fe09c4766a0
reversed = false
pushed_down_joins = 0x0
sjinfo_data = {type = 2621925520, min_lefthand = 0x7fe09c474670, 
min_righthand = 0x7fe09c4737a0, syn_lefthand = 0x1, syn_righthand = 
0x7fff6540d8c0, 
  jointype = 3540362286, ojrelid = 

Re: Optimize walsender handling invalid messages of 'drop publication'

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-26 15:01:22 +0800, Bowen Shi wrote:
> This issue has been pending for several months without any response.
> And this problem still exists in the latest minor versions of PG 12
> and PG 13.
>
> I believe that the fix in this patch is helpful.
>
> The patch has been submitted
> https://commitfest.postgresql.org/43/4393/ .  Anyone who is interested
> in this issue can help with the review.

ISTM that the path for people encountering this issue is to upgrade.

It's not unheard of that we'd backpatch a performance improvements to the
backbranches, but it's pretty rare.  It's one thing to decide to backpatch an
optimization if it had time to "mature" in the development branch, but from
what I undestand you're proposing to apply this just to the back branches.

Greetings,

Andres Freund




Re: Add some more corruption error codes to relcache

2023-06-26 Thread Kirk Wolak
On Fri, Jun 16, 2023 at 9:18 AM Andrey M. Borodin 
wrote:

> Hi hackers,
>
> Relcache errors from time to time detect catalog corruptions. For example,
> recently I observed following:
> 1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This
> type of corruption passes through data_checksums.
> 2. RelationBuildTupleDesc() was failing with "catalog is missing 1
> attribute(s) for relid 2662".
> 3. We monitor corruption error codes and alert on-call DBAs when see one,
> but the message is not marked as XX001 or XX002. It's XX000 which happens
> from time to time due to less critical reasons than data corruption.
> 4. High-availability automation switched primary to other host and other
> monitoring checks did not ring too.
>
> This particular case is not very illustrative. In fact we had index
> corruption that looked like catalog corruption.
> But still it looks to me that catalog inconsistencies (like relnatts !=
> number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
> This particular error code in my experience proved to be a good indicator
> for early corruption detection.
>
> What do you think?
> What other subsystems can be improved in the same manner?
>
> Best regards, Andrey Borodin.
>

Andrey, I think this is a good idea.  But your #1 item sounds familiar.
There was a thread about someone creating/dropping lots of databases, who
found some kind of race condition that would ZERO out pg_ catalog entries,
just like you are mentioning.  I think he found the problem with that
relations could not be found and/or the DB did not want to start.  I just
spent 30 minutes looking for it, but my "search-fu" is apparently failing.

Which leads me to ask if there is a way to detect the corrupting write
(writing all zeroes to the file when we know better?  A Zeroed out header
when one cannot exist?)  Hoping this triggers a bright idea on your end...

Kirk...


Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Thomas Munro
On Tue, Jun 27, 2023 at 4:53 PM Peter Geoghegan  wrote:
> On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro  wrote:
> > If the goal is to get rid of both pins and content locks, LSN isn't
> > enough.  A page might be evicted and replaced by another page that has
> > the same LSN because they were modified by the same record.  Maybe
> > that's vanishingly rare, but the correct thing would be counter that
> > goes up on modification AND eviction.
>
> It should be safe to allow searchers to see a version of the root page
> that is out of date. The Lehman & Yao design is very permissive about
> these things. There aren't any special cases where the general rules
> are weakened in some way that might complicate this approach.
> Searchers need to check the high key to determine if they need to move
> right -- same as always.

OK.  I guess I'm talking about a slightly more general version of the
problem inspired by the stuff I mentioned in parentheses, which would
simply get the wrong answer if the mapping changed, whereas here you'd
use the cached copy in a race case which should still work for
searches.

So I guess the question for this thread is: do we want to work on
ReadRecentBuffer(), or just take this experiment as evidence of even
more speed-up available and aim for that directly?




Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Thomas Munro
On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan  wrote:
> On Mon, Jun 26, 2023 at 9:09 PM Andres Freund  wrote:
> > I think we should be able to have a post-check that can figure out
> > if the copied root page is out of date after searching it, without needing 
> > the
> > content lock.
>
> I'm guessing that you're thinking of doing something with the page LSN?

If the goal is to get rid of both pins and content locks, LSN isn't
enough.  A page might be evicted and replaced by another page that has
the same LSN because they were modified by the same record.  Maybe
that's vanishingly rare, but the correct thing would be counter that
goes up on modification AND eviction.  (FWIW I toyed with variants of
this concept in the context of SLRU -> buffer pool migration, where I
was trying to do zero-lock CLOG lookups; in that case I didn't need
the copy of the page being discussed here due to the data being
atomically readable, but I had the same requirement for a
"did-it-change-under-my-feet?" check).




Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Peter Geoghegan
On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro  wrote:
> If the goal is to get rid of both pins and content locks, LSN isn't
> enough.  A page might be evicted and replaced by another page that has
> the same LSN because they were modified by the same record.  Maybe
> that's vanishingly rare, but the correct thing would be counter that
> goes up on modification AND eviction.

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

More concretely: A root page can be concurrently split when there is
an in-flight index scan that is about to land on it (which becomes the
left half of the split). It doesn't matter if it's a searcher that is
"between" the meta page and the root page. It doesn't matter if a
level was added. This is true even though nothing that you'd usually
think of as an interlock is held "between levels". The root page isn't
really special, except in the obvious way. We can even have two roots
at the same time (the true root, and the fast root).

-- 
Peter Geoghegan




Re: Assert !bms_overlap(joinrel->relids, required_outer)

2023-06-26 Thread Michael Paquier
On Mon, Jun 26, 2023 at 11:05:43PM -0500, Jaime Casanova wrote:
> The attached query makes beta2 crash with attached backtrace.
> Interestingly the index on ref_6 is needed to make it crash, without
> it the query works fine.

Issue reproduced here.  I am adding an open item, whose owner should
be Tom?
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs

2023-06-26 Thread Michael Paquier
On Mon, Jun 26, 2023 at 04:55:47PM -0700, Jacob Champion wrote:
> I was running the test_pg_dump extension suite, and I got annoyed that
> I couldn't keep it from deleting its dump artifacts after a successful
> run. Here's a patch to make use of PG_TEST_NOCLEAN (which currently
> covers the test cluster's base directory) with the Test::Utils
> tempdirs too.

I am still +1 in doing that.

> (Looks like this idea was also discussed last year [1]; let me know if
> I missed any more recent suggestions.)

I don't recall any specific suggestions related to that, but perhaps
it got mentioned somewhere else.

src/test/perl/README and regress.sgml both describe what
PG_TEST_NOCLEAN does, and it seems to me that these should be updated
to tell that temporary files are not removed on top of the data
folders?
--
Michael


signature.asc
Description: PGP signature


Re: ReadRecentBuffer() doesn't scale well

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-27 16:40:08 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan  wrote:
> > On Mon, Jun 26, 2023 at 9:09 PM Andres Freund  wrote:
> > > I think we should be able to have a post-check that can figure out
> > > if the copied root page is out of date after searching it, without 
> > > needing the
> > > content lock.
> >
> > I'm guessing that you're thinking of doing something with the page LSN?

Yes, that seems to be the most obvious.


> If the goal is to get rid of both pins and content locks, LSN isn't
> enough.

I was imaginging you'd have a long-lived pin. I don't think trying to make it
work without that is particularly promising in this context, where it seems
quite feasible to keep pins around for a while.


> A page might be evicted and replaced by another page that has the same LSN
> because they were modified by the same record.  Maybe that's vanishingly
> rare, but the correct thing would be counter that goes up on modification
> AND eviction.

I don't think it would need to be a single counter. If we wanted to do
something like this, I think you'd have to have a counter in the buffer desc
that's incremented whenever the page is replaced. Plus the LSN for the page
content change "counter".

Greetings,

Andres Freund




Re: Improving btree performance through specializing by key shape, take 2

2023-06-26 Thread Dilip Kumar
On Tue, Jun 27, 2023 at 9:42 AM Dilip Kumar  wrote:
>
> On Fri, Jun 23, 2023 at 8:16 PM Matthias van de Meent
>  wrote:
> >
> > On Fri, 23 Jun 2023 at 11:26, Dilip Kumar  wrote:
>
> >
> > > and I have one confusion in the
> > > below hunk in the _bt_moveright function, basically, if the parent
> > > page's right key is exactly matching the HIGH key of the child key
> > > then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
> > > use the optimization right, i.e. column number from where we need to
> > > start the comparison should be used what is passed by the caller.  But
> > > in the below hunk, you are always passing that as 'cmpcol' which is 1.
> > > I think this should be '*comparecol' because '*comparecol' will either
> > > hold the value passed by the parent if high key data exactly match
> > > with the parent's right tuple or it will hold 1 in case it doesn't
> > > match.  Am I missing something?
> >
> > We can't carry _bt_compare prefix results across pages, because the
> > key range of a page may shift while we're not holding a lock on that
> > page. That's also why the code resets the prefix to 1 every time it
> > accesses a new page ^1: it cannot guarantee correct results otherwise.
> > See also [0] and [1] for why that is important.
>
> Yeah that makes sense
>
> > ^1: When following downlinks, the code above your quoted code tries to
> > reuse the _bt_compare result of the parent page in the common case of
> > a child page's high key that is bytewise equal to the right separator
> > tuple of the parent page's downlink to this page.  However, if it
> > detects that this child high key has changed (i.e. not 100% bytewise
> > equal), we can't reuse that result, and we'll have to re-establish all
> > prefix info on that page from scratch.
> > In any case, this only establishes the prefix for the right half of
> > the page's keyspace, the prefix of the left half of the data still
> > needs to be established separetely.
> >
> > I hope this explains the reasons for why we can't reuse comparecol as
> > _bt_compare argument.
>
> Yeah got it, thanks for explaining this.  Now I see you have explained
> this in comments as well above the memcmp() statement.

At high level 0001 looks fine to me, just some suggestions

1.
+Notes about dynamic prefix truncation
+-

I feel instead of calling it "dynamic prefix truncation" should we can
call it "dynamic prefix skipping", I mean we are not
really truncating anything right, we are just skipping those
attributes in comparison?

2.
I think we should add some comments in the _bt_binsrch() function
where we are having main logic around maintaining highcmpcol and
lowcmpcol.
I think the notes section explains that very clearly but adding some
comments here would be good and then reference to that section in the
README.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com