Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
Thanks, Andres Freund and Andres Gierth.

To be related, can I invite you to help to review the parallel grouping sets
patches? It will be very great to hear some comments from you since you
contributed most of the codes for grouping sets.

the thread is
https://www.postgresql.org/message-id/CAG4reAQ8rFCc%2Bi0oju3VjaW7xSOJAkvLrqa4F-NYZzAG4SW7iQ%40mail.gmail.com

Thanks,
Pengzhou

On Fri, Mar 13, 2020 at 3:16 AM Andres Freund  wrote:

> Hi,
>
>
> On 2020-03-12 16:35:15 +0800, Pengzhou Tang wrote:
> > When reading the grouping sets codes, I find that the additional size of
> > the hash table for hash aggregates is always zero, this seems to be
> > incorrect to me, attached a patch to fix it, please help to check.
>
> Indeed, that's incorrect. Causes the number of buckets for the hashtable
> to be set higher - the size is just used for that.  I'm a bit wary of
> changing this in the stable branches - could cause performance changes?
>
> Greetings,
>
> Andres Freund
>


Re: RETURNING does not explain evaluation context for subqueries

2020-03-13 Thread Bruce Momjian
On Wed, Feb  5, 2020 at 04:32:45PM +, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/11/dml-returning.html
> Description:
> 
> In the docs explaining RETURNING
> https://www.postgresql.org/docs/11/dml-returning.html there is no mention of
> the fact that a nested sub-select in the RETURNING statement executes on the
> table as if the INSERT/UPDATE had not happened. 
> 
> I suppose maybe this might be obvious if you understand how SQL works but I
> think it is nuanced enough that it is worth explaining here as it provides
> some useful features for UPSERT queries. Example:
> 
> ```sql
> create table foo (x int primary key, y int);
> --=> CREATE TABLE
> insert into foo (x, y) values (1, 1);
> --=> INSERT 0 1
> update foo set y = 2 where x = 1 returning (select y from foo where x = 1)
> as old_y;
> /* =>
>  *  old_y 
>  * ---
>  *  1
>  * (1 row)
>  *
>  * UPDATE 1
>  */
> select * from foo;
> /* =>
>  *  x | y 
>  * ---+---
>  *  1 | 2
>  * (1 row)
>  */
> ```

Sorry for the delay in replying.  I am moving this thread to hackers
because it isn't clearly a documentation issue.  I did some research on
this and it is kind of confusing:

CREATE TABLE foo (x INT PRIMARY KEY, y INT);

INSERT INTO foo (x, y) VALUES (1, 1);

UPDATE foo SET y = y + 1 WHERE x = 1 RETURNING y;
 y
---
 2
SELECT y FROM foo;
 y
---
 2

UPDATE foo SET y = y + 1 WHERE x = 1 RETURNING (y);
 y
---
 3
SELECT y FROM foo;
 y
---
 3

UPDATE foo SET y = y + 1 WHERE x = 1 RETURNING (SELECT y);
 y
---
 4
SELECT y FROM foo;
 y
---
 4

UPDATE foo SET y = y + 1 WHERE x = 1 RETURNING (SELECT y FROM foo);
 y
---
 4
SELECT y FROM foo;
 y
---
 5

So, it is only when querying 'foo' that it uses the pre-UPDATE
visibility snapshot.  So the 'y' in 'SELECT y' is the 'y' from the
update, but the 'y' from 'SELECT y FROM foo' uses the snapshot from
before the update.  My guess is that we just didn't consider the rules
for what the 'y' references, and I bet if I dig into the code I can find
out why this happening.

RETURNING for INSERT/UPDATE/DELETE isn't part of the SQL standard, so we
don't have much guidance there.  It is as though the 'FROM foo' changes
the resolution of the 'y' because it is closer.

I am unclear if this should be documented or changed, or neither.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
On Fri, Mar 13, 2020 at 8:34 AM Andrew Gierth 
wrote:

> > "Justin" == Justin Pryzby  writes:
>
>  > On Thu, Mar 12, 2020 at 12:16:26PM -0700, Andres Freund wrote:
>  >> Indeed, that's incorrect. Causes the number of buckets for the
>  >> hashtable to be set higher - the size is just used for that. I'm a
>  >> bit wary of changing this in the stable branches - could cause
>  >> performance changes?
>
> I think (offhand, not tested) that the number of buckets would only be
> affected if the (planner-supplied) numGroups value would cause work_mem
> to be exceeded; the planner doesn't plan a hashagg at all in that case
> unless forced to (grouping by a hashable but not sortable column). Note
> that for various reasons the planner tends to over-estimate the memory
> requirement anyway.
>
>
That makes sense, thanks


Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
>
> On 2020-03-12 16:35:15 +0800, Pengzhou Tang wrote:
> > When reading the grouping sets codes, I find that the additional size of
> > the hash table for hash aggregates is always zero, this seems to be
> > incorrect to me, attached a patch to fix it, please help to check.
>
> Indeed, that's incorrect. Causes the number of buckets for the hashtable
> to be set higher - the size is just used for that.  I'm a bit wary of
> changing this in the stable branches - could cause performance changes?
>
>
 thanks for confirming this.


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread James Coleman
On Friday, March 13, 2020, Tomas Vondra 
wrote:

> On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:
>
>> On Fri, Mar 13, 2020 at 2:23 PM James Coleman  wrote:
>>
>>>
>>> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
>>>  wrote:
>>> > 3) Most of the execution plans look reasonable, except that some of the
>>> > plans look like this:
>>> >
>>> >
>>> >   QUERY PLAN
>>> >-
>>> > Limit
>>> >   ->  GroupAggregate
>>> > Group Key: t.a, t.b, t.c, t.d
>>> > ->  Incremental Sort
>>> >   Sort Key: t.a, t.b, t.c, t.d
>>> >   Presorted Key: t.a, t.b, t.c
>>> >   ->  Incremental Sort
>>> > Sort Key: t.a, t.b, t.c
>>> > Presorted Key: t.a, t.b
>>> > ->  Index Scan using t_a_b_idx on t
>>> >(10 rows)
>>> >
>>> > i.e. there are two incremental sorts on top of each other, with
>>> > different prefixes. But this this is not a new issue - it happens with
>>> > queries like this:
>>> >
>>> >SELECT a, b, c, d, count(*) FROM (
>>> >  SELECT * FROM t ORDER BY a, b, c
>>> >) foo GROUP BY a, b, c, d limit 1000;
>>> >
>>> > i.e. there's a subquery with a subset of pathkeys. Without incremental
>>> > sort the plan looks like this:
>>> >
>>> > QUERY PLAN
>>> >-
>>> > Limit
>>> >   ->  GroupAggregate
>>> > Group Key: t.a, t.b, t.c, t.d
>>> > ->  Sort
>>> >   Sort Key: t.a, t.b, t.c, t.d
>>> >   ->  Sort
>>> > Sort Key: t.a, t.b, t.c
>>> > ->  Seq Scan on t
>>> >(8 rows)
>>> >
>>> > so essentially the same plan shape. What bugs me though is that there
>>> > seems to be some sort of memory leak, so that this query consumes
>>> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
>>> > to be allocated in any memory context (at least MemoryContextStats
>>> don't
>>> > show anything like that), so I'm not sure what's going on.
>>> >
>>> > Reproducing it is fairly simple:
>>> >
>>> >CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>>> >INSERT INTO t SELECT
>>> >  1000*random(), 1000*random(), 1000*random(), 1000*random()
>>> >FROM generate_series(1,1000) s(i);
>>> >CREATE INDEX idx ON t(a,b);
>>> >ANALYZE t;
>>> >
>>> >EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>>> >FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>>> >LIMIT 100;
>>>
>>> While trying to reproduce this, instead of lots of memory usage, I got
>>> the attached assertion failure instead.
>>>
>>
>> And, without the EXPLAIN ANALYZE was able to get this one, which will
>> probably be a lot more helpful.
>>
>>
> Hmmm, I'll try reproducing it, but can you investigate the values in the
> Assert? I mean, it fails on this:
>
>   Assert(total_allocated == context->mem_allocated);
>
> so can you get a core or attach to the process using gdb, and see what's
> the expected / total value?
>
> BTW, I might have copied the wrong query - can you try with a higher
> value in the LIMIT clause? For example:
>
> EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> LIMIT 100;
>
> I think this might be the differenc ewhy you don't see the memory leak.
> Or maybe it was because of asserts? I'm not sure if I had enabled them
> in the build ...
>

I’m not at my laptop right now, but I’ve started looking at it, but I
haven’t figured it out yet. Going from memory, it had allocated 16384 but
expected 8192 (I think I have the order of that right).

It’s very consistently reproducible, thankfully, but doesn’t always happen
on the first query; IIRC always the 2nd with LIMIT 100, and I could get it
to happen with first at 96 and second at 97, but repeating 96 many times
didn’t seem to trigger it.

I’m hoping it’s the same root cause as the memory leak, but unsure.

I’ll try a higher number when I get a chance.

James


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Tomas Vondra

On Fri, Mar 13, 2020 at 04:31:16PM -0400, James Coleman wrote:

On Fri, Mar 13, 2020 at 2:23 PM James Coleman  wrote:


On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
 wrote:
> 3) Most of the execution plans look reasonable, except that some of the
> plans look like this:
>
>
>   QUERY PLAN
>-
> Limit
>   ->  GroupAggregate
> Group Key: t.a, t.b, t.c, t.d
> ->  Incremental Sort
>   Sort Key: t.a, t.b, t.c, t.d
>   Presorted Key: t.a, t.b, t.c
>   ->  Incremental Sort
> Sort Key: t.a, t.b, t.c
> Presorted Key: t.a, t.b
> ->  Index Scan using t_a_b_idx on t
>(10 rows)
>
> i.e. there are two incremental sorts on top of each other, with
> different prefixes. But this this is not a new issue - it happens with
> queries like this:
>
>SELECT a, b, c, d, count(*) FROM (
>  SELECT * FROM t ORDER BY a, b, c
>) foo GROUP BY a, b, c, d limit 1000;
>
> i.e. there's a subquery with a subset of pathkeys. Without incremental
> sort the plan looks like this:
>
> QUERY PLAN
>-
> Limit
>   ->  GroupAggregate
> Group Key: t.a, t.b, t.c, t.d
> ->  Sort
>   Sort Key: t.a, t.b, t.c, t.d
>   ->  Sort
> Sort Key: t.a, t.b, t.c
> ->  Seq Scan on t
>(8 rows)
>
> so essentially the same plan shape. What bugs me though is that there
> seems to be some sort of memory leak, so that this query consumes
> gigabytes os RAM before it gets killed by OOM. But the memory seems not
> to be allocated in any memory context (at least MemoryContextStats don't
> show anything like that), so I'm not sure what's going on.
>
> Reproducing it is fairly simple:
>
>CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>INSERT INTO t SELECT
>  1000*random(), 1000*random(), 1000*random(), 1000*random()
>FROM generate_series(1,1000) s(i);
>CREATE INDEX idx ON t(a,b);
>ANALYZE t;
>
>EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.


And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.



Hmmm, I'll try reproducing it, but can you investigate the values in the
Assert? I mean, it fails on this:

  Assert(total_allocated == context->mem_allocated);

so can you get a core or attach to the process using gdb, and see what's
the expected / total value?

BTW, I might have copied the wrong query - can you try with a higher
value in the LIMIT clause? For example:

EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
LIMIT 100;

I think this might be the differenc ewhy you don't see the memory leak.
Or maybe it was because of asserts? I'm not sure if I had enabled them
in the build ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Justin Pryzby
On Fri, Mar 13, 2020 at 02:38:51PM -0700, Andres Freund wrote:
> > |One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
> > |VACUUM to do useless work: freezing a row version is a waste of time if 
> > the row
> > |is modified soon thereafter (causing it to acquire a new XID). So the 
> > setting
> > |should be large enough that rows are not frozen until they are unlikely to
> > |change any more.
> 
> I think the overhead here might be a bit overstated. Once a page is

Could you clarify if you mean the language in docs in general or specifically
in the context of this patch ?

-- 
Justin




Re: proposal: new polymorphic types - commontype and commontypearray

2020-03-13 Thread Tom Lane
Pavel Stehule  writes:
> [ anycompatible-types-20191127.patch ]

I'm starting to review this patch seriously.  I've found some bugs and
things I didn't like, and the documentation certainly needs work, but
I think I can get it to a committable state before too much longer.

What I want to talk about right now is some preliminary refactoring
that I'd like to do, as shown in the 0001 patch below.  (0002 is the
rest of the patch as I currently have it.)  There are two main things
in it:

1. Rearrange the macros in pseudotypes.c so that we don't have any
pure-boilerplate functions that aren't built by the macros.  I don't
think this should be controversial, as it's not changing anything
functionally.

2. Refactor the function signature validation logic in pg_proc.c and
pg_aggregate.c to avoid having duplicate logic between those two.
I did that by creating new functions in parse_coerce.c (for lack of
a better place) that say whether a proposed result type or aggregate
transition type is valid given a particular set of declared input types.
The reason that this might be controversial is that it forces a slightly
less precise error detail message to be issued, since the call site that's
throwing the error doesn't know exactly which rule was being violated.
(For example, before there was a specific error message about anyrange
result requiring an anyrange input, and now there isn't.)

I think this is all right, mainly because we'd probably end up with
less-precise messages anyway for the more complex rules that 0002 is
going to add.  If anybody's really hot about it, we could complicate
the API, say by having the call sites pass in the primary error message
or by having the checking subroutines pass back an errdetail string.

We definitely need to do *something* about that, because it's already
the case that pg_aggregate.c is out of step with pg_proc.c about
polymorphism rules --- it's not enforcing the anyrange rule.  I think
there's probably no user-reachable bug in that, because an aggregate
is constrained by its implementation functions for which the rule
would be enforced, but it still seems not good.

Thoughts?

regards, tom lane

diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 0b7face..354d00a 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -93,8 +93,6 @@ AggregateCreate(const char *aggName,
 	Oid			mfinalfn = InvalidOid;	/* can be omitted */
 	Oid			sortop = InvalidOid;	/* can be omitted */
 	Oid		   *aggArgTypes = parameterTypes->values;
-	bool		hasPolyArg;
-	bool		hasInternalArg;
 	bool		mtransIsStrict = false;
 	Oid			rettype;
 	Oid			finaltype;
@@ -131,36 +129,29 @@ AggregateCreate(const char *aggName,
 			   FUNC_MAX_ARGS - 1,
 			   FUNC_MAX_ARGS - 1)));
 
-	/* check for polymorphic and INTERNAL arguments */
-	hasPolyArg = false;
-	hasInternalArg = false;
-	for (i = 0; i < numArgs; i++)
-	{
-		if (IsPolymorphicType(aggArgTypes[i]))
-			hasPolyArg = true;
-		else if (aggArgTypes[i] == INTERNALOID)
-			hasInternalArg = true;
-	}
-
 	/*
 	 * If transtype is polymorphic, must have polymorphic argument also; else
 	 * we will have no way to deduce the actual transtype.
 	 */
-	if (IsPolymorphicType(aggTransType) && !hasPolyArg)
+	if (!is_valid_polymorphic_signature(aggTransType,
+		aggArgTypes,
+		numArgs))
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
  errmsg("cannot determine transition data type"),
- errdetail("An aggregate using a polymorphic transition type must have at least one polymorphic argument.")));
+ errdetail("An aggregate using a polymorphic transition type must have at least one matching polymorphic argument.")));
 
 	/*
 	 * Likewise for moving-aggregate transtype, if any
 	 */
 	if (OidIsValid(aggmTransType) &&
-		IsPolymorphicType(aggmTransType) && !hasPolyArg)
+		!is_valid_polymorphic_signature(aggmTransType,
+		aggArgTypes,
+		numArgs))
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
  errmsg("cannot determine transition data type"),
- errdetail("An aggregate using a polymorphic transition type must have at least one polymorphic argument.")));
+ errdetail("An aggregate using a polymorphic transition type must have at least one matching polymorphic argument.")));
 
 	/*
 	 * An ordered-set aggregate that is VARIADIC must be VARIADIC ANY.  In
@@ -492,12 +483,13 @@ AggregateCreate(const char *aggName,
 	 * that itself violates the rule against polymorphic result with no
 	 * polymorphic input.)
 	 */
-	if (IsPolymorphicType(finaltype) && !hasPolyArg)
+	if (!is_valid_polymorphic_signature(finaltype,
+		aggArgTypes,
+		numArgs))
 		ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("cannot determine result data type"),
- errdetail("An aggregate returning a polymorphic type "
-		   "must have at least one polymorphic 

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Laurenz Albe
On Fri, 2020-03-13 at 13:44 -0500, Justin Pryzby wrote:
> Possible it would be better to run VACUUM *without* freeze_min_age=0 ?  (I get
> confused and have to spend 20min re-reading the vacuum GUC docs every time I
> deal with this stuff, so maybe I'm off).
> 
> As I understand, the initial motivation of this patch was to avoid disruptive
> anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
> all, it would freeze the oldest tuples, which is all that's needed; especially
> since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never 
> need
> to be vacuumed again.  Recently written tuples wouldn't be frozen, which is 
> ok,
> they're handled next time.

Freezing tuples too early is wasteful if the tuples get updated or deleted
soon after, but based on the assumption that an autovacuum triggered by insert
is dealing with an insert-mostly table, it is not that wasteful.

If we didn't freeze all tuples, it is easy to envision a situation where
bulk data loads load several million rows in a few transactions, which
would trigger a vacuum.  With the normal vacuum_freeze_min_age, that vacuum
would do nothing at all.  It is better if each vacuum freezes some rows,
in other words, if it does some of the anti-wraparound work.

> Another motivation of the patch is to allow indexonly scan, for which the
> planner looks at pages' "relallvisible" fraction (and at execution if a page
> isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
> all.  Again, some pages won't be marked allvisible, which is fine, they're
> handled next time.

Yes, freezing is irrelevant with respect to index only scans, but it helps
with mitigating the impact of anti-wraparound vacuum runs.

> I think freeze_min_age=0 could negatively affect people who have insert-mostly
> tables (I'm not concerned, but that includes us).  If they consistently hit 
> the
> autovacuum insert threshold before the cleanup threshold for updated/deleted
> tuples, any updated/deleted tuples would be frozen, which would be
> wasteful:  

I don't get that.  Surely tuples whose xmax is committed won't be frozen.

> So my question is if autovacuum triggered by insert threshold should trigger
> VACUUM with the same settings as a vacuum due to deleted tuples.  I realize 
> the
> DBA could just configure the thresholds so they'd hit vacuum for cleaning dead
> tuples, so my suggestion maybe just improves the case with the default
> settings.  It's possible to set the reloption autovacuum_freeze_min_age, which
> I think supports the idea of running a vacuum normally and letting it (and the
> DBA) decide what do with existing logic.

Yes, the DBA can explicitly set vacuum_freeze_min_age to 0.

But for one DBA who understands his or her workload well enough, and who knows
the workings of autovacuum well enough to do that kind of tuning, there are
99 DBAs who don't, and it is the goal of the patch (expressed in the subject)
to make things work for those people who go with the default.

And I believe that is better achieved with freezing as many tuples as possible.

> Also, there was a discussion about index cleanup with the conclusion that it
> was safer not to skip it, since otherwise indexes might bloat.  I think that's
> right, since vacuum for cleanup is triggered by the number of dead heap 
> tuples.
> To skip index cleanup, I think you'd want a metric for
> n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
> and trigger vacuum of each index separately).

Yes, I think we pretty much all agree on that.

> Having now played with the patch, I'll suggest that 1000 is too high a
> threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> much lower (10?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.

There is the concern that that might treat large table to seldom.

I am curious - what were the findings that led you to think that 1000
is too high?

Yours,
Laurenz Albe





Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Andres Freund
Hi,

On 2020-03-13 13:44:42 -0500, Justin Pryzby wrote:
> As I understand, the initial motivation of this patch was to avoid disruptive
> anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
> all, it would freeze the oldest tuples, which is all that's needed; especially
> since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never 
> need
> to be vacuumed again.  Recently written tuples wouldn't be frozen, which is 
> ok,
> they're handled next time.
> 
> Another motivation of the patch is to allow indexonly scan, for which the
> planner looks at pages' "relallvisible" fraction (and at execution if a page
> isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
> all.  Again, some pages won't be marked allvisible, which is fine, they're
> handled next time.
> 
> I think freeze_min_age=0 could negatively affect people who have insert-mostly
> tables (I'm not concerned, but that includes us).  If they consistently hit 
> the
> autovacuum insert threshold before the cleanup threshold for updated/deleted
> tuples, any updated/deleted tuples would be frozen, which would be
> wasteful:  

I think that's a valid concern.


> |One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
> |VACUUM to do useless work: freezing a row version is a waste of time if the 
> row
> |is modified soon thereafter (causing it to acquire a new XID). So the setting
> |should be large enough that rows are not frozen until they are unlikely to
> |change any more.

I think the overhead here might be a bit overstated. Once a page is
dirtied (or already dirty) during vacuum, and we freeze a single row
(necessating WAL logging), there's not really a good reason to not also
freeze the rest of the row on that page. The added cost for freezing
another row is miniscule compared to the "constant" cost of freezing
anything on the page.  It's of course different if there are otherwise
no tuples worth freezing on the page (not uncommon). But there's really
no reason for that to be the case:

Afaict the only problem with more aggressively freezing when we touch
(beyond hint bits) the page anyway is that we commonly end up with
multiple WAL records for the same page:

1) lazy_scan_heap()->heap_page_prune() will log a XLOG_HEAP2_CLEAN record, but 
leave
   itemids in place most of the time
2) lazy_scan_heap()->log_heap_freeze() will log a XLOG_HEAP2_FREEZE_PAGE record
3a) if no indexes exist/index cleanup is disabled:
  lazy_vacuum_page()->lazy_vacuum_page() will log a XLOG_HEAP2_CLEAN
  record, removing dead tuples (including itemids)
3b) if indexes need to be cleaned up,
  lazy_vacuum_heap()->lazy_vacuum_page() will log a XLOG_HEAP2_CLEAN

which is not nice. It likely is worth merging xl_heap_freeze_page into
xl_heap_clean, and having heap pruning always freeze once it decides to
dirty a page.

We could probably always prune dead tuples as part of heap_prune_chain()
if there's no indexes - but I'm doubtful it's worth it, since there'll
be few tables with lots of dead tuples that don't have indexes.

Merging 3b's WAL record would be harder, I think.


There's also a significant source of additional WAL records here, one
that I think should really not have been introduced:

4) HeapTupleSatisfiesVacuum() called both by heap_prune_chain(), and
  lazy_scan_heap() will often trigger a WAL record when the checksums or
  wal_log_hint_bits are enabled. If the page hasn't been modified in the
  current checkpoint window (extremely common for VACUUM, reasonably
  common for opportunistic pruning), we will log a full page write.

  Imo this really should have been avoided when checksums were added,
  that's a pretty substantial and unnecessary increase in overhead.


It's probably overkill to tie fixing the 'insert only' case to improving
the WAL logging for vacuuming / pruning. But it'd certainly would
largely remove the tradeoff discussed here, by removing additional
overhead of freezing in tables that are also updated.


> Also, there was a discussion about index cleanup with the conclusion that it
> was safer not to skip it, since otherwise indexes might bloat.  I think that's
> right, since vacuum for cleanup is triggered by the number of dead heap 
> tuples.
> To skip index cleanup, I think you'd want a metric for
> n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
> and trigger vacuum of each index separately).
> 
> Having now played with the patch, I'll suggest that 1000 is too high a
> threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
> much lower (10?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.

ISTM that the danger of regressing workloads due to suddenly repeatedly
scanning huge indexes that previously were never / rarely scanned is
significant (if there's a few dead tuples, otherwise most indexes will
be able to skip the scan since the vacuum_cleanup_index_scale_factor
introduction)).


Re: backend type in log_line_prefix?

2020-03-13 Thread Peter Eisentraut

On 2020-03-10 19:07, Alvaro Herrera wrote:

I like these patches; the first two are nice cleanup.

My only gripe is that pgstat_get_backend_desc() is not really a pgstat
function; I think it should have a different name with a prototype in
miscadmin.h (next to the enum's new location, which I would put
someplace near the "pmod.h" comment rather than where you put it;
perhaps just above the AuxProcType definition), and implementation
probably in miscinit.c.


I have committed the refactoring patches with adjustments along these 
lines.  The patch with the log_line_prefix and csvlog enhancements is 
still under discussion.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backend type in log_line_prefix?

2020-03-13 Thread Peter Eisentraut

On 2020-03-11 19:53, Justin Pryzby wrote:

Can I suggest:

$ git diff
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3a6f7f9456..56e0a1437e 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2945,7 +2945,7 @@ write_csvlog(ErrorData *edata)
 if (MyProcPid == PostmasterPid)
 appendCSVLiteral(, "postmaster");
 else if (MyBackendType == B_BG_WORKER)
-   appendCSVLiteral(, MyBgworkerEntry->bgw_type);
+   appendCSVLiteral(, MyBgworkerEntry->bgw_name);
 else
 appendCSVLiteral(, pgstat_get_backend_desc(MyBackendType));


The difference is intentional.  bgw_type is so that you can filter and 
group by type.  The bgw_name could be totally different for each instance.


Having the bgw name available somehow would perhaps also be useful, but 
then we should also do this in a consistent way for processes that are 
not background workers, such as regular client backends or wal senders 
or autovacuum workers.  Doing it just for background workers would 
create inconsistencies that the introduction of bgw_type some time ago 
sought to eliminate.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: WAL prefetch (another approach)

2020-03-13 Thread Alvaro Herrera
I tried my luck at a quick read of this patchset.
I didn't manage to go over 0005 though, but I agree with Tomas that
having this be configurable in terms of bytes of WAL is not very
user-friendly.

First of all, let me join the crowd chanting that this is badly needed;
I don't need to repeat what Chittenden's talk showed.  "WAL recovery is
now 10x-20x times faster" would be a good item for pg13 press release, 
I think.

> From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Tue, 3 Dec 2019 17:13:40 +1300
> Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
> 
> Previously a Relation was required, but it's annoying to have
> to create a "fake" one in recovery.

LGTM.

It's a pity to have to include smgr.h in bufmgr.h.  Maybe it'd be sane
to use a forward struct declaration and "struct SMgrRelation *" instead.


> From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Mon, 9 Dec 2019 17:10:17 +1300
> Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
> 
> The new name better reflects the fact that the value it returns
> is updated only when received data has been flushed to disk.
> 
> An upcoming patch will make use of the latest data that was
> written without waiting for it to be flushed, so use more
> precise function names.

Ugh.  (Not for your patch -- I mean for the existing naming convention).
It would make sense to rename WalRcvData->receivedUpto in this commit,
maybe to flushedUpto.


> From d7fa7d82c5f68d0cccf441ce9e8dfa40f64d3e0d Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Mon, 9 Dec 2019 17:22:07 +1300
> Subject: [PATCH 3/5] Add WalRcvGetWriteRecPtr() (new definition).
> 
> A later patch will read received WAL to prefetch referenced blocks,
> without waiting for the data to be flushed to disk.  To do that,
> it needs to be able to see the write pointer advancing in shared
> memory.
> 
> The function formerly bearing name was recently renamed to
> WalRcvGetFlushRecPtr(), which better described what it does.

> + pg_atomic_init_u64(>writtenUpto, 0);

Umm, how come you're using WalRcv here instead of walrcv?  I would flag
this patch for sneaky nastiness if this weren't mostly harmless.  (I
think we should do away with local walrcv pointers altogether.  But that
should be a separate patch, I think.)

> + pg_atomic_uint64 writtenUpto;

Are we already using uint64s for XLogRecPtrs anywhere?  This seems
novel.  Given this, I wonder if the comment near "mutex" needs an
update ("except where atomics are used"), or perhaps just move the
member to after the line with mutex.


I didn't understand the purpose of inc_counter() as written.  Why not
just pg_atomic_fetch_add_u64(..., 1)?

>  /*
>   *   smgrprefetch() -- Initiate asynchronous read of the specified block of 
> a relation.
> + *
> + *   In recovery only, this can return false to indicate that a file
> + *   doesn't exist (presumably it has been dropped by a later WAL
> + *   record).
>   */
> -void
> +bool
>  smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)

I think this API, where the behavior of a low-level module changes
depending on InRecovery, is confusingly crazy.  I'd rather have the
callers specifying whether they're OK with a file that doesn't exist.

> +extern PrefetchBufferResult SharedPrefetchBuffer(SMgrRelation smgr_reln,
> + 
>  ForkNumber forkNum,
> + 
>  BlockNumber blockNum);
>  extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
>  BlockNumber blockNum);

Umm, I would keep the return values of both these functions in sync.
It's really strange that PrefetchBuffer does not return
PrefetchBufferResult, don't you think?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread James Coleman
On Fri, Mar 13, 2020 at 2:23 PM James Coleman  wrote:
>
> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
>  wrote:
> > 3) Most of the execution plans look reasonable, except that some of the
> > plans look like this:
> >
> >
> >   QUERY PLAN
> >-
> > Limit
> >   ->  GroupAggregate
> > Group Key: t.a, t.b, t.c, t.d
> > ->  Incremental Sort
> >   Sort Key: t.a, t.b, t.c, t.d
> >   Presorted Key: t.a, t.b, t.c
> >   ->  Incremental Sort
> > Sort Key: t.a, t.b, t.c
> > Presorted Key: t.a, t.b
> > ->  Index Scan using t_a_b_idx on t
> >(10 rows)
> >
> > i.e. there are two incremental sorts on top of each other, with
> > different prefixes. But this this is not a new issue - it happens with
> > queries like this:
> >
> >SELECT a, b, c, d, count(*) FROM (
> >  SELECT * FROM t ORDER BY a, b, c
> >) foo GROUP BY a, b, c, d limit 1000;
> >
> > i.e. there's a subquery with a subset of pathkeys. Without incremental
> > sort the plan looks like this:
> >
> > QUERY PLAN
> >-
> > Limit
> >   ->  GroupAggregate
> > Group Key: t.a, t.b, t.c, t.d
> > ->  Sort
> >   Sort Key: t.a, t.b, t.c, t.d
> >   ->  Sort
> > Sort Key: t.a, t.b, t.c
> > ->  Seq Scan on t
> >(8 rows)
> >
> > so essentially the same plan shape. What bugs me though is that there
> > seems to be some sort of memory leak, so that this query consumes
> > gigabytes os RAM before it gets killed by OOM. But the memory seems not
> > to be allocated in any memory context (at least MemoryContextStats don't
> > show anything like that), so I'm not sure what's going on.
> >
> > Reproducing it is fairly simple:
> >
> >CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
> >INSERT INTO t SELECT
> >  1000*random(), 1000*random(), 1000*random(), 1000*random()
> >FROM generate_series(1,1000) s(i);
> >CREATE INDEX idx ON t(a,b);
> >ANALYZE t;
> >
> >EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
> >FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
> >LIMIT 100;
>
> While trying to reproduce this, instead of lots of memory usage, I got
> the attached assertion failure instead.

And, without the EXPLAIN ANALYZE was able to get this one, which will
probably be a lot more helpful.

James
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f15f92dd535 in __GI_abort () at abort.c:79
#2  0x5620946125d8 in ExceptionalCondition 
(conditionName=conditionName@entry=0x5620947c0eb0 "total_allocated == 
context->mem_allocated",
errorType=errorType@entry=0x562094668028 "FailedAssertion", 
fileName=fileName@entry=0x5620947c0c33 "aset.c", 
lineNumber=lineNumber@entry=1541)
at assert.c:67
#3  0x562094637761 in AllocSetCheck (context=context@entry=0x562096344590) 
at aset.c:1541
#4  0x5620946378aa in AllocSetDelete (context=0x562096344590) at aset.c:655
#5  0x56209463e8e0 in MemoryContextDelete (context=0x562096344590) at 
mcxt.c:245
#6  0x56209463e96f in MemoryContextDeleteChildren 
(context=context@entry=0x562096350610) at mcxt.c:265
#7  0x56209463e9c0 in MemoryContextReset (context=0x562096350610) at 
mcxt.c:142
#8  0x562094647bac in tuplesort_free (state=state@entry=0x56209633e680) at 
tuplesort.c:1327
#9  0x56209464ceb4 in tuplesort_reset (state=state@entry=0x56209633e680) at 
tuplesort.c:1406
#10 0x56209438596f in ExecIncrementalSort (pstate=0x56209631e550) at 
nodeIncrementalSort.c:584
#11 0x562094385b76 in ExecProcNode (node=0x56209631e550) at 
../../../src/include/executor/executor.h:245
#12 ExecIncrementalSort (pstate=0x56209631e338) at nodeIncrementalSort.c:638
#13 0x56209437432c in ExecProcNode (node=0x56209631e338) at 
../../../src/include/executor/executor.h:245
#14 fetch_input_tuple (aggstate=aggstate@entry=0x56209631df38) at nodeAgg.c:411
#15 0x562094376eb6 in agg_retrieve_direct 
(aggstate=aggstate@entry=0x56209631df38) at nodeAgg.c:1877
#16 0x5620943770e0 in ExecAgg (pstate=0x56209631df38) at nodeAgg.c:1597
#17 0x5620943897db in ExecProcNode (node=0x56209631df38) at 
../../../src/include/executor/executor.h:245
#18 ExecLimit (pstate=0x56209631dd30) at nodeLimit.c:149
#19 0x5620943615c4 in ExecProcNode (node=0x56209631dd30) at 
../../../src/include/executor/executor.h:245
#20 ExecutePlan (estate=estate@entry=0x56209631daf0, planstate=0x56209631dd30, 
use_parallel_mode=, operation=operation@entry=CMD_SELECT,
sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0, 
direction=ForwardScanDirection, dest=0x56209632b248, execute_once=true)
at execMain.c:1646
#21 0x562094362294 

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Tom Lane
Alvaro Herrera  writes:
> Also, I wonder if it would be better to modify our policies so that we
> update typedefs.list more frequently.  Some people include additions
> with their commits, but it's far from SOP.

Perhaps.  My own workflow includes pulling down a fresh typedefs.list
from the buildfarm (which is trivial to automate) and then adding any
typedefs invented by the patch I'm working on.  The latter part of it
makes it hard to see how the in-tree list would be very helpful; and
if we started expecting patches to include typedef updates, I'm afraid
we'd get lots of patch collisions in that file.

I don't have any big objection to updating the in-tree list more often,
but personally I wouldn't use it, unless we can find a better workflow.

regards, tom lane




Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?

2020-03-13 Thread Merlin Moncure
On Fri, Mar 13, 2020 at 2:28 PM Andres Freund  wrote:
>
> Hi,
>
> On March 13, 2020 12:08:32 PM PDT, Merlin Moncure  wrote:
> >I have 5 servers in a testing environment that are comprise a data
> >warehousing cluster.   They will typically get each get exactly the
> >same query at approximately the same time.  Yesterday, around 1pm, 3
> >of the five got stuck on the same query.  Each of them yields similar
> >stack traces.  This  happens now and then.  The server is 9.6.12
> >(which is obviously old, but I did not see any changes in relevant
> >code).
> >
> >(gdb) bt
> >#0  0x7fe856c0b463 in __epoll_wait_nocancel () from
> >/lib64/libc.so.6
> >#1  0x006b4416 in WaitEventSetWaitBlock (nevents=1,
> >occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
> >latch.c:1053
> >#2  WaitEventSetWait (set=0x27cace8, timeout=timeout@entry=-1,
> >occurred_events=occurred_events@entry=0x7ffc9f2b0f60,
> >nevents=nevents@entry=1) at latch.c:1007
> >#3  0x005f26dd in secure_write (port=0x27f16a0,
> >ptr=ptr@entry=0x27f5528, len=len@entry=192) at be-secure.c:255
> >#4  0x005fb51b in internal_flush () at pqcomm.c:1410
> >#5  0x005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
> >s@entry=0x2a4f228 "", len=70) at pqcomm.c:1356
> >#6  0x005fb7f0 in socket_putmessage (msgtype=68 'D',
> >s=0x2a4f228 "", len=) at pqcomm.c:1553
> >#7  0x005fd5d9 in pq_endmessage (buf=buf@entry=0x7ffc9f2b1040)
> >at pqformat.c:347
> >#8  0x00479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
> >printtup.c:372
> >#9  0x005c1cc9 in ExecutePlan (dest=0x2b6bca0,
> >direction=, numberTuples=0, sendTuples=1 '\001',
> >operation=CMD_SELECT,
> >use_parallel_mode=, planstate=0x2958cf8,
> >estate=0x2958be8) at execMain.c:1606
> >#10 standard_ExecutorRun (queryDesc=0x2834998, direction= >out>, count=0) at execMain.c:339
> >#11 0x006d69a7 in PortalRunSelect
> >(portal=portal@entry=0x2894e38, forward=forward@entry=1 '\001',
> >count=0, count@entry=9223372036854775807,
> >dest=dest@entry=0x2b6bca0) at pquery.c:948
> >#12 0x006d7dbb in PortalRun (portal=0x2894e38,
> >count=9223372036854775807, isTopLevel=, dest=0x2b6bca0,
> >altdest=0x2b6bca0,
> >completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
> >#13 0x006d5a06 in PostgresMain (argc=,
> >argv=, dbname=, username= >out>) at postgres.c:1109
> >#14 0x0046fc28 in BackendRun (port=0x27f16a0) at
> >postmaster.c:4342
> >#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
> >#16 ServerLoop () at postmaster.c:1721
> >#17 0x00678119 in PostmasterMain (argc=argc@entry=3,
> >argv=argv@entry=0x27c8c90) at postmaster.c:1329
> >#18 0x0047088e in main (argc=3, argv=0x27c8c90) at main.c:228
> >(gdb) quit
> >
> >Now, the fact that this happened to multiple servers at time strongly
> >suggest an external (to the database) problem.  The system initiating
> >the query, a cross database query over dblink, has been has given up
> >(and has been restarted as a precaution) a long time ago, and the
> >connection is dead.   secure_write() sets however an infinite timeout
> >to the latch, and there are clearly scenarios where epoll waits
> >forever for an event that is never going to occur.  If/when this
> >happens, the only recourse is to restart the impacted database.  The
> >question is, shouldn't the latch have a looping timeout that checks
> >for interrupts?   What would the risks be of jumping directly out of
> >the latch loop?
>
> Unless there is a kernel problem latches are interruptible by signals, as the 
> signal handler should do a SetLatch().
>
> This backtrace just looks like the backend is trying to send data to the 
> client? What makes you think it's stuck?

Well, the client has been gone for > 24 hours.   But your right, when
I send cancel to the backend, here is what happens according to
strace:
epoll_wait(3, 0x2915e08, 1, -1) = -1 EINTR (Interrupted system call)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=5024, si_uid=26} ---
write(13, "\0", 1)  = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
sendto(11, "\0\0\0\00545780\0\0\0\003508D\0\0\0d\0\t\0\0\0\00615202"...,
5640, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=43081176, u64=43081176}}], 1, -1) = 1
read(12, "\0", 16)  = 1
epoll_wait(3,


...pg_terminate_backend() however, does properly kill the query.

> If the connection is dead, epoll should return (both because we ask for the 
> relevant events, and because it just always implicitly does do so).
>
> So it seems likely that either your connection isn't actually dead (e.g. 
> waiting for tcp timeouts), or you have a kennel bug.

maybe, I suspect firewall issue. hard to say

merlin




Re: Re: Optimize crash recovery

2020-03-13 Thread Thomas Munro
On Sat, Mar 14, 2020 at 5:31 AM Alvaro Herrera  wrote:
> On 2020-Mar-14, Thunder wrote:
> > For example, if page lsn in storage is 0x9 and start to replay from 
> > 0x1.
> > If 0x1 is full-page xlog record, then we can ignore to replay xlog 
> > between 0x1~0x9 for this page.
> >
> > Is there any correct issue if the page exists in the buffer pool and
> > ignore to replay for full-page or init page if page lsn is larger than
> > the lsn of xlog record?
>
> Oh! right.  The assumption, before we had page-level checksums, was that
> the page at LSN 0x9 could have been partially written, so the upper
> half of the contents would actually be older and thus restoring the FPI
> (and all subsequent WAL changes) was mandatory.  But if the page
> checksum verifies, then there's no need to return the page back to an
> old state only to replay everything to bring it to the new state again.
>
> This seems a potentially worthwhile optimization ...

One problem is that you now have to read the block from disk, which
causes an I/O stall if the page is not already in the kernel page
cache.  That could be worse than the cost of replaying all the WAL
records you get to skip with this trick.  My WAL prefetching patch[1]
could mitigate that problem to some extent, depending on how much
prefetching your system can do.  The current version of the patch has
a GUC wal_prefetch_fpw to control whether it bothers to prefetch pages
that we a FPI for, because normally there's no point, but with this
trick you'd want to turn that on.

[1] https://commitfest.postgresql.org/27/2410/




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-13, Tom Lane wrote:

> Alvaro Herrera  writes:
> > ... You can specify a filelist to pgindent, also.  What I do is super
> > low-tech: do a "git diff origin/master", copy the filelist, and then
> > ^V^E to paste that list into a command line to run pgindent (editing to
> > remove the change histogram and irrelevant files).  I should automate
> > this ...
> 
> Yeah.  I tend to keep copies of the files I'm specifically hacking on
> in a separate work directory, and then I re-indent just that directory.
> But that's far from ideal as well.  I wonder if it'd be worth teaching
> pgindent to have some option to indent only files that are already
> modified according to git?

A quick look at git-ls-files manpage suggests that this might work:

 src/tools/pgindent/pgindent $(git ls-files --modified -- *.[ch])

If it's that easy, maybe it's not worth messing with pgindent ...


Also, I wonder if it would be better to modify our policies so that we
update typedefs.list more frequently.  Some people include additions
with their commits, but it's far from SOP.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Tom Lane
Alvaro Herrera  writes:
> ... You can specify a filelist to pgindent, also.  What I do is super
> low-tech: do a "git diff origin/master", copy the filelist, and then
> ^V^E to paste that list into a command line to run pgindent (editing to
> remove the change histogram and irrelevant files).  I should automate
> this ...

Yeah.  I tend to keep copies of the files I'm specifically hacking on
in a separate work directory, and then I re-indent just that directory.
But that's far from ideal as well.  I wonder if it'd be worth teaching
pgindent to have some option to indent only files that are already
modified according to git?

regards, tom lane




Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?

2020-03-13 Thread Andres Freund
Hi, 

On March 13, 2020 12:08:32 PM PDT, Merlin Moncure  wrote:
>I have 5 servers in a testing environment that are comprise a data
>warehousing cluster.   They will typically get each get exactly the
>same query at approximately the same time.  Yesterday, around 1pm, 3
>of the five got stuck on the same query.  Each of them yields similar
>stack traces.  This  happens now and then.  The server is 9.6.12
>(which is obviously old, but I did not see any changes in relevant
>code).
>
>(gdb) bt
>#0  0x7fe856c0b463 in __epoll_wait_nocancel () from
>/lib64/libc.so.6
>#1  0x006b4416 in WaitEventSetWaitBlock (nevents=1,
>occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
>latch.c:1053
>#2  WaitEventSetWait (set=0x27cace8, timeout=timeout@entry=-1,
>occurred_events=occurred_events@entry=0x7ffc9f2b0f60,
>nevents=nevents@entry=1) at latch.c:1007
>#3  0x005f26dd in secure_write (port=0x27f16a0,
>ptr=ptr@entry=0x27f5528, len=len@entry=192) at be-secure.c:255
>#4  0x005fb51b in internal_flush () at pqcomm.c:1410
>#5  0x005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
>s@entry=0x2a4f228 "", len=70) at pqcomm.c:1356
>#6  0x005fb7f0 in socket_putmessage (msgtype=68 'D',
>s=0x2a4f228 "", len=) at pqcomm.c:1553
>#7  0x005fd5d9 in pq_endmessage (buf=buf@entry=0x7ffc9f2b1040)
>at pqformat.c:347
>#8  0x00479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
>printtup.c:372
>#9  0x005c1cc9 in ExecutePlan (dest=0x2b6bca0,
>direction=, numberTuples=0, sendTuples=1 '\001',
>operation=CMD_SELECT,
>use_parallel_mode=, planstate=0x2958cf8,
>estate=0x2958be8) at execMain.c:1606
>#10 standard_ExecutorRun (queryDesc=0x2834998, direction=out>, count=0) at execMain.c:339
>#11 0x006d69a7 in PortalRunSelect
>(portal=portal@entry=0x2894e38, forward=forward@entry=1 '\001',
>count=0, count@entry=9223372036854775807,
>dest=dest@entry=0x2b6bca0) at pquery.c:948
>#12 0x006d7dbb in PortalRun (portal=0x2894e38,
>count=9223372036854775807, isTopLevel=, dest=0x2b6bca0,
>altdest=0x2b6bca0,
>completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
>#13 0x006d5a06 in PostgresMain (argc=,
>argv=, dbname=, username=out>) at postgres.c:1109
>#14 0x0046fc28 in BackendRun (port=0x27f16a0) at
>postmaster.c:4342
>#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
>#16 ServerLoop () at postmaster.c:1721
>#17 0x00678119 in PostmasterMain (argc=argc@entry=3,
>argv=argv@entry=0x27c8c90) at postmaster.c:1329
>#18 0x0047088e in main (argc=3, argv=0x27c8c90) at main.c:228
>(gdb) quit
>
>Now, the fact that this happened to multiple servers at time strongly
>suggest an external (to the database) problem.  The system initiating
>the query, a cross database query over dblink, has been has given up
>(and has been restarted as a precaution) a long time ago, and the
>connection is dead.   secure_write() sets however an infinite timeout
>to the latch, and there are clearly scenarios where epoll waits
>forever for an event that is never going to occur.  If/when this
>happens, the only recourse is to restart the impacted database.  The
>question is, shouldn't the latch have a looping timeout that checks
>for interrupts?   What would the risks be of jumping directly out of
>the latch loop?

Unless there is a kernel problem latches are interruptible by signals, as the 
signal handler should do a SetLatch().

This backtrace just looks like the backend is trying to send data to the 
client? What makes you think it's stuck?

If the connection is dead, epoll should return (both because we ask for the 
relevant events, and because it just always implicitly does do so).

So it seems likely that either your connection isn't actually dead (e.g. 
waiting for tcp timeouts), or you have a kennel bug.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Extracting only the columns needed for a query

2020-03-13 Thread Dmitry Dolgov
> On Tue, Feb 18, 2020 at 03:26:16PM -0800, Melanie Plageman wrote:
>
> > > I believe it would be beneficial to add this potential API extension
> > patch into
> > > the thread (as an example of an interface defining how scanCols could be
> > used)
> > > and review them together.
> >
> > As for including some code that uses the scanCols, after discussing
> > off-list with a few folks, there are three options I would like to
> > pursue for doing this.
> >
> > One option I will pursue is using the scanCols to inform the columns
> > needed to be spilled for memory-bounded hashagg (mentioned by Jeff
> > here [1]).
> >
> >
> > The third is exercising it with a test only but providing an example
> > of how a table AM API user like Zedstore uses the columns during
> > planning.
> >
>
> Basically, scanCols are simply columns that need to be scanned. It is
> probably okay if it is only used by table access method API users, as
> Pengzhou's patch illustrates.

Thanks for update. Sure, that would be fine. At the moment I have couple
of intermediate commentaries.

In general implemented functionality looks good. I've checked how it
works on the existing tests, almost everywhere required columns were not
missing in scanCols (which is probably the most important part).
Sometimes exressions were checked multiple times, which could
potentially introduce some overhead, but I believe this effect is
negligible. Just to mention some counterintuitive examples, for this
kind of query

SELECT min(q1) FROM INT8_TBL;

the same column was checked 5 times in my tests, since it's present also
in baserestrictinfo, and build_minmax_path does one extra round of
planning and invoking make_one_rel. I've also noticed that for
partitioned tables every partition is evaluated separately. IIRC they
structure cannot differ, does it makes sense then? Another interesting
example is Values Scan (e.g. in an insert statements with multiple
records), can an abstract table AM user leverage information about
columns in it?

One case, where I believe columns were missing, is statements with
returning:

INSERT INTO foo (col1)
  VALUES ('col1'), ('col2'), ('col3')
  RETURNING *;

Looks like in this situation there is only expression in reltarget is
for col1, but returning result contains all columns.

And just out of curiosity, what do you think about table AM specific
columns e.g. ctid, xmin/xmax etc? I mean, they're not included into
scanCols and should not be since they're heap AM related. But is there a
chance that there would be some AM specific columns relevant to e.g.
the columnar storage that would also make sense to put into scanCols?




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-13, James Coleman wrote:

> > I don't propose to commit 0003 of course, since it's not our policy;
> > that's just to allow running pgindent sanely, which gives you 0004
> > (though my local pgindent has an unrelated fix).  And after that you
> > notice the issue that 0005 fixes.
> 
> Is there a page on how you're supposed to run pgindent/when stuff like
> this does get added/etc.? It's all a big mystery to me right now.
> 
> Also, I noticed some of the pgindent changes aren't for changes in
> this patch series; I have that as a separate patch, but not attached
> because I see that running pgindent locally generates a massive patch,
> so I'm assuming we just ignore those for now?

Ah, I should have paid more attention to what I was attaching.  Yeah,
ideally you run pgindent and then only include the changes that are
relevant to your patch series.  We run pgindent across the whole tree
once every release, so about yearly.  Some commits go in that are not
indented correctly, and those bother everyone -- I'm guilty of this
myself more frequently than I'd like.

You can specify a filelist to pgindent, also.  What I do is super
low-tech: do a "git diff origin/master", copy the filelist, and then
^V^E to paste that list into a command line to run pgindent (editing to
remove the change histogram and irrelevant files).  I should automate
this ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?

2020-03-13 Thread Merlin Moncure
I have 5 servers in a testing environment that are comprise a data
warehousing cluster.   They will typically get each get exactly the
same query at approximately the same time.  Yesterday, around 1pm, 3
of the five got stuck on the same query.  Each of them yields similar
stack traces.  This  happens now and then.  The server is 9.6.12
(which is obviously old, but I did not see any changes in relevant
code).

(gdb) bt
#0  0x7fe856c0b463 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x006b4416 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
latch.c:1053
#2  WaitEventSetWait (set=0x27cace8, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffc9f2b0f60,
nevents=nevents@entry=1) at latch.c:1007
#3  0x005f26dd in secure_write (port=0x27f16a0,
ptr=ptr@entry=0x27f5528, len=len@entry=192) at be-secure.c:255
#4  0x005fb51b in internal_flush () at pqcomm.c:1410
#5  0x005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
s@entry=0x2a4f228 "", len=70) at pqcomm.c:1356
#6  0x005fb7f0 in socket_putmessage (msgtype=68 'D',
s=0x2a4f228 "", len=) at pqcomm.c:1553
#7  0x005fd5d9 in pq_endmessage (buf=buf@entry=0x7ffc9f2b1040)
at pqformat.c:347
#8  0x00479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
printtup.c:372
#9  0x005c1cc9 in ExecutePlan (dest=0x2b6bca0,
direction=, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT,
use_parallel_mode=, planstate=0x2958cf8,
estate=0x2958be8) at execMain.c:1606
#10 standard_ExecutorRun (queryDesc=0x2834998, direction=, count=0) at execMain.c:339
#11 0x006d69a7 in PortalRunSelect
(portal=portal@entry=0x2894e38, forward=forward@entry=1 '\001',
count=0, count@entry=9223372036854775807,
dest=dest@entry=0x2b6bca0) at pquery.c:948
#12 0x006d7dbb in PortalRun (portal=0x2894e38,
count=9223372036854775807, isTopLevel=, dest=0x2b6bca0,
altdest=0x2b6bca0,
completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
#13 0x006d5a06 in PostgresMain (argc=,
argv=, dbname=, username=) at postgres.c:1109
#14 0x0046fc28 in BackendRun (port=0x27f16a0) at postmaster.c:4342
#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
#16 ServerLoop () at postmaster.c:1721
#17 0x00678119 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x27c8c90) at postmaster.c:1329
#18 0x0047088e in main (argc=3, argv=0x27c8c90) at main.c:228
(gdb) quit

Now, the fact that this happened to multiple servers at time strongly
suggest an external (to the database) problem.  The system initiating
the query, a cross database query over dblink, has been has given up
(and has been restarted as a precaution) a long time ago, and the
connection is dead.   secure_write() sets however an infinite timeout
to the latch, and there are clearly scenarios where epoll waits
forever for an event that is never going to occur.  If/when this
happens, the only recourse is to restart the impacted database.  The
question is, shouldn't the latch have a looping timeout that checks
for interrupts?   What would the risks be of jumping directly out of
the latch loop?

merlin




Re: proposal: schema variables

2020-03-13 Thread Pavel Stehule
Hi

ne 8. 3. 2020 v 19:12 odesílatel Pavel Stehule 
napsal:

>
>
> so 7. 3. 2020 v 22:15 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> I fixed the some ugly parts of this patch - now the LET x = DEFAULT, LET
>> x = NULL is processed by more usual way.
>> Statement LET is doesn't switch between STMT_UTILITY and
>> STMT_PLAN_UTILITY like before. It is only STMT_PLAN_UTILITY statement.
>>
>
> I did some cleaning - I mainly replaced CMD_PLAN_UTILITY by CMD_LET
> because there is not another similar statement in queue.
>

 another cleaning - I repleaced CMD_LET by CMD_SELECT_UTILITY. This name is
more illustrative, and little bit cleaned code.

CMD_SELECT_UTILITY is mix of CMD_SELECT and CMD_UTILITY. It allows PREPARE
and parametrizations like CMD_SELECT. But execution is like CMD_UTILITY by
own utility routine with explicit dest receiver setting. This design
doesn't need to introduce new executor and planner nodes (like ModifyTable
and ModifyTablePath).

Regards

Pavel



> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>


schema-variables-20200313.patch.gz
Description: application/gzip


Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Justin Pryzby
On Tue, Mar 10, 2020 at 01:53:42PM +1300, David Rowley wrote:
> 2. Perhaps the documentation in maintenance.sgml should mention that
> the table will be vacuumed with the equivalent of having
> vacuum_freeze_min_age = 0, instead of:
> 
> "Such a vacuum will aggressively freeze tuples."
> 
> aggressive is the wrong word here. We call it an aggressive vacuum if
> we disable page skipping, not for setting the vacuum_freeze_min_age to
> 0.

Possible it would be better to run VACUUM *without* freeze_min_age=0 ?  (I get
confused and have to spend 20min re-reading the vacuum GUC docs every time I
deal with this stuff, so maybe I'm off).

As I understand, the initial motivation of this patch was to avoid disruptive
anti-wraparound vacuums on insert-only table.  But if vacuum were triggered at
all, it would freeze the oldest tuples, which is all that's needed; especially
since fd31cd2651 "Don't vacuum all-frozen pages.", those pages would never need
to be vacuumed again.  Recently written tuples wouldn't be frozen, which is ok,
they're handled next time.

Another motivation of the patch is to allow indexonly scan, for which the
planner looks at pages' "relallvisible" fraction (and at execution if a page
isn't allvisible, visits the heap).  Again, that happens if vacuum were run at
all.  Again, some pages won't be marked allvisible, which is fine, they're
handled next time.

I think freeze_min_age=0 could negatively affect people who have insert-mostly
tables (I'm not concerned, but that includes us).  If they consistently hit the
autovacuum insert threshold before the cleanup threshold for updated/deleted
tuples, any updated/deleted tuples would be frozen, which would be
wasteful:  

|One disadvantage of decreasing vacuum_freeze_min_age is that it might cause
|VACUUM to do useless work: freezing a row version is a waste of time if the row
|is modified soon thereafter (causing it to acquire a new XID). So the setting
|should be large enough that rows are not frozen until they are unlikely to
|change any more.

So my question is if autovacuum triggered by insert threshold should trigger
VACUUM with the same settings as a vacuum due to deleted tuples.  I realize the
DBA could just configure the thresholds so they'd hit vacuum for cleaning dead
tuples, so my suggestion maybe just improves the case with the default
settings.  It's possible to set the reloption autovacuum_freeze_min_age, which
I think supports the idea of running a vacuum normally and letting it (and the
DBA) decide what do with existing logic.

Also, there was a discussion about index cleanup with the conclusion that it
was safer not to skip it, since otherwise indexes might bloat.  I think that's
right, since vacuum for cleanup is triggered by the number of dead heap tuples.
To skip index cleanup, I think you'd want a metric for
n_dead_since_index_cleanup.  (Or maybe analyze could track dead index tuples
and trigger vacuum of each index separately).

Having now played with the patch, I'll suggest that 1000 is too high a
threshold.  If autovacuum runs without FREEZE, I don't see why it couldn't be
much lower (10?) or use (0.2 * n_ins + 50) like the other autovacuum GUC.  

-- 
Justin




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread James Coleman
On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
 wrote:
> 3) Most of the execution plans look reasonable, except that some of the
> plans look like this:
>
>
>   QUERY PLAN
>-
> Limit
>   ->  GroupAggregate
> Group Key: t.a, t.b, t.c, t.d
> ->  Incremental Sort
>   Sort Key: t.a, t.b, t.c, t.d
>   Presorted Key: t.a, t.b, t.c
>   ->  Incremental Sort
> Sort Key: t.a, t.b, t.c
> Presorted Key: t.a, t.b
> ->  Index Scan using t_a_b_idx on t
>(10 rows)
>
> i.e. there are two incremental sorts on top of each other, with
> different prefixes. But this this is not a new issue - it happens with
> queries like this:
>
>SELECT a, b, c, d, count(*) FROM (
>  SELECT * FROM t ORDER BY a, b, c
>) foo GROUP BY a, b, c, d limit 1000;
>
> i.e. there's a subquery with a subset of pathkeys. Without incremental
> sort the plan looks like this:
>
> QUERY PLAN
>-
> Limit
>   ->  GroupAggregate
> Group Key: t.a, t.b, t.c, t.d
> ->  Sort
>   Sort Key: t.a, t.b, t.c, t.d
>   ->  Sort
> Sort Key: t.a, t.b, t.c
> ->  Seq Scan on t
>(8 rows)
>
> so essentially the same plan shape. What bugs me though is that there
> seems to be some sort of memory leak, so that this query consumes
> gigabytes os RAM before it gets killed by OOM. But the memory seems not
> to be allocated in any memory context (at least MemoryContextStats don't
> show anything like that), so I'm not sure what's going on.
>
> Reproducing it is fairly simple:
>
>CREATE TABLE t (a bigint, b bigint, c bigint, d bigint);
>INSERT INTO t SELECT
>  1000*random(), 1000*random(), 1000*random(), 1000*random()
>FROM generate_series(1,1000) s(i);
>CREATE INDEX idx ON t(a,b);
>ANALYZE t;
>
>EXPLAIN ANALYZE SELECT a, b, c, d, count(*)
>FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d
>LIMIT 100;

While trying to reproduce this, instead of lots of memory usage, I got
the attached assertion failure instead.

James
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f12d7fa5535 in __GI_abort () at abort.c:79
#2  0x556dfbafe5d8 in ExceptionalCondition 
(conditionName=conditionName@entry=0x556dfbcaceb0 "total_allocated == 
context->mem_allocated",
errorType=errorType@entry=0x556dfbb54028 "FailedAssertion", 
fileName=fileName@entry=0x556dfbcacc33 "aset.c", 
lineNumber=lineNumber@entry=1541)
at assert.c:67
#3  0x556dfbb23761 in AllocSetCheck (context=context@entry=0x556dfdd5cf40) 
at aset.c:1541
#4  0x556dfbb238aa in AllocSetDelete (context=0x556dfdd5cf40) at aset.c:655
#5  0x556dfbb2a8e0 in MemoryContextDelete (context=0x556dfdd5cf40) at 
mcxt.c:245
#6  0x556dfbb0b6e0 in hash_destroy (hashp=hashp@entry=0x556dfdd6bf68) at 
dynahash.c:829
#7  0x556dfba9ca90 in set_rtable_names (dpns=dpns@entry=0x7ffeb9b1ad50, 
parent_namespaces=parent_namespaces@entry=0x0, rels_used=0x556dfdd3b590)
at ruleutils.c:3578
#8  0x556dfbaa0343 in select_rtable_names_for_explain (rtable=, rels_used=) at ruleutils.c:3418
#9  0x556dfb7d994e in ExplainPrintPlan (es=es@entry=0x556dfdc75d10, 
queryDesc=queryDesc@entry=0x556dfdd3b4f8) at explain.c:702
#10 0x556dfb7d9fe1 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x556dfdd3b3e0, into=into@entry=0x0, 
es=es@entry=0x556dfdc75d10,
queryString=queryString@entry=0x556dfdc51690 "EXPLAIN ANALYZE SELECT a, b, 
c, d, count(*)\n   FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, 
c, d\n   LIMIT 100;", params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, 
planduration=0x7ffeb9b1aeb0) at explain.c:564
#11 0x556dfb7da1e2 in ExplainOneQuery (query=, 
cursorOptions=cursorOptions@entry=256, into=into@entry=0x0, 
es=es@entry=0x556dfdc75d10,
queryString=0x556dfdc51690 "EXPLAIN ANALYZE SELECT a, b, c, d, count(*)\n   
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d\n   LIMIT 
100;", params=params@entry=0x0, queryEnv=0x0) at explain.c:388
#12 0x556dfb7dab4c in ExplainQuery (pstate=pstate@entry=0x556dfdc74e80, 
stmt=stmt@entry=0x556dfdc534c8, params=params@entry=0x0,
dest=dest@entry=0x556dfdc75c78) at ../../../src/include/nodes/nodes.h:594
#13 0x556dfb9e2ca1 in standard_ProcessUtility (pstmt=0x556dfdcfca78,
queryString=0x556dfdc51690 "EXPLAIN ANALYZE SELECT a, b, c, d, count(*)\n   
FROM (SELECT * FROM t ORDER BY a, b, c) foo GROUP BY a, b, c, d\n   LIMIT 
100;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x556dfdc75c78, qc=0x7ffeb9b1b080) at utility.c:827
#14 0x556dfb9e31f2 in ProcessUtility (pstmt=pstmt@entry=0x556dfdcfca78, 
queryString=, 

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Tomas Vondra

On Fri, Mar 13, 2020 at 01:16:44PM -0400, James Coleman wrote:

On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
 wrote:
...

Now, a couple comments about parts 0001 - 0003 of the patch ...

1) I see a bunch of failures in the regression test, due to minor
differences in the explain output. All the differences are about minor
changes in memory usage, like this:

-   "Sort Space Used": 30, +
+   "Sort Space Used": 29, +

I'm not sure if it happens on my machine only, but maybe the test is not
entirely stable.


make check passes on multiple machines for me; what arch/distro are you using?



Nothing exotic - Fedora on x64. My guess is that things like enabling
asserts will make a difference too, because then we track additional
stuff for allocated chunks etc. So I agree with Tom trying to keep this
stable is a lost cause, essentially.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: explain HashAggregate to report bucket and memory stats

2020-03-13 Thread Andres Freund
Hi,

On 2020-03-13 10:53:17 -0700, Jeff Davis wrote:
> On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote:
> > On 2020-03-13 10:15:46 -0700, Jeff Davis wrote:
> > > Also, is there a reason you report two different memory values
> > > (hashtable and tuples)? I don't object, but it seems like a little
> > > too
> > > much detail.
> > 
> > Seems useful to me - the hashtable is pre-allocated based on
> > estimates,
> > whereas the tuples are allocated "on demand". So seeing the
> > difference
> > will allow to investigate the more crucial issue...
> 
> Then do we also want to report separately on the by-ref transition
> values? That could be useful if you are using ARRAY_AGG and the states
> grow larger than you might expect.

I can see that being valuable - I've had to debug cases with too much
memory being used due to aggregate transitions before. Right now it'd be
mixed in with tuples, I believe - and we'd need a separate context for
tracking the transition values? Due to that I'm inclined to not report
separately for now.

Greetings,

Andres Freund




Re: explain HashAggregate to report bucket and memory stats

2020-03-13 Thread Jeff Davis
On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote:
> On 2020-03-13 10:15:46 -0700, Jeff Davis wrote:
> > Also, is there a reason you report two different memory values
> > (hashtable and tuples)? I don't object, but it seems like a little
> > too
> > much detail.
> 
> Seems useful to me - the hashtable is pre-allocated based on
> estimates,
> whereas the tuples are allocated "on demand". So seeing the
> difference
> will allow to investigate the more crucial issue...

Then do we also want to report separately on the by-ref transition
values? That could be useful if you are using ARRAY_AGG and the states
grow larger than you might expect.

Regards,
Jeff Davis






Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Andres Freund
Hi,

On 2020-03-13 13:36:44 -0400, Tom Lane wrote:
> James Coleman  writes:
> > On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
> >  wrote:
> >> 1) I see a bunch of failures in the regression test, due to minor
> >> differences in the explain output. All the differences are about minor
> >> changes in memory usage, like this:
> >> 
> >> -   "Sort Space Used": 30, +
> >> +   "Sort Space Used": 29, +
> >> 
> >> I'm not sure if it happens on my machine only, but maybe the test is not
> >> entirely stable.
> 
> > make check passes on multiple machines for me; what arch/distro are you 
> > using?
> 
> I think there's exactly zero chance of such output being stable across
> different platforms, particularly 32-vs-64-bit.  You'll need to either
> drop that test or find some way to mask the variability.

+1


> > Is there a better way to test these? I would prefer these code paths
> > have test coverage, but the standard SQL tests don't leave a good way
> > to handle stuff like this.
> 
> In some places we use plpgsql code to filter the EXPLAIN output.

I still think we should just go for a REPRODUCIBLE, TESTING, REGRESS or
similar EXPLAIN option, instead of playing whack-a-mole. Due to the
amount of discussion, the reduced test coverage, the increased test
complexity, the reduced quality of explain for humans we are well beyond
the point of making the cost of such an option worth it.

Greetings,

Andres Freund




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-03-13 Thread Tom Lane
James Coleman  writes:
> On Tue, Mar 10, 2020 at 10:44 PM Tomas Vondra
>  wrote:
>> 1) I see a bunch of failures in the regression test, due to minor
>> differences in the explain output. All the differences are about minor
>> changes in memory usage, like this:
>> 
>> -   "Sort Space Used": 30, +
>> +   "Sort Space Used": 29, +
>> 
>> I'm not sure if it happens on my machine only, but maybe the test is not
>> entirely stable.

> make check passes on multiple machines for me; what arch/distro are you using?

I think there's exactly zero chance of such output being stable across
different platforms, particularly 32-vs-64-bit.  You'll need to either
drop that test or find some way to mask the variability.

> Is there a better way to test these? I would prefer these code paths
> have test coverage, but the standard SQL tests don't leave a good way
> to handle stuff like this.

In some places we use plpgsql code to filter the EXPLAIN output.

regards, tom lane




Re: explain HashAggregate to report bucket and memory stats

2020-03-13 Thread Andres Freund
Hi,

On 2020-03-13 10:15:46 -0700, Jeff Davis wrote:
> Also, is there a reason you report two different memory values
> (hashtable and tuples)? I don't object, but it seems like a little too
> much detail.

Seems useful to me - the hashtable is pre-allocated based on estimates,
whereas the tuples are allocated "on demand". So seeing the difference
will allow to investigate the more crucial issue...

Greetings,

Andres Freund




Re: explain HashAggregate to report bucket and memory stats

2020-03-13 Thread Jeff Davis


+   /* hashtable->entrysize includes additionalsize */
+   hashtable->instrument.space_peak_hash = Max(
+   hashtable->instrument.space_peak_hash,
+   hashtable->hashtab->size *
sizeof(TupleHashEntryData));
+
+   hashtable->instrument.space_peak_tuples = Max(
+   hashtable->instrument.space_peak_tuples,
+   hashtable->hashtab->members *
hashtable->entrysize);

I think, in general, we should avoid estimates/projections for
reporting and try to get at a real number, like
MemoryContextMemAllocated(). (Aside: I may want to tweak exactly what
that function reports so that it doesn't count the unused portion of
the last block.)

For instance, the report is still not accurate, because it doesn't
account for pass-by-ref transition state values.

To use memory-context-based reporting, it's hard to make the stats a
part of the tuple hash table, because the tuple hash table doesn't own
the memory contexts (they are passed in). It's also hard to make it
per-hashtable (e.g. for grouping sets), unless we put each grouping set
in its own memory context.

Also, is there a reason you report two different memory values
(hashtable and tuples)? I don't object, but it seems like a little too
much detail.

Regards,
Jeff Davis






Re: shared-memory based stats collector

2020-03-13 Thread Andres Freund
Hi,

On 2020-03-13 16:34:50 +0900, Kyotaro Horiguchi wrote:
> Thank you very much!!
> 
> At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > Thomas, could you look at the first two patches here, and my review
> > questions?
> > 
> > 
> > General comments about this series:
> > - A lot of the review comments feel like I've written them before, a
> >   year or more ago. I feel this patch ought to be in a much better
> >   state. There's a lot of IMO fairly obvious stuff here, and things that
> >   have been mentioned multiple times previously.
> 
> I apologize for all of the obvious stuff or things that have been
> mentioned..  I'll address them.
> 
> > - There's a *lot* of typos in here. I realize being an ESL is hard, but
> >   a lot of these can be found with the simplest spellchecker.  That's
> >   one thing for a patch that just has been hacked up as a POC, but this
> >   is a multi year thread?
> 
> I'll review all changed part again.  I used ispell but I should have
> failed to check much of the changes.
> 
> > - There's some odd formatting. Consider using pgindent more regularly.
> 
> I'll do so.
> 
> > More detailed comments below.
> 
> Thank you very much for the intensive review, I'm going to revise the
> patch according to them.
> 
> > I'm considering rewriting the parts of the patchset that I don't like -
> > but it'll look quite different afterwards.

I take your response to mean that you'd prefer to evolve the patch
largely on your own? I'm mainly asking because I think there's some
chance that we could till get this into v13, but if so we'll have to go
for it now.

Greetings,

Andres Freund




Re: range_agg

2020-03-13 Thread Tom Lane
Paul A Jungwirth  writes:
> On Wed, Mar 11, 2020 at 4:39 PM Paul A Jungwirth
>  wrote:
>> Oh, my last email left out the most important part. :-) Is this
>> failure online somewhere so I can take a look at it and fix it?

Look for your patch(es) at

http://commitfest.cputube.org

Right now it's not even applying, presumably because Alvaro already
pushed some pieces, so you need to rebase.  But when it was applying,
one or both of the test builds was failing.

regards, tom lane




Re: Add an optional timeout clause to isolationtester step.

2020-03-13 Thread Tom Lane
Julien Rouhaud  writes:
> It seems that for all the possibly interesting cases, what we want to wait on
> is an heavyweight lock, which is already what isolationtester detects.  Maybe
> we could simply implement something like

> step "" [ WAIT UNTIL BLOCKED ] {  }

> without any change to the blocking detection function?

Um, isn't that the existing built-in behavior?

I could actually imagine some uses for the reverse option, *don't* wait
for it to become blocked but just immediately continue with issuing
the next step.

regards, tom lane




Re: Additional improvements to extended statistics

2020-03-13 Thread Dean Rasheed
On Mon, 9 Mar 2020 at 00:06, Tomas Vondra  wrote:
>
> On Mon, Mar 09, 2020 at 01:01:57AM +0100, Tomas Vondra wrote:
> >
> >Attaches is an updated patch series
> >with parts 0002 and 0003 adding tests demonstrating the issue and then
> >fixing it (both shall be merged to 0001).
> >
>
> One day I won't forget to actually attach the files ...
>

0001-0003 look reasonable to me.

One minor point -- there are now 2 code blocks that are basically the
same, looping over a list of clauses, calling clause_selectivity() and
then applying the "s1 = s1 + s2 - s1 * s2" formula. Perhaps they could
be combined into a new function (clauselist_selectivity_simple_or(),
say). I guess it would need to be passed the initial starting
selectivity s1, but it ought to help reduce code duplication.

Regards,
Dean




Re: backup manifests

2020-03-13 Thread Robert Haas
On Fri, Mar 13, 2020 at 9:53 AM tushar  wrote:
> run pg_validatebackup -
>
> [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1
> pg_validatebackup: error: could not open file "pg_hba.conf": Permission denied
>
> run pg_validatebackup  with switch -s
>
> [centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1 -s
> pg_validatebackup: backup successfully verified
>
> here file is not accessible so i think - it should throw you an error ( the 
> same above one) instead of   blindly skipping it.

I don't really want to do that. That would require it to open every
file even if it doesn't need to read the data in the files. I think in
most cases that would just slow it down for no real benefit. If you've
specified -s, you have to be OK with getting a less complete check for
problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: range_agg

2020-03-13 Thread Paul A Jungwirth
On Wed, Mar 11, 2020 at 4:39 PM Paul A Jungwirth
 wrote:
>
> On Sat, Mar 7, 2020 at 12:20 PM Tom Lane  wrote:
> > Alvaro Herrera  writes:
> > > [ v11 patches ]
> > The cfbot isn't too happy with this; it's getting differently-ordered
> > results than you apparently did for the list of owned objects in
> > dependency.out's DROP OWNED BY test.  Not sure why that should be ---
> > it seems like af6550d34 should have ensured that there's only one
> > possible ordering.
>
> Oh, my last email left out the most important part. :-) Is this
> failure online somewhere so I can take a look at it and fix it?

Looks like I sent this just to Tom before. This is something I need to
fix, right?

Regards,
Paul




Re: range_agg

2020-03-13 Thread Paul A Jungwirth
On Thu, Mar 12, 2020 at 5:38 AM Alvaro Herrera  wrote:
> ... thinking about gist+spgist, I think they could be written
> identically to those for ranges, using the lowest (first) lower bound
> and the higher (last) upper bound.
>
> ... thinking about selectivity, I think the way to write that is to
> first compute the selectivity for the range across the first lower bound
> and the last upper bound, and then subtract that for the "negative"
> space between the contained ranges.
>
> I have no immediate thoughts about typanalyze.  I suppose it should be
> somehow based on the implementation for ranges ... maybe a first-cut is
> to construct fake ranges covering the whole multirange (as above) and
> just use the ranges implementation (compute_range_stats).

Thanks, this is pretty much what I was thinking too, but I'm really
glad to have someone who knows better confirm it. I can get started on
these right away, and I'll let folks know if I need any help. When I
looked at this last fall there was a lot I didn't understand. More or
less using the existing ranges implementation should be a big help
though.

Paul




Re: Re: Optimize crash recovery

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-14, Thunder wrote:

> For example, if page lsn in storage is 0x9 and start to replay from 
> 0x1.
> If 0x1 is full-page xlog record, then we can ignore to replay xlog 
> between 0x1~0x9 for this page.
> 
> 
> Is there any correct issue if the page exists in the buffer pool and
> ignore to replay for full-page or init page if page lsn is larger than
> the lsn of xlog record?

Oh! right.  The assumption, before we had page-level checksums, was that
the page at LSN 0x9 could have been partially written, so the upper
half of the contents would actually be older and thus restoring the FPI
(and all subsequent WAL changes) was mandatory.  But if the page
checksum verifies, then there's no need to return the page back to an
old state only to replay everything to bring it to the new state again.

This seems a potentially worthwhile optimization ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add an optional timeout clause to isolationtester step.

2020-03-13 Thread Julien Rouhaud
On Fri, Mar 13, 2020 at 10:12:20AM -0400, Tom Lane wrote:
> Julien Rouhaud  writes:
>
> > I'm not familiar with those test so I'm probably missing something, but 
> > looks
> > like all isolation tests that setup a timeout are doing so to test server 
> > side
> > features (deadlock detection, statement and lock timeout).  I'm not sure how
> > adding a client-side facility to detect locks earlier is going to help 
> > reducing
> > the server side timeouts?
>
> The point is that those timeouts have to be set long enough for even a
> very slow machine to reach a desired state before the timeout happens;
> on faster machines the test is just uselessly sleeping for a long time,
> because of the fixed timeout.  My thought was that maybe the tests could
> be recast as "watch for session to reach $expected_state and then do
> the next thing", allowing them to be automatically adaptive to the
> machine's speed.  This might require some rather subtle test redesign
> and/or addition of more infrastructure (to allow recognition of the
> desired state and/or taking an appropriate next action).  I'm prepared
> to believe that not much can be done about timeouts.spec in particular,
> but it seems to me that the long delays in the deadlock tests are not
> inherent in what we need to test.


Ah I see.  I'll try to see if that could help the deadlock tests, but for sure
such feature would allow us to get rid of the two pg_sleep(5) in
tuplelock-update.

It seems that for all the possibly interesting cases, what we want to wait on
is an heavyweight lock, which is already what isolationtester detects.  Maybe
we could simply implement something like

step "" [ WAIT UNTIL BLOCKED ] {  }

without any change to the blocking detection function?


> > For the REINDEX CONCURRENTLY failure test, the problem that needs to be 
> > solved
> > isn't detecting that the command is blocked as it's already getting blocked 
> > on
> > a heavyweight lock, but being able to reliably cancel a specific query as 
> > early
> > as possible, which AFAICS isn't possible with current isolation tester:
>
> Right, it's the same thing of needing to wait till the backend has reached
> a particular state before you do the next thing.
>
> > So we would actually only need something like this to make it work:
> > step "" [ CANCEL IF BLOCKED ] { 
> I continue to resist the idea of hard-wiring this feature to query cancel
> as the action-to-take.  That will more or less guarantee that it's not
> good for anything but this one test case.  I think that the feature
> should have the behavior of "treat this step as blocked once it's reached
> state X", and then you make the next step in the permutation be one that
> issues a query cancel.  (Possibly, using pg_stat_activity and
> pg_cancel_backend for that will be painful enough that we'd want to
> invent separate script syntax that says "send a cancel to session X".
> But that's a separate discussion.)


I agree.  A new step option to kill a session rather than executing sql would
go perfectly with the above new active-wait-for-blocking-state feature.




Re: BEFORE ROW triggers for partitioned tables

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-11, Ashutosh Bapat wrote:

> On Thu, Feb 27, 2020 at 10:22 PM Alvaro Herrera
>  wrote:

> > * The new function I added, ReportTriggerPartkeyChange(), contains one
> > serious bug (namely: it doesn't map attribute numbers properly if
> > partitions are differently defined).
> 
> IIUC the code in your patch, it seems you are just looking at
> partnatts. But partition key can contain expressions also which are
> stored in partexprs. So, I think the code won't catch change in the
> partition key values when it contains expression. Using
> RelationGetPartitionQual() will fix this problem and also problem of
> attribute remapping across the partition hierarchy.

Oh, of course.

In fact, I don't need to deal with PartitionQual directly; I can just
use ExecPartitionCheck(), since in ExecBRInsertTriggers et al we already
have all we need.  v2 attached.

Here's some example output.  With my previous patch, this was the error
we reported:

 insert into parted values (1, 1, 'uno uno v2');-- fail
 ERROR:  changing partitioning columns in a before trigger is not supported
 DETAIL:  Column "a" was changed by trigger "t".

Now, passing emitError=true to ExecPartitionCheck, I get this:

 insert into parted values (1, 1, 'uno uno v2');-- fail
 ERROR:  new row for relation "parted_1_1" violates partition constraint
 DETAIL:  Failing row contains (2, 1, uno uno v2).

Note the discrepancy in the table named in the INSERT vs. the one in the
error message.  This is a low-quality error IMO.  So I'd instead pass
emitError=false, and produce my own error.  It's useful to report the
trigger name and original partition name:

 insert into parted values (1, 1, 'uno uno v2');-- fail
 ERROR:  moving row to another partition during a BEFORE trigger is not 
supported
 DETAIL:  Before trigger "t", row was to be in partition "public.parted_1_1"

Note that in this implementation I no longer know which column is the
problematic one, but I suppose users have clue enough.  Wording
suggestions welcome.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From f13d7ad4c081ede118b0f6c262ad92a49ea9f64c Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 26 Feb 2020 17:34:54 -0300
Subject: [PATCH v2] Enable BEFORE row-level triggers for partitioned tables

... with the limitation that if partitioning columns are changed, the
operation is aborted.  (The reason for this limitation is that it might
require re-routing the tuple, which doesn't work.)
---
 src/backend/commands/tablecmds.c   |  7 
 src/backend/commands/trigger.c | 40 +--
 src/backend/partitioning/partdesc.c|  2 +-
 src/include/utils/reltrigger.h |  1 +
 src/test/regress/expected/triggers.out | 53 --
 src/test/regress/sql/triggers.sql  | 39 ++-
 6 files changed, 116 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3eb861bfbf..08ad595596 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16501,13 +16501,6 @@ CloneRowTriggersToPartition(Relation parent, Relation partition)
 			 !OidIsValid(trigForm->tgparentid)))
 			continue;
 
-		/*
-		 * Complain if we find an unexpected trigger type.
-		 */
-		if (!TRIGGER_FOR_AFTER(trigForm->tgtype))
-			elog(ERROR, "unexpected trigger \"%s\" found",
- NameStr(trigForm->tgname));
-
 		/* Use short-lived context for CREATE TRIGGER */
 		oldcxt = MemoryContextSwitchTo(perTupCxt);
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 513427edf1..98911facad 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -221,18 +221,6 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 		 */
 		if (stmt->row)
 		{
-			/*
-			 * BEFORE triggers FOR EACH ROW are forbidden, because they would
-			 * allow the user to direct the row to another partition, which
-			 * isn't implemented in the executor.
-			 */
-			if (stmt->timing != TRIGGER_TYPE_AFTER)
-ereport(ERROR,
-		(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-		 errmsg("\"%s\" is a partitioned table",
-RelationGetRelationName(rel)),
-		 errdetail("Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.")));
-
 			/*
 			 * Disallow use of transition tables.
 			 *
@@ -1658,6 +1646,7 @@ RelationBuildTriggers(Relation relation)
 		build->tgtype = pg_trigger->tgtype;
 		build->tgenabled = pg_trigger->tgenabled;
 		build->tgisinternal = pg_trigger->tgisinternal;
+		build->tgisclone = OidIsValid(pg_trigger->tgparentid);
 		build->tgconstrrelid = pg_trigger->tgconstrrelid;
 		build->tgconstrindid = pg_trigger->tgconstrindid;
 		build->tgconstraint = pg_trigger->tgconstraint;
@@ -1961,6 +1950,8 @@ equalTriggerDescs(TriggerDesc *trigdesc1, TriggerDesc *trigdesc2)
 return false;
 			if (trig1->tgisinternal != 

Re:Re: Optimize crash recovery

2020-03-13 Thread Thunder
For example, if page lsn in storage is 0x9 and start to replay from 0x1.
If 0x1 is full-page xlog record, then we can ignore to replay xlog between 
0x1~0x9 for this page.


Is there any correct issue if the page exists in the buffer pool and ignore to 
replay for full-page or init page if page lsn is larger than the lsn of xlog 
record?

















At 2020-03-13 23:41:03, "Alvaro Herrera"  wrote:
>On 2020-Mar-13, Thunder wrote:
>
>> Hello hackers:
>> 
>> 
>> During crash recovery, we compare most of the lsn of xlog record with page 
>> lsn to determine if the record has already been replayed.
>> The exceptions are full-page and init-page xlog records.
>> It's restored if the xlog record includes a full-page image of the page.
>> And it initializes the page if the xlog record include init page information.
>> 
>> 
>> When we enable checksum for the page and verify page success, can we
>> compare the page lsn with the lsn of full-page xlog record or init
>> page xlog record to detemine it  has already been replayed?
>
>In order to verify that the checksum passes, you have to read the page
>first.  So what are you optimizing?
>
>-- 
>Álvaro Herrerahttps://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library

2020-03-13 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 12 Mar 2020, at 17:39, Tom Lane  wrote:
>> I'd originally thought that we might back-patch this, but I'm now of
>> the opinion that we probably should not.  If pkg-config is present,
>> this can change the default behavior about where we get libxml from,
>> which seems like something not to do in minor releases.  (OTOH, it'd
>> only matter if the default pkg-config choice is different from the
>> default xml2-config choice, so maybe the risk of breakage is small
>> enough to be acceptable?)

> I read this is as a preventative patch to stay ahead of future changes to
> packaging.  If these changes do materialize, won't they be equally likely to
> hit installations for backbranch minors as v13?

Yeah, that's the argument *for* back-patching.  Question is whether it
outweighs the risk of silently breaking somebody's build by linking
to the wrong libxml2 version.

I could go either way, honestly.  The risk doesn't seem large, but
it's not zero.

> We refer to both libxml and libxml2 in these paragraphs.  Since upstream is
> consistently referring to it as libxml2, maybe we should take this as
> opportunity to switch to that for the docs?

I think we're kind of stuck with "--with-libxml".  Conceivably we
could introduce "--with-libxml2", redefine the old switch as an
obsolete alias, and start saying "libxml2" instead of "libxml".
But I'm not sure that's worth the trouble, and it seems like
material for a different patch anyway.

regards, tom lane




Re: PATCH: add support for IN and @> in functional-dependency statistics use

2020-03-13 Thread Tomas Vondra

On Fri, Mar 13, 2020 at 08:42:49AM +, Dean Rasheed wrote:

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra  wrote:


I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

 WHERE a IN (...) AND b IN (...)



Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

 a | b
---
 1 | 1
 2 | 2
 3 | 3
 : | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.



Well, yeah. I'm sure we can produce countless examples where applying
the functional dependencies to additional types of clauses helps a lot.
I'm somewhat hesitant to just drop any restrictions, though, because
it's equally simple to produce examples with poor results.

The main issue I have with just applying dependencies to arbitrary
clauses is it uses "degree" computed for the value as a whole, and
just uses it to estimate dependency between pieces of the values.

The IN() clause does not have this problem, the other cases like @> or
pattern matching do.


Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.



Yeah, I was gonna propose the same thing. I'll get the IN bit committed
shortly (both for dependencies and MCV), improved handling of OR clauses
and some additional regression tests to increase the coverage.

Then we can discuss these improvement in more detail for PG14.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library

2020-03-13 Thread Tom Lane
Hugh McMaster  writes:
> On Fri, 13 Mar 2020 at 03:39, Tom Lane wrote:
>> That won't win us any friends, so the attached revision
>> doesn't call PKG_CHECK_MODULES unless we found pkg-config.

> Did you mean to terminate configure if pkg-config cannot find
> libxml-2.0 or the library is too old? Your doc changes don't indicate
> that intent, nor was it prior behaviour, but some projects like that
> behaviour and others don't.

Yeah, a potential edge-case here is that pkg-config is in the PATH
but it has no information about libxml2 (I doubt we need to consider
the risk that it has info about a pre-2.6.23 version).  Looking
again at the generated configure code, I realize I shouldn't have
left off the ACTION-IF-NOT-FOUND argument --- the default is to
throw an error, but we'd rather fall through and try to use xml2-config.
The eventual AC_CHECK_LIB(xml2, ...) test will catch the situation
where the library isn't there.  Updated patch attached.

> You might consider this an edge case, but you override custom
> XML2_CFLAGS/LIBS if xml2-config is detected.

Yeah, if pkg-config fails and xml2-config is present, that's true.
Since those weren't there before, and we now document them as just
a fallback solution, I think that's fine.

regards, tom lane

diff --git a/configure b/configure
index 1a0aca9..601e6e0 100755
--- a/configure
+++ b/configure
@@ -703,6 +703,8 @@ with_zlib
 with_system_tzdata
 with_libxslt
 with_libxml
+XML2_LIBS
+XML2_CFLAGS
 XML2_CONFIG
 UUID_EXTRA_OBJS
 with_uuid
@@ -719,13 +721,13 @@ with_perl
 with_tcl
 ICU_LIBS
 ICU_CFLAGS
-PKG_CONFIG_LIBDIR
-PKG_CONFIG_PATH
-PKG_CONFIG
 with_icu
 enable_thread_safety
 INCLUDES
 autodepend
+PKG_CONFIG_LIBDIR
+PKG_CONFIG_PATH
+PKG_CONFIG
 TAS
 GCC
 CPP
@@ -887,6 +889,8 @@ PKG_CONFIG_LIBDIR
 ICU_CFLAGS
 ICU_LIBS
 XML2_CONFIG
+XML2_CFLAGS
+XML2_LIBS
 LDFLAGS_EX
 LDFLAGS_SL
 PERL
@@ -1589,6 +1593,8 @@ Some influential environment variables:
   ICU_CFLAGS  C compiler flags for ICU, overriding pkg-config
   ICU_LIBSlinker flags for ICU, overriding pkg-config
   XML2_CONFIG path to xml2-config utility
+  XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
+  XML2_LIBS   linker flags for XML2, overriding pkg-config
   LDFLAGS_EX  extra linker flags for linking executables only
   LDFLAGS_SL  extra linker flags for linking shared libraries only
   PERLPerl program
@@ -7153,6 +7159,129 @@ else
 fi
 
 
+#
+# Set up pkg_config in case we need it below
+#
+
+
+
+
+
+
+
+if test "x$ac_cv_env_PKG_CONFIG_set" != "xset"; then
+	if test -n "$ac_tool_prefix"; then
+  # Extract the first word of "${ac_tool_prefix}pkg-config", so it can be a program name with args.
+set dummy ${ac_tool_prefix}pkg-config; ac_word=$2
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
+$as_echo_n "checking for $ac_word... " >&6; }
+if ${ac_cv_path_PKG_CONFIG+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  case $PKG_CONFIG in
+  [\\/]* | ?:[\\/]*)
+  ac_cv_path_PKG_CONFIG="$PKG_CONFIG" # Let the user override the test with a path.
+  ;;
+  *)
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+for ac_exec_ext in '' $ac_executable_extensions; do
+  if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
+ac_cv_path_PKG_CONFIG="$as_dir/$ac_word$ac_exec_ext"
+$as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5
+break 2
+  fi
+done
+  done
+IFS=$as_save_IFS
+
+  ;;
+esac
+fi
+PKG_CONFIG=$ac_cv_path_PKG_CONFIG
+if test -n "$PKG_CONFIG"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $PKG_CONFIG" >&5
+$as_echo "$PKG_CONFIG" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
+
+
+fi
+if test -z "$ac_cv_path_PKG_CONFIG"; then
+  ac_pt_PKG_CONFIG=$PKG_CONFIG
+  # Extract the first word of "pkg-config", so it can be a program name with args.
+set dummy pkg-config; ac_word=$2
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
+$as_echo_n "checking for $ac_word... " >&6; }
+if ${ac_cv_path_ac_pt_PKG_CONFIG+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  case $ac_pt_PKG_CONFIG in
+  [\\/]* | ?:[\\/]*)
+  ac_cv_path_ac_pt_PKG_CONFIG="$ac_pt_PKG_CONFIG" # Let the user override the test with a path.
+  ;;
+  *)
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+for ac_exec_ext in '' $ac_executable_extensions; do
+  if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
+ac_cv_path_ac_pt_PKG_CONFIG="$as_dir/$ac_word$ac_exec_ext"
+$as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5
+break 2
+  fi
+done
+  done
+IFS=$as_save_IFS
+
+  ;;
+esac
+fi
+ac_pt_PKG_CONFIG=$ac_cv_path_ac_pt_PKG_CONFIG
+if test -n "$ac_pt_PKG_CONFIG"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_pt_PKG_CONFIG" >&5
+$as_echo "$ac_pt_PKG_CONFIG" 

Re: Optimize crash recovery

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-13, Thunder wrote:

> Hello hackers:
> 
> 
> During crash recovery, we compare most of the lsn of xlog record with page 
> lsn to determine if the record has already been replayed.
> The exceptions are full-page and init-page xlog records.
> It's restored if the xlog record includes a full-page image of the page.
> And it initializes the page if the xlog record include init page information.
> 
> 
> When we enable checksum for the page and verify page success, can we
> compare the page lsn with the lsn of full-page xlog record or init
> page xlog record to detemine it  has already been replayed?

In order to verify that the checksum passes, you have to read the page
first.  So what are you optimizing?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Optimize crash recovery

2020-03-13 Thread Thunder
Hello hackers:


During crash recovery, we compare most of the lsn of xlog record with page lsn 
to determine if the record has already been replayed.
The exceptions are full-page and init-page xlog records.
It's restored if the xlog record includes a full-page image of the page.
And it initializes the page if the xlog record include init page information.


When we enable checksum for the page and verify page success, can we compare 
the page lsn with the lsn of full-page xlog record or init page xlog record to 
detemine it  has already been replayed?


BRS
Ray

Re: Refactor compile-time assertion checks for C/C++

2020-03-13 Thread Tom Lane
Michael Paquier  writes:
> Hmm.  v3 actually broke the C++ fallback of StaticAssertExpr() and
> StaticAssertStmt() (v1 did not), a simple fix being something like
> the attached.

The buildfarm seems happy, so why do you think it's broken?

If we do need to change it, I'd be inclined to just use the do{}
block everywhere, not bothering with the extra #if test.

regards, tom lane




Re: make check crashes on POWER8 machine

2020-03-13 Thread Tom Lane
Victor Wagner  writes:
> Justin Pryzby  wrote:
>> On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote:
>>> I've encountered a problem with Postgres on PowerPC machine.

>> Is it related to
>> https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us
>> https://bugzilla.kernel.org/show_bug.cgi?id=205183

> I don't think so. At least I cannot see any signal handler-related stuff
> in the trace, but see lots of calls to stored procedure executor
> instead.

Read the whole thread.  We fixed the issue with recursion in the
postmaster (9abb2bfc0); but the intermittent failure in infinite_recurse
is exactly the same as what we've been seeing for a long time in the
buildfarm, and there is zero doubt that it's that kernel bug.

In the other thread I'd suggested that we could quit running
errors.sql in parallel with other tests, but that would slow down
parallel regression testing for everybody.  I'm disinclined to do
that now, since the buildfarm problem is intermittent and easily
recognized.

regards, tom lane




Re: allow online change primary_conninfo

2020-03-13 Thread Alvaro Herrera
On 2020-Jan-22, Michael Paquier wrote:

> On Tue, Jan 21, 2020 at 06:03:18PM +0300, Sergei Kornilov wrote:
> > PS: also, I surprised why it's ok for wal_receiver_create_temp_slot
> > to be PGC_SIGHUP and ignore change of this setting until walreceiver
> > will reconnect by unrelated reason. I means walreceiver does nothing
> > special on SIGHUP. In common case change of
> > wal_receiver_create_temp_slot setting will have effect only during
> > restart of walreceiver process. And therefore we will switch to
> > archive recovery. But such design was strongly rejected for my patch
> > year ago. 
> 
> [ Looks at 3297308... ]
> Yeah, you are right.  I was not paying much attention but something
> does not stick here.  My understanding is that we should have the WAL
> receiver receive the value it needs to use from the startup process
> (aka via RequestXLogStreaming from xlog.c), and that we ought to make
> this new parameter PGC_POSTMASTER instead of PGC_SIGHUP.  HEAD is
> inconsistent here.

I'm CCing Peter as committer of 329730827848.

What are the downsides of changing wal_receiver_create_temp_slot to
PGC_POSTMASTER?  It seems pretty nasty to requires a full server
restart.  Maybe we can signal all walreceivers at that point so that
they restart with the correct setting?  That's much less problematic, I
would think.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: allow online change primary_conninfo

2020-03-13 Thread Alvaro Herrera
On 2020-Mar-13, Alvaro Herrera wrote:

> What are the downsides of changing wal_receiver_create_temp_slot to
> PGC_POSTMASTER?  It seems pretty nasty to requires a full server
> restart.  Maybe we can signal all walreceivers at that point so that
> they restart with the correct setting?  That's much less problematic, I
> would think.

... oh, that's exactly what 0002 does.  So patch 0001 is only about
making a temporary change to the create_temp_slot to be consistent with
existing policy, before changing the policy.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [proposal] de-TOAST'ing using a iterator

2020-03-13 Thread Alvaro Herrera
On 2020-Jan-12, John Naylor wrote:
> 
> I took a brief look at v11 to see if there's anything I can do to help
> it move forward. I'm not yet sure how it would look code-wise to
> implement Alvaro and Tomas's comments upthread, but I'm pretty sure
> this part means the iterator-related structs are just as exposed as
> before, but in a roundabout way that completely defeats the purpose of
> hiding internals:

Agreed -- I think this patch still needs more work before being
committable; I agree with John that the changes after v10 made it worse,
not better.  Rather than cross-including header files, it seems better
to expose some struct definitions after all and let the main iterator
interface (detoast_iterate) be a "static inline" function in detoast.h.

So let's move forward with v10 (submitted on Sept 10th).

Looking at that version, I don't think the function protos that were put
in toast_internals.h should be there at all; I think they should be in
detoast.h so that they can be used.  But I don't like the fact that
detoast.h now has to include genam.h; that seems pretty bogus.  I think
this can be fixed by moving the FetchDatumIteratorData struct definition
(but not its typedef) to toast_internals.h.

OTOH we've recently seen the TOAST interface (and header files) heavily
reworked because of table-AM considerations, so probably this needs even
more changes to avoid parts of it becoming heapam-dependant again.

create_toast_buffer() doing two pallocs seems a waste.  It could be a
single one,
+   buf = (ToastBuffer *) palloc0(MAXALIGN(sizeof(ToastBuffer)) + size);
+   buf->buf = buf + MAXALIGN(sizeof(ToastBuffer));
(I'm not sure that the MAXALIGNs are strictly necessary there; I think
we access the buf as four-byte aligned stuff somewhere in the toast
innards, but maybe I'm wrong about that.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: add support for IN and @> in functional-dependency statistics use

2020-03-13 Thread Bruce Momjian
On Fri, Mar 13, 2020 at 08:42:49AM +, Dean Rasheed wrote:
> On Thu, 12 Mar 2020 at 17:30, Tomas Vondra  
> wrote:
> >
> > I'm sorry, but I don't see how we could do this for arbitrary clauses. I
> > think we could do that for clauses that have equality semantics and
> > reference column values as a whole. So I think it's possible to do this
> > for IN clauses (which is what the first part of the patch does), but I
> > don't think we can do it for the containment operator.
> >
> > I.e. we can do that for
> >
> >  WHERE a IN (...) AND b IN (...)
> >
> 
> Hmm, the difficulty always comes back to the compatibility of the
> clauses though. It's easy to come up with artificial examples for
> which functional dependencies come up with bad estimates, even with
> just = and IN (...) operators. For example, given a perfect
> correlation like
> 
>   a | b
>  ---
>   1 | 1
>   2 | 2
>   3 | 3
>   : | :
> 
> you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
> IN (2,4,6,8,...)" to get a very bad estimate from functional
> dependencies.

Wow, that is a very good example --- the arrays do not tie elements in
one array to elements in another array;  good point.  I get it now!

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Global temporary tables

2020-03-13 Thread Prabhat Sahu
Hi Wenjing,

Please check the below combination of GTT with Primary and Foreign key
relations, with the ERROR message.


*Case1:*postgres=# CREATE GLOBAL TEMPORARY TABLE gtt1(c1 serial PRIMARY
KEY, c2 VARCHAR (50) UNIQUE NOT NULL) ON COMMIT *DELETE* ROWS;
CREATE TABLE

postgres=# CREATE GLOBAL TEMPORARY TABLE gtt2(c1 integer NOT NULL, c2
integer NOT NULL,
PRIMARY KEY (c1, c2),
FOREIGN KEY (c1) REFERENCES gtt1 (c1)) ON COMMIT *PRESERVE* ROWS;
ERROR:  unsupported ON COMMIT and foreign key combination
DETAIL:  Table "gtt2" references "gtt1", but *they do not have the same ON
COMMIT setting*.

*Case2:*
postgres=# CREATE GLOBAL TEMPORARY TABLE gtt1(c1 serial PRIMARY KEY, c2
VARCHAR (50) UNIQUE NOT NULL) ON COMMIT *PRESERVE* ROWS;
CREATE TABLE

postgres=# CREATE GLOBAL TEMPORARY TABLE gtt2(c1 integer NOT NULL, c2
integer NOT NULL,
PRIMARY KEY (c1, c2),
FOREIGN KEY (c1) REFERENCES gtt1 (c1)) ON COMMIT *DELETE* ROWS;
CREATE TABLE

In "case2" although both the primary table and foreign key GTT *do not have
the same ON COMMIT setting*, still we are able to create the PK-FK
relations with GTT.

So I hope the detail message(DETAIL:  Table "gtt2" references "gtt1", but
they do not have the same ON COMMIT setting.) in "Case1" should be more
clear(something like "wrong combination of ON COMMIT setting").

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: Add an optional timeout clause to isolationtester step.

2020-03-13 Thread Tom Lane
Julien Rouhaud  writes:
> On Wed, Mar 11, 2020 at 05:52:54PM -0300, Alvaro Herrera wrote:
>> On 2020-Mar-11, Tom Lane wrote:
>>> I'd like to see an attempt to rewrite some of the existing
>>> timeout-dependent test cases to use this facility instead of
>>> long timeouts.

>> +1.  Those long timeouts are annoying enough that infrastructure to make
>> a run shorter in normal circumstances might be sufficient justification
>> for this patch ...

> I'm not familiar with those test so I'm probably missing something, but looks
> like all isolation tests that setup a timeout are doing so to test server side
> features (deadlock detection, statement and lock timeout).  I'm not sure how
> adding a client-side facility to detect locks earlier is going to help 
> reducing
> the server side timeouts?

The point is that those timeouts have to be set long enough for even a
very slow machine to reach a desired state before the timeout happens;
on faster machines the test is just uselessly sleeping for a long time,
because of the fixed timeout.  My thought was that maybe the tests could
be recast as "watch for session to reach $expected_state and then do
the next thing", allowing them to be automatically adaptive to the
machine's speed.  This might require some rather subtle test redesign
and/or addition of more infrastructure (to allow recognition of the
desired state and/or taking an appropriate next action).  I'm prepared
to believe that not much can be done about timeouts.spec in particular,
but it seems to me that the long delays in the deadlock tests are not
inherent in what we need to test.

> For the REINDEX CONCURRENTLY failure test, the problem that needs to be solved
> isn't detecting that the command is blocked as it's already getting blocked on
> a heavyweight lock, but being able to reliably cancel a specific query as 
> early
> as possible, which AFAICS isn't possible with current isolation tester:

Right, it's the same thing of needing to wait till the backend has reached
a particular state before you do the next thing.

> So we would actually only need something like this to make it work:
> step "" [ CANCEL IF BLOCKED ] { 

Re: bitmaps and correlation

2020-03-13 Thread Justin Pryzby
There were no comments last month, so rebased, fixed tests, and kicked to next
CF.

-- 
Justin
>From e754a93aff10cb435f5ecef923a810b9edc02d68 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 8 Jan 2020 19:23:51 -0600
Subject: [PATCH v5 1/2] Make more clear the computation of min/max IO..

..and specifically the double use and effect of correlation.

Avoid re-use of the "pages_fetched" variable
---
 src/backend/optimizer/path/costsize.c | 47 ++-
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b5a0033721..bdc23a075f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -491,12 +491,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 csquared;
 	double		spc_seq_page_cost,
 spc_random_page_cost;
-	Cost		min_IO_cost,
+	double		min_pages_fetched,	/* The min and max page count based on index correlation */
+max_pages_fetched;
+	Cost		min_IO_cost,	/* The min and max cost based on index correlation */
 max_IO_cost;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
 	double		tuples_fetched;
-	double		pages_fetched;
 	double		rand_heap_pages;
 	double		index_pages;
 
@@ -579,7 +580,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 	 * (just after a CLUSTER, for example), the number of pages fetched should
 	 * be exactly selectivity * table_size.  What's more, all but the first
 	 * will be sequential fetches, not the random fetches that occur in the
-	 * uncorrelated case.  So if the number of pages is more than 1, we
+	 * uncorrelated case (the index is expected to read fewer pages, *and* each
+	 * page read is cheaper).  So if the number of pages is more than 1, we
 	 * ought to charge
 	 *		spc_random_page_cost + (pages_fetched - 1) * spc_seq_page_cost
 	 * For partially-correlated indexes, we ought to charge somewhere between
@@ -604,17 +606,17 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 		 * pro-rate the costs for one scan.  In this case we assume all the
 		 * fetches are random accesses.
 		 */
-		pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
+		max_pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
 			baserel->pages,
 			(double) index->pages,
 			root);
 
 		if (indexonly)
-			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
+			max_pages_fetched = ceil(max_pages_fetched * (1.0 - baserel->allvisfrac));
 
-		rand_heap_pages = pages_fetched;
+		rand_heap_pages = max_pages_fetched;
 
-		max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+		max_IO_cost = (max_pages_fetched * spc_random_page_cost) / loop_count;
 
 		/*
 		 * In the perfectly correlated case, the number of pages touched by
@@ -626,17 +628,17 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 		 * where such a plan is actually interesting, only one page would get
 		 * fetched per scan anyway, so it shouldn't matter much.)
 		 */
-		pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+		min_pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
 
-		pages_fetched = index_pages_fetched(pages_fetched * loop_count,
+		min_pages_fetched = index_pages_fetched(min_pages_fetched * loop_count,
 			baserel->pages,
 			(double) index->pages,
 			root);
 
 		if (indexonly)
-			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
+			min_pages_fetched = ceil(min_pages_fetched * (1.0 - baserel->allvisfrac));
 
-		min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+		min_IO_cost = (min_pages_fetched * spc_random_page_cost) / loop_count;
 	}
 	else
 	{
@@ -644,30 +646,31 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 		 * Normal case: apply the Mackert and Lohman formula, and then
 		 * interpolate between that and the correlation-derived result.
 		 */
-		pages_fetched = index_pages_fetched(tuples_fetched,
+
+		/* For the perfectly uncorrelated case (csquared=0) */
+		max_pages_fetched = index_pages_fetched(tuples_fetched,
 			baserel->pages,
 			(double) index->pages,
 			root);
 
 		if (indexonly)
-			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
+			max_pages_fetched = ceil(max_pages_fetched * (1.0 - baserel->allvisfrac));
 
-		rand_heap_pages = pages_fetched;
+		rand_heap_pages = max_pages_fetched;
 
-		/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
-		max_IO_cost = pages_fetched * spc_random_page_cost;
+		max_IO_cost = max_pages_fetched * spc_random_page_cost;
 
-		/* min_IO_cost is for the perfectly correlated case (csquared=1) */
-		pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+		/* For the perfectly correlated case (csquared=1) */
+		min_pages_fetched = ceil(indexSelectivity * (double) baserel->pages);

Re: backup manifests

2020-03-13 Thread tushar

On 3/12/20 8:16 PM, tushar wrote:

Seems like expected behavior to me. We could consider providing a more
descriptive error message, but there's now way for it to work.


Right , Error message need to be more user friendly . 


One scenario which i feel - should error out  even if  -s option is 
specified.


create  base  backup directory ( ./pg_basebackup data1)
Connect to root user and take out  the permission from pg_hba.conf file 
( chmod 004 pg_hba.conf)


run pg_validatebackup -

[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1
pg_validatebackup: error: could not open file "pg_hba.conf": Permission 
denied


run pg_validatebackup  with switch -s

[centos@tushar-ldap-docker bin]$ ./pg_validatebackup data1 -s
pg_validatebackup: backup successfully verified

here file is not accessible so i think - it should throw you an error ( 
the same above one) instead of   blindly skipping it.


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Dilip Kumar
On Fri, Mar 13, 2020 at 3:39 PM Amit Kapila  wrote:
>
> On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar  wrote:
> >
> > On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila  wrote:
> > >
> > > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar  
> > > wrote:
> > > >
> > > > I have fixed this in the attached patch set.
> > > >
> > >
> > > I have modified your
> > > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > > modifications are (a) Change src/backend/storage/lmgr/README to
> > > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > > slightly simplifies the code, (c) moved the deadlock.c check a few
> > > lines up and (d) changed a few comments.
> >
> > Changes look fine to me.
> >
>
> Today, while looking at this patch again, I realized that there is a
> where we sometimes allow group members to jump the wait queue.  This
> is primarily to avoid creating deadlocks (see ProcSleep).  Now,
> ideally, we don't need this for relation extension or page locks as
> those can never lead to deadlocks.  However, the current code will
> give group members more priority to acquire relation extension or page
> locks if any one of the members has held those locks.  Now, if we want
> we can prevent giving group members priority for these locks, but I am
> not sure how important is that case.  So, I have left that as it is by
> adding a few comments.  What do you think?
>
> Additionally, I have changed/added a few more sentences in README.

I have included all your changes in the latest patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Dilip Kumar
On Fri, Mar 13, 2020 at 11:16 AM Dilip Kumar  wrote:
>
> On Fri, Mar 13, 2020 at 11:08 AM Amit Kapila  wrote:
> >
> > On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > If we have no other choice, then I see a few downsides of adding a
> > > > special check in the LockRelease() call:
> > > >
> > > > 1. Instead of resetting/decrement the variable from specific APIs like
> > > > UnlockRelationForExtension or UnlockPage, we need to have it in
> > > > LockRelease. It will also look odd, if set variable in
> > > > LockRelationForExtension, but don't reset in the
> > > > UnlockRelationForExtension variant.  Now, maybe we can allow to reset
> > > > it at both places if it is a flag, but not if it is a counter
> > > > variable.
> > > >
> > > > 2. One can argue that adding extra instructions in a generic path
> > > > (like LockRelease) is not a good idea, especially if those are for an
> > > > Assert. I understand this won't add anything which we can measure by
> > > > standard benchmarks.
> > >
> > > I have just written a WIP patch for relation extension lock where
> > > instead of incrementing and decrementing the counter in
> > > LockRelationForExtension and UnlockRelationForExtension respectively.
> > > We can just set and reset the flag in LockAcquireExtended and
> > > LockRelease.  So this patch appears simple to me as we are not
> > > involving the transaction APIs to set and reset the flag.  However, we
> > > need to add an extra check as you have already mentioned.  I think we
> > > could measure the performance and see whether it has any impact or
> > > not?
> > >
> >
> > LockAcquireExtended()
> > {
> > ..
> > + if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
> > + IsRelationExtensionLockHeld = true;
> > ..
> > }
> >
> > Can we move this check inside a function (CheckAndSetLockHeld or
> > something like that) as we need to add a similar thing for page lock?
>
> ok

Done

>
> > Also, how about moving the set and reset of these flags to
> > GrantLockLocal and RemoveLocalLock as that will further reduce the
> > number of places where we need to add such a check.
>
> Make sense to me.

Done
>

>  Another thing is
> > to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
> > get the lock tag.
>
> ok

Done

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v6-0001-Add-assert-check-for-relation-extension-lock.patch
Description: Binary data


v6-0004-Page-lock-to-conflict-among-parallel-group-member.patch
Description: Binary data


v6-0002-Extend-the-assert-for-the-page-lock.patch
Description: Binary data


v6-0003-Relation-extension-lock-to-conflict-among-paralle.patch
Description: Binary data


Re: make check crashes on POWER8 machine

2020-03-13 Thread Victor Wagner
On Fri, 13 Mar 2020 07:43:59 -0500
Justin Pryzby  wrote:

> On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote:
> > Hi,
> > 
> > I've encountered a problem with Postgres on PowerPC machine.
> > Sometimes  
> 
> Is it related to
> https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us
> https://bugzilla.kernel.org/show_bug.cgi?id=205183

I don't think so. At least I cannot see any signal handler-related stuff
in the trace, but see lots of calls to stored procedure executor
instead.

Although several different stack traces show completely different parts
of code when signal SIGSEGV arrives, which may point to asynchronous
nature of the problem.

Unfortunately I've not kept all the cores I've seen.

It rather looks like that in some rare circumstances Postgres is unable
to properly determine end of stack condition.
-- 





Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-03-13 Thread Justin Pryzby
@cfbot: rebased onto 085b6b6679e73b9b386f209b4d625c7bc60597c0

The merge conflict presents another opportunity to solicit comments on the new
approach.  Rather than making "recurse into tmpdir" the end goal:

  - add a function to show metadata of an arbitrary dir;
  - add isdir arguments to pg_ls_* functions (including pg_ls_tmpdir but not
pg_ls_dir).
  - maybe add pg_ls_dir_recurse, which satisfies the original need;
  - retire pg_ls_dir (does this work with tuplestore?)
  - profit

The alternative seems to be to go back to Alvaro's earlier proposal:
 - not only add "isdir", but also recurse;

I think I would insist on adding a general function to recurse into any dir.
And *optionally* change ps_ls_* to recurse (either by accepting an argument, or
by making that a separate patch to debate).  tuplestore is certainly better
than keeping a stack/List of DIRs for this.

On Tue, Mar 10, 2020 at 01:30:37PM -0500, Justin Pryzby wrote:
> I took a step back, and I wondered whether we should add a generic function 
> for
> listing a dir with metadata, possibly instead of changing the existing
> functions.  Then one could do pg_ls_dir_metadata('pg_wal',false,false);
> 
> Since pg8.1, we have pg_ls_dir() to show a list of files.  Since pg10, we've
> had pg_ls_logdir and pg_ls_waldir, which show not only file names but also
> (some) metadata (size, mtime).  And since pg12, we've had pg_ls_tmpfile and
> pg_ls_archive_statusdir, which also show metadata.
> 
> ...but there's no a function which lists the metadata of an directory other
> than tmp, wal, log.
> 
> One can do this:
> |SELECT b.*, c.* FROM (SELECT 'base' a)a, LATERAL (SELECT 
> a||'/'||pg_ls_dir(a.a)b)b, pg_stat_file(b)c;
> ..but that's not as helpful as allowing:
> |SELECT * FROM pg_ls_dir_metadata('.',true,true);
> 
> There's also no function which recurses into an arbitrary directory, so it
> seems shortsighted to provide a function to recursively list a tmpdir.
> 
> Also, since pg_ls_dir_metadata indicates whether the path is a dir, one can
> write a SQL function to show the dir recursively.  It'd be trivial to plug in
> wal/log/tmp (it seems like tmpdirs of other tablespace's are not entirely
> trivial).
> |SELECT * FROM pg_ls_dir_recurse('base/pgsql_tmp');
> 
> Also, on a neighboring thread[1], Tom indicated that the pg_ls_* functions
> should enumerate all files during the initial call, which sounds like a bad
> idea when recursively showing directories.  If we add a function recursing 
> into
> a directory, we'd need to discuss all the flags to expose to it, like recurse,
> ignore_errors, one_filesystem?, show_dotfiles (and eventually bikeshed all the
> rest of the flags in find(1)).
> 
> My initial patch [2] changed ls_tmpdir to show metadata columns including
> is_dir, but not decend.  It's pretty unfortunate if a function called
> pg_ls_tmpdir hides shared filesets, so maybe it really is best to change that
> (it's new in v12).
> 
> I'm interested to in feedback on the alternative approach, as attached.  The
> final patch to include all the rest of columns shown by pg_stat_file() is more
> of an idea/proposal and not sure if it'll be desirable.  But pg_ls_tmpdir() is
> essentially the same as my v1 patch.
> 
> This is intended to be mostly independent of any fix to the WARNING I reported
> [1].  Since my patch collapses pg_ls_dir into pg_ls_dir_files, we'd only need
> to fix one place.  I'm planning to eventually look into Tom's suggestion of
> returning tuplestore to fix that, and maybe rebase this patchset on top of
> that.
> 
> -- 
> Justin
> 
> [1] 
> https://www.postgresql.org/message-id/flat/20200308173103.GC1357%40telsasoft.com
> [2] 
> https://www.postgresql.org/message-id/20191214224735.GA28433%40telsasoft.com
>From 950eed8812a167a12da49553f7e3a2d1438d779b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 6 Mar 2020 16:50:07 -0600
Subject: [PATCH v10 1/9] Document historic behavior about hiding directories
 and special files

Should backpatch to v10: tmpdir, waldir and archive_statusdir
---
 doc/src/sgml/func.sgml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 323366feb6..4c0ea5ab3f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21450,6 +21450,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 (mtime) of each file in the log directory. By default, only superusers
 and members of the pg_monitor role can use this function.
 Access may be granted to others using GRANT.
+Filenames beginning with a dot, directories, and other special files are not shown.

 

@@ -21461,6 +21462,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 default only superusers and members of the pg_monitor role
 can use this function. Access may be granted to others using
 GRANT.
+Filenames beginning with a dot, directories, and other special files are not shown.

 

@@ -21473,6 

Re: make check crashes on POWER8 machine

2020-03-13 Thread Justin Pryzby
On Fri, Mar 13, 2020 at 10:29:13AM +0300, Victor Wagner wrote:
> Hi,
> 
> I've encountered a problem with Postgres on PowerPC machine. Sometimes

Is it related to
https://www.postgresql.org/message-id/20032.1570808731%40sss.pgh.pa.us
https://bugzilla.kernel.org/show_bug.cgi?id=205183

(My initial report on that thread was unrelated user-error on my part)

> It seems that problem is in errors.sql when executed 
> 
> select infinite_recures(); statement
> 
> so stack trace, produced by gdb is too long to post here.
> 
> Problem is rare and doesn't occur on all runs of make check.
> When I run make check repeatedly it occurs once a several hundreds runs.
> 
> It seems that problem is architecture-dependent, because I cannot
> reproduce it on x86_64 CPU with more than thousand runs of make check.

That's all consistent with the above problem.

> Running RedHat 7.6.

-- 
Justin




Re: [Proposal] Global temporary tables

2020-03-13 Thread tushar

On 3/9/20 10:01 PM, 曾文旌(义从) wrote:

Fixed in global_temporary_table_v18-pg13.patch.


Thanks Wenjing.

I am getting this error  "ERROR:  could not open file 
"base/13589/t3_16440": No such file or directory" if 
max_active_global_temporary_table set to 0


Please refer this scenario -

postgres=# create global temp table  tab1 (n int ) with ( 
on_commit_delete_rows='true');

CREATE TABLE
postgres=# insert into tab1 values (1);
INSERT 0 1
postgres=# select * from tab1;
 n
---
(0 rows)

postgres=# alter system set max_active_global_temporary_table=0;
ALTER SYSTEM
postgres=# \q
[tushar@localhost bin]$ ./pg_ctl -D data/ restart -c -l logs123

waiting for server to start done
server started

[tushar@localhost bin]$ ./psql postgres
psql (13devel)
Type "help" for help.

postgres=# insert into tab1 values (1);
ERROR:  could not open file "base/13589/t3_16440": No such file or directory
postgres=#

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: Restore replication settings when modifying a field type

2020-03-13 Thread Peter Eisentraut

On 2020-03-10 14:16, Euler Taveira wrote:
On Thu, 5 Mar 2020 at 09:45, Peter Eisentraut 
> wrote:


On 2020-02-11 00:38, Quan Zongliang wrote:
 > new patch attached.

I didn't like so much how the updating of the replica identity was
hacked into the middle of ATRewriteCatalogs().  I have an alternative
proposal in the attached patch that queues up an ALTER TABLE ...
REPLICA
IDENTITY command into the normal ALTER TABLE processing.  I have also
added tests to the test suite.

LGTM. Tests are ok. I've rebased it (because 
61d7c7bce3686ec02bd64abac742dd35ed9b9b01). Are you planning to backpatch 
it? IMHO you should because it is a bug (since REPLICA IDENTITY was 
introduced in 9.4). This patch can be applied as-is in 12 but not to 
other older branches. I attached new patches.


Thanks.  This has been committed and backpatched to 9.5.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Use PKG_CHECK_MODULES to detect the libxml2 library

2020-03-13 Thread Hugh McMaster
On Fri, 13 Mar 2020 at 03:39, Tom Lane wrote:
> I poked at this issue a bit more and realized that Hugh's patch
> flat-out breaks building --with-libxml in environments without
> pkg-config, because PKG_CHECK_MODULES just gives up and dies
> if there's no pkg-config (as we'd already found out in connection
> with ICU).

As you found out, that is by design. PKG_CHECK_MODULES actually checks
for pkg-config via PKG_PROG_PKG_CONFIG, but only in the first
expansion of PKG_CHECK_MODULES. If the first instance is in a
conditional, then the check for pkg-config is also in that
conditional. Once a build system begins using pkg-config without a
fallback (e.g. like for ICU), pkg-config becomes a build dependency
(and, yes, I realise ICU isn't mandatory here).

That won't win us any friends, so the attached revision
> doesn't call PKG_CHECK_MODULES unless we found pkg-config.

Did you mean to terminate configure if pkg-config cannot find
libxml-2.0 or the library is too old? Your doc changes don't indicate
that intent, nor was it prior behaviour, but some projects like that
behaviour and others don't.

> I also concluded that if the user has set XML2_CONFIG, it's pretty clear
> that her intent is to use whatever that is pointing at, so we should
> not use pkg-config in that case either.
>
> Also, I'd been going back and forth about whether it was worth
> documenting XML2_CFLAGS/XML2_LIBS, but I realized that use of
> PKG_CHECK_MODULES(XML2, ...) basically forces the issue for us:
> it does AC_ARG_VAR on them, which puts them into configure's
> --help output and makes configure picky about caching them.
> So we can't really pretend they're boring implementation detail.

You might consider this an edge case, but you override custom
XML2_CFLAGS/LIBS if xml2-config is detected.

> So the attached mostly adopts Peter's old suggested docs, but
> I added discussion of XML2_CFLAGS/XML2_LIBS and dropped the mention
> of forcing matters with --with-libs/--with-libraries (not because
> that doesn't work anymore but because it seemed like we were offering
> quite enough alternatives already).
>
> I'd originally thought that we might back-patch this, but I'm now of
> the opinion that we probably should not.  If pkg-config is present,
> this can change the default behavior about where we get libxml from,
> which seems like something not to do in minor releases.  (OTOH, it'd
> only matter if the default pkg-config choice is different from the
> default xml2-config choice, so maybe the risk of breakage is small
> enough to be acceptable?)




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Laurenz Albe
On Fri, 2020-03-13 at 07:00 -0500, Justin Pryzby wrote:
> > 2. The new feature can be completely disabled. This might be very
> > useful for people who suffer from auto-vacuum starvation.
> 
> > Yes, but in particular so it can be completely disabled easily.
> 
> How is it disabled ?  By setting scale_factor=100 ?
> 
> +   { 
>   
>
> +   "autovacuum_vacuum_insert_scale_factor",  
>   
>
> +   "Number of tuple inserts prior to vacuum as a 
> fraction of reltuples",   
>
> +   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST, 
>   
>
> +   ShareUpdateExclusiveLock  
>   
>
> +   },
>   
>
> +   -1, 0.0, 100.0
>   
>
> 
> Note, vacuum_cleanup_index_scale_factor uses max: 1e10
> See 4d54543efa5eb074ead4d0fadb2af4161c943044

By setting the threshold very high, or by setting the scale factor to 100.

Yours,
Laurenz Albe





Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Laurenz Albe
On Fri, 2020-03-13 at 12:05 +0300, Darafei "Komяpa" Praliaskouski wrote:
>  1. introduce no new parameters and trigger autovacuum if the number
> > of inserts exceeds the regular vacuum threshold.
> > 
> > 2. introduce the new parameters with high base threshold and zero scale 
> > factor.
> 
> Both of these look good to me. 1 is approach in my initial patch
> sketch, 2 is approach taken by Laurenz.
> Values I think in when considering vacuum is "how many megabytes of
> table aren't frozen/visible" (since that's what translates into
> processing time knowing io limits of storage), and "how many pages
> aren't yet vacuumed".
> 
> Threshold in Laurenz's patch was good enough for my taste - it's
> basically "vacuum after every gigabyte", and that's exactly what we
> implemented when working around this issue manually. There's enough
> chance that latest gigabyte is in RAM and vacuum will be super fast on
> it; reading a gigabyte of data is not a showstopper for most
> contemporary physical and cloud environments I can think of. If
> reading a gigabyte is a problem already then wraparound is a
> guaranteed disaster.
> 
> About index only scan, this threshold seems good enough too. There's a
> good chance last gig is already in RAM, and previous data was
> processed with previous vacuum. Anyway - with this patch Index Only
> Scan starts actually working :)
> 
> I'd vote for 2 with a note "rip it off all together later and redesign
> scale factors and thresholds system to something more easily
> graspable". Whoever needs to cancel the new behavior for some reason
> will have a knob then, and patch is laid out already.
> 
> > 3. introduce the new parameters with low base threshold and high scale 
> > factor.
> 
> This looks bad to me. "the bigger the table, the longer we wait" does
> not look good for me for something designed as a measure preventing
> issues with big tables.

Thanks for the feedback.

It looks like we have a loose consensus on #2, i.e. my patch.

Yours,
Laurenz Albe





Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Justin Pryzby
On Wed, Mar 11, 2020 at 10:32:47AM +1300, David Rowley wrote:
> 2. The new feature can be completely disabled. This might be very
> useful for people who suffer from auto-vacuum starvation.

On Thu, Mar 12, 2020 at 08:28:05PM +1300, David Rowley wrote:
> Yes, but in particular so it can be completely disabled easily.

How is it disabled ?  By setting scale_factor=100 ?

+   {   

   
+   "autovacuum_vacuum_insert_scale_factor",

   
+   "Number of tuple inserts prior to vacuum as a fraction 
of reltuples",  

+   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,   

   
+   ShareUpdateExclusiveLock

   
+   },  

   
+   -1, 0.0, 100.0  

   

Note, vacuum_cleanup_index_scale_factor uses max: 1e10
See 4d54543efa5eb074ead4d0fadb2af4161c943044

-- 
Justin




Re: Refactor compile-time assertion checks for C/C++

2020-03-13 Thread Michael Paquier
On Fri, Mar 13, 2020 at 03:12:34PM +0900, Michael Paquier wrote:
> On Thu, Mar 12, 2020 at 09:43:54AM -0400, Tom Lane wrote:
>> I don't feel a need to expend a whole lot of sweat there.  The existing
>> text is fine, it just bugged me that the code deals with three cases
>> while the comment block only acknowledged two.  So I'd just go with
>> what you have in v3.
> 
> Thanks, Tom.  I have committed v3 then.

Hmm.  v3 actually broke the C++ fallback of StaticAssertExpr() and
StaticAssertStmt() (v1 did not), a simple fix being something like
the attached.

The buildfarm does not really care about that, but it could for
example by using the only c++ code compiled in the tree in
src/backend/jit/?  That also means that only builds using --with-llvm
with a compiler old enough would trigger that stuff.
--
Michael
diff --git a/src/include/c.h b/src/include/c.h
index 6558801e5f..51db902fc3 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -860,8 +860,13 @@ extern void ExceptionalCondition(const char *conditionName,
 	static_assert(condition, errmessage)
 #else
 /* Fallback implementation for C and C++ */
+#ifndef __cplusplus
 #define StaticAssertStmt(condition, errmessage) \
 	((void) sizeof(struct { int static_assert_failure : (condition) ? 1 : -1; }))
+#else
+#define StaticAssertStmt(condition, errmessage) \
+	do { struct static_assert_struct { int static_assert_failure : (condition) ? 1 : -1; }; } while(0)
+#endif		/* __cplusplus */
 #define StaticAssertExpr(condition, errmessage) \
 	StaticAssertStmt(condition, errmessage)
 #define StaticAssertDecl(condition, errmessage) \


signature.asc
Description: PGP signature


Re: truncating timestamps on arbitrary intervals

2020-03-13 Thread Isaac Morland
On Fri, 13 Mar 2020 at 03:13, John Naylor 
wrote:

> On Wed, Feb 26, 2020 at 11:36 PM Tom Lane  wrote:
> >
> > * In general, binning involves both an origin and a stride.  When
> > working with plain numbers it's almost always OK to set the origin
> > to zero, but it's less clear to me whether that's all right for
> > timestamps.  Do we need another optional argument?  Even if we
> > don't, "zero" for tm_year is 1900, which is going to give results
> > that surprise somebody.
>

- align weeks to start on Sunday
> select date_trunc_interval('7 days'::interval, TIMESTAMP '2020-02-11
> 01:01:01.0', TIMESTAMP '1900-01-02');
>  date_trunc_interval
> -
>  2020-02-09 00:00:00
> (1 row)
>

I'm confused by this. If my calendars are correct, both 1900-01-02
and 2020-02-11 are Tuesdays. So if the date being adjusted and the origin
are both Tuesday, shouldn't the day part be left alone when truncating to 7
days? Also, I'd like to confirm that the default starting point for 7 day
periods (weeks) is Monday, per ISO. I know it's very fashionable in North
America to split the weekend in half but it's not the international
standard.

Perhaps the starting point for dates should be either 0001-01-01 (the
proleptic beginning of the CE calendar) or 2001-01-01 (the beginning of the
current 400-year repeating cycle of leap years and weeks, and a Monday,
giving the appropriate ISO result for truncating to 7 day periods).


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Dilip Kumar
On Fri, Mar 13, 2020 at 2:32 PM Kuntal Ghosh  wrote:
>
> On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila  wrote:
> >
> > On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh  
> > wrote:
> > > I think moving them inside a macro is a good idea. Also, I think we
> > > should move all the Assert related code inside some debugging macro
> > > similar to this:
> > > #ifdef LOCK_DEBUG
> > > 
> > > #endif
> > >
> > If we move it under some macro, then those Asserts will be only
> > enabled when that macro is defined.  I think we want there Asserts to
> > be enabled always in assert enabled build, these will be like any
> > other Asserts in the code.  What is the advantage of doing those under
> > macro?
> >
> My concern is related to performance regression. We're using two
> static variables in hot-paths only for checking a few asserts. So, I'm
> not sure whether we should enable the same by default, specially when
> asserts are itself disabled.
> -ResetRelExtLockHeldCount()
> +ResetRelExtPageLockHeldCount()
>  {
>   RelationExtensionLockHeldCount = 0;
> + PageLockHeldCount = 0;
> +}
> Also, we're calling this method from frequently used functions like
> Commit/AbortTransaction. So, it's better these two static variables
> share the same cache line and reinitalize them with a single
> instruction.

In the recent version of the patch, instead of a counter, we have done
with a flag.  So I think now we can just keep a single variable and we
can just reset the bit in a single instruction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Amit Kapila
On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar  wrote:
>
> On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila  wrote:
> >
> > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar  wrote:
> > >
> > > I have fixed this in the attached patch set.
> > >
> >
> > I have modified your
> > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > modifications are (a) Change src/backend/storage/lmgr/README to
> > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > slightly simplifies the code, (c) moved the deadlock.c check a few
> > lines up and (d) changed a few comments.
>
> Changes look fine to me.
>

Today, while looking at this patch again, I realized that there is a
where we sometimes allow group members to jump the wait queue.  This
is primarily to avoid creating deadlocks (see ProcSleep).  Now,
ideally, we don't need this for relation extension or page locks as
those can never lead to deadlocks.  However, the current code will
give group members more priority to acquire relation extension or page
locks if any one of the members has held those locks.  Now, if we want
we can prevent giving group members priority for these locks, but I am
not sure how important is that case.  So, I have left that as it is by
adding a few comments.  What do you think?

Additionally, I have changed/added a few more sentences in README.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0002-Allow-relation-extension-and-page-locks-to-conflict-.v2.patch
Description: Binary data


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Kuntal Ghosh
On Fri, Mar 13, 2020 at 8:42 AM Dilip Kumar  wrote:
> > On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh  
> > wrote:
> >
> > > + /*
> > > + * The relation extension or page lock can never participate in actual
> > > + * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
> > > + * no advantage in checking wait edges from it.
> > > + */
> > > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > > + return false;
> > > +
> > > Since this is true, we can also avoid these kind of locks in
> > > ExpandConstraints, right?
>
> I am not sure I understand this part.  Because topological sort will
> work on the soft edges we have created when we found the cycle,  but
> for relation extension/page lock we are completely ignoring hard/soft
> edge then it will never participate in topo sort as well.  Am I
> missing something?
>
No, I think you're right. We only add constraints if we've detected a
cycle in the graph. Hence, you don't need the check here.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-13 Thread Komяpa
On Fri, Mar 13, 2020 at 3:19 AM Laurenz Albe  wrote:
>
> On Fri, 2020-03-13 at 09:10 +1300, David Rowley wrote:
> > So you're suggesting we drive the insert-vacuums from existing
> > scale_factor and threshold?  What about the 1 billion row table
> > example above?
>
> I am still not 100% certain if that is really realistic.
> Transactions that insert only a single row are probably the
> exception in large insert-only tables.
>
> But I think that we probably always can find a case where any given
> parameter setting is not so great, so in order to get ahead
> let's decide on something that is not right out stupid.
> Changing the defaults later is always an option.
>
> So the three options are:
>
> 1. introduce no new parameters and trigger autovacuum if the number
>of inserts exceeds the regular vacuum threshold.
>
> 2. introduce the new parameters with high base threshold and zero scale 
> factor.

Both of these look good to me. 1 is approach in my initial patch
sketch, 2 is approach taken by Laurenz.
Values I think in when considering vacuum is "how many megabytes of
table aren't frozen/visible" (since that's what translates into
processing time knowing io limits of storage), and "how many pages
aren't yet vacuumed".

Threshold in Laurenz's patch was good enough for my taste - it's
basically "vacuum after every gigabyte", and that's exactly what we
implemented when working around this issue manually. There's enough
chance that latest gigabyte is in RAM and vacuum will be super fast on
it; reading a gigabyte of data is not a showstopper for most
contemporary physical and cloud environments I can think of. If
reading a gigabyte is a problem already then wraparound is a
guaranteed disaster.

About index only scan, this threshold seems good enough too. There's a
good chance last gig is already in RAM, and previous data was
processed with previous vacuum. Anyway - with this patch Index Only
Scan starts actually working :)

I'd vote for 2 with a note "rip it off all together later and redesign
scale factors and thresholds system to something more easily
graspable". Whoever needs to cancel the new behavior for some reason
will have a knob then, and patch is laid out already.

> 3. introduce the new parameters with low base threshold and high scale factor.

This looks bad to me. "the bigger the table, the longer we wait" does
not look good for me for something designed as a measure preventing
issues with big tables.

> I think all three are viable.
> If nobody else wants to weigh in, throw a coin.
>
> Yours,
> Laurenz Albe
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa




Re: Add an optional timeout clause to isolationtester step.

2020-03-13 Thread Julien Rouhaud
On Wed, Mar 11, 2020 at 05:52:54PM -0300, Alvaro Herrera wrote:
> On 2020-Mar-11, Tom Lane wrote:
>
> > We could re-use Julien's ideas about the isolation spec syntax by
> > making it be, roughly,
> >
> > step "" {  } [ blocked if "" "" ]
> >
> > and then those items would need to be passed as parameters of the prepared
> > query.
>
> I think for test readability's sake, it'd be better to put the BLOCKED
> IF clause ahead of the SQL, so you can write it in the same line and let
> the SQL flow to the next one:
>
> STEP "long_select" BLOCKED IF "lwlock" "ClogControlLock"
>   { select foo from pg_class where ... some more long clauses ... }
>
> otherwise I think a step would require more lines to write.
>
> > I'd like to see an attempt to rewrite some of the existing
> > timeout-dependent test cases to use this facility instead of
> > long timeouts.  If we could get rid of the timeouts in the
> > deadlock tests, that'd go a long way towards showing that this
> > idea is actually any good.
>
> +1.  Those long timeouts are annoying enough that infrastructure to make
> a run shorter in normal circumstances might be sufficient justification
> for this patch ...


I'm not familiar with those test so I'm probably missing something, but looks
like all isolation tests that setup a timeout are doing so to test server side
features (deadlock detection, statement and lock timeout).  I'm not sure how
adding a client-side facility to detect locks earlier is going to help reducing
the server side timeouts?

For the REINDEX CONCURRENTLY failure test, the problem that needs to be solved
isn't detecting that the command is blocked as it's already getting blocked on
a heavyweight lock, but being able to reliably cancel a specific query as early
as possible, which AFAICS isn't possible with current isolation tester:

- either we reliably cancel the query using a statement timeout, but we'll make
  it slow for everyone
- either we send a blind pg_cancel_backend() hoping that we don't catch
  anything else (and also make it slower than required to make sure that it's
  not canceled to early)

So we would actually only need something like this to make it work:

step "" [ CANCEL IF BLOCKED ] { 

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-03-13 Thread Kuntal Ghosh
On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila  wrote:
>
> On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh  
> wrote:
> > I think moving them inside a macro is a good idea. Also, I think we
> > should move all the Assert related code inside some debugging macro
> > similar to this:
> > #ifdef LOCK_DEBUG
> > 
> > #endif
> >
> If we move it under some macro, then those Asserts will be only
> enabled when that macro is defined.  I think we want there Asserts to
> be enabled always in assert enabled build, these will be like any
> other Asserts in the code.  What is the advantage of doing those under
> macro?
>
My concern is related to performance regression. We're using two
static variables in hot-paths only for checking a few asserts. So, I'm
not sure whether we should enable the same by default, specially when
asserts are itself disabled.
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
 {
  RelationExtensionLockHeldCount = 0;
+ PageLockHeldCount = 0;
+}
Also, we're calling this method from frequently used functions like
Commit/AbortTransaction. So, it's better these two static variables
share the same cache line and reinitalize them with a single
instruction.

>
> >   /*
> > + * The relation extension or page lock conflict even between the group
> > + * members.
> > + */
> > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > + {
> > + PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
> > + proclock);
> > + return true;
> > + }
> > This check includes the heavyweight locks that conflict even under
> > same parallel group. It also has another property that they can never
> > participate in deadlock cycles. And, the number of locks under this
> > category is likely to increase in future with new parallel features.
> > Hence, it could be used in multiple places. Should we move the
> > condition inside a macro and just call it from here?
> >
>
> Right, this is what I have suggested upthread. Do you have any
> suggestions for naming such a macro or function?  I could think of
> something like LocksConflictAmongGroupMembers or
> LocksNotParticipateInDeadlock. The first one suits more for its usage
> in LockCheckConflicts and the second in the deadlock.c code. So none
> of those sound perfect to me.
>
Actually, I'm not able to come up with a good suggestion. I'm trying
to think of a generic name similar to strong or weak locks but with
the following properties:
a. Locks that don't participate in deadlock detection
b. Locks that conflicts in the same parallel group

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: add support for IN and @> in functional-dependency statistics use

2020-03-13 Thread Dean Rasheed
On Thu, 12 Mar 2020 at 17:30, Tomas Vondra  wrote:
>
> I'm sorry, but I don't see how we could do this for arbitrary clauses. I
> think we could do that for clauses that have equality semantics and
> reference column values as a whole. So I think it's possible to do this
> for IN clauses (which is what the first part of the patch does), but I
> don't think we can do it for the containment operator.
>
> I.e. we can do that for
>
>  WHERE a IN (...) AND b IN (...)
>

Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

  a | b
 ---
  1 | 1
  2 | 2
  3 | 3
  : | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.

Regards,
Dean




Re: shared-memory based stats collector

2020-03-13 Thread Kyotaro Horiguchi
Thank you very much!!

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund  wrote in 
> Hi,
> 
> Thomas, could you look at the first two patches here, and my review
> questions?
> 
> 
> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
>   year or more ago. I feel this patch ought to be in a much better
>   state. There's a lot of IMO fairly obvious stuff here, and things that
>   have been mentioned multiple times previously.

I apologize for all of the obvious stuff or things that have been
mentioned..  I'll address them.

> - There's a *lot* of typos in here. I realize being an ESL is hard, but
>   a lot of these can be found with the simplest spellchecker.  That's
>   one thing for a patch that just has been hacked up as a POC, but this
>   is a multi year thread?

I'll review all changed part again.  I used ispell but I should have
failed to check much of the changes.

> - There's some odd formatting. Consider using pgindent more regularly.

I'll do so.

> More detailed comments below.

Thank you very much for the intensive review, I'm going to revise the
patch according to them.

> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BEFORE ROW triggers for partitioned tables

2020-03-13 Thread Peter Eisentraut

On 2020-03-12 05:17, Ashutosh Bapat wrote:

On Wed, Mar 11, 2020 at 8:53 PM Ashutosh Bapat
 wrote:

Will it be easier to subject the new tuple to the partition level
constraints themselves and report if those are violated. See
RelationGetPartitionQual() for getting partition constraints. This
function includes partition constraints from all the levels so in your
function you don't have to walk up the partition tree. It includes
constraints from the level above the table that was named in the
command, but that might be fine. We will catch the error earlier and
may be provide a better error message.


I realized that this will implement the third option in your original
proposal, not the second one. I suppose that's fine too?


It might be that that is actually easier to do.  Instead of trying to 
figure out which columns have changed, in the face of different column 
ordering and general expressions, just check after a trigger whether the 
column still fits into the partition.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Reducing WaitEventSet syscall churn

2020-03-13 Thread Kyotaro Horiguchi
Hello.

At Tue, 10 Mar 2020 08:19:24 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
me> I'l continue reviewing in later mail.
me> 
me> > 0005: "libpq: Add PQsocketChangeCount to advertise socket changes."
me>  

At Thu, 27 Feb 2020 12:17:45 +1300, Thomas Munro  wrote
in 
> On Sat, Feb 8, 2020 at 10:15 AM Thomas Munro  wrote:
> 0005: "libpq: Add PQsocketChangeCount to advertise socket changes."
> 
> To support a long lived WES, libpq needs a way tell us when the socket
> changes underneath our feet.  This is the simplest thing I could think
> of; better ideas welcome.

I think at least windows is the reason for not just detecting the
change of the value of fd.  Instead of counting disconnection, we
could use event libpq-event.

QregisterEventProc returns false for all of bad-parameter,
already-registered, out-of-memory and proc-rejection. I don't think it
is usable interface so the attached 0005 patch fixes that. (but I
found it not necessarily needed after making 0007, but I included it
as a proposal separate from this patch set. It's not including the
corresponding doc fix.).

> 0006: "Reuse a WaitEventSet in libpqwalreceiver.c."
> 
> Rather than having all users of libpqwalreceiver.c deal with the
> complicated details of wait set management, have libpqwalreceiver
> expose a waiting interface that understands socket changes.

Looks reasonable. The attached 0006 and 0007 are a possible
replacement if we use libpq-event.

> Unfortunately, I couldn't figure out how to use CommonWaitSet for this
> (ie adding and removing sockets to that as required), due to
> complications with the bookkeeping required to provide the fd_closed
> flag to RemoveWaitEvent().  So it creates its own internal long lived
> WaitEventSet.

Agreed since they are used different way. But with the attached closed
connection is marked as wes_socket_position = -1.

> 0007: "Use a WaitEventSet for postgres_fdw."

Continues..

The attached are:
0001-0004 Not changed
0005 Fix interface of PQregisterEventProc
0006 Add new libpq event for this use.
0007 Another version of "0006 Reuse a WaitEventSet in
 libpqwalreceiver.c" based on libpq event.
0008-0011 Not changed (old 0007-0010, blindly appended)

passed the regression (includeing TAP recovery test) up to here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 1a80abd292ff0d6a47274eb188a5576b2ef3cf6e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 25 Feb 2020 02:53:35 +1300
Subject: [PATCH 01/11] Don't use EV_CLEAR for kqueue events.

For the semantics to match the epoll implementation, we need a
socket to continue to appear readable/writable if you wait
multiple times without doing I/O in between (in Linux
terminology, level-triggered rather than edge-triggered).
Similar to commit 3b790d256f8 for Windows.

Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 046ca5c6c7..3b6acfb251 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -991,7 +991,7 @@ WaitEventAdjustKqueueAdd(struct kevent *k_ev, int filter, int action,
 {
 	k_ev->ident = event->fd;
 	k_ev->filter = filter;
-	k_ev->flags = action | EV_CLEAR;
+	k_ev->flags = action;
 	k_ev->fflags = 0;
 	k_ev->data = 0;
 	AccessWaitEvent(k_ev) = event;
@@ -1003,7 +1003,7 @@ WaitEventAdjustKqueueAddPostmaster(struct kevent *k_ev, WaitEvent *event)
 	/* For now postmaster death can only be added, not removed. */
 	k_ev->ident = PostmasterPid;
 	k_ev->filter = EVFILT_PROC;
-	k_ev->flags = EV_ADD | EV_CLEAR;
+	k_ev->flags = EV_ADD;
 	k_ev->fflags = NOTE_EXIT;
 	k_ev->data = 0;
 	AccessWaitEvent(k_ev) = event;
-- 
2.18.2

>From f93551a42b9e9ad0da05197d48fac5790474249d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 24 Feb 2020 15:39:24 +1300
Subject: [PATCH 02/11] Use a long lived WaitEventSet for WaitLatch().

Create CommonWaitSet at backend startup time, and use it to
implement WaitLatch().  This avoids a bunch of epoll/kqueue
system calls, and makes sure we don't run into EMFILE later
due to lack of file descriptors.

Reorder SubPostmasterMain() slightly so that we restore the
postmaster pipe and Windows signal before we reach
InitPostmasterChild(), to make this work in EXEC_BACKEND
builds.

Author: Thomas Munro
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
---
 src/backend/postmaster/postmaster.c | 24 +++---
 src/backend/storage/ipc/latch.c | 49 +++--
 src/backend/utils/init/miscinit.c   |  2 ++
 src/include/storage/latch.h |  1 +
 4 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 46be78aadb..c472971ce0 100644
--- 

Re: truncating timestamps on arbitrary intervals

2020-03-13 Thread John Naylor
On Wed, Feb 26, 2020 at 11:36 PM Tom Lane  wrote:
>
> * In general, binning involves both an origin and a stride.  When
> working with plain numbers it's almost always OK to set the origin
> to zero, but it's less clear to me whether that's all right for
> timestamps.  Do we need another optional argument?  Even if we
> don't, "zero" for tm_year is 1900, which is going to give results
> that surprise somebody.

I tried the simplest way in the attached v5. Examples (third param is origin):

-- same result as no origin:
select date_trunc_interval('5 min'::interval, TIMESTAMP '2020-02-01
01:01:01', TIMESTAMP '2020-02-01');
 date_trunc_interval
-
 2020-02-01 01:00:00
(1 row)

-- shift bins by 2.5 min:
select date_trunc_interval('5 min'::interval, TIMESTAMP '2020-02-1
01:01:01', TIMESTAMP '2020-02-01 00:02:30');
 date_trunc_interval
-
 2020-02-01 00:57:30
(1 row)

-- align weeks to start on Sunday
select date_trunc_interval('7 days'::interval, TIMESTAMP '2020-02-11
01:01:01.0', TIMESTAMP '1900-01-02');
 date_trunc_interval
-
 2020-02-09 00:00:00
(1 row)

I've put off adding documentation on the origin piece pending comments
about the approach.

I haven't thought seriously about timezone yet, but hopefully it's
just work and nothing to think too hard about.


--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v5-datetrunc_interval.patch
Description: Binary data


Re: [PATCH] Skip llvm bytecode generation if LLVM is missing

2020-03-13 Thread Andres Freund
On 2020-03-13 14:08:12 +0800, Craig Ringer wrote:
> The alternative would be to detect a missing clang and emit a much more
> informative error than the current one that explicitly suggests retrying
> with
> 
> make with_llvm=no
> 
> or setting with_llvm=no in the environment.

That, that, that's what I suggested upthread?




RE: Planning counters in pg_stat_statements (using pgss_store)

2020-03-13 Thread imai.yoshik...@fujitsu.com
On Thu, Mar 12, 2020 at 6:37 PM, Julien Rouhaud wrote:
> On Thu, Mar 12, 2020 at 1:11 PM Marco Slot  wrote:
> > On Thu, Mar 12, 2020 at 11:31 AM Julien Rouhaud 
> wrote:
> > > There's at least the current version of IVM patchset that lacks the
> > > querytext.  Looking at various extensions, I see that pg_background
> > > and pglogical call pg_plan_query internally but shouldn't have any
> > > issue providing the query text.  But there's also citus extension,
> > > which don't keep around the query string at least when distributing
> > > plans, which makes sense since it's of no use and they're heavily
> > > modifying the original Query.  I think that citus folks opinion on
> > > the subject would be useful, so I'm Cc-ing Marco.
> >
> > Most of the time we keep our Query * data structures in a form that
> > can be deparsed back into a query string by a modified copy of
> > ruleutils.c, so we could generate a correct query string if absolutely
> > necessary, though there are performance-sensitive cases where we'd
> > rather not have the deparsing overhead.
> 
> Yes, deparsing is probably too expensive for that use case.
> 
> > In case of INSERT..SELECT into a distributed table, we might call
> > pg_plan_query on the SELECT part (Query *) and send the output into a
> > DestReceiver that sends tuples into shards of the distributed table
> > via COPY. The fact that SELECT does not show up in pg_stat_statements
> > separately is generally fine because it's an implementation detail,
> > and it would probably be a little confusing because the user never ran
> > the SELECT query. Moreover, the call to pg_plan_query would already be
> > reflected in the planning or execution time of the top-level query, so
> > it would be double counted if it had its own entry.
> 
> Well, surprising statements can already appears in pg_stat_statements when
> you use some psql features, or if you have triggers as those will run 
> additional
> queries under the hood.
> 
> The difference here is that since citus is a CustomNode, underlying calls to
> planner will be accounted for that node, and that's indeed annoying.  I can 
> see
> that citus is doing some calls to spi_exec or
> Executor* (in ExecuteLocalTaskPlan), which could also trigger
> pg_stat_statements, but I don't know if a queryid is present there.
> 
> > Another case is when some of the shards turn out to be local to the
> > server that handles the distributed query. In that case we plan the
> > queries on those shards via pg_plan_query instead of deparsing and
> > sending the query string to a remote server. It would be less
> > confusing for these queries to show in pg_stat_statements, because the
> > queries on the shards on remote servers will show up as well. However,
> > this is a performance-sensitive code path where we'd rather avoid
> > deparsing.
> 
> Agreed.
> 
> > In general, I'd prefer if there was no requirement to pass a correct
> > query string. I'm ok with passing "SELECT 'citus_internal'" or just ""
> > if that does not lead to issues. Passing NULL to signal that the
> > planner call should not be tracked separately does seem a bit cleaner.
> 
> That's very interesting feedback, thanks!  I'm not a fan of giving a way for
> queries to say that they want to be ignored by pg_stat_statements, but double
> counting the planning counters seem even worse, so I'm +0.5 to accept NULL
> query string in the planner, incidentally making pgss ignore those.

It is preferable that we can count various queries statistics as much as 
possible
but if it causes overhead even when without using pg_stat_statements, we would
not have to force them to set appropriate query_text.
About settings a fixed string in query_text, I think it doesn't make much sense
because users can't take any actions by seeing those queries' stats. Moreover, 
if
we set a fixed string in query_text to avoid pg_stat_statement's errors, codes
would be inexplicable for other developers who don't know there's such
requirements.
After all, I agree accepting NULL query string in the planner.

I don't know it is useful but there are also codes that avoid an error when
sourceText is NULL.

executor_errposition(EState *estate, int location)
{
...
/* Can't do anything if source text is not available */
if (estate == NULL || estate->es_sourceText == NULL)
}

--
Yoshikazu Imai


Re: logical replication empty transactions

2020-03-13 Thread Craig Ringer
On Tue, 10 Mar 2020 at 02:30, Andres Freund  wrote:

> Hi,
>
> On 2020-03-06 13:53:02 +0800, Craig Ringer wrote:
> > On Mon, 2 Mar 2020 at 19:26, Amit Kapila 
> wrote:
> >
> > > One thing that is not clear to me is how will we advance restart_lsn
> > > if we don't send any empty xact in a system where there are many such
> > > xacts?
> >
> > Same way we already do it for writes that are not replicated over
> > logical replication, like vacuum work etc. The upstream sends feedback
> > with reply-requested. The downstream replies. The upstream advances
> > confirmed_flush_lsn, and that lazily updates restart_lsn.
>
> It'll still delay it a bit.
>

Right, but we don't generally care because there's no sync rep txn waiting
for confirmation. If we lose progress due to a crash it doesn't matter. It
does delay removal of old WAL a little, but it hardly matters.


> Somewhat independent from the issue at hand: It'd be really good if we
> could evolve the syncrep framework to support per-database waiting... It
> shouldn't be that hard, and the current situation sucks quite a bit (and
> yes, I'm to blame).
>

Hardly, you just didn't get the chance to fix that on top of the umpteen
other things you had to change to make all the logical stuff work. You
didn't break it, just didn't implement every single possible enhancement
all at once. Shocking, I tell you.


I'm not quite sure what you mean by "poke the walsender"? Kinda sounds
> like sending a signal, but decoding happens inside after the walsender,
> so there's no need for that. Do you just mean somehow requesting that
> walsender sends a feedback message?
>

Right. I had in mind something like sending a ProcSignal via our funky
multiplexed signal mechanism to ask the walsender to immediately generate a
keepalive message with a reply-requested flag, then set the walsender's
latch so we wake it promptly.


> To address the volume we could:
>
> 1a) Introduce a pgoutput message type to indicate that the LSN has
>   advanced, without needing separate BEGIN/COMMIT. Right now BEGIN is
>   21 bytes, COMMIT is 26. But we really don't need that much here. A
>   single message should do the trick.
>

It would. Is it worth caring though? Especially since it seems rather
unlikely that the actual network data volume of begin/commit msgs will be
much of a concern. It's not like we're PITRing logical streams, and if we
did, we could just filter out empty commits on the receiver side.

That message pretty much already exists in the form of a walsender
keepalive anyway so we might as well re-use that and not upset the protocol.


> 1b) Add a LogicalOutputPluginWriterUpdateProgress parameter (and
>   possibly rename) that indicates that we are intentionally "ignoring"
>   WAL. For walsender that callback then could check if it could just
>   forward the position of the client (if it was entirely caught up
>   before), or if it should send a feedback request (if syncrep is
>   enabled, or distance is big).
>

I can see something like that being very useful, because at present only
the output plugin knows if a txn is "empty" as far as that particular slot
and output plugin is concerned. The reorder buffering mechanism cannot do
relation-level filtering before it sends the changes to the output plugin
during ReorderBufferCommit, since it only knows about relfilenodes not
relation oids. And the output plugin might be doing finer grained filtering
using row-filter expressions or who knows what else.

But as described above that will only help for txns done in DBs other than
the one the logical slot is for or txns known to have an empty
ReorderBuffer when the commit is seen.

If there's a txn in the slot's db with a non-empty reorderbuffer, the
output plugin won't know if the txn is empty or not until it finishes
processing all callbacks and sees the commit for the txn. So it will
generally have emitted the Begin message on the wire by the time it knows
it has nothing useful to say. And Pg won't know that this txn is empty as
far as this output plugin with this particular slot, set of output plugin
params, and current user-catalog state is concerned, so it won't have any
way to call the output plugin's "update progress" callback instead of the
usual begin/change/commit callbacks.

But I think we can already skip empty txns unless sync-rep is enabled with
no core changes, and send empty txns as walsender keepalives instead, by
altering only output plugins, like this:

* Stash BEGIN data in plugin's LogicalDecodingContext.output_plugin_private
when plugin's begin callback called, don't write anything to the outstream
* Write out BEGIN message lazily when any other callback generates a
message that does need to be written out
* If no BEGIN written by the time COMMIT callback called, discard the
COMMIT too. Check if sync rep enabled. if it is,
call LogicalDecodingContext.update_progress from within the output plugin
commit handler, otherwise just ignore the commit totally. Probably by

RE: Planning counters in pg_stat_statements (using pgss_store)

2020-03-13 Thread imai.yoshik...@fujitsu.com
On Thu, Mar 12, 2020 at 10:31 AM, Julien Rouhaud wrote:
> > * bufusage still only counts the buffer usage during executor.
> >
> >   Now we have the ability to count the buffer usage during planner but we
> keep
> >   the bufusage count the buffer usage during executor for now.
> 
> The bufusage should reflect the sum of planning and execution usage if
> track_planning is enabled.  Did I miss something there?

Ah, you're right. I somehow misunderstood it. Sorry for the annoying.

> > * We add Assert in pg_plan_query so that query_text will not be NULL
> > when executing planner.
> >
> >   There's no case query_text will be NULL with current sources. It is not
> >   ensured there will be any case query_text will be possibly NULL in the
> >   future though. Some considerations are needed by other people about
> this.
> 
> There's at least the current version of IVM patchset that lacks the querytext.

I saw IVM patchset but I thought it is difficult to impose them to give 
appropriate
querytext.


> Looking at various extensions, I see that pg_background and pglogical call
> pg_plan_query internally but shouldn't have any issue providing the query 
> text.
> But there's also citus extension, which don't keep around the query string at
> least when distributing plans, which makes sense since it's of no use and
> they're heavily modifying the original Query.  I think that citus folks 
> opinion on
> the subject would be useful, so I'm Cc-ing Marco.

Thank you for looking those codes. I will comment about this in another mail.

--
Yoshikazu Imai


Re: Refactor compile-time assertion checks for C/C++

2020-03-13 Thread Michael Paquier
On Thu, Mar 12, 2020 at 09:43:54AM -0400, Tom Lane wrote:
> I don't feel a need to expend a whole lot of sweat there.  The existing
> text is fine, it just bugged me that the code deals with three cases
> while the comment block only acknowledged two.  So I'd just go with
> what you have in v3.

Thanks, Tom.  I have committed v3 then.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Skip llvm bytecode generation if LLVM is missing

2020-03-13 Thread Craig Ringer
On Fri, 13 Mar 2020 at 04:35, Andres Freund  wrote:

>
> IMO only if the packager screwed up. The dependencies of the package
> that includes pgxs, headers should have the dependencies to llvm. Which
> e.g. debian's does:
>

Yes, I agree that the underlying issue is mainly with packaging.

This proposal came out of chasing down some packaging problems relating to
the EL-7  -devel packages for Pg  11 and 12, per linked mails in initial
post. They don't declare a runtime dependency on llvm toolset or clang, so
they're basically broken given the way we assume the presence of those
tools.

But

(a) do we really want to force everyone to pull in clang and the llvm
toolset when they install the -devel pkg, even if they don't install or
need the postgresqlNN-llvmjit  package?
(b) EL-7 doesn't have a usable llvm/clang version even in EPEL, you have to
add a separate SCL LLVM toolset repo. So adding a dependency on
llvm-toolset into EL-7's postgresql11-devel and postgresql12-devel is most
undesirable, especially in a point release, as it'll make lots of stuff
explode messily.

I learned (b) the hard way. Don't do that.

If the consensus is that this is a packaging issue, (a) is fine, and we
should just quit whining and install a suitable clang/llvm, I'll cope with
that. I'll ask yum-packagers to patch Makefile.global for EL-7 with a
workaround like the one proposed here and for newer RH where a suitable
LLVM is available, just declare it as a dependency of the -devel pkg going
forward then make lots of noise about the change.

But the problem is that even for newer RH "enterprise" distros LLVM/clang
live in EPEL, and IIRC we don't presently require any dependencies from
EPEL to install the base postgresqlNN-* packages (except llvmjit). So we'd
be making EPEL a new repo dependency for the -devel pkg and that's not
something I'm too fond of doing in a minor release.

The alternative would be to detect a missing clang and emit a much more
informative error than the current one that explicitly suggests retrying
with

make with_llvm=no

or setting with_llvm=no in the environment.

The whole thing is a mess caused by this "enterprise-y" repository split
between core and "extras" and I'm rather frustrated by the whole thing, but
the current situation isn't much fun for users.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise