date:20191118

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-18 Thread Thomas Munro

On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen  wrote:
> In the test case above, the small table has 3000 tuples and 100 distinct 
> values on column ‘a’.
> If we use small table as inner table.  The chan length of the bucket is 30. 
> And we need to
> search the whole chain on probing the hash table. So the cost of probing is 
> bigger than build
> hash table, and we need to use big table as inner.
>
> But in fact this is not true. We initialized 620,000 buckets in hashtable. 
> But only 100 buckets
> has chains with length 30. Other buckets are empty. Only hash values need to 
> be compared.
> Its costs are very small. We have 100,000 distinct key and 100,000,000 tuple 
> on outer table.
> Only (100/10)* tuple_num tuples will search the whole chain. The other 
> tuples
> (number = (98900/10)*tuple_num*) in outer
> table just compare with the hash value. So the actual cost is much smaller 
> than the planner
> calculated. This is the reason why using a small table as inner is faster.

So basically we think that if t_big is on the outer side, we'll do
100,000,000 probes and each one is going to scan a t_small bucket with
chain length 30, so that looks really expensive.  Actually only a
small percentage of its probes find tuples with the right hash value,
but final_cost_hash_join() doesn't know that.  So we hash t_big
instead, which we estimated pretty well and it finishes up with
buckets of length 1,000 (which is actually fine in this case, they're
not unwanted hash collisions, they're duplicate keys that we need to
emit) and we probe them 3,000 times (which is also fine in this case),
but we had to do a bunch of memory allocation and/or batch file IO and
that turns out to be slower.

I am not at all sure about this but I wonder if it would be better to
use something like:

  run_cost += outer_path_rows * some_small_probe_cost;
  run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();

If we can estimate how many tuples will actually match accurately,
that should also be the number of times we have to run the quals,
since we don't usually expect hash collisions (bucket collisions, yes,
but hash collisions where the key doesn't turn out to be equal, no*).

* ... but also yes as you approach various limits, so you could also
factor in bucket chain length that is due to being prevented from
expanding the number of buckets by arbitrary constraints, and perhaps
also birthday_problem(hash size, key space) to factor in unwanted hash
collisions that start to matter once you get to billions of keys and
expect collisions with short hashes.

Re: Hypothetical indexes using BRIN broken since pg10

2019-11-18 Thread Julien Rouhaud

On Tue, Nov 19, 2019 at 6:40 AM Michael Paquier  wrote:
>
> On Fri, Nov 15, 2019 at 12:07:15PM +0900, Michael Paquier wrote:
> > So, Heikki, are you planning to work more on that and commit a change
> > close to what has been proposed upthread in [1]?  It sounds to me that
> > this has the advantage to be non-intrusive and a similar solution has
> > been used for GIN indexes.  Moving the redesign out of the discussion,
> > is there actually a downsize with back-patching something like
> > Heikki's version?
>
> So...  I have been looking at this patch, and indeed it would be nice
> to pass down a better value than BRIN_DEFAULT_PAGES_PER_RANGE to be
> able to compute the stats in brincostestimate().  Still, it looks also
> to me that this allows the code to be able to compute some stats
> directly.  As there is no consensus on a backpatch yet, my take would
> be for now to apply just the attached on HEAD, and consider a
> back-patch later on if there are more arguments in favor of it.  If
> you actually test hypopg currently, the code fails when attempting to
> open the relation to get the stats now.
>
> Attached are the patch for HEAD, as well as a patch to apply to hypopg
> on branch REL1_STABLE to make the module compatible with PG13~.
>
> Any objections?

None from me.  I'm obviously biased, but I hope that it can get
backpatched.  BRIN is probably seldom used, but we shouldn't make it
harder to use it, even if it's that's only for hypothetical usage, and
even if it'll still be quite inexact.

> NB @Julien: perhaps you'd want to apply the second patch to the
> upstream repo of hypopg, and add more tests for other index AMs like
> GIN and BRIN.

Thanks!  I didn't noticed that the compatibility macro for heap_open
was removed in f25968c49, I'll commit this patch on hypopg with some
compatibility macros to make sure that it compiles against all
versions.  GIN (and some others) are unfortunately explicitly
disallowed with hypopg.  Actually, most of the code already handles it
but I have no clear idea on how to estimate the number of tuples and
the size of such indexes.  But yes, I should definitely add more tests
for supported AM, although I can't add any for BRIN until a fix is
committed :(

RE: Recovery performance of DROP DATABASE with many tablespaces

2019-11-18 Thread k.jami...@fujitsu.com

On Wed, Nov 13, 2019 5:34PM (GMT+9), Fujii Masao wrote:
> On Wed, Nov 13, 2019 at 3:57 PM k.jami...@fujitsu.com 
> wrote:
> >
> > On Wed, Oct. 2, 2019 5:40 PM, Fujii Masao wrote:
> > > On Tue, Jul 10, 2018 at 3:04 PM Michael Paquier 
> wrote:
> > > >
> > > > On Thu, Jul 05, 2018 at 01:42:20AM +0900, Fujii Masao wrote:
> > > > > TBH, I have no numbers measured by the test.
> > > > > One question about your test is; how did you measure the
> > > > > *recovery
> > > > > time* of DROP DATABASE? Since it's *recovery* performance,
> > > > > basically it's not easy to measure that.
> > > >
> > > > It would be simple to measure the time it takes to replay this
> > > > single DROP DATABASE record by putting two gettimeofday() calls or
> > > > such things and then take the time difference.  There are many
> > > > methods that you could use here, and I suppose that with a shared
> > > > buffer setting of a couple of GBs of shared buffers you would see
> > > > a measurable difference with a dozen of tablespaces or so.  You
> > > > could also take a base backup after creating all the tablespaces,
> > > > connect the standby and then drop the database on the primary to
> > > > see the actual time it takes.  Your patch looks logically correct
> > > > to me because DropDatabaseBuffers is a
> > > > *bottleneck* with large shared_buffers, and it would be nice to
> > > > see numbers.
> > >
> > > Thanks for the comment!
> > >
> > > I measured how long it takes to replay DROP DATABASE with 1000
> > > tablespaces, in master and patched version. shared_buffers was set to
> 16GB.
> > >
> > > [master]
> > > It took 8 seconds to replay DROP DATABASE with 1000 tablespaces, as 
> > > follows.
> > > In this case, 16GB shared_buffers was fully scanned 1000 times.
> > >
> > > 2019-10-02 16:50:14 JST LOG:  redo starts at 0/228
> > > 2019-10-02 16:50:22 JST LOG:  redo done at 0/300A298
> > >
> > > [patched]
> > > It took less than 1 second to replay DROP DATABASE with 1000
> > > tablespaces, as follows. In this case, 16GB shared_buffers was scanned
> only one time.
> > >
> > > 2019-10-02 16:47:03 JST LOG:  redo starts at 0/228
> > > 2019-10-02 16:47:03 JST LOG:  redo done at 0/3001588
> > >
> >
> > Hi Fujii-san,
> >
> > It's been a while, so I checked the patch once again.
> > It's fairly straightforward and I saw no problems nor bug in the code.
> 
> Thanks for the review!
> 
> > > [patched]
> > > It took less than 1 second to replay DROP DATABASE with 1000
> > > tablespaces,
> > The results are good.
> > I want to replicate the performance to confirm the results as well.
> > Could you share how you measured the recovery replay?
> 
> I forgot the actual steps that I used for the measurement.
> But I think they are something like
> 
> 1. create database "hoge"
> 2. create 1,000 tablespaces
> 3. create 1,000 tables on the database "hoge".
> each table should be placed in different tablespace.
> 4. take a base backup
> 5. drop database "hoge"
> 6. shutdown the server with immediate mode 7. start an archive recovery from
> the backup taken at #4 8. measure how long it takes to apply DROP DATABASE
> record by
> checking the timestamp at REDO start and REDO end.
> 
> I think that I performed the above steps on the master and the patched 
> version.
> 
> > Did you actually execute a failover?
> 
> No.

I'm sorry for the late reply, and thanks for the guide above.
I replicated the same recovery test above on a standalone server
and have confirmed with the logs that the patch made the recovery faster.

[MASTER/UNPATCHED] ~10 seconds
2019-11-19 15:25:23.891 JST [23042] LOG:  redo starts at 0/180006A0
...
2019-11-19 15:25:34.492 JST [23042] LOG:  redo done at 0/1800A478

[PATCHED]  ~less than 1 sec
2019-11-19 15:31:59.415 JST [17625] LOG:  redo starts at 0/40005B8
...
2019-11-19 15:32:00.159 JST [17625] CONTEXT:  WAL redo at 0/4000668 for 
Database/DROP: dir 1663/16384 16385/16384...//further details ommitted//...
...
2019-11-19 15:32:00.159 JST [17625] LOG:  redo done at 0/4001638

I believe there are no problems, so I am marking this patch now
as "Ready for Committer".

Regards,
Kirk Jamison

Re: Hypothetical indexes using BRIN broken since pg10

2019-11-18 Thread Michael Paquier

On Fri, Nov 15, 2019 at 12:07:15PM +0900, Michael Paquier wrote:
> So, Heikki, are you planning to work more on that and commit a change
> close to what has been proposed upthread in [1]?  It sounds to me that
> this has the advantage to be non-intrusive and a similar solution has
> been used for GIN indexes.  Moving the redesign out of the discussion,
> is there actually a downsize with back-patching something like
> Heikki's version?

So...  I have been looking at this patch, and indeed it would be nice
to pass down a better value than BRIN_DEFAULT_PAGES_PER_RANGE to be
able to compute the stats in brincostestimate().  Still, it looks also
to me that this allows the code to be able to compute some stats
directly.  As there is no consensus on a backpatch yet, my take would
be for now to apply just the attached on HEAD, and consider a
back-patch later on if there are more arguments in favor of it.  If
you actually test hypopg currently, the code fails when attempting to
open the relation to get the stats now.

Attached are the patch for HEAD, as well as a patch to apply to hypopg
on branch REL1_STABLE to make the module compatible with PG13~.

Any objections?

NB @Julien: perhaps you'd want to apply the second patch to the
upstream repo of hypopg, and add more tests for other index AMs like
GIN and BRIN.
--
Michael
diff --git a/hypopg.c b/hypopg.c
index 4ab76b7..fe8be62 100644
--- a/hypopg.c
+++ b/hypopg.c
@@ -140,22 +140,22 @@ hypo_getNewOid(Oid relid)
 	char		relpersistence;
 
 	/* Open the relation on which we want a new OID */
-	relation = heap_open(relid, AccessShareLock);
+	relation = table_open(relid, AccessShareLock);
 
 	reltablespace = relation->rd_rel->reltablespace;
 	relpersistence = relation->rd_rel->relpersistence;
 
 	/* Close the relation and release the lock now */
-	heap_close(relation, AccessShareLock);
+	table_close(relation, AccessShareLock);
 
 	/* Open pg_class to aks a new OID */
-	pg_class = heap_open(RelationRelationId, RowExclusiveLock);
+	pg_class = table_open(RelationRelationId, RowExclusiveLock);
 
 	/* ask for a new relfilenode */
 	newoid = GetNewRelFileNode(reltablespace, pg_class, relpersistence);
 
 	/* Close pg_class and release the lock now */
-	heap_close(pg_class, RowExclusiveLock);
+	table_close(pg_class, RowExclusiveLock);
 
 	return newoid;
 }
@@ -297,7 +297,7 @@ hypo_get_relation_info_hook(PlannerInfo *root,
 		Relation	relation;
 
 		/* Open the current relation */
-		relation = heap_open(relationObjectId, AccessShareLock);
+		relation = table_open(relationObjectId, AccessShareLock);
 
 		if (relation->rd_rel->relkind == RELKIND_RELATION
 #if PG_VERSION_NUM >= 90300
@@ -324,7 +324,7 @@ hypo_get_relation_info_hook(PlannerInfo *root,
 		}
 
 		/* Close the relation release the lock now */
-		heap_close(relation, AccessShareLock);
+		table_close(relation, AccessShareLock);
 	}
 
 	if (prev_get_relation_info_hook)
diff --git a/hypopg_index.c b/hypopg_index.c
index db119e1..7a1d18a 100644
--- a/hypopg_index.c
+++ b/hypopg_index.c
@@ -1396,7 +1396,7 @@ hypopg_get_indexdef(PG_FUNCTION_ARGS)
 			if (indexpr_item == NULL)
 elog(ERROR, "too few entries in indexprs list");
 			indexkey = (Node *) lfirst(indexpr_item);
-			indexpr_item = lnext(indexpr_item);
+			indexpr_item = lnext(entry->indexprs, indexpr_item);
 
 			/* Deparse */
 			str = deparse_expression(indexkey, context, false, false);
@@ -1546,7 +1546,7 @@ hypo_estimate_index_simple(hypoIndex * entry, BlockNumber *pages, double *tuples
 	rel = makeNode(RelOptInfo);
 
 	/* Open the hypo index' relation */
-	relation = heap_open(entry->relid, AccessShareLock);
+	relation = table_open(entry->relid, AccessShareLock);
 
 	if (!RelationNeedsWAL(relation) && RecoveryInProgress())
 		ereport(ERROR,
@@ -1567,7 +1567,7 @@ hypo_estimate_index_simple(hypoIndex * entry, BlockNumber *pages, double *tuples
 	  >pages, >tuples, >allvisfrac);
 
 	/* Close the relation and release the lock now */
-	heap_close(relation, AccessShareLock);
+	table_close(relation, AccessShareLock);
 
 	hypo_estimate_index(entry, rel);
 	*pages = entry->pages;
diff --git a/import/hypopg_import_index.c b/import/hypopg_import_index.c
index d482f30..e6b6a2c 100644
--- a/import/hypopg_import_index.c
+++ b/import/hypopg_import_index.c
@@ -91,7 +91,7 @@ build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
 			if (indexpr_item == NULL)
 elog(ERROR, "wrong number of index expressions");
 			indexvar = (Expr *) lfirst(indexpr_item);
-			indexpr_item = lnext(indexpr_item);
+			indexpr_item = lnext(index->indexprs, indexpr_item);
 		}
 
 		tlist = lappend(tlist,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 024f325eb0..abbc7e9e7e 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,6 +102,7 @@
 #include 
 
 #include "access/brin.h"
+#include "access/brin_page.h"
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
@@ -6865,12 +6866,35 @@

Re: [PATCH] Implement INSERT SET syntax

2019-11-18 Thread Gareth Palmer




> On 15/11/2019, at 10:20 AM, Tom Lane  wrote:
> 
> Gareth Palmer  writes:
>>> On 19/08/2019, at 3:00 AM, Tom Lane  wrote:
>>> Perhaps the way to resolve Peter's objection is to make the syntax
>>> more fully like UPDATE:
>>>INSERT INTO target SET c1 = x, c2 = y+z, ... FROM tables-providing-x-y-z
>>> (with the patch as-submitted corresponding to the case with an empty
>>> FROM clause, hence no variables in the expressions-to-be-assigned).
> 
>> Thanks for the feedback. Attached is version 3 of the patch that makes
>> the syntax work more like an UPDATE statement when a FROM clause is used.
> 
> Since nobody has objected to this, I'm supposing that there's general
> consensus that that design sketch is OK, and we can move on to critiquing
> implementation details.  I took a look, and didn't like much of what I saw.
> 
> * In the grammar, there's no real need to have separate productions
> for the cases with FROM and without.  The way you have it is awkward,
> and it arbitrarily rejects combinations that work fine in plain
> SELECT, such as WHERE without FROM.  You should just do
> 
> insert_set_clause:
>   SET set_clause_list from_clause where_clause
> group_clause having_clause window_clause opt_sort_clause
> opt_select_limit
> 
> relying on the ability of all those symbols (except set_clause_list) to
> reduce to empty.
> 
> * This is randomly inconsistent with select_no_parens, and not in a
> good way, because you've omitted the option that's actually most likely
> to be useful, namely for_locking_clause.  I wonder whether it's practical
> to refactor select_no_parens so that the stuff involving optional trailing
> clauses can be separated out into a production that insert_set_clause
> could also use.  Might not be worth the trouble, but I'm concerned
> about select_no_parens growing additional clauses that we then forget
> to also add to insert_set_clause.
> 
> * I'm not sure if it's worth also refactoring simple_select so that
> the "into_clause ... window_clause" business could be shared.  But
> it'd likely be a good idea to at least have a comment there noting
> that any changes in that production might need to be applied to
> insert_set_clause as well.
> 
> * In kind of the same vein, it feels like the syntax documentation
> is awkwardly failing to share commonality that it ought to be
> able to share with the SELECT man page.
> 
> * I dislike the random hacking you did in transformMultiAssignRef.
> That weakens a useful check for error cases, and it's far from clear
> why the new assertion is OK.  It also raises the question of whether
> this is really the only place you need to touch in parse analysis.
> Perhaps it'd be better to consider inventing new EXPR_KIND_ values
> for this situation; you'd then have to run around and look at all the
> existing EXPR_KIND uses, but that seems like a useful cross-check
> activity anyway.  Or maybe we need to take two steps back and
> understand why that change is needed at all.  I'd imagined that this
> patch would be only syntactic sugar for something you can do already,
> so it's not quite clear to me why we need additional changes.
> 
> (If it's *not* just syntactic sugar, then the scope of potential
> problems becomes far greater, eg does ruleutils.c need to know
> how to reconstruct a valid SQL command from a querytree like this.
> If we're not touching ruleutils.c, we need to be sure that every
> command that can be written this way can be written old-style.)

So it appears as though it may not require any changes to ruleutils.c
as the parser is converting the multi-assignments into separate
columns, eg:

CREATE RULE r1 AS ON INSERT TO tab1
  DO INSTEAD
  INSERT INTO tab2 SET (col2, col1) = (new.col2, 0), col3 = tab3.col3
  FROM tab3

The rule generated is:

 r1 AS ON INSERT TO tab1 DO INSTEAD
   INSERT INTO tab2 (col2, col1, col3)
 SELECT new.col2, 0 AS col1, tab3.col3 FROM tab3

It will trigger that Assert() though, as EXPR_KIND_SELECT_TARGET is
now also being passed to transformMultiassignRef().

> * Other documentation gripes: the lone example seems insufficient,
> and there needs to be an entry under COMPATIBILITY pointing out
> that this is not per SQL spec.
> 
> * Some of the test cases seem to be expensively repeating
> construction/destruction of tables that they could have shared with
> existing test cases.  I do not consider it a virtue for new tests
> added to an existing test script to be resolutely independent of
> what's already in that script.
> 
> I'm setting this back to Waiting on Author.
> 
>   regards, tom lane

Re: Invisible PROMPT2

2019-11-18 Thread Thomas Munro

On Tue, Nov 19, 2019 at 12:09 PM Tom Lane  wrote:
> You should follow the logic in pg_wcswidth: compute PQmblen() first,
> and bail out if it's more than the remaining string length, otherwise
> it's ok to apply PQdsplen().

Got it.  I was worried that it wasn't safe to call even PQmblen(),
because I didn't know a fact about all encodings: as described in the
comment of pg_gb18030_mblen(), all implementations read only the first
byte to determine the length, except for GB18030 which reads the
second byte too, and that's OK because there's always a null
terminator.

> It might be a good idea to explicitly initialize last_prompt1_width to
> zero, for clarity.
>
> Should the user docs explicitly say "of the same width as the most recent
> output of PROMPT1", as you have in the comments?  That seems a more
> precise specification, and it will eliminate some questions people will
> otherwise ask.
>
> LGTM otherwise.

Done, and pushed.  I also skipped negative results from PQdsplen like
pg_wcswidth() does (that oversight explained why a non-readline build
showed the correct alignment for PROMPT1 '%[%033[1m%]%M
%n@%/%R%[%033[0m%]%# ' by strange concindence).

Thanks all for the feedback.  I think the new bikeshed colour looks good.

Re: [PATCH] Implement INSERT SET syntax

2019-11-18 Thread Gareth Palmer




> On 15/11/2019, at 10:20 AM, Tom Lane  wrote:
> 
> Gareth Palmer  writes:
>>> On 19/08/2019, at 3:00 AM, Tom Lane  wrote:
>>> Perhaps the way to resolve Peter's objection is to make the syntax
>>> more fully like UPDATE:
>>>INSERT INTO target SET c1 = x, c2 = y+z, ... FROM tables-providing-x-y-z
>>> (with the patch as-submitted corresponding to the case with an empty
>>> FROM clause, hence no variables in the expressions-to-be-assigned).
> 
>> Thanks for the feedback. Attached is version 3 of the patch that makes
>> the syntax work more like an UPDATE statement when a FROM clause is used.
> 
> Since nobody has objected to this, I'm supposing that there's general
> consensus that that design sketch is OK, and we can move on to critiquing
> implementation details.  I took a look, and didn't like much of what I saw.
> 
> * In the grammar, there's no real need to have separate productions
> for the cases with FROM and without.  The way you have it is awkward,
> and it arbitrarily rejects combinations that work fine in plain
> SELECT, such as WHERE without FROM.  You should just do
> 
> insert_set_clause:
>   SET set_clause_list from_clause where_clause
> group_clause having_clause window_clause opt_sort_clause
> opt_select_limit
> 
> relying on the ability of all those symbols (except set_clause_list) to
> reduce to empty.

There are two separate productions to match the two different types
of inserts: INSERT with VALUES and INSERT with SELECT.

The former has to store the the values in valuesLists so that DEFAULT
can still be used.

Allowing a WHERE without a FROM also mean that while this would
work:

INSERT INTO t SET c = DEFAULT;

But this would fail with 'DEFAULT is not allowed in this context':

INSERT INTO t SET c = DEFAULT WHERE true;

I should have put a comment explaining why there are two rules.

It could be combined into one production but there would have to be
a check that $4 .. $9 are NULL to determine what type of INSERT to
use.

transformInsertStmt() also has an optimisation for the case of a
single valueLists entry.

> * This is randomly inconsistent with select_no_parens, and not in a
> good way, because you've omitted the option that's actually most likely
> to be useful, namely for_locking_clause.  I wonder whether it's practical
> to refactor select_no_parens so that the stuff involving optional trailing
> clauses can be separated out into a production that insert_set_clause
> could also use.  Might not be worth the trouble, but I'm concerned
> about select_no_parens growing additional clauses that we then forget
> to also add to insert_set_clause.
> 
> * I'm not sure if it's worth also refactoring simple_select so that
> the "into_clause ... window_clause" business could be shared.  But
> it'd likely be a good idea to at least have a comment there noting
> that any changes in that production might need to be applied to
> insert_set_clause as well.

I can add opt_for_locking_clause and a comment to simple_select to
start with while the format of insert_set_clause is still being
worked out.

> * In kind of the same vein, it feels like the syntax documentation
> is awkwardly failing to share commonality that it ought to be
> able to share with the SELECT man page.

I could collapse the from clause to just '[ FROM from_clause ]'
and have it refer to the from clause and everything after it in
SELECT.

> * I dislike the random hacking you did in transformMultiAssignRef.
> That weakens a useful check for error cases, and it's far from clear
> why the new assertion is OK.  It also raises the question of whether
> this is really the only place you need to touch in parse analysis.
> Perhaps it'd be better to consider inventing new EXPR_KIND_ values
> for this situation; you'd then have to run around and look at all the
> existing EXPR_KIND uses, but that seems like a useful cross-check
> activity anyway.  Or maybe we need to take two steps back and
> understand why that change is needed at all.  I'd imagined that this
> patch would be only syntactic sugar for something you can do already,
> so it's not quite clear to me why we need additional changes.
> 
> (If it's *not* just syntactic sugar, then the scope of potential
> problems becomes far greater, eg does ruleutils.c need to know
> how to reconstruct a valid SQL command from a querytree like this.
> If we're not touching ruleutils.c, we need to be sure that every
> command that can be written this way can be written old-style.)

It was intended to just be syntatic sugar, however because
set_clause_list is being re-used the ability to do multi-assignment
in an INSERT's targetList 'came along for the ride' which has
no equivalent in the current INSERT syntax.

That would be why those EXPR_KIND's are now appearing in
transformMultiAssignRef().

There are 3 things that could be done here:

1. Update ruletutils.c to emit INSERT SET in get_insert_query_def()
   if query->hasSubLinks is true.

2.

Re: progress report for ANALYZE

2019-11-18 Thread Amit Langote

Yamada-san,

Thanks for working on this.

On Wed, Nov 6, 2019 at 2:50 PM Tatsuro Yamada
 wrote:
> I revised the patch as following because I realized counting the types of ext
> stats is not useful for users.
>
>   - Attached new patch counts a number of ext stats instead the types of ext 
> stats.
>
> So we can see the counter goes to "2", if we created above ext stats (pg_ext1 
> and
> pg_ext2) and analyzed as you wrote. :)

I have looked at the patch and here are some comments.

I think include_children and current_relid are not enough to
understand the progress of analyzing inheritance trees, because even
with current_relid being updated, I can't tell how many more there
will be.  I think it'd be better to show the total number of children
and the number of children processed, just like
pg_stat_progress_create_index does for partitions.  So, instead of
include_children and current_relid, I think it's better to have
child_tables_total, child_tables_done, and current_child_relid, placed
last in the set of columns.

Also, inheritance tree stats are created *after* creating single table
stats, so I think that it would be better to have a distinct phase
name for that, say "acquiring inherited sample rows".  In
do_analyze_rel(), you can select which of two phases to set based on
whether inh is true or not.  For partitioned tables, the progress
output will immediately switch to this phase, because partitioned
table itself is empty so there's nothing to do in the "acquiring
sample rows" phase.

That's all for now.

Thanks,
Amit

Re: PITR on DROP DATABASE, deleting of the database directory despite the recovery_target_time set before.

2019-11-18 Thread Craig Ringer

On Mon, 18 Nov 2019 at 18:48, Nicolas Lutic  wrote:

> Dear Hackers,
>
> After a drop database
>

with FORCE?


> , he tried to recover the data on the last inserted transaction by using
> the recovery_target_time.
> The issue is the database is present in the system catalog but the
> directory was still deleted.
> Here the technical information of the database
> version 11
> default  postgresql.conf except for this options
> wal_level = replica
> archive_mode = on
> archive_command = 'cp %p /tmp/wal_archive/%f '
> log_statement = 'all'
> log_min_messages = debug5
>
>
> The following method was used
>
>- create cluster
>
>
>- create database
>
>
>- create 1 table
>
>
>- create 1 index on 1 column
>
>
>- insert 1 rows
>
>
>- backup with pg_base_backup
>
>
>- insert 2 rows
>
> autocommit?

>
>
>
>- drop database
>
> force?


>
>- Change recovery behaviour in that case to prevent all xact
>operation to perform until COMMIT timestamp is checked against
>recovery_time bound (but it seems to be difficult as state
>
> https://www.postgresql.org/message-id/flat/20141125160629.GC21475%40msg.df7cb.de
>which also identifies the problem and tries to give some solutions.  Maybe
>another way, as a trivial guess (all apologises) is to buffer immediate
>xacts until we have the commit for each and apply the whole buffer xact
>once the timestamp known (and checked agains recovery_target_time value);
>
>
>- The other way to improve this is to update PostgreSQL
>documentation  by specifying that recovery_target_time cannot be used
>in this case. There should be multiple places where it can be stated.
>The best one (if only one) seems to be in
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=doc/src/sgml/config.sgml;h=
>
> 
>
>
If this only happens when a DB is dropped under load with force, I lean
toward just documenting it as a corner case.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: physical slot xmin dependency on logical slot?

2019-11-18 Thread Craig Ringer

On Tue, 19 Nov 2019 at 05:37, Jeremy Finzel  wrote:

> We had a scenario today that was new to us.  We had a logical replication
> slot that was severely far behind.  Before dropping this logical slot, we
> made a physical point-in-time-recovery snapshot of the system with this
> logical slot.
>
> This logical slot was causing severe catalog bloat.  We proceeded to drop
> the logical slot which was over 12000 WAL segments behind.  The physical
> slot was only a few 100 segments behind and still in place.
>
> But now proceeding to VAC FULL the catalog tables did not recover any
> bloat beyond the now-dropped logical slot.  Eventually to our surprise, we
> found that dropping the physical slot allowed us to recover the bloat.
>
> We saw in forensics after the fact that xmin of the physical slot equaled
> the catalog_xmin of the logical slot.  Is there some dependency here where
> physical slots made of a system retain all transactions of logical slots it
> contains as well?  If so, could someone help us understand this, and is
> there documentation around this?  Is this by design?
>

I expect that you created the replica in a manner that preserved the
logical replication slot on it. You also had hot_standby_feedback enabled.

PostgreSQL standbys send the global xmin and (in Pg10+) catalog_xmin to the
upstream when hot_standby_feedback is enabled. If there's a slot holding
the catalog_xmin on the replica down, that'll be passed on via
hot_standby_feedback to the upstream. On Pg 9.6 or older, or if the replica
isn't using a physical replication slot, the catalog_xmin is treated as a
regular xmin since there's nowhere in PGPROC or PGXACT to track the
separate catalog_xmin. If the standby uses a physical slot, then on pg10+
the catalog_xmin sent by the replica is stored as the catalog_xmin on the
physical slot instead.

Either way, if you have hot_standby_feedback enabled on a standby, that
feedback includes the requirements of any replication slots on the standby.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Postgres on IBM z/OS 2.2.0 and 2.3.0

2019-11-18 Thread Tom Lane

parveen mehta  writes:
> Thanks for providing valuable inputs into my concerns. In the last part you 
> mentioned and i am quoting it here "would limit the parts of the system
> that would have to be cleansed of ASCII-isms to libpq and src/bin/.
> But that's already a nontrivial headache I suspect." I am not clear on the 
> ASCII-isms to libpq and src/bin/. Can you share some knowledge on those 
> items. Are those standard directory locations ? Sorry if i am being ignorant.

I'm just speaking of those subtrees of PG's source code:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=tree

Some of the concrete problems are likely to be the same ones mentioned
on wikipedia's EBCDIC page:

https://en.wikipedia.org/wiki/EBCDIC

notably that in ASCII the upper-case English letters have consecutive
codes, as do the lower-case ones, but that's not the case in EBCDIC.
This'd break pg_toupper and pg_tolower, and perhaps other places.
(Or perhaps that's the only place, but there's a lot of code to be
audited to find out.)

The lack of well-defined equivalents for a lot of common ASCII
punctuation is likely to be a problem as well.  psql's internal
parsing of SQL commands, for example, is entirely unprepared for
the idea that characters it needs to recognize might be encoding
dependent.  But unless you want to restrict yourself to just one
EBCDIC code page, something would have to be done about that.

Another issue, which is something that might be unique to
Postgres, is that we expect that bytes with the high bit set
are elements of multibyte characters, while bytes without are
plain ASCII.  I think you could set things up so that mblen()
returns 1 despite the high bit being set, but at the very least
this would result in an efficiency hit due to taking the slow
"multibyte" code paths even for plain English letters.  Perhaps
it wouldn't matter too much on the client side; hard to tell
without investing a lot of work to try it.

The server-side code has a lot more of these assumptions buried
in it than the client side, which is why I doubt it's feasible
to get the server to run in EBCDIC.  But maybe you could create
encoding translation functions and treat EBCDIC code pages as
client-only encodings, which is a concept we already have.

On the whole though, it seems like a lot of work for a dubious
goal, which is probably why nobody's tackled it.

regards, tom lane

Re: Duplicate Workers entries in some EXPLAIN plans

2019-11-18 Thread Maciek Sakrejda

On Thu, Oct 24, 2019 at 6:48 PM Andres Freund  wrote:
> Unfortunately I think the fix isn't all that trivial, due to the way we
> output the per-worker information at the end of ExplainNode(), by just
> dumping things into a string.  It seems to me that a step in the right
> direction would be for ExplainNode() to create
> planstate->worker_instrument StringInfos, which can be handed to
> routines like show_sort_info(), which would print the per-node
> information into that, rather than directly dumping into
> es->output. Most of the current "Show worker detail" would be done
> earlier in ExplainNode(), at the place where we current display the
> "actual rows" bit.
>
> ISTM that should include removing the duplication fo the the contents of
> show_sort_info(), and probably also for the Gather, GatherMerge blocks
> (I've apparently skipped adding the JIT information to the latter, not
> sure if we ought to fix that in the stable branches).
>
> Any chance you want to take a stab at that?

It took me a while, but I did take a stab at it (thanks for your
off-list help). Attached is my patch that changes the structured
formats to merge sort worker output in with costs/timing/buffers
worker output. I have not touched any other worker output yet, since
it's not under a Workers group as far as I can tell (so it does not
exhibit the problem I originally reported). I can try to make further
changes here if the approach is deemed sound. I also have not touched
text output; above you had proposed

>  Sort Method: external merge  Disk: 4920kB
>  Buffers: shared hit=682 read=10188, temp read=1415 written=2101
>  Worker 0: actual time=130.058..130.324 rows=1324 loops=1
>Sort Method: external merge  Disk: 5880kB
>Buffers: shared hit=337 read=3489, temp read=505 written=739
>  Worker 1: actual time=130.273..130.512 rows=1297 loops=1
>Buffers: shared hit=345 read=3507, temp read=505 written=744
>Sort Method: external merge  Disk: 5920kB

which makes sense to me, but I'd like to confirm this is the approach
we want before I add it to the patch.

This is my first C in close to a decade (and I was never much of a C
programmer to begin with), so please be gentle.

As Andres suggested off-list, I also changed the worker output to
order fields that also occur in the parent node in the same way as the
parent node.

I've also added a test for the patch, and because this is really an
EXPLAIN issue rather than a query feature issue, I added a
src/test/regress/sql/explain.sql for the test. I added a couple of
utility functions for munging json-formatted EXPLAIN plans into
something we can repeatably verify in regression tests (the functions
use json rather than jsonb to preserve field order). I have not added
this for YAML or XML (even though they should behave the same way),
since I'm not familiar with the the functions to manipulate those data
types in a similar way (if they exist). My hunch is due to the
similarity of structured formats, just testing JSON is enough, but I
can expand/adjust tests as necessary.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..8f898fc20c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -100,7 +100,7 @@ static void show_sortorder_options(StringInfo buf, Node *sortexpr,
    Oid sortOperator, Oid collation, bool nullsFirst);
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 			 List *ancestors, ExplainState *es);
-static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_sort_info(SortState *sortstate, StringInfo* worker_strs, ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 ExplainState *es);
@@ -290,6 +290,7 @@ NewExplainState(void)
 	es->costs = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
+	es->root_str = es->str;
 
 	return es;
 }
@@ -1073,9 +1074,20 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	const char *partialmode = NULL;
 	const char *operation = NULL;
 	const char *custom_name = NULL;
+	StringInfo *worker_strs = NULL;
 	int			save_indent = es->indent;
 	bool		haschildren;
 
+	/* Prepare per-worker output */
+	if (es->analyze && planstate->worker_instrument) {
+		int num_workers = planstate->worker_instrument->num_workers;
+		int n;
+		worker_strs = (StringInfo *) palloc0(num_workers * sizeof(StringInfo));
+		for (n = 0; n < num_workers; n++) {
+			worker_strs[n] = makeStringInfo();
+		}
+	}
+
 	switch (nodeTag(plan))
 	{
 		case T_Result:
@@ -1340,6 +1352,64 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		ExplainPropertyBool("Parallel Aware", plan->parallel_aware, es);
 	}
 
+	/* Prepare worker general execution details */
+	if (es->analyze && es->verbose && planstate->worker_instrument)
+	{
+		WorkerInstrumentation *w = planstate->worker_instrument;
+		int			n;
+
+		for (n = 0;

Re: Invisible PROMPT2

2019-11-18 Thread Tom Lane

Thomas Munro  writes:
> Right, a PQdsplen()/PQmblen() loop works nicely, as attached.

> I spotted a potential problem: I suppose I could write a PROMPT1 that
> includes an invalid multibyte sequence at the end of the buffer and
> trick PQmblen() or PQdsplen() into reading a few bytes past the end.
> Two defences against that would be (1) use pg_encoding_verifymb()
> instead of PQmblen() and (2) use pg_encoding_max_length() to make sure
> you can't get close enough to the end of the buffer, but neither of
> those functions are available to psql.

You should follow the logic in pg_wcswidth: compute PQmblen() first,
and bail out if it's more than the remaining string length, otherwise
it's ok to apply PQdsplen().

It might be a good idea to explicitly initialize last_prompt1_width to
zero, for clarity.

Should the user docs explicitly say "of the same width as the most recent
output of PROMPT1", as you have in the comments?  That seems a more
precise specification, and it will eliminate some questions people will
otherwise ask.

LGTM otherwise.

regards, tom lane

Re: physical slot xmin dependency on logical slot?

2019-11-18 Thread Andres Freund

Hi,

On 2019-11-18 15:36:47 -0600, Jeremy Finzel wrote:
> We had a scenario today that was new to us.  We had a logical replication
> slot that was severely far behind.  Before dropping this logical slot, we
> made a physical point-in-time-recovery snapshot of the system with this
> logical slot.

> This logical slot was causing severe catalog bloat.  We proceeded to drop
> the logical slot which was over 12000 WAL segments behind.  The physical
> slot was only a few 100 segments behind and still in place.
> 
> But now proceeding to VAC FULL the catalog tables did not recover any bloat
> beyond the now-dropped logical slot.  Eventually to our surprise, we found
> that dropping the physical slot allowed us to recover the bloat.
> 
> We saw in forensics after the fact that xmin of the physical slot equaled
> the catalog_xmin of the logical slot.  Is there some dependency here where
> physical slots made of a system retain all transactions of logical slots it
> contains as well?  If so, could someone help us understand this, and is
> there documentation around this?  Is this by design?
> 
> We had thought that the physical slot would only retain the WAL it needed
> for its own restart_lsn, not the segments needed by only logical slots as
> well.  Any explanation would be much appreciated!

The logical slot on the standby affects hot_standby_feedback, which in
turn means that the physical slot also transports xmin horizons to the
primary.

Note that our docs suggest to drop slots when cloning a node (and
pg_basebackup/basebackup on the server side do so automatically):

It is often a good idea to also omit from the backup the files
within the cluster's pg_replslot/ directory, so that
replication slots that exist on the master do not become part of the
backup.  Otherwise, the subsequent use of the backup to create a standby
may result in indefinite retention of WAL files on the standby, and
possibly bloat on the master if hot standby feedback is enabled, because
the clients that are using those replication slots will still be connecting
to and updating the slots on the master, not the standby.  Even if the
backup is only intended for use in creating a new master, copying the
replication slots isn't expected to be particularly useful, since the
contents of those slots will likely be badly out of date by the time
the new master comes on line.

It's generally useful to look at pg_stat_replication for these kinds of
things...

Greetings,

Andres Freund

Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-18 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 >> Well, one obvious completely general method is to teach the planner
 >> (somehow) to spot conditions of the form
 >> (a > $1 OR (a = $1 AND b > $2) OR (a = $1 AND b = $2 AND c > $3) ...)
 >> etc. and make them indexable if the sense of the > or < operator at
 >> each step matched an ASC or DESC column in the index.

 Tom> I think really the only attraction of that is that it could be
 Tom> argued to be standard --- but I rather doubt that it's common for
 Tom> DBMSes to recognize such things.

At least MSSQL can recognize that a query with

  WHERE (a > @a OR (a = @a AND b > @b)) ORDER BY a,b

can be satisfied with an ordered index scan on an (a,b) index and no
sort, which is good enough for pagination queries. Haven't confirmed
yes/no for any other databases yet.

(As an aside, if you try and do that in PG using UNION ALL in place of
the OR, to try and get a mergeappend of two index scans, it doesn't work
well because of how we discard redundant pathkeys; you end up with Sort
nodes in the plan.)

 Tom> It'd certainly be a royal pain in the rear both to implement and
 Tom> to use, at least for more than about two sort columns.

For pagination one or two columns seems most likely, but in any event
the query can be generated mechanically if need be.

-- 
Andrew (irc:RhodiumToad)

physical slot xmin dependency on logical slot?

2019-11-18 Thread Jeremy Finzel

We had a scenario today that was new to us.  We had a logical replication
slot that was severely far behind.  Before dropping this logical slot, we
made a physical point-in-time-recovery snapshot of the system with this
logical slot.

This logical slot was causing severe catalog bloat.  We proceeded to drop
the logical slot which was over 12000 WAL segments behind.  The physical
slot was only a few 100 segments behind and still in place.

But now proceeding to VAC FULL the catalog tables did not recover any bloat
beyond the now-dropped logical slot.  Eventually to our surprise, we found
that dropping the physical slot allowed us to recover the bloat.

We saw in forensics after the fact that xmin of the physical slot equaled
the catalog_xmin of the logical slot.  Is there some dependency here where
physical slots made of a system retain all transactions of logical slots it
contains as well?  If so, could someone help us understand this, and is
there documentation around this?  Is this by design?

We had thought that the physical slot would only retain the WAL it needed
for its own restart_lsn, not the segments needed by only logical slots as
well.  Any explanation would be much appreciated!

Thanks,
Jeremy

Re: Invisible PROMPT2

2019-11-18 Thread Thomas Munro

On Tue, Nov 19, 2019 at 6:21 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Yeah.  Maybe pg_wcswidth() would be OK though, and it's available in
> > psql, though I guess you'd have to make a copy with the escaped bits
> > stripped out.
>
> Right, you should use pg_wcswidth() or the underlying PQdsplen() function
> to compute display width.  The latter might be more convenient since
> you could apply it character by character rather than making a copy
> of the string.

Right, a PQdsplen()/PQmblen() loop works nicely, as attached.

I spotted a potential problem: I suppose I could write a PROMPT1 that
includes an invalid multibyte sequence at the end of the buffer and
trick PQmblen() or PQdsplen() into reading a few bytes past the end.
Two defences against that would be (1) use pg_encoding_verifymb()
instead of PQmblen() and (2) use pg_encoding_max_length() to make sure
you can't get close enough to the end of the buffer, but neither of
those functions are available to psql.

v2-0001-Allow-invisible-PROMPT2-in-psql.patch
Description: Binary data

Re: segfault in geqo on experimental gcc animal

2019-11-18 Thread Fabien COELHO




Hello Martin,


The issue is resolved now and tests are fine for me.


I recompiled gcc trunk and the moonjelly is back to green.

Thanks!

--
Fabien.

Re: Postgres on IBM z/OS 2.2.0 and 2.3.0

2019-11-18 Thread parveen mehta

 Tom,
Thanks for providing valuable inputs into my concerns. In the last part you 
mentioned and i am quoting it here "would limit the parts of the system
that would have to be cleansed of ASCII-isms to libpq and src/bin/.
But that's already a nontrivial headache I suspect." I am not clear on the 
ASCII-isms to libpq and src/bin/. Can you share some knowledge on those items. 
Are those standard directory locations ? Sorry if i am being ignorant.
Regards

On Saturday, November 16, 2019, 10:33:28 AM EST, Tom Lane 
 wrote:  

 parveen mehta  writes:
> I am researching whether an postgres installation can be done on Unix system 
> services(USS) in z/OS. USS is a POSIX compliant OS on z/OS and i wonder if 
> you have any experience with installing it there that you can share with me. 
> I would be highly appreciative of your comments and thoughts.

The last discussion around this [1] concluded that you'd probably crash
and burn due to z/OS wanting to use EBCDIC encoding.  There's a lot of
ASCII-related assumptions in our code, and nobody is interested in
trying to get rid of them.

It's possible that you could run the server in ASCII and treat EBCDIC
as a client-only encoding, which would limit the parts of the system
that would have to be cleansed of ASCII-isms to libpq and src/bin/.
But that's already a nontrivial headache I suspect.

            regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/BLU437-SMTP4B3FF36035D8A3C3816D49C160%40phx.gbl

Re: 'Invalid lp' during heap_xlog_delete

2019-11-18 Thread Daniel Wood

>   I'll try to look into that  with lower page sizes for relation and WAL 
> pages.

The page size is totally unrelated to this bug.  When you repro the redo 
failure it is because the log record is being applied to an old page version.  
The correct newer page version never got written because of the truncate page 
invalidation.  The cause is not a torn write.

> What's that?

Azure PostgreSQL on Windows has proprietary mechanisms in its filesystem to 
guarantee that a write is atomic.

- Dan

> On November 18, 2019 at 4:58 AM Michael Paquier < mich...@paquier.xyz 
> mailto:mich...@paquier.xyz > wrote:
> 
> 
> On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote:
> 
> > > Sorry I missed one thing. Turn off full page writes.
> > 
> > > Hmm. Linux FSes use typically 4kB pages. I'll try to look into 
> > that
> with lower page sizes for relation and WAL pages.
> 
> 
> > > I'm running in an env. with atomic 8K writes.
> > 
> > > What's that?
> --
> Michael
>

Re: [HACKERS] proposal: schema variables

2019-11-18 Thread Pavel Stehule

ne 3. 11. 2019 v 17:27 odesílatel Pavel Stehule 
napsal:

>
>
> čt 10. 10. 2019 v 11:41 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> minor change - replace heap_tuple_fetch_attr by detoast_external_attr.
>>
>>
> similar update - heap_open, heap_close was replaced by table_open,
> table_close
>

fresh rebase

Regards

Pavel


> Regards
>
> Pavel
>


schema-variables-20191118.patch.gz
Description: application/gzip

Re: [HACKERS] pg_shmem_allocations view

2019-11-18 Thread Andres Freund

Hi,

On 2019-11-18 21:49:55 +0900, Michael Paquier wrote:
> +/* SQL SRF showing allocated shared memory */
> +Datum
> +pg_get_shmem_allocations(PG_FUNCTION_ARGS)
> This could be more talkative.

I don't really see what it'd say, except restate the function name as a
sentence? I think that kind of comment has negative, not positive value.

- Andres

Re: Reverse collations (initially for making keyset pagination cover more cases)

2019-11-18 Thread Tom Lane

Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> Lastly, your proposed use-case has some attraction, but this
>  Tom> proposal only supports it if the column you need to be differently
>  Tom> sorted is textual. What if the sort columns are all numerics and
>  Tom> timestamps?

> There are already trivial ways to reverse the orders of those, viz.
> (-number) and (-extract(epoch from timestampcol)). The lack of any
> equivalent method for text is what prompted this idea.

Those "trivial ways" have failure cases, eg with INT_MIN.  I don't buy
that this argument justifies introducing a kluge into text comparison.

>  Tom> Thinking about that, it seems like what we'd want is some sort of
>  Tom> more-general notion of row comparison, to express "bounded below
>  Tom> in an arbitrary ORDER BY ordering". Not quite sure what it ought
>  Tom> to look like.

> Well, one obvious completely general method is to teach the planner
> (somehow) to spot conditions of the form
>   (a > $1 OR (a = $1 AND b > $2) OR (a = $1 AND b = $2 AND c > $3) ...)
> etc. and make them indexable if the sense of the > or < operator at
> each step matched an ASC or DESC column in the index.

I think really the only attraction of that is that it could be argued
to be standard --- but I rather doubt that it's common for DBMSes to
recognize such things.  It'd certainly be a royal pain in the rear
both to implement and to use, at least for more than about two sort
columns.

Back at
https://www.postgresql.org/message-id/10492.1531515255%40sss.pgh.pa.us
I proposed that we might consider allowing row comparisons to specify
an explicit list of operators:

>> One idea for resolving that is to extend the OPERATOR syntax to allow
>> multiple operator names for row comparisons, along the lines of
>>  ROW(a,b) OPERATOR(pg_catalog.<, public.<) ROW(c,d)

I wonder whether it'd be feasible to solve this problem by doing that
and then allowing the operators to be of different comparison types,
that is "ROW(a,b) OPERATOR(<, >) ROW(c,d)".  The semantics would be
that the first not-equal column pair determines the result according
to the relevant operator.  But I'm not quite sure what to do if the
rows are in fact equal --- if some of the operators are like "<" and
some are like "<=", what should the result be?  Maybe let the last
column's operator decide that?

regards, tom lane

Re: [PATCH][BUG FIX] Unsafe access pointers.

2019-11-18 Thread Alvaro Herrera

On 2019-Nov-15, Ranier Vilela wrote:

> Hi,
> Last time, I promise.
> 
> It's probably not happening, but it can happen, I think.

This patch assumes that anything can happen after elog(ERROR).  That's
wrong -- under ERROR or higher, elog() (as well as ereport) never
returns to the caller.  If this was possible, there would be thousands
of places that would need to be patched, all over the server code.  But
it's not.

>   /* Fetch opclass information */
>   classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
> - if (!HeapTupleIsValid(classtup))
> + if (!HeapTupleIsValid(classtup)) {
>   elog(ERROR, "cache lookup failed for operator class %u", 
> opclassoid);
> +return false;
> +}

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Invisible PROMPT2

2019-11-18 Thread Tom Lane

Thomas Munro  writes:
> On Mon, Nov 18, 2019 at 1:49 PM Alvaro Herrera  
> wrote:
>> On 2019-Nov-18, Thomas Munro wrote:
>>> Nice idea.  Here's one like that, that just does the counting at the
>>> end and looks out for readline control codes.  It's pretty naive about
>>> what "width" means though: you'll get two spaces for UTF-8 encoded é,
>>> and I suppose a complete implementation would know about the half
>>> width/full width thing for Chinese and Japanese etc.

> Yeah.  Maybe pg_wcswidth() would be OK though, and it's available in
> psql, though I guess you'd have to make a copy with the escaped bits
> stripped out.

Right, you should use pg_wcswidth() or the underlying PQdsplen() function
to compute display width.  The latter might be more convenient since
you could apply it character by character rather than making a copy
of the string.

regards, tom lane

Re: Role membership and DROP

2019-11-18 Thread Laurenz Albe

On Fri, 2019-11-15 at 13:41 -0500, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Wed, 2019-11-13 at 17:17 -0500, Tom Lane wrote:
> > > It might be worth clarifying this point in section 5.7,
> > > https://www.postgresql.org/docs/devel/ddl-priv.html
> > > but let's not duplicate that in every ref/ page.
> > I have attached a proposed patch.
> 
>
> The right to modify or destroy an object is always the privilege of
> -   the owner only.
> +   the owner.  Like all privileges, that right can be inherited by members of
> +   the owning role.
>
> 
> Hm.  This is more or less contradicting the original meaning of the
> existing sentence, so maybe we need to rewrite a bit more.  What do
> you think of
> 
> The right to modify or destroy an object is inherent in being the
> object's owner.  Like all privileges, that right can be inherited by
> members of the owning role; but there is no way to grant or revoke
> it more selectively.
> 
> A larger problem (pre-existing, since there's a reference to being a
> member of the owning role just a bit further down) is that I don't think
> we've defined role membership at this point, so the reader is quite
> entitled to come away more confused than they were before.  It might not
> be advisable to try to cover role membership here, but we should at
> least add a cross-reference to where it's explained.

I think you are right about the potential confusion; I have added a
cross-reference.  That cross-reference is hopefully still in short-term
memory when the reader proceeds to the second reference to role membership
a few sentences later.

I like your second sentence, but I think that "the right ... is inherent
in being the ... owner" is unnecessarily complicated.
Removing the "always" and "only" makes the apparent contradiction between
the sentences less jarring to me.

I won't fight about words though.  Attached is my second attempt.

Yours,
Laurenz Albe
From 2e3abaaa3b0a5deb006d2210c3e66f5b3571bfd2 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 18 Nov 2019 15:23:10 +0100
Subject: [PATCH] Document that the right to ALTER or DROP is hereditary

Discussion: https://postgr.es/m/504497aca66bf34bdcdd90bd0bcebdc3a33f577b.ca...@cybertec.at
---
 doc/src/sgml/ddl.sgml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 9d6ec2c738..030c896f82 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1578,8 +1578,10 @@ ALTER TABLE products RENAME TO items;
   
 
   
-   The right to modify or destroy an object is always the privilege of
-   the owner only.
+   The right to modify or destroy an object is the privilege of the owner.
+   Like all privileges, that right can be inherited by members of the owning role,
+   but there is no way to grant or revoke it more selectively.
+   See  for more about role membership.
   
 
   
-- 
2.21.0

Re: function calls optimization

2019-11-18 Thread Andrzej Barszcz

Hi

I need advice.
ResetExprContext(econtext) is defined as
MemoryContextReset((econtext)->ecxt_per_tuple_memory).
I can register callback in MemoryContext but it is always cleaned on every
call to MemoryContextReset().
How to reset some fields of ExprContext ( living in per_query_memory ) when
ResetExprContext is called ?

czw., 31 paź 2019 o 16:52 Andres Freund  napisał(a):

> Hi,
>
> On October 31, 2019 8:51:11 AM PDT, Andrzej Barszcz 
> wrote:
> >x <> 0 is evaluated first, 1/x only when x <> 0, not ?
> >
> >czw., 31 paź 2019 o 16:45 Tom Lane  napisał(a):
> >
> >> Andres Freund  writes:
> >> > Potentially related note: for nodes like seqscan, combining the
> >qual and
> >> projection processing into one expression seems to be a noticable win
> >(at
> >> least when taking care do emit two different sets of deform
> >expression
> >> steps).
> >>
> >> There's just one problem: that violates SQL semantics, and not in
> >> a small way.
> >>
> >> select 1/x from tab where x <> 0
>
> On postgres lists the policy is to reply below the quoted bit, and to trim
> the quote appropriately.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Khandekar

On Mon, 18 Nov 2019 at 17:52, Amit Kapila  wrote:
> I have one more question regarding this patch.   It seems to me that
> the files opened via OpenTransientFile or OpenTemporaryFile are
> automatically closed at transaction end(abort), but that doesn't seem
> to be the case for files opened with PathNameOpenFile.  See
> AtEOXact_Files and AtEOSubXact_Files.  So, now with the change
> proposed by this patch, don't we need to deal it in some other way?

For the API's that use VFDs (like PathNameOpenFile), the files opened
are always recorded in the VfdCache array. So it is not required to do
the cleanup at (sub)transaction end, because the kernel fds get closed
dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
limit. So if a transaction aborts, the fds might remain open, but
those will get cleaned up whenever we require more fds, through
ReleaseLruFiles(). Whereas, for files opened through
OpenTransientFile(), VfdCache is not involved, so this needs
transaction end cleanup.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: dropdb --force

2019-11-18 Thread Pavel Stehule

po 18. 11. 2019 v 7:39 odesílatel Pavel Stehule 
napsal:

>
>
> po 18. 11. 2019 v 7:37 odesílatel Amit Kapila 
> napsal:
>
>> On Mon, Nov 18, 2019 at 10:59 AM Pavel Stehule 
>> wrote:
>> > po 18. 11. 2019 v 6:24 odesílatel Amit Kapila 
>> napsal:
>> >>
>> >> On Mon, Nov 18, 2019 at 10:33 AM Pavel Stehule <
>> pavel.steh...@gmail.com> wrote:
>> >> >
>> >> > po 18. 11. 2019 v 4:43 odesílatel vignesh C 
>> napsal:
>> >> >>
>> >> >>
>> >> >> I had seen that isolation test(src/test/isolation) has a framework
>> to
>> >> >> support this. You can have a look to see if it can be handled using
>> >> >> that.
>> >> >
>> >> >
>> >> > I'll look there
>> >> >
>> >>
>> >> If we want to have a test for this, then you might want to look at
>> >> test src/test/recovery/t/013_crash_restart.  In that test, we keep a
>> >> connection open and then validate whether it is terminated.  Having
>> >> said that, I think it might be better to add this as a separate test
>> >> patch apart from main patch because it is a bit of a timing-dependent
>> >> test and might fail on some slow machines.  We can always revert this
>> >> if it turns out to be an unstable test.
>> >
>> >
>> > +1
>> >
>>
>> So, are you planning to give it a try?  I think even if we want to
>> commit this separately, it is better to have a test for this before we
>> commit.
>>
>
> I'll send this test today
>

here is it

Regards

Pavel

>
> Pavel
>
>
>>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
diff --git a/src/bin/scripts/t/060_dropdb_force.pl b/src/bin/scripts/t/060_dropdb_force.pl
new file mode 100644
index 00..b895a7fbba
--- /dev/null
+++ b/src/bin/scripts/t/060_dropdb_force.pl
@@ -0,0 +1,149 @@
+#
+# Tests killing session connected to dropped database
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(60);
+
+my $node = get_new_node('master');
+$node->init;
+$node->start;
+
+# Create database that will be dropped
+$node->safe_psql('postgres', 'CREATE DATABASE foobar1');
+
+# Run psql, keeping session alive, so we have an alive backend to kill.
+my ($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+my $killme = IPC::Run::start(
+	[
+		'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+		$node->connstr('foobar1')
+	],
+	'<',
+	\$killme_stdin,
+	'>',
+	\$killme_stdout,
+	'2>',
+	\$killme_stderr,
+	$psql_timeout);
+
+#Ensure killme process is active
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	'acquired pid');
+my $pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# and it is connected to foobar1 database
+is($node->safe_psql('postgres', qq[SELECT pid FROM pg_stat_activity WHERE datname='foobar1' AND pid = $pid;]),
+	$pid, 'database foobar1 is used');
+
+$node->safe_psql('postgres', 'DROP DATABASE foobar1 WITH (FORCE)');
+
+# Check that psql sees the killed backend as having been terminated
+$killme_stdin .= q[
+SELECT 1;
+];
+ok( pump_until(
+		$killme,
+		\$killme_stderr,
+		qr/FATAL:  terminating connection due to administrator command/m
+	),
+	"psql query died successfully after SIGTERM");
+$killme_stderr = '';
+$killme_stdout = '';
+$killme->finish;
+
+# and there is not a database with this name
+is($node->safe_psql('postgres', qq[SELECT EXISTS(SELECT * FROM pg_database WHERE datname='foobar1');]),
+	'f', 'database foobar1 was removed');
+
+# Create database that will be dropped
+$node->safe_psql('postgres', 'CREATE DATABASE foobar1');
+
+# restart psql processes, now that the crash cycle finished
+($killme_stdin, $killme_stdout, $killme_stderr) = ('', '', '');
+$killme->run();
+
+# Acquire pid of new backend
+$killme_stdin .= q[
+SELECT pg_backend_pid();
+];
+ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+	"acquired pid for SIGKILL");
+$pid = $killme_stdout;
+chomp($pid);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# and it is connected to foobar1 database
+is($node->safe_psql('postgres', qq[SELECT pid FROM pg_stat_activity WHERE datname='foobar1' AND pid = $pid;]),
+	$pid, 'database foobar1 is used');
+
+# Now drop database with dropdb --force command
+$node->issues_sql_like(
+	[ 'dropdb', '--force', 'foobar1' ],
+	qr/statement: DROP DATABASE foobar1 WITH \(FORCE\);/,
+	'SQL DROP DATABASE (FORCE) run');
+
+# Check that psql sees the killed backend as having been terminated
+$killme_stdin .= q[
+SELECT 1;
+];
+ok( pump_until(
+		$killme,
+		\$killme_stderr,
+		qr/FATAL:  terminating connection due to administrator command/m
+	),
+	"psql query died successfully after SIGTERM");
+$killme_stderr = '';
+$killme_stdout = '';
+$killme->finish;
+
+# and

Re: 'Invalid lp' during heap_xlog_delete

2019-11-18 Thread Michael Paquier

On Thu, Nov 14, 2019 at 07:38:19PM -0800, Daniel Wood wrote:
> Sorry I missed one thing.  Turn off full page writes.

Hmm.  Linux FSes use typically 4kB pages.  I'll try to look into that
with lower page sizes for relation and WAL pages.

> I'm running in an env. with atomic 8K writes.

What's that?
--
Michael

signature.asc
Description: PGP signature

Re: Getting Recordset through returning refcursor

2019-11-18 Thread Michael Paquier

On Fri, Nov 15, 2019 at 03:55:28PM +0530, Amit Kapila wrote:
> I think the chances of getting an answer on ODBC related queries will
> be more if you post on pgsql-interfaces.

There is also a mailing list dedicated to Postgres ODBC:
https://www.postgresql.org/list/pgsql-odbc/
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] pg_shmem_allocations view

2019-11-18 Thread Michael Paquier

On Fri, Nov 15, 2019 at 11:59:34AM -0800, Andres Freund wrote:
> On 2019-11-15 14:43:09 -0500, Robert Haas wrote:
>> This never got applied, and that annoyed me again today, so here's a
>> new version that I've whacked around somewhat and propose to commit. I
>> ripped out the stuff pertaining to dynamic shared memory segments,
>> both because I think it might need some more thought and discussion,
>> and because the part the pertains to the main shared memory segment is
>> the part I keep wishing we had. We can add that other part later if
>> we're all agreed on it, but let's go ahead and add this part now.
> 
> Oh, nice!  Makes sense to me to split off the dsm part.

last Friday we had a conference in Tokyo, and this has been actually
mentioned when we had an after-dinner with a couple of other hackers.
Then a couple of hours later this thread rises from the ashes.

+/* SQL SRF showing allocated shared memory */
+Datum
+pg_get_shmem_allocations(PG_FUNCTION_ARGS)
This could be more talkative.

>> +# shared memory usage
>> +{ oid => '8613',
>> +  descr => 'allocations from the main shared memory segment',
>> +  proname => 'pg_get_shmem_allocations', 'prorows' => 10, 'proretset' => 
>> 't',
>> +  provolatile => 's', 'prorettype' => 'record', 'proargtypes' => '',
>> +  proallargtypes => '{text,int8,int8}', proargmodes => '{o,o,o}',
>> +  proargnames => '{name,off,size}',
>> +  prosrc => 'pg_get_shmem_allocations' },
> 
> Hm. I think the function is actually volatile, rather than stable?
> Queries can trigger shmem allocations internally, right?

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Attempt to consolidate reading of XLOG page

2019-11-18 Thread Michael Paquier

On Fri, Nov 15, 2019 at 06:41:02PM -0300, Alvaro Herrera wrote:
> I don't quite understand why you backed off from switching to pread.  It
> seemed a good change to me.
>
> [...]
>
> Having seek/open be a boolean "xlr_seek" seems a bit weird.  Changed to
> an "operation" enum.  (Maybe if we go back to pg_pread we can get rid of
> this.)  Accordingly, change WALReadRaiseError and WALDumpReadPage.

This has been quickly mentioned on the thread which has introduced
pread():
https://www.postgresql.org/message-id/c2f56d0a-cadd-3df1-ae48-b84dc8128...@redhat.com

Now, read() > pread() > read()+lseek(), and we don't actually need to
seek into the file for all the cases where we read a WAL page.  And on
a platform which uses the fallback implementation, this increases the
number of lseek() calls.  I can see as you say that using it directly
in the refactoring can simplify the code.
--
Michael


signature.asc
Description: PGP signature

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Kapila

On Mon, Nov 18, 2019 at 5:20 PM Amit Kapila  wrote:
>
> On Thu, Oct 3, 2019 at 4:48 PM Amit Khandekar  wrote:
> >
> > On Wed, 18 Sep 2019 at 12:24, Amit Khandekar  wrote:
> > > Probably, for now at least, what everyone seems to agree is to take my
> > > earlier attached patch forward.
> > >
> > > I am going to see if I can add a TAP test for the patch, and will add
> > > the patch into the commitfest soon.
> >
> > Attached is an updated patch v2.
> >
>
> I see that you have made changes in ReorderBufferRestoreChanges to use
> PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> reason for the same?
>

I have one more question regarding this patch.   It seems to me that
the files opened via OpenTransientFile or OpenTemporaryFile are
automatically closed at transaction end(abort), but that doesn't seem
to be the case for files opened with PathNameOpenFile.  See
AtEOXact_Files and AtEOSubXact_Files.  So, now with the change
proposed by this patch, don't we need to deal it in some other way?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Khandekar

On Mon, 18 Nov 2019 at 17:20, Amit Kapila  wrote:
> I see that you have made changes in ReorderBufferRestoreChanges to use
> PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> reason for the same?  In my test environment, with the test provided
> by you, I got the error (reported in this thread) via
> ReorderBufferSerializeTXN.

You didn't get this error with the patch applied, did you ?

If you were debugging this without the patch applied, I suspect that
the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
generating this error is because the max limit must be already crossed
because of earlier calls to ReorderBufferRestoreChanges().

Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
sufficient because the code in that function has made sure the fd gets
closed there itself.

If you are getting this error even with the patch applied, then this
needs investigation.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: segfault in geqo on experimental gcc animal

2019-11-18 Thread Martin Liška

Hello.

The issue is resolved now and tests are fine for me.

Martin

On Fri, 15 Nov 2019 at 13:11, Martin Liška  wrote:
>
> Heh, it's me who now breaks postgresql build:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92529
>
> Martin
>
> On Fri, 15 Nov 2019 at 13:01, Fabien COELHO
>  wrote:
> >
> >
> > > Yes, after the revision I see other failing tests like:
> >
> > Indeed, I can confirm there are still 18/195 fails with the updated gcc.
> >
> > > I'm going to investigate that and will inform you guys.
> >
> > Great, thanks!
> >
> > --
> > Fabien.

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Kapila

On Thu, Oct 3, 2019 at 4:48 PM Amit Khandekar  wrote:
>
> On Wed, 18 Sep 2019 at 12:24, Amit Khandekar  wrote:
> > Probably, for now at least, what everyone seems to agree is to take my
> > earlier attached patch forward.
> >
> > I am going to see if I can add a TAP test for the patch, and will add
> > the patch into the commitfest soon.
>
> Attached is an updated patch v2.
>

I see that you have made changes in ReorderBufferRestoreChanges to use
PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
reason for the same?  In my test environment, with the test provided
by you, I got the error (reported in this thread) via
ReorderBufferSerializeTXN.  See call stack below:

!errfinish(int dummy=0, ...)  Line 441 C
!OpenTransientFilePerm(const char * fileName=0x012deeac, int
fileFlags=33033, unsigned short fileMode=384)  Line 2272 + 0x57 bytes
C
!OpenTransientFile(const char * fileName=0x012deeac, int
fileFlags=33033)  Line 2256 + 0x15 bytes C
!ReorderBufferSerializeTXN(ReorderBuffer * rb=0x01ee4d80,
ReorderBufferTXN * txn=0x1f9a6ce8)  Line 2302 + 0x11 bytes C
!ReorderBufferIterTXNInit(ReorderBuffer * rb=0x01ee4d80,
ReorderBufferTXN * txn=0x01f08f80)  Line 1044 + 0xd bytes C


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-11-18 Thread Dilip Kumar

On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila  wrote:
>
> On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar  wrote:
> >
> > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila  wrote:
> > >
> > >
> > > Few other comments on this patch:
> > > 1.
> > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > +
> > > + /*
> > > + * Execute the invalidation message locally.
> > > + *
> > > + * XXX Do we need to care about relcacheInitFileInval and
> > > + * the other fields added to ReorderBufferChange, or just
> > > + * about the message itself?
> > > + */
> > > + LocalExecuteInvalidationMessage(>data.inval.msg);
> > > + break;
> > >
> > > Here, why are we executing messages individually?  Can't we just
> > > follow what we do in DecodeCommit which is to record the invalidations
> > > in ReorderBufferTXN as we encounter them and then allow them to
> > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit.  Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.  So I think there is
no point in setting that flag every time we get an invalidation
message.


> > IMHO, the reason is that in DecodeCommit, we get all the invalidation
> > at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
> > know which invalidation message to execute so for being safe we have
> > to execute all.  But, since we are logging all invalidation
> > individually, we exactly know at this stage which cache to invalidate.
> > So it is better to only invalidate required cache not all.
> >
>
> In that case, invalidations can be processed multiple times, the first
> time when these individual WAL logs for invalidation are processed and
> then later at commit time when we accumulate all invalidation messages
> and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
> Can we avoid to execute invalidations from other places after this
> patch which also includes executing them as part of XLOG_INVALIDATIONS
> processing?
I think we can avoid invalidation which is done as part of
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  I need to further
investigate the invalidation which is done as part of
XLOG_INVALIDATIONS.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

PITR on DROP DATABASE, deleting of the database directory despite the recovery_target_time set before.

2019-11-18 Thread Nicolas Lutic

Dear Hackers,

A customer reported a strange behaviour on a PITR restoration. 
After a drop database, he tried to recover the data on the last inserted
transaction by using the recovery_target_time.
The issue is the database is present in the system catalog but the
directory was still deleted.
Here the technical information of the database
version 11
default  postgresql.conf except for this options
    wal_level = replica
    archive_mode = on
    archive_command = 'cp %p /tmp/wal_archive/%f '
    log_statement = 'all'
    log_min_messages = debug5

  
The following method was used 

  * create cluster

  * create database

  * create 1 table 

  * create 1 index on 1 column

  * insert 1 rows

  * backup with pg_base_backup

  * insert 2 rows

  * drop database

  * stop instance

  * found the last inserted transaction timestamp(|'2019-11-13
11:49:08.413744+01'|) before drop database

  * replace $datadir by a pg_base_backup archive

  * edit recovery.conf

  * |restore_command = 'cp /tmp/wal_archive/%f "%p"'|

  * |recovery_target_time = '2019-11-13 11:49:08.413744+01'|

  * |recovery_target_inclusive = true|

  * |restart cluster|

|
|
|
|
|I tried to understand what's happening, when we analyse the
postgresql.log (log_min_message = debug5), we can see that  |
|
|
|the recovery stopped before transaction 574 (repository database) so at
transaction 573 being the last insert, but the database directory was
still deleted.|
|
|
|
|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  remove KnownAssignedXid 572|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000178 for
Transaction/COMMIT: 2019-11-13 11:49:08.248928+01|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 573
latestObservedXid 572|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/30001A0 for
Heap/INSERT: off 3|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 573
latestObservedXid 573|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/30001F8 for
Btree/INSERT_LEAF: off 3|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 573
latestObservedXid 573|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000238 for
Transaction/COMMIT: 2019-11-13 11:49:08.413744+01|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 573
latestObservedXid 573|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000238 for
Transaction/COMMIT: 2019-11-13 11:49:08.413744+01|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  remove KnownAssignedXid 573|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000238 for
Transaction/COMMIT: 2019-11-13 11:49:08.413744+01|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 574
latestObservedXid 573|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000260 for
Heap/DELETE: off 4 KEYS_UPDATED|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  prune KnownAssignedXids to 574|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/3000730 for
Standby/RUNNING_XACTS: nextXid 575 latestCompletedXid 573
oldestRunningXid 574; 1 xacts: 574|
|
|
|2019-11-13 11:55:12.732 CET [30666] DEBUG:  record known xact 574
latestObservedXid 574|
|
|
|2019-11-13 11:55:12.732 CET [30666] CONTEXT:  WAL redo at 0/30007D8 for
Database/DROP: dir 16384/1663|
|
|
|2019-11-13 11:55:12.738 CET [30666] LOG:  recovery stopping before
commit of transaction 574, time 2019-11-13 11:49:10.683426+01|
|
|
|2019-11-13 11:55:12.738 CET [30666] LOG:  recovery has paused|
|
|
|
|
|
|

By analysing the wal file with pg_waldump 

|rmgr: Heap    len (rec/tot): 54/   198, tx:    572, lsn:
0/0328, prev 0/02F8, desc: INSERT off 2, blkref #0: rel
1663/16384/16385 blk 0 FPW|
|
|
|rmgr: Btree   len (rec/tot): 53/   133, tx:    572, lsn:
0/03F0, prev 0/0328, desc: INSERT_LEAF off 2, blkref #0: rel
1663/16384/16388 blk 1 FPW|
|
|
|rmgr: Transaction len (rec/tot): 34/    34, tx:    572, lsn:
0/03000178, prev 0/03F0, desc: COMMIT 2019-11-13 11:49:08.248928 CET|
|
|
|rmgr: Heap    len (rec/tot): 87/    87, tx:    573, lsn:
0/030001A0, prev 0/03000178, desc: INSERT off 3, blkref #0: rel
1663/16384/16385 blk 0|
|
|
|rmgr: Btree   len (rec/tot): 64/    64, tx:    573, lsn:
0/030001F8, prev 0/030001A0, desc: INSERT_LEAF off 3, blkref #0: rel
1663/16384/16388 blk 1|
|
|
|rmgr: Transaction len (rec/tot): 34/    34, tx:    573, lsn:
0/03000238, prev 0/030001F8, desc: COMMIT 2019-11-13 11:49:08.413744 CET|
|
|
|rmgr: Heap    len (rec/tot): 59/  1227, tx:    574, lsn:
0/03000260, prev 0/03000238, desc: DELETE off 4 KEYS_UPDATED , blkref
#0: rel 1664/0/1262 blk 0 FPW|
|
|
|rmgr: Standby len (rec/tot): 54/    54, tx:  0, lsn:
0/03000730, prev 0/03000260, desc: RUNNING_XACTS nextXid 575
latestCompletedXid 573 oldestRunningXid 574; 1 xacts: 574|
|
|
|rmgr: XLOG    len (rec/tot):    106/

Re: adding partitioned tables to publications

2019-11-18 Thread Amit Langote

On Tue, Nov 12, 2019 at 10:11 AM Amit Langote  wrote:
> Initial
> syncing code can be easily modified to support any combination of
> source and target relations, but changes needed for real-time
> replication seem non-trivial.

I have spent some time hacking on this.  With the attached updated
patch, adding a partitioned table to publication results in publishing
the inserts, updates, deletes of the table's leaf partitions as
inserts, updates, deletes of the table itself (it all happens inside
pgoutput).  So, the replication target table doesn't necessarily have
to be a partitioned table and even if it is partitioned its partitions
don't have to match one-to-one.

One restriction remains though: partitioned tables on a subscriber
can't accept updates and deletes, because we'd need to map those to
updates and deletes of their partitions, including handling a tuple
possibly moving from one partition to another during an update.

Also, I haven't added subscription tests yet.

Attached updated patch.  The previous division into a refactoring
patch and feature patch no longer made to sense to me, so there is
only one this time.

Thanks,
Amit

v5-0001-Support-adding-partitioned-tables-to-publications.patch
Description: Binary data

Re: SegFault on 9.6.14

2019-11-18 Thread Amit Kapila

On Fri, Oct 18, 2019 at 10:08 AM Amit Kapila  wrote:
>
> On Thu, Oct 17, 2019 at 10:51 AM Thomas Munro  wrote:
> >
> > ===
> > Don't shut down Gather[Merge] early under Limit.
> >
> > Revert part of commit 19df1702f5.
> >
> > Early shutdown was added by that commit so that we could collect
> > statistics from workers, but unfortunately it interacted badly with
> > rescans.  Rescanning a Limit over a Gather node could produce a SEGV
> > on 9.6 and resource leak warnings on later releases.  By reverting the
> > early shutdown code, we might lose statistics in some cases of Limit
> > over Gather, but that will require further study to fix.
> >
>
> How about some text like below?  I have added slightly different text
> to explain the reason for the problem.
>
> "Early shutdown was added by that commit so that we could collect
> statistics from workers, but unfortunately, it interacted badly with
> rescans.  The problem is that we ended up destroying the parallel
> context which is required for rescans.  This leads to rescans of a
> Limit node over a Gather node to produce unpredictable results as it
> tries to access destroyed parallel context.  By reverting the early
> shutdown code, we might lose statistics in some cases of Limit over
> Gather, but that will require further study to fix."
>
> I am not sure but we can even add a comment in the place where we are
> removing some code (in nodeLimit.c) to indicate that 'Ideally we
> should shutdown parallel resources here to get the correct stats, but
> that would lead to rescans misbehaving when there is a Gather [Merge]
> node beneath it.  (Explain the reason for misbehavior and the ideas we
> discussed in this thread to fix the same) ."
>
> I can try to come up with comments in nodeLimit.c on the above lines
> if we think that is a good idea?
>

I have modified the commit message as proposed above and additionally
added comments in nodeLimit.c.  I think we should move ahead with this
bug-fix patch.  If we don't like the comment, it can anyway be
improved later.

Any suggestions?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Don-t-shut-down-Gather-Merge-early-under-Limit.patch
Description: Binary data

41 matches

Mail list logo