date:20191102

> I am sending patch where kill was replaced by SendProcSignal. I tested it
> on pg_bench with 400 connections, and it works without problems.
>

I tested it under high load and it is works. But it is slower than just
kill - under load the killing 400 connections needs about 1.5 sec.

Maybe for forced drop database we can increase max time limit to 10 seconds
(maybe more (statement timeout)) ? It is granted so we wait just on sigterm
handling. Other situations are finished by error. So in this case is very
probable so waiting will be successful and then we can wait long time.


> Regards
>
> Pavel
>
>>
>> Pavel
>>
>>
>>> --
>>> With Regards,
>>> Amit Kapila.
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>

Re: CREATE TEXT SEARCH DICTIONARY segfaulting on 9.6+

Arthur Zakirov  writes:
> On 2019/10/13 10:26, Tomas Vondra wrote:
>> So I think we need some sort of cross-check here. We certainly need to
>> make NISortDictionary() check the affix value is within AffixData
>> bounds, and error out when the index is non-sensical (maybe negative
>> and/or exceeding nAffixData).

> I agree, I attached the patch which do this. I also added couple 
> asserts, tests and fixed condition in getAffixFlagSet():

> - if (curaffix > 0 && curaffix <= Conf->nAffixData)
> + if (curaffix > 0 && curaffix < Conf->nAffixData)

Looks reasonable to me, and we need to get something done before
the upcoming releases, so I pushed this.  Perhaps there's more
that could be done later.

regards, tom lane

Re: dropdb --force

Hi

so 2. 11. 2019 v 17:18 odesílatel Pavel Stehule 
napsal:

>
>
> pá 25. 10. 2019 v 4:55 odesílatel Amit Kapila 
> napsal:
>
>> On Thu, Oct 24, 2019 at 8:22 PM Pavel Stehule 
>> wrote:
>> >
>> > čt 24. 10. 2019 v 11:10 odesílatel Amit Kapila 
>> napsal:
>> >>
>> >> While making some changes in the patch, I noticed that in
>> >> TerminateOtherDBBackends, there is a race condition where after we
>> >> release the ProcArrayLock, the backend process to which we decided to
>> >> send a signal exits by itself and the same pid can be assigned to
>> >> another backend which is connected to some other database.  This leads
>> >> to terminating a wrong backend.  I think we have some similar race
>> >> condition at few other places like in pg_terminate_backend(),
>> >> ProcSleep() and CountOtherDBBackends().  I think here the risk is a
>> >> bit more because there could be a long list of pids.
>> >>
>> >> One idea could be that we write a new function similar to IsBackendPid
>> >> which takes dbid and ensures that pid belongs to that database and use
>> >> that before sending kill signal, but still it will not be completely
>> >> safe.  But, I think it can be closer to cases like we already have in
>> >> code.
>> >>
>> >> Another possible idea could be to use the SendProcSignal mechanism
>> >> similar to how we have used it in CancelDBBackends() to allow the
>> >> required backends to exit by themselves.  This might be safer.
>> >>
>> >> I am not sure if we can entirely eliminate this race condition and
>> >> whether it is a good idea to accept such a race condition even though
>> >> it exists in other parts of code.  What do you think?
>> >>
>> >> BTW, I have added/revised some comments in the code and done few other
>> >> cosmetic changes, the result of which is attached.
>> >
>> >
>> > Tomorrow I'll check variants that you mentioned.
>> >
>> > We sure so there are not any new connect to removed database, because
>> we hold lock there.
>> >
>>
>> Right.
>>
>> > So check if oid db is same should be enough.
>> >
>>
>> We can do this before sending a kill signal but is it enough?  Because
>> as soon as we release ProcArrayLock anytime the other process can exit
>> and a new process can use its pid.  I think this is more of a
>> theoretical risk and might not be easy to hit, but still, we can't
>> ignore it.
>>
>
> yes, there is a theoretical risk probably - the released pid should near
> current fresh pid from range 0 .. pid_max.
>
> Probably the best solutions is enhancing SendProcSignal and using it here
> and fix CountOtherDBBackends.
>
> Alternative solution can be killing in block 50 processes and recheck.
> I'll try to write both and you can decide for one
>

I am sending patch where kill was replaced by SendProcSignal. I tested it
on pg_bench with 400 connections, and it works without problems.

Regards

Pavel

>
> Pavel
>
>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..cef1b90421 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -21,7 +21,11 @@ PostgreSQL documentation
 
  
 
-DROP DATABASE [ IF EXISTS ] name
+DROP DATABASE [ IF EXISTS ] name [ [ WITH ] ( option [, ...] ) ] 
+
+where option can be:
+
+FORCE
 
  
 
@@ -32,9 +36,11 @@ DROP DATABASE [ IF EXISTS ] name
DROP DATABASE drops a database. It removes the
catalog entries for the database and deletes the directory
containing the data.  It can only be executed by the database owner.
-   Also, it cannot be executed while you or anyone else are connected
-   to the target database.  (Connect to postgres or any
-   other database to issue this command.)
+   It cannot be executed while you are connected to the target database.
+   (Connect to postgres or any other database to issue this
+   command.)
+   Also, if anyone else is connected to the target database, this command will
+   fail unless you use the FORCE option described below.
   
 
   
@@ -64,6 +70,25 @@ DROP DATABASE [ IF EXISTS ] name
  
 

+
+   
+FORCE
+
+ 
+  Attempt to terminate all existing connections to the target database.
+  It doesn't terminate if prepared transactions, active logical replication
+  slots or subscriptions are present in the target database.
+ 
+ 
+  This will fail if the current user has no permissions to terminate other
+  connections. Required permissions are the same as with 
+  pg_terminate_backend, described in
+  .  This will also fail if we
+  are not able to terminate connections.
+ 
+
+   
+
   
  
 
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 01d66212e9..38a2bfa969 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -811,7 +811,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */

The proposed option is not much different from what it is now.We are not trying to save some space - we will reuse the existing one. We just work in 64 bit transaction counters. Correct me if I'm wrong - the address of the next version of the line is stored in the 6 byte field t_cid in the tuple header - which is not attached to the current page in any way - and can be stored anywhere in the table. Nothing changes.I often explain things very poorly, but I will try.for exampleEach transaction is assigned a unique 64-bit xid.In the tuple header, we replace 32-bit xmin and xmax - with one 64-bit field - let's call it xid.SupposeTransaction 1 does INSERTThe first version is created (Tuple1).Tuple1. Tuple_header.xid = Transacrion1.xid and Tuple1. Tuple_header. t_cid = 0;Transaction 3 (started after transaction 1) does UPDATEThe second version is created (Tuple2).Tuple1. Tuple_header. t_cid = (address) Tuple2;Tuple2. Tuple_header.xid = Transacrion3.xid and Tuple2. Tuple_header. t_cid = 0;Transaction 2 (started between transaction1 and transaction2) makes SELECTReads Tuple1. Transaction 2 sees that Tuple1.Tuple_header.xid 02.11.2019, 21:15, "Tomas Vondra" :On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote: The proposed option does not need to change the length of either the page header or tuple header. Therefore, you will not need to physically change the data.So how do you link the tuple versions together? Clearly, that has to bestored somewhere ...-- Tomas Vondra http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2019-11-02 Thread Peter Geoghegan

On Sat, Nov 2, 2019 at 11:56 AM Peter Geoghegan  wrote:
> On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > v27-0001-Index-skip-scan.patch
>
> Some random thoughts on this:

And now some more:

* I'm confused about this code in _bt_skip():

> /*
>  * Andvance backward but read forward. At this moment we are at the next
>  * distinct key at the beginning of the series. In case if scan just
>  * started, we can read forward without doing anything else. Otherwise 
> find
>  * previous distinct key and the beginning of it's series and read forward
>  * from there. To do so, go back one step, perform binary search to find
>  * the first item in the series and let _bt_readpage do everything else.
>  */
> else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
> {
> if (!scanstart)
> {
> _bt_drop_lock_and_maybe_pin(scan, >currPos);
> offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
>
> /* One step back to find a previous value */
> _bt_readpage(scan, dir, offnum);

Why is it okay to call _bt_drop_lock_and_maybe_pin() like this? It
looks like that will drop the lock (but not the pin) on the same
buffer that you binary search with _bt_binsrch() (since the local
variable "buf" is also the buf in "so->currPos").

* It also seems a bit odd that you assume that the scan is
"scan->xs_want_itup", but then check that condition many times. Why
bother?

* Similarly, why bother using _bt_drop_lock_and_maybe_pin() at all,
rather than just unlocking the buffer directly? We'll only drop the
pin for a scan that is "!scan->xs_want_itup", which is never the case
within _bt_skip().

I think that the macros and stuff that manage pins and buffer locks in
nbtsearch.c is kind of a disaster anyway [1]. Maybe there is some
value in trying to be consistent with existing nbtsearch.c code in
ways that aren't strictly necessary.

* Not sure why you need this code after throwing an error:

> else
> {
> elog(ERROR, "Could not read closest index tuples: %d", 
> offnum);
> pfree(so->skipScanKey);
> so->skipScanKey = NULL;
> return false;
> }

[1] 
https://www.postgresql.org/message-id/flat/CAH2-Wz=m674-rkqdcg+jcd9qgzn1kcg-fodyw4-j+5_pfch...@mail.gmail.com
--
Peter Geoghegan

Re: bitmaps and correlation

2019-11-02 Thread Justin Pryzby

Attached is a fixed and rebasified patch for cfbot.
Included inline for conceptual review.


>From f3055a5696924427dda280da702c41d2d2796a24 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 1 Jan 2019 16:17:28 -0600
Subject: [PATCH v2] Use correlation statistic in costing bitmap scans..

Same as for an index scan, an uncorrelated bitmap (like modulus) which access a
certain number of pages across the entire length of a table should have cost
estimate heavily weighted by random access, compared to an bitmap scan which
accesses same number of pages across a small portion of the table.

Note, Tom points out that there are cases where a column could be
tightly-clumped without being hightly-ordered.  Since we have correlation
already, we use that for now, and if someone creates a statistic for
clumpiness, we'll re-evaluate at some later date.
---
 src/backend/optimizer/path/costsize.c | 84 ---
 src/backend/optimizer/path/indxpath.c |  8 ++--
 src/include/nodes/pathnodes.h |  3 ++
 src/include/optimizer/cost.h  |  2 +-
 4 files changed, 77 insertions(+), 20 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index c5f6593..aaac29a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -549,11 +549,12 @@ cost_index(IndexPath *path, PlannerInfo *root, double 
loop_count,
 
/*
 * Save amcostestimate's results for possible use in bitmap scan 
planning.
-* We don't bother to save indexStartupCost or indexCorrelation, 
because a
-* bitmap scan doesn't care about either.
+* We don't bother to save indexStartupCost, because a
+* bitmap scan doesn't care.
 */
path->indextotalcost = indexTotalCost;
path->indexselectivity = indexSelectivity;
+   path->indexCorrelation = indexCorrelation;
 
/* all costs for touching index itself included here */
startup_cost += indexStartupCost;
@@ -986,12 +987,33 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, 
RelOptInfo *baserel,
 * appropriate to charge spc_seq_page_cost apiece.  The effect is
 * nonlinear, too. For lack of a better idea, interpolate like this to
 * determine the cost per page.
+* Note this works at PAGE granularity, so even if we read 1% of a
+* table's tuples, if we have to read nearly every page, it should be
+* considered sequential.
 */
-   if (pages_fetched >= 2.0)
+   if (pages_fetched >= 2.0) {
+   double correlation, cost_per_page2;
cost_per_page = spc_random_page_cost -
(spc_random_page_cost - spc_seq_page_cost)
* sqrt(pages_fetched / T);
-   else
+
+   // XXX: interpolate based on correlation from seq_page_cost to 
rand_page_cost?
+   // a highly correlated bitmap scan 1) likely reads fewer pages; 
and,
+   // 2) at higher "density" (more sequential).
+   // double correlation = get_indexpath_correlation(root, 
bitmapqual);
+   correlation = ((IndexPath *)bitmapqual)->indexCorrelation;
+   cost_per_page2 = spc_seq_page_cost +
+   (1-correlation*correlation)*(spc_random_page_cost - 
spc_seq_page_cost);  // XXX: *sqrt(pages_fetched/T) ?
+   // There are two variables: fraction of pages(T) and 
correlation.
+   // If T is high, giving sequential reads, we want low 
cost_per_page regardless of correlation.
+   // If correlation is high, we (probably) want low cost per page.
+   // ...the exception is if someone does an =ANY() query on a 
list of non-consecutive values.
+   // Something like start_time=ANY('2017-01-01', '2017-02-01',...)
+   // which reads small number of rows from pages all across the 
length of the table.
+   // But index scan doesn't seem to do address that at all, so 
leave it alone for now.
+   cost_per_page=Min(cost_per_page, cost_per_page2);
+   // cost_per_page=sqrt(cost_per_page*cost_per_page + 
cost_per_page2*cost_per_page2);
+   } else
cost_per_page = spc_random_page_cost;
 
run_cost += pages_fetched * cost_per_page;
@@ -1035,15 +1057,16 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, 
RelOptInfo *baserel,
 
 /*
  * cost_bitmap_tree_node
- * Extract cost and selectivity from a bitmap tree node 
(index/and/or)
+ * Extract cost, selectivity, and correlation from a bitmap tree 
node (index/and/or)
  */
 void
-cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec)
+cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec, double 
*correlation)
 {
if (IsA(path, IndexPath))
{
*cost = ((IndexPath *) path)->indextotalcost;
*selec =

Re: v12.0: ERROR: could not find pathkey item to sort

Amit Langote  writes:
>> This would
>> probably require refactoring things so that there are separate
>> entry points to add child equivalences for base rels and join rels.
>> But that seems cleaner anyway than what you've got here.

> Separate entry points sounds better, but only in HEAD?

I had actually had in mind that we might have two wrappers around a
common search-and-replace routine, but after studying the code I see that
there's just enough differences to make it probably not worth the trouble
to do it like that.  I did spend a bit of time removing cosmetic
differences between the two versions, though, mostly by creating
common local variables.

I think the way you did the matching_ecs computation is actually wrong:
we need to find ECs that reference any rel of the join, not only those
that reference both inputs.  If nothing else, the way you have it here
makes the results dependent on which pair of input rels gets considered
first, and that's certainly bad for multiway joins.

Also, I thought we should try to put more conditions on whether we
invoke add_child_join_rel_equivalences at all.  In the attached I did

if ((enable_partitionwise_join || enable_partitionwise_aggregate) &&
(joinrel->has_eclass_joins ||
 has_useful_pathkeys(root, parent_joinrel)))

but I wonder if we could do more, like checking to see if the parent
joinrel is partitioned.  Alternatively, maybe it's unnecessary because
we won't try to build child joinrels unless these conditions are true?

I did not like the test case either.  Creating a whole new (and rather
large) test table just for this one case is unreasonably expensive ---
it about doubles the runtime of the equivclass test for me.  There's
already a perfectly good test table in partition_join.sql, which seems
like a more natural home for this anyhow.  After a bit of finagling
I was able to adapt the test query to fail on that table.

Patch v4 attached.  I've not looked at what we need to do to make this
work in v12.

regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index ccc07ba..082776f 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2209,7 +2209,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,

 /*
  * add_child_rel_equivalences
- *	  Search for EC members that reference the parent_rel, and
+ *	  Search for EC members that reference the root parent of child_rel, and
  *	  add transformed members referencing the child_rel.
  *
  * Note that this function won't be called at all unless we have at least some
@@ -2217,6 +2217,12 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
  *
  * parent_rel and child_rel could be derived from appinfo, but since the
  * caller has already computed them, we might as well just pass them in.
+ *
+ * The passed-in AppendRelInfo is not used when the parent_rel is not a
+ * top-level baserel, since it shows the mapping from the parent_rel but
+ * we need to translate EC expressions that refer to the top-level parent.
+ * Using it is faster than using adjust_appendrel_attrs_multilevel(), though,
+ * so we prefer it when we can.
  */
 void
 add_child_rel_equivalences(PlannerInfo *root,
@@ -2224,6 +2230,8 @@ add_child_rel_equivalences(PlannerInfo *root,
 		   RelOptInfo *parent_rel,
 		   RelOptInfo *child_rel)
 {
+	Relids		top_parent_relids = child_rel->top_parent_relids;
+	Relids		child_relids = child_rel->relids;
 	int			i;

 	/*
@@ -2248,7 +2256,7 @@ add_child_rel_equivalences(PlannerInfo *root,
 			continue;

 		/* Sanity check eclass_indexes only contain ECs for parent_rel */
-		Assert(bms_is_subset(child_rel->top_parent_relids, cur_ec->ec_relids));
+		Assert(bms_is_subset(top_parent_relids, cur_ec->ec_relids));

 		/*
 		 * We don't use foreach() here because there's no point in scanning
@@ -2268,13 +2276,14 @@ add_child_rel_equivalences(PlannerInfo *root,
 			 * already-transformed child members.  Otherwise, if some original
 			 * member expression references more than one appendrel, we'd get
 			 * an O(N^2) explosion of useless derived expressions for
-			 * combinations of children.
+			 * combinations of children.  (But add_child_join_rel_equivalences
+			 * may add targeted combinations for partitionwise-join purposes.)
 			 */
 			if (cur_em->em_is_child)
 continue;		/* ignore children here */

 			/* Does this member reference child's topmost parent rel? */
-			if (bms_overlap(cur_em->em_relids, child_rel->top_parent_relids))
+			if (bms_overlap(cur_em->em_relids, top_parent_relids))
 			{
 /* Yes, generate transformed child version */
 Expr	   *child_expr;
@@ -2295,8 +2304,8 @@ add_child_rel_equivalences(PlannerInfo *root,
 	child_expr = (Expr *)
 		adjust_appendrel_attrs_multilevel(root,
 		  (Node *) cur_em->em_expr,
-		  child_rel->relids,
-		  child_rel->top_parent_relids);
+

Re: 64 bit transaction id

2019-11-02 Thread Павел Ерёмин

The proposed option does not need to change the length of either the page header or tuple header. Therefore, you will not need to physically change the data. regards01.11.2019, 20:10, "Tomas Vondra" :On Fri, Nov 01, 2019 at 10:25:17AM +0100, Pavel Stehule wrote:Hipá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин napsal: Hi. sorry for my English. I want to once again open the topic of 64 bit transaction id. I did not manage to find in the archive of the option that I want to discuss, so I write. If I searched poorly, then please forgive me. The idea is not very original and probably has already been considered, again I repeat - I did not find it. Therefore, please do not scold me severely. In discussions of 64-bit transaction id, I did not find mention of an algorithm for storing them, as it was done, for example, in MS SQL Server. What if instead of 2 fields (xmin and xmax) with a total length of 64 bits - use 1 field (let's call it xid) with a length of 64 bits in tuple header? In this field store the xid of the transaction that created the version. In this case, the new transaction in order to understand whether the read version is suitable for it or not, will have to read the next version as well. Those. The downside of such decision is of course an increase in I / O. Transactions will have to read the +1 version. On the plus side, the title of the tuple remains the same length.is 32 bit tid really problem? Why you need to know state of last 2^31transactions? Is not problem in too low usage (or maybe too high overhead)of VACUUM FREEZE.It certainly can be an issue for large and busy systems, that may needanti-wraparoud vacuum every couple of days. If that requires rewriting acouple of TB of data, it's not particularly nice. That's why 64-bit XIDswere discussed repeatedly in the past, and it's likely to get even morepressing as the systems get larger.I am not sure if increasing this range can has much more fatal problems(maybe later)Well, not fatal, but naive approaches can increase per-tuple overhead.And we already have plenty of that, hence there were proposals to usepage epochs and so on.regards-- Tomas Vondra http://www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Index Skip Scan

2019-11-02 Thread Peter Geoghegan

On Wed, Sep 25, 2019 at 2:33 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> v27-0001-Index-skip-scan.patch

Some random thoughts on this:

* Is _bt_scankey_within_page() really doing the right thing within empty pages?

It looks like you're accidentally using the high key when the leaf
page is empty with forward scans (assuming that the leaf page isn't
rightmost). You'll need to think about empty pages for both forward
and backward direction scans there.

Actually, using the high key in some cases may be desirable, once the
details are worked out -- the high key is actually very helpful with
low cardinality indexes. If you populate an index with retail
insertions (i.e. don't just do a CREATE INDEX after the table is
populated), and use low cardinality data in the indexed columns then
you'll see this effect. You can have a few hundred index entries for
each distinct value, and the page split logic added to Postgres 12 (by
commit fab25024) will still manage to "trap" each set of duplicates on
their own distinct leaf page. Leaf pages will have a high key that
looks like the values that appear on the page to the right. The goal
is for forward index scans to access the minimum number of leaf pages,
especially with low cardinality data and with multi-column indexes.
(See also: commit 29b64d1d)

A good way to see this for yourself is to get the Wisconsin Benchmark
tables (the tenk1 table and related tables from the regression tests)
populated using retail insertions. "CREATE TABLE __tenk1(like tenk1
including indexes); INSERT INTO __tenk1 SELECT * FROM tenk1;" is how I
like to set this up. Then you can see that we only access one leaf
page easily by forcing bitmap scans (i.e. "set enable* ..."), and
using "EXPLAIN (analyze, buffers) SELECT ... FROM __tenk1 WHERE ...",
where the SELECT query is a simple point lookup query (bitmap scans
happen to instrument the index buffer accesses in a way that makes it
absolutely clear how many index page buffers were touched). IIRC the
existing tenk1 indexes have no more than a few hundred duplicates for
each distinct value in all cases, so only one leaf page needs to be
accessed by simple "key = val" queries in all cases.

(I imagine that the "four" index you add in the regression test would
generally need to visit more than one leaf page for simple point
lookup queries, but in any case the high key is a useful way of
detecting a "break" in the values when indexing low cardinality data
-- these breaks are generally "aligned" to leaf page boundaries.)

I also like to visualize the keyspace of indexes when poking around at
that stuff, generally by using some of the queries that you can find
on the Wiki [1].

* The extra scankeys that you are using in most of the new nbtsearch.c
code is an insertion scankey -- not a search style scankey. I think
that you should try to be a bit clearer on that distinction in
comments. It is already confusing now, but at least there is only zero
or one insertion scankeys per scan (for the initial positioning).

* There are two _bt_skip() prototypes in nbtree.h (actually, there is
a btskip() and a _bt_skip()). I understand that the former is a public
wrapper of the latter, but I find the naming a little bit confusing.
Maybe rename _bt_skip() to something that is a little bit more
suggestive of its purpose.

* Suggest running pgindent on the patch.

[1] 
https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index
-- 
Peter Geoghegan

Re: 64 bit transaction id

2019-11-02 Thread Tomas Vondra


On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote:

   
  The proposed option does not need to change the length of either the page
  header or tuple header. Therefore, you will not need to physically change
  the data.
   


So how do you link the tuple versions together? Clearly, that has to be
stored somewhere ...


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_upgrade and subscriptions

2019-11-02 Thread Oleksii Kliukin

Hello,

I came across a surprising behavior when upgrading our PostgreSQL 10 DBs that
also serve as a destination for the logical replication of some reference 
tables.

pg_upgrade turns off all subscriptions on the cluster and doesn't turn them on.
Specifically, it creates them with connect = false, as discussed at the thread
starting at
https://www.postgresql.org/message-id/e4fbfad5-c6ac-fd50-6777-18c84b34e...@2ndquadrant.com

Unfortunately, I can't find any mention of this in the docs of pg_upgrade, so
I am at leas willing to add those if we can't resolve this in a more automated
way (read below).

Since we can determine those subscriptions that were active on the old cluster
immediately before the upgrade, we could collect those and emit a script at
the end of the pg_upgrade to turn them on, similar to the one pg_upgrade
produces for the analyze.

There are two options when re-enabling the subscriptions: either continue from
the current position (possibly discarding all changes that happened while the
database was offline), or truncate the destination tables and copy the data 
again.
The first one corresponds to setting copy_data=false when doing refresh 
publication.
The second one is a combination of a prior truncate + refresh publication with
copy_data=true and doesn't seem like an action that is performed in a
single transaction. Would it make sense to add a copy_truncate flag, false
by default, that would instruct the logical replication worker to truncate the
destination table immediately before resyncing it from the origin?

Regards,
Oleksii

Re: [Proposal] Global temporary tables

And pg_catalog.pg_statistics_gtt() is set returning functions?
>

yes

I afraid that it is not acceptable solution from performance point of view:
> pg_statictic table is accessed by keys (,,)
>

I don't think so it is problem. The any component, that needs to use fast
access can use some special function that check index or check some memory
buffers.


If it can not be done using index scan, then it can cause significant
> performance slow down.
>

where you need fast access when you use SQL access? Inside postgres
optimizer is caches everywhere. And statistics cache should to know so have
to check index and some memory buffers.

The proposed view will not be used by optimizer, but it can be used by some
higher layers. I think so there is a agreement so GTT metadata should not
be stored in system catalogue. If are stored in some syscache or somewhere
else is not important in this moment. But can be nice if for user the GTT
metadata should not be black hole. I think so is better to change some
current tables to views, than use some special function just specialized
for GTT (these functions should to exists in both variants). When I think
about it - this is important not just for functionality that we expect from
GTT. It is important for consistency of Postgres catalog - how much
different should be GTT than other types of tables in system catalogue from
user's perspective.

Re: Getting psql to redisplay command after \e

Fabien COELHO  writes:
> My point is to possibly not implicitely  at the end of \e, but to 
> behave as if we were moving in history, which allows editing the lines, so 
> that you would get
>psql=> select 1
> Instead of the above.

>> I agree it might be nicer if you could do that, but that's *far* beyond 
>> the scope of this patch.  It would take entirely fundamental rethinking 
>> of our use of libreadline, if indeed it's possible at all.  I also don't 
>> see how we could have syntax-aware per-line prompts if we were allowing 
>> readline to treat the whole query as one line.

> I was suggesting something much simpler than rethinking readline handling. 
> Does not mean that it is a good idea, but while testing the patch I would 
> have liked the unfinished line to be in the current editing buffer, 
> basically as if I had not typed .

I did experiment with trying to do that, but I couldn't get it to work,
even with the single version of libreadline I had at hand.  It appears
to me that readline() starts by clearing the internal buffer.  Even if
we could persuade it to work in a particular readline version, I think
the odds of making it portable across all the libreadline and libedit
versions that are out there aren't very good.  And there's definitely
no chance of being remotely compatible with that behavior when using the
bare tty drivers (psql -n).

In practice, if you decide that you don't like what you're looking at,
you're probably going to go back into the editor to fix it, ie issue
another \e.  So I'm not sure that it's worth such pushups to get the
data into readline's buffer.

regards, tom lane

Re: dropdb --force

pá 25. 10. 2019 v 4:55 odesílatel Amit Kapila 
napsal:

> On Thu, Oct 24, 2019 at 8:22 PM Pavel Stehule 
> wrote:
> >
> > čt 24. 10. 2019 v 11:10 odesílatel Amit Kapila 
> napsal:
> >>
> >> While making some changes in the patch, I noticed that in
> >> TerminateOtherDBBackends, there is a race condition where after we
> >> release the ProcArrayLock, the backend process to which we decided to
> >> send a signal exits by itself and the same pid can be assigned to
> >> another backend which is connected to some other database.  This leads
> >> to terminating a wrong backend.  I think we have some similar race
> >> condition at few other places like in pg_terminate_backend(),
> >> ProcSleep() and CountOtherDBBackends().  I think here the risk is a
> >> bit more because there could be a long list of pids.
> >>
> >> One idea could be that we write a new function similar to IsBackendPid
> >> which takes dbid and ensures that pid belongs to that database and use
> >> that before sending kill signal, but still it will not be completely
> >> safe.  But, I think it can be closer to cases like we already have in
> >> code.
> >>
> >> Another possible idea could be to use the SendProcSignal mechanism
> >> similar to how we have used it in CancelDBBackends() to allow the
> >> required backends to exit by themselves.  This might be safer.
> >>
> >> I am not sure if we can entirely eliminate this race condition and
> >> whether it is a good idea to accept such a race condition even though
> >> it exists in other parts of code.  What do you think?
> >>
> >> BTW, I have added/revised some comments in the code and done few other
> >> cosmetic changes, the result of which is attached.
> >
> >
> > Tomorrow I'll check variants that you mentioned.
> >
> > We sure so there are not any new connect to removed database, because we
> hold lock there.
> >
>
> Right.
>
> > So check if oid db is same should be enough.
> >
>
> We can do this before sending a kill signal but is it enough?  Because
> as soon as we release ProcArrayLock anytime the other process can exit
> and a new process can use its pid.  I think this is more of a
> theoretical risk and might not be easy to hit, but still, we can't
> ignore it.
>

yes, there is a theoretical risk probably - the released pid should near
current fresh pid from range 0 .. pid_max.

Probably the best solutions is enhancing SendProcSignal and using it here
and fix CountOtherDBBackends.

Alternative solution can be killing in block 50 processes and recheck. I'll
try to write both and you can decide for one

Pavel


> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [Proposal] Global temporary tables

2019-11-02 Thread Julien Rouhaud

On Sat, Nov 2, 2019 at 4:09 PM Konstantin Knizhnik
 wrote:
>
> On 02.11.2019 10:19, Julien Rouhaud wrote:
> > On Sat, Nov 2, 2019 at 6:31 AM Pavel Stehule  
> > wrote:
> >> pá 1. 11. 2019 v 17:09 odesílatel Konstantin Knizhnik 
> >>  napsal:
> >>> On 01.11.2019 18:26, Robert Haas wrote:
>  On Fri, Nov 1, 2019 at 11:15 AM Konstantin Knizhnik
>   wrote:
> > It seems to me that I have found quite elegant solution for per-backend 
> > statistic for GTT: I just inserting it in backend's catalog cache, but 
> > not in pg_statistic table itself.
> > To do it I have to add InsertSysCache/InsertCatCache functions which 
> > insert pinned entry in the correspondent cache.
> > I wonder if there are some pitfalls of such approach?
>  That sounds pretty hackish. You'd have to be very careful, for
>  example, that if the tables were dropped or re-analyzed, all of the
>  old entries got removed --
> >>> I have checked it:
> >>> - when table is reanalyzed, then cache entries are replaced.
> >>> - when table is dropped, then cache entries are removed.
> >>>
>  and then it would still fail if any code
>  tried to access the statistics directly from the table, rather than
>  via the caches. My assumption is that the statistics ought to be
>  stored in some backend-private data structure designed for that
>  purpose, and that the code that needs the data should be taught to
>  look for it there when the table is a GTT.
> >>> Yes, if you do "select * from pg_statistic" then you will not see
> >>> statistic for GTT in this case.
> >>> But I do not think that it is so critical. I do not believe that anybody
> >>> is trying to manually interpret values in this table.
> >>> And optimizer is retrieving statistic through sys-cache mechanism and so
> >>> is able to build correct plan in this case.
> >>
> >> Years ago, when I though about it, I wrote patch with similar design. It's 
> >> working, but surely it's ugly.
> >>
> >> I have another idea. Can be pg_statistics view instead a table?
> >>
> >> Some like
> >>
> >> SELECT * FROM pg_catalog.pg_statistics_rel
> >> UNION ALL
> >> SELECT * FROM pg_catalog.pg_statistics_gtt();
> >>
> >> Internally - when stat cache is filled, then there can be used 
> >> pg_statistics_rel and pg_statistics_gtt() directly. What I remember, there 
> >> was not possibility to work with queries, only with just relations.
> > It'd be a loss if you lose the ability to see the statistics, as there
> > are valid use cases where you need to see the stats, eg. understanding
> > why you don't get the plan you wanted.  There's also at least one
> > extension [1] that allows you to backup and use restored statistics,
> > so there are definitely people interested in it.
> >
> > [1]: https://github.com/ossc-db/pg_dbms_stats
> It seems to have completely no sense to backup and restore statistic for
> temporary tables which life time is limited to life time of backend,
> doesn't it?

In general yes I agree, but it doesn't if the goal is to understand
why even after an analyze on the temporary table your query is still
behaving poorly.  It can be useful to allow reproduction or just give
someone else the statistics to see what's going on.

Re: Remove configure --disable-float4-byval and --disable-float8-byval

Peter Eisentraut  writes:
> On 2019-11-01 15:41, Robert Haas wrote:
>> On a related note, why do we store typbyval in the catalog anyway
>> instead of inferring it from typlen and maybe typalign? It seems like
>> a bad idea to record on disk the way we pass around values in memory,
>> because it means that a change to how values are passed around in
>> memory has ramifications for on-disk compatibility.

> This sounds interesting.  It would remove a pg_upgrade hazard (in the 
> long run).

> There is some backward compatibility to be concerned about.

Yeah.  The point here is that typbyval specifies what the C functions
concerned with the datatype are expecting.  We can't just up and say
"we're going to decide that for you".

I do get the point that supporting two different typbyval options
for float8 and related types is a nontrivial cost.

regards, tom lane

Re: [Proposal] Global temporary tables

2019-11-02 Thread Konstantin Knizhnik

On 02.11.2019 8:30, Pavel Stehule wrote:

pá 1. 11. 2019 v 17:09 odesílatel Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> napsal:

On 01.11.2019 18:26, Robert Haas wrote:
> On Fri, Nov 1, 2019 at 11:15 AM Konstantin Knizhnik
> mailto:k.knizh...@postgrespro.ru>>
wrote:
>> It seems to me that I have found quite elegant solution for
per-backend statistic for GTT: I just inserting it in backend's
catalog cache, but not in pg_statistic table itself.
>> To do it I have to add InsertSysCache/InsertCatCache functions
which insert pinned entry in the correspondent cache.
>> I wonder if there are some pitfalls of such approach?
> That sounds pretty hackish. You'd have to be very careful, for
> example, that if the tables were dropped or re-analyzed, all of the
> old entries got removed --

I have checked it:
- when table is reanalyzed, then cache entries are replaced.
- when table is dropped, then cache entries are removed.

> and then it would still fail if any code
> tried to access the statistics directly from the table, rather than
> via the caches. My assumption is that the statistics ought to be
> stored in some backend-private data structure designed for that
> purpose, and that the code that needs the data should be taught to
> look for it there when the table is a GTT.

Yes, if you do "select * from pg_statistic" then you will not see
statistic for GTT in this case.
But I do not think that it is so critical. I do not believe that
anybody
is trying to manually interpret values in this table.
And optimizer is retrieving statistic through sys-cache mechanism
and so
is able to build correct plan in this case.

Years ago, when I though about it, I wrote patch with similar design. 
It's working, but surely it's ugly.

I have another idea. Can be pg_statistics view instead a table?

Some like

SELECT * FROM pg_catalog.pg_statistics_rel
UNION ALL
SELECT * FROM pg_catalog.pg_statistics_gtt();

And pg_catalog.pg_statistics_gtt() is set returning functions?
I afraid that it is not acceptable solution from performance point of 
view: pg_statictic table is accessed by keys (,,)
If it can not be done using index scan, then it can cause significant 
performance slow down.

Internally - when stat cache is filled, then there can be used 
pg_statistics_rel and pg_statistics_gtt() directly. What I remember, 
there was not possibility to work with queries, only with just relations.

Or crazy idea - today we can implement own types of heaps. Is possible 
to create engine where result can be combination of some shared data 
and local data. So union will be implemented on heap level.
This implementation can be simple, just scanning pages from shared 
buffers and from local buffers. For these tables we don't need complex 
metadata. It's crazy idea, and I think so union with table function 
should be best.

Frankly speaking, implementing special heap access method for 
pg_statistic just to handle case of global temp tables seems to be overkill
from my point of view. It requires a lot coding (or at least copying a 
lot of code from heapam). Also, as I wrote above, we need also index for 
efficient lookup of statistic.

Re: [Proposal] Global temporary tables

2019-11-02 Thread Konstantin Knizhnik





On 02.11.2019 10:19, Julien Rouhaud wrote:

On Sat, Nov 2, 2019 at 6:31 AM Pavel Stehule  wrote:

pá 1. 11. 2019 v 17:09 odesílatel Konstantin Knizhnik 
 napsal:

On 01.11.2019 18:26, Robert Haas wrote:

On Fri, Nov 1, 2019 at 11:15 AM Konstantin Knizhnik
 wrote:

It seems to me that I have found quite elegant solution for per-backend 
statistic for GTT: I just inserting it in backend's catalog cache, but not in 
pg_statistic table itself.
To do it I have to add InsertSysCache/InsertCatCache functions which insert 
pinned entry in the correspondent cache.
I wonder if there are some pitfalls of such approach?

That sounds pretty hackish. You'd have to be very careful, for
example, that if the tables were dropped or re-analyzed, all of the
old entries got removed --

I have checked it:
- when table is reanalyzed, then cache entries are replaced.
- when table is dropped, then cache entries are removed.


and then it would still fail if any code
tried to access the statistics directly from the table, rather than
via the caches. My assumption is that the statistics ought to be
stored in some backend-private data structure designed for that
purpose, and that the code that needs the data should be taught to
look for it there when the table is a GTT.

Yes, if you do "select * from pg_statistic" then you will not see
statistic for GTT in this case.
But I do not think that it is so critical. I do not believe that anybody
is trying to manually interpret values in this table.
And optimizer is retrieving statistic through sys-cache mechanism and so
is able to build correct plan in this case.


Years ago, when I though about it, I wrote patch with similar design. It's 
working, but surely it's ugly.

I have another idea. Can be pg_statistics view instead a table?

Some like

SELECT * FROM pg_catalog.pg_statistics_rel
UNION ALL
SELECT * FROM pg_catalog.pg_statistics_gtt();

Internally - when stat cache is filled, then there can be used 
pg_statistics_rel and pg_statistics_gtt() directly. What I remember, there was 
not possibility to work with queries, only with just relations.

It'd be a loss if you lose the ability to see the statistics, as there
are valid use cases where you need to see the stats, eg. understanding
why you don't get the plan you wanted.  There's also at least one
extension [1] that allows you to backup and use restored statistics,
so there are definitely people interested in it.

[1]: https://github.com/ossc-db/pg_dbms_stats
It seems to have completely no sense to backup and restore statistic for 
temporary tables which life time is limited to life time of backend,

doesn't it?

Re: On disable_cost

Tomas Vondra  writes:
> On Fri, Nov 01, 2019 at 09:30:52AM -0700, Jim Finnerty wrote:
>> re: coping with adding disable_cost more than once
>> 
>> Another option would be to have a 2-part Cost structure.  If disable_cost is
>> ever added to the Cost, then you set a flag recording this.  If any plans
>> exist that have no disable_costs added to them, then the planner chooses the
>> minimum cost among those, otherwise you choose the minimum cost path.

> Yeah, I agree having is_disabled flag, and treat all paths with 'true'
> as more expensive than paths with 'false' (and when both paths have the
> same value then actually compare the cost) is probably the way forward.

It would have to be a count, not a boolean --- for example, you want to
prefer a path that uses one disabled SeqScan over a path that uses two.

I'm with Andres in being pretty worried about the extra burden imposed
on add_path comparisons.

regards, tom lane

Re: Refactor parse analysis of EXECUTE command

Peter Eisentraut  writes:
> This patch moves the parse analysis component of ExecuteQuery() and
> EvaluateParams() into a new transformExecuteStmt() that is called from
> transformStmt().

Uhmm ... no actual patch attached?

regards, tom lane

Re: On disable_cost

Zhenghua Lyu  writes:
>> I think a more robust way to disable forbidden plan types would be to
>> handle the disabling in add_path(). Instead of having a high disable cost
>> on the Path itself, the comparison add_path() would always consider
>> disabled paths as more expensive than others, regardless of the cost.

Getting rid of disable_cost would be a nice thing to do, but I would
rather not do it by adding still more complexity to add_path(), not
to mention having to bloat Paths with a separate "disabled" marker.

The idea that I've been thinking about is to not generate disabled
Paths in the first place, thus not only fixing the problem but saving
some cycles.  While this seems easy enough for "optional" paths,
we have to reserve the ability to generate certain path types regardless,
if there's no other way to implement the query.  This is a bit of a
stumbling block :-(.  At the base relation level, we could do something
like generating seqscan last, and only if no other path has been
successfully generated.  But I'm not sure how to scale that up to
joins.  In particular, imagine that we consider joining A to B, and
find that the only way is a nestloop, so we generate a nestloop join
despite that being nominally disabled.  The next join level would
then see that as an available path, and it might decide that
((A nestjoin B) join C) is the cheapest choice, even though there
might have been a way to do, say, ((A join C) join B) with no use of
nestloops.  Users would find this surprising.

Maybe the only way to do this is a separate number-of-uses-of-
disabled-plan-types cost figure in Paths, but I still don't want
to go there.  The number of cases where disable_cost's shortcomings
really matter is too small to justify that, IMHO.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-11-02 Thread Masahiko Sawada

On Sat, 2 Nov 2019 at 21:33, Antonin Houska  wrote:

> Robert Haas  wrote:
>
> > On Tue, Aug 6, 2019 at 10:36 AM Bruce Momjian  wrote:
>
> > Seems reasonable (not that I am an encryption expert).
> >
> > > For WAL, we effectively create a 16MB bitstream, though we can create
> it
> > > in parts as needed.  (Creating it in parts is easier in CTR mode.)  The
> > > nonce is the segment number, but each 16-byte chunk uses a different
> > > counter.  Therefore, even if you are encrypting the same 8k page
> several
> > > times in the WAL, the 8k page would be different because of the LSN
> (and
> > > other changes), and the bitstream you encrypt/XOR it with would be
> > > different because the counter would be different for that offset in the
> > > WAL.
> >
> > But, if you encrypt the same WAL page several times, the LSN won't
> > change, because a WAL page doesn't have an LSN on it, and if it did,
> > it wouldn't be changing, because an LSN is just a position within the
> > WAL stream, so any given byte on any given WAL page always has the
> > same LSN, whatever it is.
> >
> > And if the counter value changed on re-encryption, I don't see how
> > we'd know what counter value to use when decrypting.  There's no way
> > for the code that is decrypting to know how many times the page got
> > rewritten as it was being filled.
> >
> > Please correct me if I'm being stupid here.
>
> In my implementation (I haven't checked whether Masahiko Sawada changed
> this
> in his patch) I avoided repeated encryption of different data using the
> same
> key+IV by omitting the unused part of the WAL page from encryption. Already
> written records can be encrypted repeatedly because they do not change.
>
>
Yeah my patch doesn't change this part. IV for WAL encryption consists of
the segment file number, page offset within segment file and the counter
for CTR cipher mode.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: alternative to PG_CATCH

Peter Eisentraut  writes:
> committed with a leading underscore

I hadn't actually tested this patch before commit, but now that
it's in, I'm seeing assorted compiler warnings:

plpy_exec.c: In function 'PLy_procedure_call':
plpy_exec.c:1028: warning: 'rv' may be used uninitialized in this function
pltcl.c: In function 'pltcl_handler':
pltcl.c:721: warning: 'retval' may be used uninitialized in this function
plpy_typeio.c: In function 'PLyObject_ToBytea':
plpy_typeio.c:904: warning: 'rv' may be used uninitialized in this function
plperl.c: In function 'plperl_call_handler':
plperl.c:1843: warning: 'retval' may be used uninitialized in this function

I'm not happy about that.

regards, tom lane

Re: Can avoid list_copy in recomputeNamespacePath() conditionally?

amul sul  writes:
> I wondered can we have a shortcut somewhat similar to following POC
> in recomputeNamespacePath () when the recomputed path is the same as the
> previous baseSearchPath/activeSearchPath :
> +   /* TODO: POC */
> +   if (equal(oidlist, baseSearchPath))
> +   return;

There's an awful lot missing from that sketch; all of the remaining
steps still need to be done:

baseCreationNamespace = firstNS;
baseTempCreationPending = temp_missing;

/* Mark the path valid. */
baseSearchPathValid = true;
namespaceUser = roleid;

/* And make it active. */
activeSearchPath = baseSearchPath;
activeCreationNamespace = baseCreationNamespace;
activeTempCreationPending = baseTempCreationPending;

/* Clean up. */
pfree(rawname);
list_free(namelist);
list_free(oidlist);

More to the point, I think the onus would be on the patch submitter
to prove that the extra complexity had some measurable benefit.
I really doubt that it would, since the list_copy is surely trivial
compared to the catalog lookup work we had to do to compute the OID
list above here.

It'd likely be more useful to see if you could reduce the number of
places where we have to invalidate the path in the first place.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-11-02 Thread Antonin Houska

Robert Haas  wrote:

> On Tue, Aug 6, 2019 at 10:36 AM Bruce Momjian  wrote:

> Seems reasonable (not that I am an encryption expert).
> 
> > For WAL, we effectively create a 16MB bitstream, though we can create it
> > in parts as needed.  (Creating it in parts is easier in CTR mode.)  The
> > nonce is the segment number, but each 16-byte chunk uses a different
> > counter.  Therefore, even if you are encrypting the same 8k page several
> > times in the WAL, the 8k page would be different because of the LSN (and
> > other changes), and the bitstream you encrypt/XOR it with would be
> > different because the counter would be different for that offset in the
> > WAL.
> 
> But, if you encrypt the same WAL page several times, the LSN won't
> change, because a WAL page doesn't have an LSN on it, and if it did,
> it wouldn't be changing, because an LSN is just a position within the
> WAL stream, so any given byte on any given WAL page always has the
> same LSN, whatever it is.
> 
> And if the counter value changed on re-encryption, I don't see how
> we'd know what counter value to use when decrypting.  There's no way
> for the code that is decrypting to know how many times the page got
> rewritten as it was being filled.
> 
> Please correct me if I'm being stupid here.

In my implementation (I haven't checked whether Masahiko Sawada changed this
in his patch) I avoided repeated encryption of different data using the same
key+IV by omitting the unused part of the WAL page from encryption. Already
written records can be encrypted repeatedly because they do not change.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-11-02 Thread Antonin Houska

Robert Haas  wrote:

> On Mon, Aug 5, 2019 at 8:44 PM Bruce Momjian  wrote:
> > Right.  The 8k page LSN changes each time the page is modified, and the
> > is part of the page nonce.
> 
> What about hint bit changes?
> 
> I think even with wal_log_hints=on, it's not the case that *every*
> change to hint bits results in an LSN change.

Change to hint bits does not result in LSN change in the case I described here

https://www.postgresql.org/message-id/28452.1572443058%40antos

but I consider this a bug (BTW, I discovered this problem when thinking about
the use of LSN as encryption IV). Do you mean any other case? If LSN does not
get changed, then the related full-page image WAL record is not guaranteed to
be on disk during crash recovery. Thus if page checksum is invalid due to
torn-page write, there's now WAL record to fix the page.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: pglz performance

2019-11-02 Thread Andrey Borodin



> 1 нояб. 2019 г., в 18:48, Alvaro Herrera  
> написал(а):
> 
> On 2019-Nov-01, Peter Eisentraut wrote:
> 
>> On 2019-10-25 07:05, Andrey Borodin wrote:
 21 окт. 2019 г., в 14:09, Andrey Borodin  написал(а):
 
 With Silesian corpus pglz_decompress_hacked is actually decreasing 
 performance on high-entropy data.
 Meanwhile pglz_decompress_hacked8 is still faster than usual 
 pglz_decompress.
 In spite of this benchmarks, I think that pglz_decompress_hacked8 is safer 
 option.
>>> 
>>> Here's v3 which takes into account recent benchmarks with Silesian Corpus 
>>> and have better comments.
>> 
>> Your message from 21 October appears to say that this change makes the
>> performance worse.  So I don't know how to proceed with this.
> 
> As I understand that report, in these results "less is better", so the
> hacked8 variant shows better performance (33.8) than current (42.5).
> The "hacked" variant shows worse performance (48.2) that the current
> code.
This is correct. Thanks, Álvaro.

> The "in spite" phrase seems to have been a mistake.
Yes. Sorry, I actually thought that "in spite" is a contradiction of "despite" 
and means "In view of".

> I am surprised that there is so much variability in the performance
> numbers, though, based on such small tweaks of the code.
Silesian Corpus is very different from WALs and PG data files. Data files are 
rich in long sequences of same byte. This sequences are long, thus unrolled 
very effectively by memcpy method.
But Silesian corpus is rich in short matches of few bytes.

> 1 нояб. 2019 г., в 19:59, Tomas Vondra  
> написал(а):
> I'd try running the benchmarks to verify the numbers, and maybe do some
> additional tests, but it's not clear to me which patches should I use.
Cool, thanks!

> I think the last patches with 'hacked' and 'hacked8' in the name are a
> couple of months old, and the recent posts attach just a single patch.
> Andrey, can you post current versions of both patches?
PFA two patches:
v4-0001-Use-memcpy-in-pglz-decompression.patch (known as 'hacked' in test_pglz 
extension)
v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch (known as 
'hacked8')

Best regards, Andrey Borodin.


v4-0001-Use-memcpy-in-pglz-decompression.patch
Description: Binary data


v4-0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch
Description: Binary data

Re: Adding percentile metrics to pg_stat_statements module

2019-11-02 Thread Adrien Nayrat

On 10/31/19 8:32 PM, Tomas Vondra wrote:
> IMO having some sort of CDF approximation (being a q-digest or t-digest)
> would be useful, although it'd probably need to be optional (mostly
> becuase of memory consumption).

+1, I like this idea. If we are afraid of CPU cost we could imagine some kind of
sampling or add the possibility to collect only for a specific queryid.

I dreamed of this kind of feature for PoWA.  Thus, it could make possible to
compare CDF between two days for example, before and after introducing a change.

Regards,

-- 
Adrien NAYRAT

Re: [Proposal] Global temporary tables

2019-11-02 Thread Julien Rouhaud

On Sat, Nov 2, 2019 at 8:23 AM Pavel Stehule  wrote:
>
> so 2. 11. 2019 v 8:18 odesílatel Julien Rouhaud  napsal:
>>
>> On Sat, Nov 2, 2019 at 6:31 AM Pavel Stehule  wrote:
>> >
>> > pá 1. 11. 2019 v 17:09 odesílatel Konstantin Knizhnik 
>> >  napsal:
>> >>
>> >> On 01.11.2019 18:26, Robert Haas wrote:
>> >> > On Fri, Nov 1, 2019 at 11:15 AM Konstantin Knizhnik
>> >> >  wrote:
>> >> >> It seems to me that I have found quite elegant solution for 
>> >> >> per-backend statistic for GTT: I just inserting it in backend's 
>> >> >> catalog cache, but not in pg_statistic table itself.
>> >> >> To do it I have to add InsertSysCache/InsertCatCache functions which 
>> >> >> insert pinned entry in the correspondent cache.
>> >> >> I wonder if there are some pitfalls of such approach?
>> >> > That sounds pretty hackish. You'd have to be very careful, for
>> >> > example, that if the tables were dropped or re-analyzed, all of the
>> >> > old entries got removed --
>> >>
>> >> I have checked it:
>> >> - when table is reanalyzed, then cache entries are replaced.
>> >> - when table is dropped, then cache entries are removed.
>> >>
>> >> > and then it would still fail if any code
>> >> > tried to access the statistics directly from the table, rather than
>> >> > via the caches. My assumption is that the statistics ought to be
>> >> > stored in some backend-private data structure designed for that
>> >> > purpose, and that the code that needs the data should be taught to
>> >> > look for it there when the table is a GTT.
>> >>
>> >> Yes, if you do "select * from pg_statistic" then you will not see
>> >> statistic for GTT in this case.
>> >> But I do not think that it is so critical. I do not believe that anybody
>> >> is trying to manually interpret values in this table.
>> >> And optimizer is retrieving statistic through sys-cache mechanism and so
>> >> is able to build correct plan in this case.
>> >
>> >
>> > Years ago, when I though about it, I wrote patch with similar design. It's 
>> > working, but surely it's ugly.
>> >
>> > I have another idea. Can be pg_statistics view instead a table?
>> >
>> > Some like
>> >
>> > SELECT * FROM pg_catalog.pg_statistics_rel
>> > UNION ALL
>> > SELECT * FROM pg_catalog.pg_statistics_gtt();
>> >
>> > Internally - when stat cache is filled, then there can be used 
>> > pg_statistics_rel and pg_statistics_gtt() directly. What I remember, there 
>> > was not possibility to work with queries, only with just relations.
>>
>> It'd be a loss if you lose the ability to see the statistics, as there
>> are valid use cases where you need to see the stats, eg. understanding
>> why you don't get the plan you wanted.  There's also at least one
>> extension [1] that allows you to backup and use restored statistics,
>> so there are definitely people interested in it.
>>
>> [1]: https://github.com/ossc-db/pg_dbms_stats
>
>
> I don't think - the extensions can use UNION and the content will be same as 
> caches used by planner.

Yes, I agree that changing pg_statistics to be a view as you showed
would fix the problem.  I was answering Konstantin's point:

>> >> But I do not think that it is so critical. I do not believe that anybody
>> >> is trying to manually interpret values in this table.
>> >> And optimizer is retrieving statistic through sys-cache mechanism and so
>> >> is able to build correct plan in this case.

which is IMHO a wrong assumption.

Re: update ALTER TABLE with ATTACH PARTITION lock mode (docs)

On Fri, Nov 01, 2019 at 11:58:43AM -0500, Justin Pryzby wrote:
> Attaching a partition acquires a SHARE UPDATE EXCLUSIVE 
> lock
> on the parent table, in addition to ACCESS EXCLUSIVE locks
> on the table to be attached and the DEFAULT partition (if
> any). 
> 

Sounds fine.  So gathering everything I get the attached.  Any
thoughts from others?
--
Michael
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index d7158c1b03..8e5592c106 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3952,14 +3952,14 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 
  Before running the ATTACH PARTITION command, it is
  recommended to create a CHECK constraint on the table to
- be attached describing the desired partition constraint.  That way,
+ be attached matching the desired partition constraint. That way,
  the system will be able to skip the scan to validate the implicit
- partition constraint. Without such a constraint, the table will be
+ partition constraint. Without the CHECK constraint, the table will be
  scanned to validate the partition constraint while holding an
  ACCESS EXCLUSIVE lock on that partition
  and a SHARE UPDATE EXCLUSIVE lock on the parent table.
- One may then drop the constraint after ATTACH PARTITION
- is finished, because it is no longer necessary.
+ It may be desired to drop the redundant CHECK constraint
+ after ATTACH PARTITION is finished.
 
 
 
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 62542cd8a1..d3a37b4a6a 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -841,7 +841,7 @@ WITH ( MODULUS numeric_literal, REM
   or as a default partition by using DEFAULT.
   For each index in the target table, a corresponding
   one will be created in the attached table; or, if an equivalent
-  index already exists, will be attached to the target table's index,
+  index already exists, it will be attached to the target table's index,
   as if ALTER INDEX ATTACH PARTITION had been executed.
   Note that if the existing table is a foreign table, it is currently not
   allowed to attach the table as a partition of the target table if there
@@ -864,17 +864,17 @@ WITH ( MODULUS numeric_literal, REM
   already exist.
   If any of the CHECK constraints of the table being
   attached is marked NO INHERIT, the command will fail;
-  such a constraint must be recreated without the NO INHERIT
+  such constraints must be recreated without the NO INHERIT
   clause.
  
 
  
   If the new partition is a regular table, a full table scan is performed
-  to check that no existing row in the table violates the partition
+  to check that existing rows in the table do not violate the partition
   constraint.  It is possible to avoid this scan by adding a valid
-  CHECK constraint to the table that would allow only
-  the rows satisfying the desired partition constraint before running this
-  command.  It will be determined using such a constraint that the table
+  CHECK constraint to the table that allows only
+  rows satisfying the desired partition constraint before running this
+  command.  The CHECK constraint will be used to determine that the table  
   need not be scanned to validate the partition constraint.  This does not
   work, however, if any of the partition keys is an expression and the
   partition does not accept NULL values.  If attaching
@@ -900,6 +900,14 @@ WITH ( MODULUS numeric_literal, REM
   the scan of the new partition, it is always skipped when the default
   partition is a foreign table.
  
+
+ 
+  Attaching a partition acquires a
+  SHARE UPDATE EXCLUSIVE lock on the parent table,
+  in addition to ACCESS EXCLUSIVE locks on the table
+  to be attached and the default partition (if any).
+ 
+
 

 


signature.asc
Description: PGP signature

Re: Remove configure --disable-float4-byval and --disable-float8-byval

2019-11-02 Thread Peter Eisentraut


On 2019-11-01 15:41, Robert Haas wrote:

On a related note, why do we store typbyval in the catalog anyway
instead of inferring it from typlen and maybe typalign? It seems like
a bad idea to record on disk the way we pass around values in memory,
because it means that a change to how values are passed around in
memory has ramifications for on-disk compatibility.


This sounds interesting.  It would remove a pg_upgrade hazard (in the 
long run).


There is some backward compatibility to be concerned about.  This change 
would require extension authors to change their code to insert #ifdef 
USE_FLOAT8_BYVAL or similar, where currently their code might only 
support one method or the other.



rhaas=# select typname, typlen, typbyval, typalign from pg_type where
typlen in (1,2,4,8) != typbyval;


There are also typlen=6 types.  Who knew. ;-)


  typname  | typlen | typbyval | typalign
--++--+--
  macaddr8 |  8 | f| i
(1 row)


This might be a case of the above issue: It's easier to just make it 
pass by reference always than deal with a bunch of #ifdefs.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Application name for pg_basebackup and friends

On Thu, Oct 31, 2019 at 03:10:35PM +0200, Heikki Linnakangas wrote:
> You can already set application name with the environment variable or on the
> database connections string:
> 
> pg_receivewal -D /wal -d "host=myhost application_name=myreceiver"
> 
> I don't think we need a new comand line switch for it.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: TestLib::command_fails_like enhancement

On Thu, Oct 31, 2019 at 01:02:58PM -0400, Andrew Dunstan wrote:
> This small patch authored by my colleague Craig Ringer enhances
> Testlib's command_fails_like by allowing the passing of extra keyword
> type arguments. The keyword initially recognized is 'extra_ipcrun_opts'.
> The value for this keyword needs to be an array, and is passed through
> to the call to IPC::Run.

Why not.

> Some patches I will be submitting shortly rely on this enhancement.

Anything submitted yet or any examples?  I was just wondering in which
case it mattered.
--
Michael


signature.asc
Description: PGP signature

Re: The command tag of "ALTER MATERIALIZED VIEW RENAME COLUMN"

On Fri, Nov 01, 2019 at 02:17:03PM +0500, Ibrar Ahmed wrote:
> Do we really need a regression test cases for such small oversights?

It is possible to get the command tags using an event trigger...  But
that sounds hack-ish.
--
Michael


signature.asc
Description: PGP signature

Re: Remove configure --disable-float4-byval and --disable-float8-byval

2019-11-02 Thread Peter Eisentraut


On 2019-10-31 14:36, Tom Lane wrote:

Peter Eisentraut  writes:

float4 is now always pass-by-value; the pass-by-reference code path is
completely removed.


I think this is OK.


OK, here is a patch for just this part, and we can continue the 
discussion on the rest in the meantime.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 117590e3e1bb74ce8bebbd4791fcfcb514e24136 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 2 Nov 2019 08:09:01 +0100
Subject: [PATCH v2] Remove configure --disable-float4-byval

This build option was only useful to maintain compatibility for
version-0 functions, but those are no longer supported, so this option
can be removed.

float4 is now always pass-by-value; the pass-by-reference code path is
completely removed.

Discussion: 
https://www.postgresql.org/message-id/flat/f3e1e576-2749-bbd7-2d57-3f9dcf752...@2ndquadrant.com
---
 configure   | 42 -
 configure.in| 10 --
 doc/src/sgml/func.sgml  |  5 ---
 doc/src/sgml/installation.sgml  | 12 ---
 src/backend/access/index/indexam.c  |  5 ---
 src/backend/access/transam/xlog.c   | 17 --
 src/backend/bootstrap/bootstrap.c   |  2 +-
 src/backend/catalog/genbki.pl   |  2 +-
 src/backend/commands/analyze.c  |  2 +-
 src/backend/utils/fmgr/dfmgr.c  |  9 --
 src/backend/utils/fmgr/fmgr.c   | 19 ++-
 src/backend/utils/misc/pg_controldata.c | 17 --
 src/bin/initdb/initdb.c |  3 --
 src/bin/pg_controldata/pg_controldata.c |  2 --
 src/bin/pg_resetwal/pg_resetwal.c   |  3 --
 src/include/catalog/catversion.h|  2 +-
 src/include/catalog/pg_control.h|  4 +--
 src/include/catalog/pg_proc.dat |  6 ++--
 src/include/catalog/pg_type.dat |  2 +-
 src/include/fmgr.h  |  2 --
 src/include/pg_config.h.in  |  7 -
 src/include/postgres.h  | 23 +-
 src/tools/msvc/Solution.pm  | 11 ---
 src/tools/msvc/config_default.pl|  1 -
 24 files changed, 18 insertions(+), 190 deletions(-)

diff --git a/configure b/configure
index 6b1c779ee3..accd06c2e6 100755
--- a/configure
+++ b/configure
@@ -866,7 +866,6 @@ with_system_tzdata
 with_zlib
 with_gnu_ld
 enable_largefile
-enable_float4_byval
 enable_float8_byval
 '
   ac_precious_vars='build_alias
@@ -1525,7 +1524,6 @@ Optional Features:
   --enable-cassertenable assertion checks (for debugging)
   --disable-thread-safety disable thread-safety in client libraries
   --disable-largefile omit support for large files
-  --disable-float4-byval  disable float4 passed by value
   --disable-float8-byval  disable float8 passed by value
 
 Optional Packages:
@@ -16801,46 +16799,6 @@ _ACEOF
 
 
 
-# Decide whether float4 is passed by value: user-selectable, enabled by default
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with float4 
passed by value" >&5
-$as_echo_n "checking whether to build with float4 passed by value... " >&6; }
-
-
-# Check whether --enable-float4-byval was given.
-if test "${enable_float4_byval+set}" = set; then :
-  enableval=$enable_float4_byval;
-  case $enableval in
-yes)
-
-$as_echo "#define USE_FLOAT4_BYVAL 1" >>confdefs.h
-
-   float4passbyval=true
-  ;;
-no)
-  float4passbyval=false
-  ;;
-*)
-  as_fn_error $? "no argument expected for --enable-float4-byval option" 
"$LINENO" 5
-  ;;
-  esac
-
-else
-  enable_float4_byval=yes
-
-$as_echo "#define USE_FLOAT4_BYVAL 1" >>confdefs.h
-
-   float4passbyval=true
-fi
-
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $enable_float4_byval" >&5
-$as_echo "$enable_float4_byval" >&6; }
-
-cat >>confdefs.h <<_ACEOF
-#define FLOAT4PASSBYVAL $float4passbyval
-_ACEOF
-
-
 # Decide whether float8 is passed by value.
 # Note: this setting also controls int8 and related types such as timestamp.
 # If sizeof(Datum) >= 8, this is user-selectable, enabled by default.
diff --git a/configure.in b/configure.in
index 2b9025cac3..e5ad3bf883 100644
--- a/configure.in
+++ b/configure.in
@@ -1931,16 +1931,6 @@ AC_CHECK_SIZEOF([void *])
 AC_CHECK_SIZEOF([size_t])
 AC_CHECK_SIZEOF([long])
 
-# Decide whether float4 is passed by value: user-selectable, enabled by default
-AC_MSG_CHECKING([whether to build with float4 passed by value])
-PGAC_ARG_BOOL(enable, float4-byval, yes, [disable float4 passed by value],
-  [AC_DEFINE([USE_FLOAT4_BYVAL], 1,
- [Define to 1 if you want float4 values to be passed 
by value. (--enable-float4-byval)])
-   float4passbyval=true],
-  [float4passbyval=false])
-AC_MSG_RESULT([$enable_float4_byval])
-AC_DEFINE_UNQUOTED([FLOAT4PASSBYVAL], [$float4passbyval], [float4 values are

Re: [PATCH][DOC] Fix for PREPARE TRANSACTION doc and postgres_fdw message.

On Fri, Nov 01, 2019 at 05:29:23PM +0100, Gilles Darold wrote:
> I have attached a patch to the documentation that adds remote tables to
> the list of objects where any operation prevent using a prepared
> transaction, currently it is just notified "operations involving
> temporary tables or the session's temporary namespace".

Perhaps we had better use foreign tables for the error message and the
docs?
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] Global temporary tables

so 2. 11. 2019 v 8:23 odesílatel Pavel Stehule 
napsal:

>
>
> so 2. 11. 2019 v 8:18 odesílatel Julien Rouhaud 
> napsal:
>
>> On Sat, Nov 2, 2019 at 6:31 AM Pavel Stehule 
>> wrote:
>> >
>> > pá 1. 11. 2019 v 17:09 odesílatel Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> napsal:
>> >>
>> >> On 01.11.2019 18:26, Robert Haas wrote:
>> >> > On Fri, Nov 1, 2019 at 11:15 AM Konstantin Knizhnik
>> >> >  wrote:
>> >> >> It seems to me that I have found quite elegant solution for
>> per-backend statistic for GTT: I just inserting it in backend's catalog
>> cache, but not in pg_statistic table itself.
>> >> >> To do it I have to add InsertSysCache/InsertCatCache functions
>> which insert pinned entry in the correspondent cache.
>> >> >> I wonder if there are some pitfalls of such approach?
>> >> > That sounds pretty hackish. You'd have to be very careful, for
>> >> > example, that if the tables were dropped or re-analyzed, all of the
>> >> > old entries got removed --
>> >>
>> >> I have checked it:
>> >> - when table is reanalyzed, then cache entries are replaced.
>> >> - when table is dropped, then cache entries are removed.
>> >>
>> >> > and then it would still fail if any code
>> >> > tried to access the statistics directly from the table, rather than
>> >> > via the caches. My assumption is that the statistics ought to be
>> >> > stored in some backend-private data structure designed for that
>> >> > purpose, and that the code that needs the data should be taught to
>> >> > look for it there when the table is a GTT.
>> >>
>> >> Yes, if you do "select * from pg_statistic" then you will not see
>> >> statistic for GTT in this case.
>> >> But I do not think that it is so critical. I do not believe that
>> anybody
>> >> is trying to manually interpret values in this table.
>> >> And optimizer is retrieving statistic through sys-cache mechanism and
>> so
>> >> is able to build correct plan in this case.
>> >
>> >
>> > Years ago, when I though about it, I wrote patch with similar design.
>> It's working, but surely it's ugly.
>> >
>> > I have another idea. Can be pg_statistics view instead a table?
>> >
>> > Some like
>> >
>> > SELECT * FROM pg_catalog.pg_statistics_rel
>> > UNION ALL
>> > SELECT * FROM pg_catalog.pg_statistics_gtt();
>> >
>> > Internally - when stat cache is filled, then there can be used
>> pg_statistics_rel and pg_statistics_gtt() directly. What I remember, there
>> was not possibility to work with queries, only with just relations.
>>
>> It'd be a loss if you lose the ability to see the statistics, as there
>> are valid use cases where you need to see the stats, eg. understanding
>> why you don't get the plan you wanted.  There's also at least one
>> extension [1] that allows you to backup and use restored statistics,
>> so there are definitely people interested in it.
>>
>> [1]: https://github.com/ossc-db/pg_dbms_stats
>
>
> I don't think - the extensions can use UNION and the content will be same
> as caches used by planner.
>

sure, if some one try to modify directly system tables, then it should be
fixed.

Re: [Proposal] Global temporary tables