[PATCH] psql: add tab completion for \df slash command suffixes

2019-08-29 Thread Ian Barwick

Hi

I just noticed "\df[TAB]" fails to offer any tab-completion for
the possible suffixes ("\dfa", "\dfn", "\dfp", "\dft" and "\dfw").
Trivial patch attached, which applies back to Pg96, and separate
patches for Pg95 and Pg94.

I'll add this to the next commitfest.

Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
commit 41a4ca2f70a1ea3bf0deb17ad54a14446de0a300
Author: Ian Barwick 
Date:   Fri Aug 30 13:30:47 2019 +0900

psql: add tab completion for \df slash commands

"\df" itself was listed for tab completion, but none of the
possible suffixes were.

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bcc7404c55..bc8a7bd78b 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1418,7 +1418,8 @@ psql_completion(const char *text, int start, int end)
 		"\\connect", "\\conninfo", "\\C", "\\cd", "\\copy",
 		"\\copyright", "\\crosstabview",
 		"\\d", "\\da", "\\dA", "\\db", "\\dc", "\\dC", "\\dd", "\\ddp", "\\dD",
-		"\\des", "\\det", "\\deu", "\\dew", "\\dE", "\\df",
+		"\\des", "\\det", "\\deu", "\\dew", "\\dE",
+		"\\df", "\\dfa", "\\dfn", "\\dfp", "\\dft", "\\dfw",
 		"\\dF", "\\dFd", "\\dFp", "\\dFt", "\\dg", "\\di", "\\dl", "\\dL",
 		"\\dm", "\\dn", "\\do", "\\dO", "\\dp", "\\dP", "\\dPi", "\\dPt",
 		"\\drds", "\\dRs", "\\dRp", "\\ds", "\\dS",
commit 616745c6c2c1f96446cebed7dbadd6f6307c7817
Author: Ian Barwick 
Date:   Fri Aug 30 13:45:24 2019 +0900

psql: add tab completion for \df slash commands

"\df" itself was listed for tab completion, but none of the
possible suffixes were.

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d6988f2654..9c8a18ab48 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -870,7 +870,8 @@ psql_completion(const char *text, int start, int end)
 
 	static const char *const backslash_commands[] = {
 		"\\a", "\\connect", "\\conninfo", "\\C", "\\cd", "\\copy", "\\copyright",
-		"\\d", "\\da", "\\db", "\\dc", "\\dC", "\\dd", "\\dD", "\\des", "\\det", "\\deu", "\\dew", "\\df",
+		"\\d", "\\da", "\\db", "\\dc", "\\dC", "\\dd", "\\dD", "\\des", "\\det", "\\deu", "\\dew",
+		"\\df", "\\dfa", "\\dfn", "\\dfp", "\\dft", "\\dfw",
 		"\\dF", "\\dFd", "\\dFp", "\\dFt", "\\dg", "\\di", "\\dl", "\\dL",
 		"\\dn", "\\do", "\\dp", "\\drds", "\\ds", "\\dS", "\\dt", "\\dT", "\\dv", "\\du", "\\dx",
 		"\\e", "\\echo", "\\ef", "\\encoding",
commit 8ed98bb4cad91c71f0a04c28428c1e4cd862d938
Author: Ian Barwick 
Date:   Fri Aug 30 13:41:09 2019 +0900

psql: add tab completion for \df slash commands

"\df" itself was listed for tab completion, but none of the
possible suffixes were.

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index b657ab7b8d..784e2a6bad 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -913,7 +913,8 @@ psql_completion(const char *text, int start, int end)
 	static const char *const backslash_commands[] = {
 		"\\a", "\\connect", "\\conninfo", "\\C", "\\cd", "\\copy", "\\copyright",
 		"\\d", "\\da", "\\db", "\\dc", "\\dC", "\\dd", "\\ddp", "\\dD",
-		"\\des", "\\det", "\\deu", "\\dew", "\\dE", "\\df",
+		"\\des", "\\det", "\\deu", "\\dew", "\\dE",
+		"\\df", "\\dfa", "\\dfn", "\\dfp", "\\dft", "\\dfw",
 		"\\dF", "\\dFd", "\\dFp", "\\dFt", "\\dg", "\\di", "\\dl", "\\dL",
 		"\\dm", "\\dn", "\\do", "\\dO", "\\dp", "\\drds", "\\ds", "\\dS",
 		"\\dt", "\\dT", "\\dv", "\\du", "\\dx", "\\dy",


Re: Improve error detections in TAP tests by spreading safe_psql

2019-08-29 Thread Michael Paquier
On Wed, Aug 28, 2019 at 09:44:58PM -0400, Tom Lane wrote:
> Well, it's useful if you just want the stdout back.  But its name
> is a bit misleading if the default behavior of psql is just as
> safe.  Not sure whether renaming it is worthwhile.

It is not that complicated enough to capture stdout with
PostgresNode::psql either, so if we are making the default of one
(PostgresNode::psql) look like the other (PostgresNode::safe_psql),
then we lose no actual feature by dropping safe_psql.

> Yeah, but only if the test cases are independent, which I think is
> mostly not true in our TAP scripts.  Otherwise you're just looking
> at cascading errors.

Yep.  We have a couple of cases though where things are quite
independent, like the 2PC suite divided into sequences of different
transactions, but mostly there are dependencies between one step and
the follow-up ones.

> Yup, the tradeoff is people possibly having test scripts outside
> our tree that would break, versus the possibility that we'll back-patch
> test changes that don't behave as expected in back branches.  I don't
> know if the former is true, but I'm afraid the latter is a certainty
> if we don't back-patch.

Exactly, that's why I am wondering how wide breakages in those
out-of-the-tree tests would go as we have PostgresNode since 9.6.
Facing this choice makes me uneasy, which is why I would tend to only
do incompatible things on HEAD.  Another, safer, possibility would be
to introduce in back-branches an extra psql-like routine enforcing
errors which we use in our tests, to keep the former ones for
compatibility, and to drop the old ones on HEAD.  It is never fun to
face sudden errors on a system when doing a minor upgrade.
--
Michael


signature.asc
Description: PGP signature


Re: Yet another fast GiST build

2019-08-29 Thread Peter Geoghegan
On Thu, Aug 29, 2019 at 8:22 PM Alexander Korotkov
 wrote:
> Alternatively you can encode size in Z-value.  But this increases
> dimensionality of space and decreases efficiency of join.  Also,
> spatial join can be made using two indexes, even just current GiST
> without Z-values.  We've prototyped that, see [1].

I'm pretty sure that spatial joins generally need two spatial indexes
(usually R-Trees). There seems to have been quite a lot of research in
it in the 1990s.

-- 
Peter Geoghegan




Re: Consolidate 'unique array values' logic into a reusable function?

2019-08-29 Thread Thomas Munro
Hello,

I'm reviving a thread from 2016, because I wanted this thing again today.

Tom Lane  wrote:
> Thomas Munro  writes:
> > Here's a sketch patch that creates a function array_unique which takes
> > the same arguments as qsort or qsort_arg and returns the new length.
>
> Hmm ... I'd be against using this in backend/regex/, because I still
> have hopes of converting that to a standalone library someday (and
> in any case it needs to stay compatible with Tcl's copy of the code).
> But otherwise this seems like a reasonable proposal.
>
> As for the function name, maybe "qunique()" to go with "qsort()"?
> I'm not thrilled with "array_unique" because that sounds like it
> is meant for Postgres' array data types.

OK, here it is renamed to qunique() and qunique_arg().  It's a bit odd
because it has nothing to do with the quicksort algorithm, but make
some sense because it's always used with qsort().  I suppose we could
save a few more lines if there were a qsort_unique() function that
does both, since the arguments are identical.  I also moved it into a
new header lib/qunique.h.  Any better ideas for where it should live?
I removed the hunk under regex.

One thing I checked is that on my system it is inlined along with the
comparator when that is visible, so no performance should be lost by
throwing away the open coded versions.  This makes me think that eg
oid_cmp() should probably be defined in a header; clearly we're also
carrying a few functions that should be consolidated into a new
int32_cmp() function, somewhere, too.  (It might also be interesting
to use the pg_attribute_always_inline trick to instantiate some common
qsort() specialisations for a bit of speed-up, but that's another
topic.)

Adding to CF.

--
Thomas Munro
https://enterprisedb.com
From b5fd67b9586cfdc3e66ccbdf019ec6d95f6206ec Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 30 Aug 2019 13:41:04 +1200
Subject: [PATCH] Consolidate code that makes a sorted array unique.

Introduce qunique() and qunique_arg() which can be used after qsort()
and qsort_arg() respectively to remove duplicate values.

Author: Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com
---
 contrib/hstore/hstore_io.c   |  5 +++
 contrib/intarray/_int_tool.c | 19 +++-
 contrib/pg_trgm/trgm_op.c| 25 ++-
 src/backend/access/nbtree/nbtutils.c | 19 ++--
 src/backend/executor/nodeTidscan.c   | 13 ++
 src/backend/utils/adt/acl.c  | 15 ++-
 src/backend/utils/adt/tsgistidx.c| 29 ++---
 src/backend/utils/adt/tsquery_op.c   | 29 ++---
 src/backend/utils/adt/tsvector.c |  5 ++-
 src/backend/utils/adt/tsvector_op.c  | 59 -
 src/backend/utils/cache/syscache.c   | 21 +++--
 src/include/lib/qunique.h| 65 
 12 files changed, 113 insertions(+), 191 deletions(-)
 create mode 100644 src/include/lib/qunique.h

diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c
index 745497c76f..86d7c7c2d8 100644
--- a/contrib/hstore/hstore_io.c
+++ b/contrib/hstore/hstore_io.c
@@ -323,6 +323,11 @@ hstoreUniquePairs(Pairs *a, int32 l, int32 *buflen)
 	}
 
 	qsort((void *) a, l, sizeof(Pairs), comparePairs);
+
+	/*
+	 * We can't use qunique here because we have some clean-up code to run on
+	 * removed elements.
+	 */
 	ptr = a + 1;
 	res = a;
 	while (ptr - a < l)
diff --git a/contrib/intarray/_int_tool.c b/contrib/intarray/_int_tool.c
index fde8d15e2c..27d9ce7297 100644
--- a/contrib/intarray/_int_tool.c
+++ b/contrib/intarray/_int_tool.c
@@ -6,6 +6,7 @@
 #include 
 
 #include "catalog/pg_type.h"
+#include "lib/qunique.h"
 
 #include "_int.h"
 
@@ -310,23 +311,13 @@ internal_size(int *a, int len)
 ArrayType *
 _int_unique(ArrayType *r)
 {
-	int		   *tmp,
-			   *dr,
-			   *data;
 	int			num = ARRNELEMS(r);
+	bool		duplicates_found;	/* not used */
 
-	if (num < 2)
-		return r;
+	num = qunique_arg(ARRPTR(r), num, sizeof(int), isort_cmp,
+	  _found);
 
-	data = tmp = dr = ARRPTR(r);
-	while (tmp - data < num)
-	{
-		if (*tmp != *dr)
-			*(++dr) = *tmp++;
-		else
-			tmp++;
-	}
-	return resize_intArrayType(r, dr + 1 - ARRPTR(r));
+	return resize_intArrayType(r, num);
 }
 
 void
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index 0d4614e9c8..42fb625e7f 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -8,6 +8,7 @@
 #include "trgm.h"
 
 #include "catalog/pg_type.h"
+#include "lib/qunique.h"
 #include "tsearch/ts_locale.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -163,26 +164,6 @@ comp_trgm(const void *a, const void *b)
 	return CMPTRGM(a, b);
 }
 
-static int
-unique_array(trgm *a, int len)
-{
-	trgm	   *curend,
-			   *tmp;
-
-	curend = tmp = a;
-	while (tmp - a < len)
-		if (CMPTRGM(tmp, curend))
-		{
-			curend++;
-			CPTRGM(curend, tmp);
-			tmp++;
-		}
-		else
-			tmp++;
-
-	

Re: Yet another fast GiST build

2019-08-29 Thread Alexander Korotkov
On Fri, Aug 30, 2019 at 2:34 AM Peter Geoghegan  wrote:
> On Thu, Aug 29, 2019 at 3:48 PM Alexander Korotkov
>  wrote:
> > > As you can see, Z-order build is on order of magnitude faster. Select 
> > > performance is roughly the same. Also, index is significantly smaller.
> >
> > Cool!  These experiments bring me to following thoughts.  Can we not
> > only build, but also maintain GiST indexes in B-tree-like manner?  If
> > we put Z-value together with MBR to the non-leaf keys, that should be
> > possible.  Maintaining it in B-tree-like manner would have a lot of
> > advantages.
>
> I'm not an expert on GiST, but that seems like it would have a lot of
> advantages in the long term. It is certainly theoretically appealing.
>
> Could this make it easier to use merge join with containment
> operators? I'm thinking of things like geospatial joins, which can
> generally only be performed as nested loop joins at the moment. This
> is often wildly inefficient.

AFAICS, spatial joins aren't going to be as easy as just merge joins
on Z-value.  When searching for close points in two datasets (closer
than given threshold) we can scan some ranges of Z-value in one
dataset while iterating on another.  But dealing with prolonged
spatial objects in not that easy.  In order to determine if there are
matching values in given Z-range you also need to be aware on size of
objects inside that Z-range.  So, before merge-like join you need to
not only sort, but build some index-like structure.

Alternatively you can encode size in Z-value.  But this increases
dimensionality of space and decreases efficiency of join.  Also,
spatial join can be made using two indexes, even just current GiST
without Z-values.  We've prototyped that, see [1].

Links
1. https://github.com/pgsphere/pgsphere/blob/crossmatch_cnode/crossmatch.c

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: A problem about partitionwise join

2019-08-29 Thread Richard Guo
On Fri, Aug 30, 2019 at 2:08 AM Etsuro Fujita 
wrote:

> On Thu, Aug 29, 2019 at 6:45 PM Richard Guo  wrote:
> > On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita 
> wrote:
> >> On Tue, Aug 27, 2019 at 4:57 PM Richard Guo  wrote:
> >> > Check the query below as a more illustrative example:
> >> >
> >> > create table p (k int, val int) partition by range(k);
> >> > create table p_1 partition of p for values from (1) to (10);
> >> > create table p_2 partition of p for values from (10) to (100);
> >> >
> >> > If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
> >> > partitionwise join:
> >> >
> >> > # explain (costs off)
> >> > select * from p as foo join p as bar on foo.k = bar.k and foo.k =
> bar.val;
> >> >QUERY PLAN
> >> > -
> >> >  Append
> >> >->  Hash Join
> >> >  Hash Cond: (foo.k = bar.k)
> >> >  ->  Seq Scan on p_1 foo
> >> >  ->  Hash
> >> >->  Seq Scan on p_1 bar
> >> >  Filter: (k = val)
> >> >->  Hash Join
> >> >  Hash Cond: (foo_1.k = bar_1.k)
> >> >  ->  Seq Scan on p_2 foo_1
> >> >  ->  Hash
> >> >->  Seq Scan on p_2 bar_1
> >> >  Filter: (k = val)
> >> > (13 rows)
> >> >
> >> > But if we exchange the order of the two quals to 'foo.k = bar.val and
> >> > foo.k = bar.k', then partitionwise join cannot be generated any more,
> >> > because we only have joinclause 'foo.k = bar.val' as it first reached
> >> > score of 3. We have missed the joinclause on the partition key
> although
> >> > it does exist.
> >> >
> >> > # explain (costs off)
> >> > select * from p as foo join p as bar on foo.k = bar.val and foo.k =
> bar.k;
> >> >QUERY PLAN
> >> > -
> >> >  Hash Join
> >> >Hash Cond: (foo.k = bar.val)
> >> >->  Append
> >> >  ->  Seq Scan on p_1 foo
> >> >  ->  Seq Scan on p_2 foo_1
> >> >->  Hash
> >> >  ->  Append
> >> >->  Seq Scan on p_1 bar
> >> >  Filter: (val = k)
> >> >->  Seq Scan on p_2 bar_1
> >> >  Filter: (val = k)
> >> > (11 rows)
> >>
> >> I think it would be nice if we can address this issue.
>
> > Attached is a patch as an attempt to address this issue. The idea is
> > quite straightforward. When building partition info for joinrel, we
> > generate any possible EC-derived joinclauses of form 'outer_em =
> > inner_em', which will be used together with the original restrictlist to
> > check if there exists an equi-join condition for each pair of partition
> > keys.
>
> Thank you for the patch!  Will review.  Could you add the patch to the
> upcoming CF so that it doesn’t get lost?
>

Added this patch: https://commitfest.postgresql.org/24/2266/

Thanks
Richard


Re: Wrong value in metapage of GIN INDEX.

2019-08-29 Thread keisuke kuroda
Hi Moon-san.

Thank you for posting.

We are testing the GIN index onJSONB type.
The default maintenance_work_mem (64MB) was fine in usually cases.
However, this problem occurs when indexing very large JSONB data.

best regards,
Keisuke Kuroda

2019年8月29日(木) 17:20 Moon, Insung :

> Dear Hackers.
>
> Kuroda-san and I are interested in the GIN index and have been testing
> various things.
> While testing, we are found a little bug.
> Some cases, the value of nEntries in the metapage was set to the wrong
> value.
>
> This is a reproduce of bug situation.
> =# SET maintenance_work_mem TO '1MB';
> =# CREATE TABLE foo(i jsonb);
> =# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
> generate_series(1, 1) AS i;
>
> # Input the same value again.
> =# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
> generate_series(1, 1) AS i;
> # Creates GIN Index.
> =# CREATE INDEX foo_idx ON foo USING gin (i jsonb_ops);
>
>
> =# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0)) WITH
> (fastupdate=off);
> -[ RECORD 1 ]+---
> pending_head | 4294967295
> pending_tail | 4294967295
> tail_free_size   | 0
> n_pending_pages  | 0
> n_pending_tuples | 0
> n_total_pages| 74
> n_entry_pages| 69
> n_data_pages | 4
> n_entries| 20004 <--★
> version  | 2
>
> In this example, the nentries value should be 10001 because the gin
> index stores duplicate values in one leaf(posting tree or posting
> list).
> But, if look at the nentries value of metapage using pageinspect, it
> is stored as 20004.
> So, Let's run the vacuum.
>
>
> =# VACUUM foo;
> =# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0));
> -[ RECORD 1 ]+---
> pending_head | 4294967295
> pending_tail | 4294967295
> tail_free_size   | 0
> n_pending_pages  | 0
> n_pending_tuples | 0
> n_total_pages| 74
> n_entry_pages| 69
> n_data_pages | 4
> n_entries| 10001 <--★
> version  | 2
>
> Ah. Run to the vacuum, nEntries is changing the normal value.
>
> There is a problem with the ginEntryInsert function. That calls the
> table scan when creating the gin index, ginBuildCallback function
> stores the new heap value inside buildstate struct.
> And next step, If GinBuildState struct is the size of the memory to be
> using is equal to or larger than the maintenance_work_mem value, run
> to input value into the GIN index.
> This process is a function called ginEnctryInsert.
> The ginEntryInsert function called at this time determines that a new
> entry is added and increase the value of nEntries.
> However currently, ginEntryInsert is first to increase in the value of
> nEntries, and to determine if there are the same entries in the
> current GIN index.
> That causes the bug.
>
> The patch is very simple.
> Fix to increase the value of nEntries only when a non-duplicate GIN
> index leaf added.
>
> This bug detection and code fix worked with Kuroda-san.
>
> Best Regards.
> Moon.
>


Re: Yet another fast GiST build

2019-08-29 Thread Peter Geoghegan
On Thu, Aug 29, 2019 at 3:48 PM Alexander Korotkov
 wrote:
> > As you can see, Z-order build is on order of magnitude faster. Select 
> > performance is roughly the same. Also, index is significantly smaller.
>
> Cool!  These experiments bring me to following thoughts.  Can we not
> only build, but also maintain GiST indexes in B-tree-like manner?  If
> we put Z-value together with MBR to the non-leaf keys, that should be
> possible.  Maintaining it in B-tree-like manner would have a lot of
> advantages.

I'm not an expert on GiST, but that seems like it would have a lot of
advantages in the long term. It is certainly theoretically appealing.

Could this make it easier to use merge join with containment
operators? I'm thinking of things like geospatial joins, which can
generally only be performed as nested loop joins at the moment. This
is often wildly inefficient.

-- 
Peter Geoghegan




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-29 Thread Peter Geoghegan
On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
 wrote:
> Your explanation helped me to understand that this approach can be
> extended to
> the case of insertion into posting list, that doesn't trigger posting
> split,
> and that nbtsplitloc indeed doesn't need to know about posting tuples
> specific.
> The code is much cleaner now.

Fantastic!

> Some individual indexes are larger, some are smaller compared to the
> expected output.

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.

We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.

I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)

Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?

BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or  "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.

How do you feel about this CREATE INDEX index-size-is-larger business?

-- 
Peter Geoghegan




Re: REINDEX filtering in the backend

2019-08-29 Thread Michael Paquier
On Thu, Aug 29, 2019 at 10:52:55AM +0200, Julien Rouhaud wrote:
> That was already suggested by Thomas and seconded by Peter E., see
> https://www.postgresql.org/message-id/2b1504ac-3d6c-11ec-e1ce-3daf132b3d37%402ndquadrant.com.
> 
> I personally think that it's a sensible approach, and I'll be happy to
> propose a patch for that too if no one worked on this yet.

That may be interesting to sort out first then because we'd likely
want to know what is first in the catalogs before designing the
filtering processing looking at it, no?
--
Michael


signature.asc
Description: PGP signature


Re: Yet another fast GiST build

2019-08-29 Thread Alexander Korotkov
On Mon, Aug 26, 2019 at 10:59 AM Andrey Borodin  wrote:
> In many cases GiST index can be build fast using z-order sorting.
>
> I've looked into proof of concept by Nikita Glukhov [0] and it looks very 
> interesting.
> So, I've implemented yet another version of B-tree-like GiST build.
> It's main use case and benefits can be summarized with small example:
>
> postgres=# create table x as select point (random(),random()) from 
> generate_series(1,300,1);
> SELECT 300
> Time: 5061,967 ms (00:05,062)
> postgres=# create index ON x using gist (point ) with 
> (fast_build_sort_function=gist_point_sortsupport);
> CREATE INDEX
> Time: 6140,227 ms (00:06,140)
> postgres=# create index ON x using gist (point );
> CREATE INDEX
> Time: 32061,200 ms (00:32,061)
>
> As you can see, Z-order build is on order of magnitude faster. Select 
> performance is roughly the same. Also, index is significantly smaller.

Cool!  These experiments bring me to following thoughts.  Can we not
only build, but also maintain GiST indexes in B-tree-like manner?  If
we put Z-value together with MBR to the non-leaf keys, that should be
possible.  Maintaining it in B-tree-like manner would have a lot of
advantages.
1) Binary search in non-leaf pages instead of probing each key is much faster.
2) Split algorithm is also simpler and faster.
3) We can refind existing index tuple in predictable time of O(log N).
And this doesn't depend on MBR overlapping degree.  This is valuable
for future table AMs, which would have notion of deleting individual
index tuples (for instance, zheap promises to eventually support
delete-marking indexes).

Eventually, we may come to the idea of B-tree indexes with
user-defined additional keys in non-leaf tuples, which could be used
for scanning.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread fn ln
 > The usual approach is to send self-contained and numbered patches,
> eg "chain-fix-1.patch", "chain-fix-2.patch", and so on, unless there are
> complex patches designed to be committed in stages.
Thanks, I got it. I have never made a patch before so I'll keep it in my
mind.
Self-contained patch is now attached.

> it can be left blank
I'm fine with that.


disable_implicit_transaction_chaining-3.patch
Description: Binary data


Re: pg_upgrade: Error out on too many command-line arguments

2019-08-29 Thread Peter Eisentraut
On 2019-08-26 17:45, Ibrar Ahmed wrote:
> On Mon, Aug 26, 2019 at 9:46 AM Michael Paquier  > wrote:
> 
> On Sun, Aug 25, 2019 at 05:10:47PM +0200, Julien Rouhaud wrote:
> > I did some searching, and oid2name.c is also missing this.
> 
> And pgbench, no?
> 
>  
> Yes, the patch is slightly different.

Thanks, pushed all that together.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-08-29 Thread Tomas Vondra

On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:

On 28.08.2019 22:06, Tomas Vondra wrote:




Interesting. Any idea where does the extra overhead in 
this particular
case come from? It's hard to deduce that from the single 
flame graph,
when I don't have anything to compare it with (i.e. the 
flame graph for

the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 100);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with 
SSD for tests.


Therefore, probably you may write changes on receiver in 
bigger chunks,

not each change separately.


Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?


I run this on a single machine, but walsender and worker are 
utilizing almost 100% of CPU per each process all the time, and 
at apply side I/O syscalls take about 1/3 of CPU time. Though I 
am still not sure, but for me this result somehow links 
performance drop with problems at receiver side.


Writing in batches was just a hypothesis and to validate it I 
have performed test with large txn, but consisting of a smaller 
number of wide rows. This test does not exhibit any significant 
performance drop, while it was streamed too. So it seems to be 
valid. Anyway, I do not have other reasonable ideas beside that 
right now.


It seems that overhead added by synchronous replica is lower by 
2-3 times compared with Postgres master and streaming with 
spilling. Therefore, the original patch eliminated delay before 
large transaction processing start by sender, while this 
additional patch speeds up the applier side.


Although the overall speed up is surely measurable, there is a 
room for improvements yet:


1) Currently bgworkers are only spawned on demand without some 
initial pool and never stopped. Maybe we should create a small 
pool on replication start and offload some of idle bgworkers if 
they exceed some limit?


2) Probably we can track somehow that incoming change has 
conflicts with some of being processed xacts, so we can wait for 
specific bgworkers only in that case?


3) Since the communication between main logical apply worker and 
each bgworker from the pool is a 'single producer --- single 
consumer' problem, then probably it is possible to wait and 
set/check flags without locks, but using just atomics.


What do you think about this concept in general? Any concerns and 
criticism are welcome!






Hi Tomas,

Thank you for a quick response.


I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.


OK, I had the same vision about this point. Any minor differences here 
will be neglectable for a sufficiently large transaction.




There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same 
commit

order, and introducing deadlocks that would not exist in single-process
apply.


Probably I haven't explained well this part, sorry for that. In my 
patch I don't use workers pool for a concurrent transaction apply, but 
rather for a fast context switch between long-lived streamed 
transactions. In other words we apply all changes arrived from the 
sender in a completely serial manner. Being written step-by-step it 
looks like:


1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker 
one by one via shm_mq_send.


3) Read STREAM STOP message and wait until our worker will apply all 
changes in the queue.


4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected 
worker either commits or aborts.


Thus, it automatically guaranties the same commit order on replica as 
on master. Yes, we loose some performance here, since we don't apply 
transactions concurrently, but it would bring all those problems you 
have described.




OK, so it's apply in multiple processes, but at any moment only a single
apply process is active. 

However, you helped me to figure out another point I have forgotten. 
Although we ensure commit order 

Re: A problem about partitionwise join

2019-08-29 Thread Etsuro Fujita
On Thu, Aug 29, 2019 at 6:45 PM Richard Guo  wrote:
> On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita  wrote:
>> On Tue, Aug 27, 2019 at 4:57 PM Richard Guo  wrote:
>> > Check the query below as a more illustrative example:
>> >
>> > create table p (k int, val int) partition by range(k);
>> > create table p_1 partition of p for values from (1) to (10);
>> > create table p_2 partition of p for values from (10) to (100);
>> >
>> > If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
>> > partitionwise join:
>> >
>> > # explain (costs off)
>> > select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
>> >QUERY PLAN
>> > -
>> >  Append
>> >->  Hash Join
>> >  Hash Cond: (foo.k = bar.k)
>> >  ->  Seq Scan on p_1 foo
>> >  ->  Hash
>> >->  Seq Scan on p_1 bar
>> >  Filter: (k = val)
>> >->  Hash Join
>> >  Hash Cond: (foo_1.k = bar_1.k)
>> >  ->  Seq Scan on p_2 foo_1
>> >  ->  Hash
>> >->  Seq Scan on p_2 bar_1
>> >  Filter: (k = val)
>> > (13 rows)
>> >
>> > But if we exchange the order of the two quals to 'foo.k = bar.val and
>> > foo.k = bar.k', then partitionwise join cannot be generated any more,
>> > because we only have joinclause 'foo.k = bar.val' as it first reached
>> > score of 3. We have missed the joinclause on the partition key although
>> > it does exist.
>> >
>> > # explain (costs off)
>> > select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
>> >QUERY PLAN
>> > -
>> >  Hash Join
>> >Hash Cond: (foo.k = bar.val)
>> >->  Append
>> >  ->  Seq Scan on p_1 foo
>> >  ->  Seq Scan on p_2 foo_1
>> >->  Hash
>> >  ->  Append
>> >->  Seq Scan on p_1 bar
>> >  Filter: (val = k)
>> >->  Seq Scan on p_2 bar_1
>> >  Filter: (val = k)
>> > (11 rows)
>>
>> I think it would be nice if we can address this issue.

> Attached is a patch as an attempt to address this issue. The idea is
> quite straightforward. When building partition info for joinrel, we
> generate any possible EC-derived joinclauses of form 'outer_em =
> inner_em', which will be used together with the original restrictlist to
> check if there exists an equi-join condition for each pair of partition
> keys.

Thank you for the patch!  Will review.  Could you add the patch to the
upcoming CF so that it doesn’t get lost?

Best regards,
Etsuro Fujita




Re: block-level incremental backup

2019-08-29 Thread Jeevan Ladhe
Due to the inherent nature of pg_basebackup, the incremental backup also
allows taking backup in tar and compressed format. But, pg_combinebackup
does not understand how to restore this. I think we should either make
pg_combinebackup support restoration of tar incremental backup or restrict
taking the incremental backup in tar format until pg_combinebackup
supports the restoration by making option '--lsn' and '-Ft' exclusive.

It is arguable that one can take the incremental backup in tar format,
extract
that manually and then give the resultant directory as input to the
pg_combinebackup, but I think that kills the purpose of having
pg_combinebackup utility.

Thoughts?

Regards,
Jeevan Ladhe


Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread Fabien COELHO



Thanks, I got it. I have never made a patch before so I'll keep it in my 
mind. Self-contained patch is now attached.


v3 applies, compiles, "make check" ok.

I turned it ready on the app.

--
Fabien




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-08-29 Thread Alexey Kondratov

On 28.08.2019 22:06, Tomas Vondra wrote:




Interesting. Any idea where does the extra overhead in this 
particular
case come from? It's hard to deduce that from the single flame 
graph,
when I don't have anything to compare it with (i.e. the flame 
graph for

the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 100);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD 
for tests.


Therefore, probably you may write changes on receiver in bigger 
chunks,

not each change separately.


Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?


I run this on a single machine, but walsender and worker are 
utilizing almost 100% of CPU per each process all the time, and at 
apply side I/O syscalls take about 1/3 of CPU time. Though I am 
still not sure, but for me this result somehow links performance 
drop with problems at receiver side.


Writing in batches was just a hypothesis and to validate it I have 
performed test with large txn, but consisting of a smaller number of 
wide rows. This test does not exhibit any significant performance 
drop, while it was streamed too. So it seems to be valid. Anyway, I 
do not have other reasonable ideas beside that right now.


It seems that overhead added by synchronous replica is lower by 2-3 
times compared with Postgres master and streaming with spilling. 
Therefore, the original patch eliminated delay before large 
transaction processing start by sender, while this additional patch 
speeds up the applier side.


Although the overall speed up is surely measurable, there is a room 
for improvements yet:


1) Currently bgworkers are only spawned on demand without some 
initial pool and never stopped. Maybe we should create a small pool 
on replication start and offload some of idle bgworkers if they 
exceed some limit?


2) Probably we can track somehow that incoming change has conflicts 
with some of being processed xacts, so we can wait for specific 
bgworkers only in that case?


3) Since the communication between main logical apply worker and each 
bgworker from the pool is a 'single producer --- single consumer' 
problem, then probably it is possible to wait and set/check flags 
without locks, but using just atomics.


What do you think about this concept in general? Any concerns and 
criticism are welcome!






Hi Tomas,

Thank you for a quick response.


I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.


OK, I had the same vision about this point. Any minor differences here 
will be neglectable for a sufficiently large transaction.




There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same 
commit

order, and introducing deadlocks that would not exist in single-process
apply.


Probably I haven't explained well this part, sorry for that. In my patch 
I don't use workers pool for a concurrent transaction apply, but rather 
for a fast context switch between long-lived streamed transactions. In 
other words we apply all changes arrived from the sender in a completely 
serial manner. Being written step-by-step it looks like:


1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker 
one by one via shm_mq_send.


3) Read STREAM STOP message and wait until our worker will apply all 
changes in the queue.


4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker 
either commits or aborts.


Thus, it automatically guaranties the same commit order on replica as on 
master. Yes, we loose some performance here, since we don't apply 
transactions concurrently, but it would bring all those problems you 
have described.


However, you helped me to figure out another point I have forgotten. 
Although we ensure commit order automatically, the beginning of streamed 
xacts may reorder. It happens if some small xacts have been commited on 
master since the streamed one started, because we do not start 

Re: RFC: seccomp-bpf support

2019-08-29 Thread Joe Conway
On 8/29/19 10:00 AM, Tom Lane wrote:
> Joe Conway  writes:
>> Clearly Joshua and I disagree, but understand that the consensus is not
>> on our side. It is our assessment that PostgreSQL will be subject to
>> seccomp willingly or not (e.g., via docker, systemd, etc.) and the
>> community might be better served to get out in front and have first
>> class support.
> 
> Sure, but ...
> 
>> But I don't want to waste any more of anyone's time on this topic,
>> except to ask if two strategically placed hooks are asking too much?
> 
> ... hooks are still implying a design with the filter control inside
> Postgres.  Which, as I said before, seems like a fundamentally incorrect
> architecture.  I'm not objecting to having such control, but I think
> it has to be outside the postmaster, or it's just not a credible
> security improvement.

I disagree. Once a filter is loaded there is no way to unload it short
of a postmaster restart. That is an easily detected event that can be
alerted upon, and that is definitely a security improvement.

Perhaps that is a reason to also set the session level GUC to
PGC_POSTMASTER, but that is an easy change if deemed necessary.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: RFC: seccomp-bpf support

2019-08-29 Thread Joshua Brindle
On Thu, Aug 29, 2019 at 10:01 AM Tom Lane  wrote:
>
> Joe Conway  writes:
> > Clearly Joshua and I disagree, but understand that the consensus is not
> > on our side. It is our assessment that PostgreSQL will be subject to
> > seccomp willingly or not (e.g., via docker, systemd, etc.) and the
> > community might be better served to get out in front and have first
> > class support.
>
> Sure, but ...
>
> > But I don't want to waste any more of anyone's time on this topic,
> > except to ask if two strategically placed hooks are asking too much?
>
> ... hooks are still implying a design with the filter control inside
> Postgres.  Which, as I said before, seems like a fundamentally incorrect
> architecture.  I'm not objecting to having such control, but I think
> it has to be outside the postmaster, or it's just not a credible
> security improvement.  It doesn't help to say "I'm going to install
> a lock to keep out a thief who *by assumption* is carrying lock
> picking tools."
>

I recognize this discussion is over but this is a mischaracterization
of the intent. Upthread I said:
"This may not do it alone, and security
conscious integrators will want to, for example, add seccomp filters
to systemd to prevent superuser from disabling them. The postmaster
and per-role lists can further reduce the available syscalls based on
the exact extensions and PLs being used. Each step reduced the surface
more and throwing it all out because one step can go rogue is
unsatisfying."

There are no security silver bullets, each thing we do is risk
reduction, and that includes this patchset, whether you can see it or
not.

Thank you.




Re: RFC: seccomp-bpf support

2019-08-29 Thread Tom Lane
Joe Conway  writes:
> Clearly Joshua and I disagree, but understand that the consensus is not
> on our side. It is our assessment that PostgreSQL will be subject to
> seccomp willingly or not (e.g., via docker, systemd, etc.) and the
> community might be better served to get out in front and have first
> class support.

Sure, but ...

> But I don't want to waste any more of anyone's time on this topic,
> except to ask if two strategically placed hooks are asking too much?

... hooks are still implying a design with the filter control inside
Postgres.  Which, as I said before, seems like a fundamentally incorrect
architecture.  I'm not objecting to having such control, but I think
it has to be outside the postmaster, or it's just not a credible
security improvement.  It doesn't help to say "I'm going to install
a lock to keep out a thief who *by assumption* is carrying lock
picking tools."

regards, tom lane




pg_resetwal and --wal-segsize

2019-08-29 Thread Pavel Demidov
Hello

I hear is not recommended to set pg_resetwal with --wal-segsize for wal
increasing. Are any more detailed information exists about it? What an
effects could be? Does it possible change it due full offline?

Regards,
Paul


Re: RFC: seccomp-bpf support

2019-08-29 Thread Joe Conway
On 8/28/19 4:07 PM, Peter Eisentraut wrote:
> On 2019-08-28 21:38, Joshua Brindle wrote:
>> I think we need to reign in the thread somewhat. The feature allows
>> end users to define some sandboxing within PG. Nothing is being forced
>> on anyone
> 
> Features come with a maintenance cost.  If we ship it, then people are
> going to try it out.  Then weird things will happen.  They will report
> mysterious bugs.  They will complain to their colleagues.  It all comes
> with a cost.
> 
>> but we would like the capability to harden a PG installation
>> for many reasons already stated.
> 
> Most if not all of those reasons seem to have been questioned.


Clearly Joshua and I disagree, but understand that the consensus is not
on our side. It is our assessment that PostgreSQL will be subject to
seccomp willingly or not (e.g., via docker, systemd, etc.) and the
community might be better served to get out in front and have first
class support.

But I don't want to waste any more of anyone's time on this topic,
except to ask if two strategically placed hooks are asking too much?

Specifically hooks to replace these two stanzas in the patch:

8<--
diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 62dc93d..2216d49 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*** PostmasterMain(int argc, char *argv[])
*** 963,968 
--- 963,982 

[...]

diff --git a/src/backend/utils/init/postinit.c
b/src/backend/utils/init/postinit.c
index 43b9f17..aac1940 100644
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
*** InitPostgres(const char *in_dbname, Oid
*** 1056,1061 
--- 1056,1076 

[...]

8<--


We will continue to pursue this development for customers that require
it and plan to provide an update on our analysis and results.

We thank you for your comments and suggestions.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread Fabien COELHO


Hello,


transaction_read_only must be 'on' because AND CHAIN test sets the
default_transaction_read_only to 'on'.


Failure of this test means that the transaction was chained from an 
implicit transaction, which is not our desired behavior. Perhaps you are 
using a wrong binary?


Nope, I blindly assumed that your patch was self contained, but it is not, 
and my test did not include the initial fix.


The usual approach is to send self-contained and numbered patches,
eg "chain-fix-1.patch", "chain-fix-2.patch", and so on, unless there are 
complex patches designed to be committed in stages.


For the expected result, I was wrongly assuming that "SET TRANSACTION" was 
session persistent, which is not the case, there is a "SET SESSION …" for 
that. Moreover, the usual transaction default is read-write, so I was a 
little surprise. It would help to show the (unusual) current setting 
before the test.


I also have registered the patch to the CF app:

https://commitfest.postgresql.org/24/2265/

But I cannot fill in your name, maybe you could register and add it, or it 
can be left blank.


--
Fabien.

Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread fn ln
Added two kinds of test for the implicit transaction: in single query and
in implicit block.
The patch file is now created with Unix-style line ending (LF).

2019年8月29日(木) 15:30 Fabien COELHO :

>
> Hello,
>
> > COMMIT AND CHAIN in implicit block leaves blockState as TBLOCK_STARTED,
> > which doesn't trigger the chaining. but ROLLBACK AND CHAIN sets the
> > blockState into TBLOCK_ABORT_PENDING, so the chaining is triggered.
> >
> > I think disabling s->chain beforehand should do the desired behavior.
>
> Patch applies with "patch", although "git apply" complained because of
> CRLF line terminations forced by the octet-stream mime type.
>
> Patch compiles cleanly. Make check ok.
>
> Patch works for me, and solution seems appropriate. It should be committed
> for pg 12.0.
>
> There could be a test added in "regress/sql/transactions.sql", I'd suggest
> something like:
>
> -- implicit transaction and not chained.
> COMMIT AND CHAIN;
> COMMIT;
> ROLLBACK AND CHAIN;
> ROLLBACK;
>
> which should show the appropriate "no transaction in progress" warnings.
>
> Doc could be made a little clearer about what to expect when there is no
> explicit transaction in progress.
>
> --
> Fabien.
>


implicit_xact_chain_test.patch
Description: Binary data


Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread fn ln
 transaction_read_only must be 'on' because AND CHAIN test sets the
default_transaction_read_only to 'on'.
Failure of this test means that the transaction was chained from an
implicit transaction, which is not our desired behavior.
Perhaps you are using a wrong binary?

2019年8月29日(木) 21:10 Fabien COELHO :

>
> Hello,
>
> > Added two kinds of test for the implicit transaction: in single query and
> > in implicit block.
>
> Ok.
>
> > The patch file is now created with Unix-style line ending (LF).
>
> Thanks.
>
> Patch applies and compiles cleanly. However, "make check" is not ok
> on the added tests.
>
> SHOW transaction_read_only;
>  transaction_read_only
> ---
>- on
>+ off
>
> ISTM that the run is right and the patch test output is wrong, i.e. the
> transaction_read_only is expected to stay as is.
>
> --
> Fabien.
>


Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-08-29 Thread Anastasia Lubennikova

28.08.2019 6:19, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
  wrote:

Now the algorithm is the following:
- In case page split is needed, pass both tuples to _bt_split().
  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that 

Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread Fabien COELHO



Hello,


Added two kinds of test for the implicit transaction: in single query and
in implicit block.


Ok.


The patch file is now created with Unix-style line ending (LF).


Thanks.

Patch applies and compiles cleanly. However, "make check" is not ok
on the added tests.

   SHOW transaction_read_only;
transaction_read_only
   ---
  - on
  + off

ISTM that the run is right and the patch test output is wrong, i.e. the 
transaction_read_only is expected to stay as is.


--
Fabien.




Re: Zedstore - compressed in-core columnar storage

2019-08-29 Thread Heikki Linnakangas

On 29/08/2019 14:30, Ashutosh Sharma wrote:


On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang > wrote:


You are correct that we currently go through each item in the leaf
page that
contains the given tid, specifically, the logic to retrieve all the
attribute
items inside a ZSAttStream is now moved to decode_attstream() in the
latest
code, and then in zsbt_attr_fetch() we again loop through each item we
previously retrieved from decode_attstream() and look for the given
tid. 



Okay. Any idea why this new way of storing attribute data as streams 
(lowerstream and upperstream) has been chosen just for the attributes 
but not for tids. Are only attribute blocks compressed but not the tids 
blocks?


Right, only attribute blocks are currently compressed. Tid blocks need 
to be modified when there are UPDATEs or DELETE, so I think having to 
decompress and recompress them would be more costly. Also, there is no 
user data on the TID tree, and the Simple-8b encoded codewords used to 
represent the TIDs are already pretty compact. I'm not sure how much 
gain you would get from passing it through a general purpose compressor.


I could be wrong though. We could certainly try it out, and see how it 
performs.



One
optimization we can to is to tell decode_attstream() to stop
decoding at the
tid we are interested in. We can also apply other tricks to speed up the
lookups in the page, for fixed length attribute, it is easy to do
binary search
instead of linear search, and for variable length attribute, we can
probably
try something that we didn't think of yet. 



I think we can probably ask decode_attstream() to stop once it has found 
the tid that we are searching for but then we only need to do that for 
Index Scans.


I've been thinking that we should add a few "bookmarks" on long streams, 
so that you could skip e.g. to the midpoint in a stream. It's a tradeoff 
though; when you add more information for random access, it makes the 
representation less compact.



Zedstore currently implement update as delete+insert, hence the old
tid is not
reused. We don't store the tuple in our UNDO log, and we only store the
transaction information in the UNDO log. Reusing the tid of the old
tuple means
putting the old tuple in the UNDO log, which we have not implemented
yet.

OKay, so that means performing update on a non-key attribute would also 
require changes in the index table. In short, HOT update is currently 
not possible with zedstore table. Am I right?


That's right. There's a lot of potential gain for doing HOT updates. For 
example, if you UPDATE one column on every row on a table, ideally you 
would only modify the attribute tree containing that column. But that 
hasn't been implemented.


- Heikki




Re: Email to hackers for test coverage

2019-08-29 Thread Ahsan Hadi
On Wed, Aug 28, 2019 at 9:43 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-08-22 11:46, movead...@highgo.ca wrote:
> > *1. src/include/utils/float.h:140*
> >
> > Analyze:
> > This is an error report line when converting a big float8 value
> > which a float4 can not storage to float4.
> >
> > Test case:
> > Add a test case as below in file float4.sql:
> > select float4(1234567890123456789012345678901234567890::float8);
>
> > +-- Add test case for float4() type fonversion function
>
> Check spelling
>
> > *2. src/include/utils/float.h:145*
> >
> > Analyze:
> > This is an error report line when converting a small float8 value
> > which a float4 can not storage to float4.
> >
> > Test case:
> > Add a test case as below in file float4.sql:
> > select float4(0.01::float8);
> >
> > *3.src/include/utils/sortsupport.h:264*
> >
> > Analyze:
> > It is reverse sorting for the data type that has abbreviated for
> > sort, for example macaddr, uuid, numeric, network and I choose
> > numeric to do it.
> >
> > Test cast:
> > Add a test case as below in file numeric.sql:
> > INSERT INTO num_input_test(n1) values('99.998');
> > INSERT INTO num_input_test(n1) values('99.997');
> > SELECT * FROM num_input_test ORDER BY n1 DESC;
>
> >  INSERT INTO num_input_test(n1) VALUES ('nan');
> > +INSERT INTO num_input_test(n1) values('99.998');
> > +INSERT INTO num_input_test(n1) values('99.997');
>
> Make spaces and capitalization match surrounding code.
>
> > Result and patch
> >
> > By adding the test cases, the test coverage of  float.h increased from
> > 97.7% to 100% and sortsupport.h increased from 76.7% to 80.0%.
>
> That's fine, but I suggest that if you really want to make an impact in
> test coverage, look to increase function coverage rather than line
> coverage.  Or look for files that are not covered at all.
>
>
+1


> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>


Re: no mailing list hits in google

2019-08-29 Thread Magnus Hagander
On Wed, Aug 28, 2019 at 10:31 PM Alvaro Herrera 
wrote:

> On 2019-Aug-28, Thomas Kellerer wrote:
>
> > Merlin Moncure schrieb am 28.08.2019 um 18:22:
> > > My test case here is the query: pgsql-hackers
> >
> > That search term is the first hit on DuckDuckGo:
> > https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch=h_=web
>
> Yes, but that's an old post, not the one from this year.
>
>
It does show another interesting point though -- it *also* includes hits
from third party list archiving sites, which are *also* gone from Google at
this point. And those are definitely not gone from Google because we have a
robots.txt blocking /list/ -- it must be something else.

//Magnus


Re: Zedstore - compressed in-core columnar storage

2019-08-29 Thread Ashutosh Sharma
On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang  wrote:

> You are correct that we currently go through each item in the leaf page
> that
> contains the given tid, specifically, the logic to retrieve all the
> attribute
> items inside a ZSAttStream is now moved to decode_attstream() in the latest
> code, and then in zsbt_attr_fetch() we again loop through each item we
> previously retrieved from decode_attstream() and look for the given tid.
>

Okay. Any idea why this new way of storing attribute data as streams
(lowerstream and upperstream) has been chosen just for the attributes but
not for tids. Are only attribute blocks compressed but not the tids blocks?


> One
> optimization we can to is to tell decode_attstream() to stop decoding at
> the
> tid we are interested in. We can also apply other tricks to speed up the
> lookups in the page, for fixed length attribute, it is easy to do binary
> search
> instead of linear search, and for variable length attribute, we can
> probably
> try something that we didn't think of yet.
>

I think we can probably ask decode_attstream() to stop once it has found
the tid that we are searching for but then we only need to do that for
Index Scans.

Zedstore currently implement update as delete+insert, hence the old tid is
> not
> reused. We don't store the tuple in our UNDO log, and we only store the
> transaction information in the UNDO log. Reusing the tid of the old tuple
> means
> putting the old tuple in the UNDO log, which we have not implemented yet.
>
>
OKay, so that means performing update on a non-key attribute would also
require changes in the index table. In short, HOT update is currently not
possible with zedstore table. Am I right?

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:*http://www.enterprisedb.com *


Re: pg_get_databasebyid(oid)

2019-08-29 Thread Ibrar Ahmed
On Thu, Aug 29, 2019 at 3:16 PM Sergei Kornilov  wrote:

> Hello
>
> > Is there a need for this function for the user?
>
> This was feature request from user. I got such comment:
>
> This function is useful when working with pg_stat_statements. For
> obtaining a databаse name for particular query you need to join pg_database
> relation, but for obtaining an username you just need pg_get_userbyid(). So
> it will be useful not to join extra relation and get a database name using
> the similar function - pg_get_databasebyid().
>
> regards, Sergei
>

Hi,
I think its a user request and don't require to be in the core of
PostgreSQL.
A simple SQL function can fulfill the requirement of the user.

CREATE OR REPLACE FUNCTION pg_get_databasebyid(dboid integer) RETURNS name
AS $$

SELECT datname from pg_database WHERE oid = dboid;
$$ LANGUAGE SQL;

-- 
Ibrar Ahmed


Re: pg_get_databasebyid(oid)

2019-08-29 Thread Sergei Kornilov
Hello

> Is there a need for this function for the user?

This was feature request from user. I got such comment:

This function is useful when working with pg_stat_statements. For obtaining a 
databаse name for particular query you need to join pg_database relation, but 
for obtaining an username you just need pg_get_userbyid(). So it will be useful 
not to join extra relation and get a database name using the similar function - 
pg_get_databasebyid().

regards, Sergei




Re: basebackup.c's sendFile() ignores read errors

2019-08-29 Thread Jeevan Ladhe
Hi Jeevan,

On Wed, Aug 28, 2019 at 10:26 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Tue, Aug 27, 2019 at 10:33 PM Robert Haas 
> wrote:
>
>> While reviewing a proposed patch to basebackup.c this morning, I found
>> myself a bit underwhelmed by the quality of the code and comments in
>> basebackup.c's sendFile(). I believe it's already been pointed out
>> that the the retry logic here is not particularly correct, and the
>> comments demonstrate a pretty clear lack of understanding of how the
>> actual race conditions that are possible here might operate.
>>
>> However, I then noticed another problem which I think is significantly
>> more serious: this code totally ignores the possibility of a failure
>> in fread().  It just assumes that if fread() fails to return a
>> positive value, it's reached the end of the file, and if that happens,
>> it just pads out the rest of the file with zeroes.  So it looks to me
>> like if fread() encountered, say, an I/O error while trying to read a
>> data file, and if that error were on the first data block, then the
>> entire contents of that file would be replaced with zero bytes in the
>> backup, and no error or warning of any kind would be given to the
>> user.  If it hit the error later, everything from the point of the
>> error onward would just get replaced with zero bytes.  To be clear, I
>> think it is fine for basebackup.c to fill out the rest of the expected
>> length with zeroes if in fact the file has been truncated during the
>> backup; recovery should fix it.  But recovery is not going to fix data
>> that got eaten because EIO was encountered while copying a file.
>>
>
> Per fread(3) manpage, we cannot distinguish between an error or EOF. So to
> check for any error we must use ferror() after fread(). Attached patch
> which
> tests ferror() and throws an error if it returns true.
>

You are right. This seems the right approach to me. I can see at least
couple
of places already using ferror() to guard errors of fread()
like CopyGetData(),
read_backup_label() etc.


> However, I think
> fread()/ferror() both do not set errno (per some web reference) and thus
> throwing a generic error here. I have manually tested it.
>

If we are interested in getting the errno, then instead of fread(), we can
use read(). But, here, in particular, we are not decision making anything
depending on errno so I think we should be fine with your current patch.


>
>> The logic that rereads a block appears to have the opposite problem.
>> Here, it will detect an error, but it will also fail in the face of a
>> concurrent truncation, which it shouldn't.
>>
>
> For this, I have checked for feof() assuming that when the file gets
> truncated
> we reach EOF. And if yes, getting out of the loop instead of throwing an
> error.
> I may be wrong as I couldn't reproduce this issue.
>

I had a look at the patch and this seems correct to me.

Few minor comments:

+ /* Check fread() error. */
+ CHECK_FREAD_ERROR(fp, pathbuf);
+

The comments above the macro call at both the places are not necessary as
your macro name itself is self-explanatory.

--
+ /*
+ * If file is truncated, then we will hit
+ * end-of-file error in which case we don't
+ * want to error out, instead just pad it with
+ * zeros.
+ */
+ if (feof(fp))

The if block does not do the truncation right away, so I think the comment
above can be reworded to explain why we reset cnt?

Regards,
Jeevan Ladhe


Re: A problem about partitionwise join

2019-08-29 Thread Richard Guo
On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita 
wrote:

> Hi,
>
> On Tue, Aug 27, 2019 at 4:57 PM Richard Guo  wrote:
> > Check the query below as a more illustrative example:
> >
> > create table p (k int, val int) partition by range(k);
> > create table p_1 partition of p for values from (1) to (10);
> > create table p_2 partition of p for values from (10) to (100);
> >
> > If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
> > partitionwise join:
> >
> > # explain (costs off)
> > select * from p as foo join p as bar on foo.k = bar.k and foo.k =
> bar.val;
> >QUERY PLAN
> > -
> >  Append
> >->  Hash Join
> >  Hash Cond: (foo.k = bar.k)
> >  ->  Seq Scan on p_1 foo
> >  ->  Hash
> >->  Seq Scan on p_1 bar
> >  Filter: (k = val)
> >->  Hash Join
> >  Hash Cond: (foo_1.k = bar_1.k)
> >  ->  Seq Scan on p_2 foo_1
> >  ->  Hash
> >->  Seq Scan on p_2 bar_1
> >  Filter: (k = val)
> > (13 rows)
> >
> > But if we exchange the order of the two quals to 'foo.k = bar.val and
> > foo.k = bar.k', then partitionwise join cannot be generated any more,
> > because we only have joinclause 'foo.k = bar.val' as it first reached
> > score of 3. We have missed the joinclause on the partition key although
> > it does exist.
> >
> > # explain (costs off)
> > select * from p as foo join p as bar on foo.k = bar.val and foo.k =
> bar.k;
> >QUERY PLAN
> > -
> >  Hash Join
> >Hash Cond: (foo.k = bar.val)
> >->  Append
> >  ->  Seq Scan on p_1 foo
> >  ->  Seq Scan on p_2 foo_1
> >->  Hash
> >  ->  Append
> >->  Seq Scan on p_1 bar
> >  Filter: (val = k)
> >->  Seq Scan on p_2 bar_1
> >  Filter: (val = k)
> > (11 rows)
>
> I think it would be nice if we can address this issue.
>

Thank you.

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Any comments are welcome!

Thanks
Richard


v1-0001-Fix-up-partitionwise-join.patch
Description: Binary data


Re: REINDEX filtering in the backend

2019-08-29 Thread Julien Rouhaud
On Thu, Aug 29, 2019 at 2:09 AM Michael Paquier  wrote:
>
> On Wed, Aug 28, 2019 at 10:22:07AM +0200, Julien Rouhaud wrote:
> >>> The filtering is done at table level (with and without the
> >>> concurrently option), so SCHEMA, DATABASE and SYSTEM automatically
> >>> benefit from it.  If this clause is used with a REINDEX INDEX, the
> >>> statement errors out, as I don't see a valid use case for providing a
> >>> single index name and asking to possibly filter it at the same time.
> >>
> >> Supporting that case would not be that much amount of work, no?
> >
> > Probably not, but I'm dubious about the use case.  Adding the lib
> > version in the catalog would be more useful for people who want to
> > write their own rules to reindex specific set of indexes.
>
> Hearing from others here would be helpful.  My take is that having a
> simple option doing the filtering, without the need to store anything
> in the catalogs, would be helpful enough for users mixing both index
> types on a single table.  Others may not agree.

That was already suggested by Thomas and seconded by Peter E., see
https://www.postgresql.org/message-id/2b1504ac-3d6c-11ec-e1ce-3daf132b3d37%402ndquadrant.com.

I personally think that it's a sensible approach, and I'll be happy to
propose a patch for that too if no one worked on this yet.

> >>  I think that it would be nice to be
> >> able to do both, generating an error for REINDEX INDEX if the index
> >> specified is not compatible, and a LOG if the index is not filtered
> >> out when a list is processed.
> >
> > Do you mean having an error if the index does not contain any
> > collation based type?  Also, REINDEX only accept a single name, so
> > there shouldn't be any list processing for REINDEX INDEX?  I'm not
> > really in favour of adding extra code the filtering when user asks for
> > a specific index name to be reindexed.
>
> I was referring to adding an error if trying to reindex an index with
> the filtering enabled.  That's useful to inform the user that what he
> intends to do is not compatible with the options provided.

Ok, I can add it if needed.

> > That's what I did when I first submitted the feature in reindexdb.  I
> > didn't use them because it means switching to TAP tests.  I can drop
> > the simple regression test (especially since I now realize than one is
> > quite broken) and use the TAP one if, or should I keep both?
>
> There is no need for TAP I think.  You could for example store the
> relid and its relfilenode in a temporary table before running the
> reindex, run the REINDEX and then compare with what pg_class stores.
> And that's much cheaper than setting a new instance for a TAP test.

Oh indeed, good point!  I'll work on better tests using this approach then.




Re: pg_get_databasebyid(oid)

2019-08-29 Thread Ibrar Ahmed
On Wed, Aug 28, 2019 at 6:05 PM Sergei Kornilov  wrote:

> > Please add that to commitfest.
>
> Done: https://commitfest.postgresql.org/24/2261/
>
> regards, Sergei
>
Hi,

I have checked the code, the function "pg_get_userbyid" is used in many
places in code. I am just curious why we need that  "pg_get_databasebyid"
function. Is there a need for this function for the user?


-- 
Ibrar Ahmed


Wrong value in metapage of GIN INDEX.

2019-08-29 Thread Moon, Insung
Dear Hackers.

Kuroda-san and I are interested in the GIN index and have been testing
various things.
While testing, we are found a little bug.
Some cases, the value of nEntries in the metapage was set to the wrong value.

This is a reproduce of bug situation.
=# SET maintenance_work_mem TO '1MB';
=# CREATE TABLE foo(i jsonb);
=# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
generate_series(1, 1) AS i;

# Input the same value again.
=# INSERT INTO foo(i) select jsonb_build_object('foobar001', i) FROM
generate_series(1, 1) AS i;
# Creates GIN Index.
=# CREATE INDEX foo_idx ON foo USING gin (i jsonb_ops);


=# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0)) WITH
(fastupdate=off);
-[ RECORD 1 ]+---
pending_head | 4294967295
pending_tail | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages| 74
n_entry_pages| 69
n_data_pages | 4
n_entries| 20004 <--★
version  | 2

In this example, the nentries value should be 10001 because the gin
index stores duplicate values in one leaf(posting tree or posting
list).
But, if look at the nentries value of metapage using pageinspect, it
is stored as 20004.
So, Let's run the vacuum.


=# VACUUM foo;
=# SELECT * FROM gin_metapage_info(get_raw_page('foo_idx', 0));
-[ RECORD 1 ]+---
pending_head | 4294967295
pending_tail | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages| 74
n_entry_pages| 69
n_data_pages | 4
n_entries| 10001 <--★
version  | 2

Ah. Run to the vacuum, nEntries is changing the normal value.

There is a problem with the ginEntryInsert function. That calls the
table scan when creating the gin index, ginBuildCallback function
stores the new heap value inside buildstate struct.
And next step, If GinBuildState struct is the size of the memory to be
using is equal to or larger than the maintenance_work_mem value, run
to input value into the GIN index.
This process is a function called ginEnctryInsert.
The ginEntryInsert function called at this time determines that a new
entry is added and increase the value of nEntries.
However currently, ginEntryInsert is first to increase in the value of
nEntries, and to determine if there are the same entries in the
current GIN index.
That causes the bug.

The patch is very simple.
Fix to increase the value of nEntries only when a non-duplicate GIN
index leaf added.

This bug detection and code fix worked with Kuroda-san.

Best Regards.
Moon.


GIN_Metapage_bugfix.patch
Description: Binary data


Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread fn ln
 COMMIT AND CHAIN in implicit block leaves blockState as TBLOCK_STARTED,
which doesn't trigger the chaining.
but ROLLBACK AND CHAIN sets the blockState into TBLOCK_ABORT_PENDING, so
the chaining is triggered.

I think disabling s->chain beforehand should do the desired behavior.

2019年8月25日(日) 18:11 Fabien COELHO :

>
> > The following bug has been logged on the website:
> >
> > Bug reference:  15977
> > Logged by:  mtlh kdvt
> > Email address:  emuser20140...@gmail.com
> > PostgreSQL version: 12beta3
> > Operating system:   Windows
> > Description:
> >
> > When a ROLLBACK AND CHAIN command is executed in the implicit transaction
> > block, a new transaction will be started:
> >   db=# ROLLBACK AND CHAIN;
> >   WARNING:  there is no transaction in progress
> >   ROLLBACK
> >   db=# ROLLBACK AND CHAIN;
> >   ROLLBACK
> >
> > However, a COMMIT AND CHAIN command won't start a new transaction:
> >   db=# COMMIT AND CHAIN;
> >   WARNING:  there is no transaction in progress
> >   COMMIT
> >   db=# COMMIT AND CHAIN;
> >   WARNING:  there is no transaction in progress
> >   COMMIT
>
> Thanks for the report.
>
> Indeed, I confirm, and I should have caught this one while reviewing…
>
> Doc says:
>
> "If AND CHAIN is specified, a new transaction is immediately started with
> the same transaction characteristics as the just finished one. Otherwise,
> no new transaction is started."
>
> If there is no transaction in progress, the spec is undefined. Logically,
> ITSM that there should be no tx reset if none was in progress, so ROLLBACK
> has the wrong behavior?
>
> A quick glance at the code did not yield any obvious culprit, but maybe
> I'm not looking at the right piece of code.
>
> Doc could happend ", if any" to be clearer.
>
> --
> Fabien.


disable_implicit_xact_chaining.patch
Description: Binary data


Re: Resume vacuum and autovacuum from interruption and cancellation

2019-08-29 Thread Masahiko Sawada
On Tue, Aug 27, 2019 at 2:55 PM Jamison, Kirk  wrote:
>
> On Monday, August 19, 2019 10:39 AM (GMT+9), Masahiko Sawada wrote:
> > Fixed.
> >
> > Attached the updated version patch.
>
> Hi Sawada-san,
>
> I haven't tested it with heavily updated large tables, but I think the patch
> is reasonable as it helps to shorten the execution time of vacuum by removing
> the redundant vacuuming and prioritizing reclaiming the garbage instead.
> I'm not sure if it's commonly reported to have problems even when we repeat
> vacuuming the already-vacuumed blocks, but I think it's a reasonable 
> improvement.
>
> I skimmed the patch and have few comments. If they deem fit, feel free to
> follow, but it's also ok if you don't.

Thank you for reviewing this patch! I've attached the updated patch
incorporated all your comments and some improvements.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 0bd1cf64aa55977029371e3402d05240dfdfec21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 10 Jul 2019 19:02:56 +0900
Subject: [PATCH v3] Add RESUME option to VACUUM and autovacuum.

This commit adds a new reloption, vaucum_resume, which controls
whether vacuum attempt to resume vacuuming from the last vacuumed
block saved at vacuum_resume_block column of pg_stat_all_tables. It
also adds a new option to the VACUUM command, RESUME which can be used
to override the reloption.
---
 doc/src/sgml/monitoring.sgml   |  5 ++
 doc/src/sgml/ref/vacuum.sgml   | 18 +++
 src/backend/access/common/reloptions.c | 13 -
 src/backend/access/heap/vacuumlazy.c   | 88 +++---
 src/backend/catalog/system_views.sql   |  1 +
 src/backend/commands/vacuum.c  | 13 +
 src/backend/postmaster/pgstat.c| 42 
 src/backend/utils/adt/pgstatfuncs.c| 14 ++
 src/include/catalog/pg_proc.dat|  5 ++
 src/include/commands/vacuum.h  |  5 +-
 src/include/pgstat.h   | 14 ++
 src/include/utils/rel.h|  2 +
 src/test/regress/expected/rules.out|  3 ++
 src/test/regress/expected/vacuum.out   | 20 
 src/test/regress/sql/vacuum.sql| 21 
 15 files changed, 255 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c..dfd8669 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2785,6 +2785,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
  Estimated number of rows modified since this table was last analyzed
 
 
+ vacuum_resume_block
+ bigint
+ Block number to resume vacuuming
+
+
  last_vacuum
  timestamp with time zone
  Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index f9b0fb8..ae3798b 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -34,6 +34,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ]
 INDEX_CLEANUP [ boolean ]
 TRUNCATE [ boolean ]
+RESUME [ boolean ]
 
 and table_and_columns is:
 
@@ -224,6 +225,23 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ . This behavior is helpful
+  when resuming vacuum run from interruption and cancellation. The default
+  is false unless the vacuum_resume option has been
+  set to true. This option is ignored if either the FULL,
+  FREEZE, or DISABLE_PAGE_SKIPPING
+  option is used.
+ 
+
+   
+
+   
 boolean
 
  
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 42647b0..0200eb0 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"vacuum_resume",
+			"Enables vacuum to resume vacuuming from the last vacuumed block",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1412,7 +1421,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, vacuum_truncate)}
+		offsetof(StdRdOptions, vacuum_truncate)},
+		{"vacuum_resume", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, vacuum_resume)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, );
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1d..1895ca0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -92,6 +92,14 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
+ * When a table has no indexes, save the progress every 8GB so that we can
+ * 

Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread Fabien COELHO



Patch works for me, and solution seems appropriate. It should be committed 
for pg 12.0.


I have listed this as an open issue of the upcoming pg 12:

https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Open_Issues

--
Fabien.




Re: BUG #15977: Inconsistent behavior in chained transactions

2019-08-29 Thread Fabien COELHO



Hello,

COMMIT AND CHAIN in implicit block leaves blockState as TBLOCK_STARTED, 
which doesn't trigger the chaining. but ROLLBACK AND CHAIN sets the 
blockState into TBLOCK_ABORT_PENDING, so the chaining is triggered.


I think disabling s->chain beforehand should do the desired behavior.


Patch applies with "patch", although "git apply" complained because of 
CRLF line terminations forced by the octet-stream mime type.


Patch compiles cleanly. Make check ok.

Patch works for me, and solution seems appropriate. It should be committed 
for pg 12.0.


There could be a test added in "regress/sql/transactions.sql", I'd suggest 
something like:


-- implicit transaction and not chained.
COMMIT AND CHAIN;
COMMIT;
ROLLBACK AND CHAIN;
ROLLBACK;

which should show the appropriate "no transaction in progress" warnings.

Doc could be made a little clearer about what to expect when there is no 
explicit transaction in progress.


--
Fabien.




Re: refactoring - share str2*int64 functions

2019-08-29 Thread Fabien COELHO


Bonjour Michaël,


 - *ptr && WHATEVER(*ptr)
  *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
  char but at the end. It might be debatable in some places, e.g. it is
  likely that there are no spaces in the string, but likely that there are
  more than one digit.


Still this makes the checks less robust?


I do not see any downside, but maybe I lack imagination.

On my laptop isWHATEVER is implementd through an array mapping characters 
to a bitfield saying whether each char is WHATEVER, depending on the bit. 
This array is well defined for index 0 ('\0').


If an implementation is based on comparisons, for isdigit it would be:

   c >= '0' && c <= '9'

Then checking c != '\0' is redundant with c >= '0'.

Given the way the character checks are used in the function, we do not go 
beyond the end of the string because we only proceed further when a 
character is actually recognized, else we return.


So I cannot see any robustness issue, just a few cycles saved.


Hmmm. Have you looked at the fallback cases when the corresponding builtins
are not available?

I'm unsure of a reliable way to detect a generic unsigned int overflow
without expensive dividing back and having to care about zero…


Mr Freund has mentioned that here:
https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucm...@alap3.anarazel.de


Yep, that is what I mean by expensive: there is an integer division, which 
is avoided if b is known to be 10, hence is not zero and the limit value 
can be checked directly on the input without having to perform a division 
each time.



So I was pretty happy with my two discreet, small and efficient tests.


That's also a matter of code and interface consistency IMHO.


Possibly.

ISTM that part of the motivation is to reduce costs for heavily used 
conversions, eg on COPY. Function "scanf" is overly expensive because it 
has to interpret its format, and we have to check for overflows…


Anyway, if we assume that the builtins exist and rely on efficient 
hardware check, maybe we do not care about the fallback cases which would 
just be slow but never executed.


Note that all this is moot, as all instances of string to uint64 
conversion do not check for errors.


Attached v7 does create uint64 overflow inline functions. The stuff yet is 
not moved to "common/int.c", a file which does not exists, even if 
"common/int.h" does.


--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..28891ba337 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -79,6 +79,7 @@
 #include "utils/builtins.h"
 #include "utils/hashutils.h"
 #include "utils/memutils.h"
+#include "common/string.h"
 
 PG_MODULE_MAGIC;
 
@@ -1022,7 +1023,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/* parse command tag to retrieve the number of affected rows. */
 		if (completionTag &&
 			strncmp(completionTag, "COPY ", 5) == 0)
-			rows = pg_strtouint64(completionTag + 5, NULL, 10);
+			(void) pg_strtouint64(completionTag + 5, );
 		else
 			rows = 0;
 
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..8e75d52b06 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -21,6 +21,7 @@
 #include "catalog/heap.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
+#include "common/string.h"
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
@@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 	CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
 
 	if (strncmp(completionTag, "SELECT ", 7) == 0)
-		_SPI_current->processed =
-			pg_strtouint64(completionTag + 7, NULL, 10);
+		(void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
 	else
 	{
 		/*
@@ -2361,8 +2361,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 else if (IsA(stmt->utilityStmt, CopyStmt))
 {
 	Assert(strncmp(completionTag, "COPY ", 5) == 0);
-	_SPI_current->processed = pg_strtouint64(completionTag + 5,
-			 NULL, 10);
+	(void) pg_strtouint64(completionTag + 5, &_SPI_current->processed);
 }
 			}
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..17cde42b4d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -34,6 +34,7 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
+#include "common/string.h"
 #include "nodes/extensible.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -80,7 +81,7 @@
 #define READ_UINT64_FIELD(fldname) \
 	token = pg_strtok();		/* skip :fldname */ \
 	token = pg_strtok();		/* get field value */ \
-	local_node->fldname = pg_strtouint64(token, NULL, 10)
+	(void) pg_strtouint64(token, _node->fldname)
 
 /* Read a long integer field