Misleading comment about single_copy, and some bikeshedding

2019-06-23 Thread Thomas Munro
Hi,

/*
 * GatherMergePath runs several copies of a plan in parallel and collects
 * the results, preserving their common sort order.  For gather merge, the
 * parallel leader always executes the plan too, so we don't need single_copy.
 */
typedef struct GatherMergePath

The second sentence is not true as of commit e5253fdc, and the
attached patch removes it.

Even before that commit, the comment was a bit circular: the reason
GatherMergePath doesn't need a single_copy field is because
force_parallel_mode specifically means "try to stick a Gather node on
top in a test mode with one worker and no leader participation", and
this isn't a Gather node.

Hmm.  I wonder if we should rename force_parallel_mode to
force_gather_node in v13.  The current name has always seemed slightly
misleading to me; it sounds like some kind of turbo boost button but
really it's a developer-only test mode.  Also, does it belong under
DEVELOPER_OPTIONS instead of QUERY_TUNING_OTHER?  I'm also wondering
if the variable single_copy would be better named
no_leader_participation or single_participant.  I find "copy" a
slightly strange way to refer to the number of copies *allowed to
run*, but maybe that's just me.

-- 
Thomas Munro
https://enterprisedb.com


0001-Remove-misleading-comment-from-pathnodes.h.patch
Description: Binary data


Re: using explicit_bzero

2019-06-23 Thread Michael Paquier
On Sun, Jun 23, 2019 at 09:56:40PM +0200, Peter Eisentraut wrote:
> On 2019-06-21 15:45, Dagfinn Ilmari Mannsåker wrote:
>> Also, on Linux it requires libbsd: https://libbsd.freedesktop.org/
> 
> No, it's in glibc.

From man:
explicit_bzero() first appeared in glibc 2.25.
--
Michael


signature.asc
Description: PGP signature


Re: using explicit_bzero

2019-06-23 Thread Michael Paquier
On Sun, Jun 23, 2019 at 09:57:18PM +0200, Peter Eisentraut wrote:
> On 2019-06-23 21:55, Peter Eisentraut wrote:
>> On 2019-06-21 15:25, Tom Lane wrote:
>>> Peter Eisentraut  writes:
 +#ifndef HAVE_EXPLICIT_BZERO
 +#define explicit_bzero(b, len) bzero(b, len)
 +#endif
>>>
>>> This presumes that every platform has bzero, which is unsafe (POSIX
>>> doesn't specify it) and is an assumption we kicked to the curb a dozen
>>> years ago (067a5cdb3).  Please use memset() for the substitute instead.

+1.

>> OK, done.
> 
> and with patch attached

CreateRole() and AlterRole() can manipulate a password in plain format
in memory.  The cleanup could be done just after calling
encrypt_password() in user.c.

Could it be possible to add the new flag in pg_config.h.win32?
--
Michael


signature.asc
Description: PGP signature


Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-23 Thread Michael Paquier
On Sun, Jun 23, 2019 at 07:21:02PM +0200, Peter Eisentraut wrote:
> Updated patch for that.

I have been looking at this patch set.  Patch 0001 looks good to me.
You are removing all traces of a set of timestamp keywords not
supported anymore, and no objections from my side for this cleanup.

+#define MAXPG_LSNCOMPONENT 8
+
 static bool
 check_recovery_target_lsn(char **newval, void **extra, GucSource source)
Let's avoid the duplication for the declarations.  I would suggest to
move the definitions of MAXPG_LSNLEN and MAXPG_LSNCOMPONENT to
pg_lsn.h.  Funny part, I was actually in need of this definition a
couple of days ago for a LSN string in a frontend tool.  I would
suggest renames at the same time:
- PG_LSN_LEN
- PG_LSN_COMPONENT

I think that should have a third definition for
"0123456789abcdefABCDEF", say PG_LSN_CHARACTERS, and we could have one
more for the separator '/'.

Avoiding the duplication between pg_lsn.c and guc.c is proving to be
rather ugly and reduces the readability within pg_lsn.c, so please let
me withdraw my previous objection.  (Looked at that part.)

-   if (strcmp(*newval, "epoch") == 0 ||
-   strcmp(*newval, "infinity") == 0 ||
-   strcmp(*newval, "-infinity") == 0 ||
Why do you remove these?  They should still be rejected because they
make no sense as recovery targets, no?

It may be worth mentioning that AdjustTimestampForTypmod() is not
duplicated because we don't care about the typmod in this case.
--
Michael


signature.asc
Description: PGP signature


Re: Problem with default partition pruning

2019-06-23 Thread Alvaro Herrera
On 2019-Jun-24, shawn wang wrote:

Hello,

> Thank you for your reply.
> You can see that the mail start time is February 22. So I looked at the
> latest version at that time. I found that v11.2 was the newest branch at
> the time. So I tried to merge this patch into the code, and I found that
> everything worked.

I see -- I only tried master, didn't occur to me to try it against
REL_11_STABLE.

> So I tested on this branch and got the results.
> You need to add the v4_default_partition_pruning.patch
> 
> first,
> and then add the
> v3_ignore_contradictory_where_clauses_at_partprune_step.patch
> 

Oh, so there are two patches?  It's easier to keep track if they're
always posted together.  Anyway, I may have some time to have a look
tomorrow (Monday).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimize partial TOAST decompression

2019-06-23 Thread Binguo Bao
> This is not correct: L bytes of compressed data do not always can be
decoded into at least L bytes of data. At worst we have one control byte
per 8 bytes of literal bytes. This means at most we need (L*9 + 8) / 8
bytes with current pglz format.

Good catch! I've corrected the related code in the patch.

> Also, I'm not sure you use SET_VARSIZE_COMPRESSED correctly...
I followed the code in toast_fetch_datum function[1], and I didn't see any
wrong with it.

Best regards, Binguo Bao

[1]
https://github.com/postgres/postgres/blob/master/src/backend/access/heap/tuptoaster.c#L1898

Andrey Borodin  于2019年6月23日周日 下午5:23写道:

> Hi, Binguo!
>
> > 2 июня 2019 г., в 19:48, Binguo Bao  написал(а):
> >
> > Hi, hackers!
> 
> > This seems to have a 10x improvement. If the number of toast data chunks
> is more, I believe that patch can play a greater role, there are about 200
> related TOAST data chunks for each entry in the case.
>
> That's really cool that you could produce meaningful patch long before end
> of GSoC!
>
> I'll describe what is going on a little:
> 1. We have compressed value, which resides in TOAST table.
> 2. We want only some fraction of this value. We want some prefix with
> length L.
> 3. Previously Paul Ramsey submitted patch that omits decompression of
> value beyond desired L bytes.
> 4. Binguo's patch tries to do not fetch compressed data which will not bee
> needed to decompressor. In fact it fetches L bytes from TOAST table.
>
> This is not correct: L bytes of compressed data do not always can be
> decoded into at least L bytes of data. At worst we have one control byte
> per 8 bytes of literal bytes. This means at most we need (L*9 + 8) / 8
> bytes with current pglz format.
>
> Also, I'm not sure you use SET_VARSIZE_COMPRESSED correctly...
>
> Best regards, Andrey Borodin.
From 505dcc4fdef18710c98718685180190056b4d9ed Mon Sep 17 00:00:00 2001
From: BBG 
Date: Sun, 2 Jun 2019 19:18:46 +0800
Subject: [PATCH] Optimize partial TOAST decompression

---
 src/backend/access/heap/tuptoaster.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..89ffc2d 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -262,6 +262,8 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 	struct varlena *result;
 	char	   *attrdata;
 	int32		attrsize;
+	int32 		needsize;
+	int32 		max_compressed_size;
 
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
@@ -273,8 +275,22 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 		if (!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
 			return toast_fetch_datum_slice(attr, sliceoffset, slicelength);
 
-		/* fetch it back (compressed marker will get set automatically) */
-		preslice = toast_fetch_datum(attr);
+		needsize = sliceoffset + slicelength;
+		if (needsize > (INT_MAX / 9))
+		{
+			/* Approximate to avoid overflow */
+			max_compressed_size = (needsize / 8 + 1) * 9;
+		}
+		else
+		{
+			max_compressed_size = (needsize * 9) / 8 + 1;
+		}
+
+		/*
+		 * Be sure to get enough compressed slice
+		 * and compressed marker will get set automatically
+		 */
+		preslice = toast_fetch_datum_slice(attr, 0, max_compressed_size);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -2031,7 +2047,8 @@ toast_fetch_datum(struct varlena *attr)
  *	Reconstruct a segment of a Datum from the chunks saved
  *	in the toast relation
  *
- *	Note that this function only supports non-compressed external datums.
+ *	Note that this function supports non-compressed external datums
+ *	and compressed external datum slices at the start of the object.
  * --
  */
 static struct varlena *
@@ -2072,10 +2089,9 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 length)
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
 	/*
-	 * It's nonsense to fetch slices of a compressed datum -- this isn't lo_*
-	 * we can't return a compressed datum which is meaningful to toast later
+	 * It's meaningful to fetch slices at the start of a compressed datum.
 	 */
-	Assert(!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer));
+	Assert(!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer) || 0 == sliceoffset);
 
 	attrsize = toast_pointer.va_extsize;
 	totalchunks = ((attrsize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
@@ -2091,7 +2107,11 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset, int32 length)
 
 	result = (struct varlena *) palloc(length + VARHDRSZ);
 
-	SET_VARSIZE(result, length + VARHDRSZ);
+	if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer)) {
+		SET_VARSIZE_COMPRESSED(result, length + VARHDRSZ);
+	} else {
+		SET_VARSIZE(result, length + VARHDRSZ);
+	}
 
 	if (length == 0)
 		return result;			/* Can save a lot of work at this point! */
-- 
2.7.4



Re: Plugging some testing holes

2019-06-23 Thread Michael Paquier
On Sun, Jun 23, 2019 at 06:15:06PM -0400, Andrew Dunstan wrote:
> Alvaro pointed out to me recently that the buildfarm client doesn't have
> any provision for running module tests like commit_ts and
> snapshot_too_old that use NO_INSTALLCHECK. On looking into this a bit
> more, I noticed that we also don't run any TAP tests in
> src/test/modules. I'm adding some code to the client to remedy both of
> these, and crake has been running it quite happily for a week or so. Are
> there any other holes of this nature that should be plugged?

src/test/kerberos/ and src/test/ldap/.

contrib modules having TAP tests are actually able to run the tests.
Only an installcheck triggered from contrib/ happens at step
contrib-install-check-C, right?

> We'll need some MSVC build tools support for some of it.

This one is more complex.  We don't actually track TAP_TESTS in
src/tools/msvc/ yet.  Cough.
--
Michael


signature.asc
Description: PGP signature


Re: Fix doc bug in logical replication.

2019-06-23 Thread Robert Treat
On Sun, Jun 23, 2019 at 1:25 PM Peter Eisentraut
 wrote:
>
> On 2019-04-12 19:52, Robert Treat wrote:
> > It is clear to me that the docs are wrong, but I don't see anything
> > inherently incorrect about the code itself. Do you have suggestions
> > for how you would like to see the code comments improved?
>
> The question is perhaps whether we want to document that non-matching
> data types do work.  It happens to work now, but do we always want to
> guarantee that?  There is talk of a binary mode for example.
>

Whether we *want* to document that it works, documenting that it
doesn't work when it does can't be the right answer. If you want to
couch the language to leave the door open that we may not support this
the same way in the future I wouldn't be opposed to that, but at this
point we will have three releases with the current behavior in
production, so if we decide to change the behavior, it is likely going
to break certain use cases. That may be ok, but I'd expect a
documentation update to accompany a change that would cause such a
breaking change.

Robert Treat
https://xzilla.net




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-06-23 Thread Masahiko Sawada
On Thu, Jun 20, 2019 at 10:46 PM Joe Conway  wrote:
>
> On 6/20/19 8:34 AM, Masahiko Sawada wrote:
> > I think even if we provide the per table encryption we can have
> > encryption keys per tablepaces. That is, all tables on the same
> > tablespace encryption use the same encryption keys but user can
> > control encrypted objects per tables.
> >
> >> Will we add the table-level TDE in the first version?
> >
> > I hope so but It's under discussion now.
>
> +1
>
> >> And I have two questions.
> >> 1. Will we add hooks to support replacing the encryption algorithms?
> >> 2. Will we add some encryption algorithm or use some in some libraries?
> >
> > Currently the WIP patch uses openssl and supports only AES-256 and I
> > don't have a plan to have such extensibility for now. But it might be
> > a good idea in the future. I think it would be not hard to support
> > symmetric encryption altgorithms supported by openssl but would you
> > like to support other encryption algorithms?
>
> Supporting other symmetric encryption algorithms would be nice but I
> don't think that is required for the first version. It would also be
> nice but not initially required to support different encryption
> libraries. The implementation should be written with both of these
> eventualities in mind though IMHO.

Agreed.

>
> I would like to see strategically placed hooks in the key management so
> that an extension could, for example, layer another key in between the
> master key and the per-tablespace key. This would allow extensions to
> provide additional control over when decryption is allowed.

Interesting.

The master key management is a important topic. In my proposal, we
provide generic key management APIs such as getkey, removekey,
generatekey, in order to manage the master key. A key management
extension could get the master key from an arbitrary external system.
So it also could layer an another key in between the master key and
the per-tablespace key.

Do you have any thoughts on the master key management? That design
would be flexible but complicated. Especially, the API design would be
controversial.

There is a way to enter the passphrase to Postgres to get the master
key stored in the database but the passphrase could be written in the
server log if we pass it using SQL command, which is bad. It would
require to invent another system to prevent particular SQL from being
written to the server log. Another example is to enter the password or
passphrase via a command line option. But it also could require users
to write the plain passphrase or password in script files, which is
bad too.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Problem with default partition pruning

2019-06-23 Thread shawn wang
Hi Alvaro,
Thank you for your reply.
You can see that the mail start time is February 22. So I looked at the
latest version at that time. I found that v11.2 was the newest branch at
the time. So I tried to merge this patch into the code, and I found that
everything worked. So I tested on this branch and got the results.
You need to add the v4_default_partition_pruning.patch

first,
and then add the
v3_ignore_contradictory_where_clauses_at_partprune_step.patch

.
Otherwise, you will find some errors.
I hope this helps you.

Regards.

-- 
Shawn Wang

Alvaro Herrera  于2019年6月22日周六 上午4:03写道:

> On 2019-Jun-17, Shawn Wang wrote:
>
> > I tested different types of key values, and multi-level partitioned
> tables, and found no problems.
> > Only the SQL in the file of src/test/regress/results/partition_prune.out
> has a space that caused the regression test to fail.
>
> It's not clear to me what patch were you reviewing.  The latest patch I
> see in this thread, in [1], does not apply in any branches.  As another
> test, I tried to apply the patch on commit 489e431ba56b (before Tom's
> partprune changes in mid May); if you use "patch -p1
> --ignore-whitespace" it is accepted, but the failure case proposed at
> the start of the thread shows the same behavior (namely, that test1_def
> is scanned when it is not needed):
>
> 55432 12devel 23506=# create table test1(id int, val text) partition by
> range (id);
> create table test1_1 partition of test1 for values from (0) to (100);
> create table test1_2 partition of test1 for values from (150) to (200);
> create table test1_def partition of test1 default;
> explain select * from test1 where id > 0 and id < 30;
> CREATE TABLE
> Duración: 5,736 ms
> CREATE TABLE
> Duración: 5,622 ms
> CREATE TABLE
> Duración: 3,585 ms
> CREATE TABLE
> Duración: 3,828 ms
>QUERY PLAN
> ─
>  Append  (cost=0.00..58.16 rows=12 width=36)
>->  Seq Scan on test1_1  (cost=0.00..29.05 rows=6 width=36)
>  Filter: ((id > 0) AND (id < 30))
>->  Seq Scan on test1_def  (cost=0.00..29.05 rows=6 width=36)
>  Filter: ((id > 0) AND (id < 30))
> (5 filas)
>
> Duración: 2,465 ms
>
>
> [1] https://postgr.es/m/00cf01d4eea7$afa43370$0eec9a50$@lab.ntt.co.jp
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>


Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Tomas Vondra

On Mon, Jun 24, 2019 at 01:42:32AM +0200, Tomas Vondra wrote:

On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:

On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).



Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.



Yeah, I realized that too, eventually. One way to fix that would be
adding %.15f to the sprintf() call, but that just adds ugliness. It's
probably time to rewrite the function to build the tuple from datums,
instead of relying on BuildTupleFromCStrings.



OK, attached is a patch doing this. It's pretty simple, and it does
resolve the issue with frequency precision.

There's one issue with the signature, though - currently the function
returns null flags as bool array, but values are returned as simple
text value (formatted in array-like way, but still just a text).

In the attached patch I've reworked both to proper arrays, but obviously
that'd require a CATVERSION bump - and there's not much apetite for that
past beta2, I suppose. So I'll just undo this bit.



Meh, I forgot to attach the patch, of couse ...


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
>From 9aa3ce9bfdcab50a6d5dcf289bcf715e4a8aed9c Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 24 Jun 2019 00:50:53 +0200
Subject: [PATCH 1/3] rework pg_mcv_list_items

---
 src/backend/statistics/mcv.c| 106 
 src/include/catalog/pg_proc.dat |   2 +-
 2 files changed, 42 insertions(+), 66 deletions(-)

diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..1b1a5c9846 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1134,9 +1134,6 @@ Datum
 pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
 {
FuncCallContext *funcctx;
-   int call_cntr;
-   int max_calls;
-   AttInMetadata *attinmeta;
 
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
@@ -1166,13 +1163,13 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("function returning record 
called in context "
"that cannot accept 
type record")));
+   tupdesc = BlessTupleDesc(tupdesc);
 
/*
 * generate attribute metadata needed later to produce tuples 
from raw
 * C strings
 */
-   attinmeta = TupleDescGetAttInMetadata(tupdesc);
-   funcctx->attinmeta = attinmeta;
+   funcctx->attinmeta = TupleDescGetAttInMetadata(tupdesc);
 
MemoryContextSwitchTo(oldcontext);
}
@@ -1180,18 +1177,17 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
 
-   call_cntr = funcctx->call_cntr;
-   max_calls = funcctx->max_calls;
-   attinmeta = funcctx->attinmeta;
-
-   if (call_cntr < max_calls)  /* do when there is more left to send */
+   if (funcctx->call_cntr < funcctx->max_calls)/* do when there is 
more left to send */
{
-   char  **values;
+   Datum   values[5];
+   boolnulls[5];
HeapTuple   tuple;
Datum   result;
+   int dims[1];
+   int lbs[1];
 
-   StringInfoData itemValues;
-   StringInfoData itemNulls;
+   Datum *itemNulls;
+   Datum *itemValues;
 
int i;
 
@@ -1203,24 +1199,16 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
 
mcvlist = (MCVList *) funcctx->user_fctx;
 
-   Assert(call_cntr < mcvlist->nitems);
-
-   item = >items[call_cntr];
-
-   /*
-* Prepare a values array for building the returned tuple. This 
should
-* be an array of C strings which will be processed later by 
the type
-* input functions.
-*/
-   values = (char **) palloc0(5 * sizeof(char *));
+   Assert(funcctx->call_cntr < mcvlist->nitems);
 
-   

Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Tomas Vondra

On Sat, Jun 22, 2019 at 04:10:52PM +0200, Tomas Vondra wrote:

On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:

On Thu, 20 Jun 2019 at 23:35, Tomas Vondra  wrote:


On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:


I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.


But would that be an issue, or a good thing? I mean, as long as the item
is above mincount, we take the counts as reliable. As I explained, my
motivation for proposing that was that both

  ... (cost=... rows=1 ...) (actual=... rows=101 ...)

and

  ... (cost=... rows=100 ...) (actual=... rows=200 ...)

have exactly the same Abs(freq - base_freq), but I think we both agree
that the first misestimate is much more dangerous, because it's off by six
orders of magnitude.



Hmm, that's a good example. That definitely suggests that we should be
trying to minimise the relative error, but also perhaps that what we
should be looking at is actually just the ratio freq / base_freq,
rather than their difference.



Attached are patches that implement this (well, the first one optimizes
how the base frequency is computed, which helps for high statistic target
values). The 0002 part picks the items based on

 Max(freq/base_freq, base_freq/freq)

It did help with the addresses data set quite a bit, but I'm sure it needs
more testing. I've also tried using

 freq * abs(freq - base_freq)

but the estimates were not as good.

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).



FWIW while doing more tests on this, I've realized a rather annoying
behavior while increasing the statistics target.

With the current algorithm picking values merely based on frequency, the
MCV list is expanded in a stable way. Increasing the statistics target
means the MCV list may grow, but the larger MCV list will contain the
smaller MCV one (ignoring changes due to a differences in the sample).

After switching to the two-phase algorithm (first picking candidates
based on mincount, then picking the items based on error) that's no
longer true. I've repeatedly seen cases when increasing the target
lowered mincount, adding candidates with larger frequency errors,
removing some of the items from the "smaller" MCV list.

In practice this means that increasing the statistics target may easily
make the estimates much worse (for some of the values).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Tomas Vondra

On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:

On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).



Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.



Yeah, I realized that too, eventually. One way to fix that would be
adding %.15f to the sprintf() call, but that just adds ugliness. It's
probably time to rewrite the function to build the tuple from datums,
instead of relying on BuildTupleFromCStrings.



OK, attached is a patch doing this. It's pretty simple, and it does
resolve the issue with frequency precision.

There's one issue with the signature, though - currently the function
returns null flags as bool array, but values are returned as simple
text value (formatted in array-like way, but still just a text).

In the attached patch I've reworked both to proper arrays, but obviously
that'd require a CATVERSION bump - and there's not much apetite for that
past beta2, I suppose. So I'll just undo this bit.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Plugging some testing holes

2019-06-23 Thread Andrew Dunstan


Alvaro pointed out to me recently that the buildfarm client doesn't have
any provision for running module tests like commit_ts and
snapshot_too_old that use NO_INSTALLCHECK. On looking into this a bit
more, I noticed that we also don't run any TAP tests in
src/test/modules. I'm adding some code to the client to remedy both of
these, and crake has been running it quite happily for a week or so. Are
there any other holes of this nature that should be plugged? We'll need
some MSVC build tools support for some of it.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Tomas Vondra

On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).



Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.



Yeah, I realized that too, eventually. One way to fix that would be
adding %.15f to the sprintf() call, but that just adds ugliness. It's
probably time to rewrite the function to build the tuple from datums,
instead of relying on BuildTupleFromCStrings.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: unlogged sequences

2019-06-23 Thread Peter Eisentraut
On 2019-06-21 07:31, Michael Paquier wrote:
> 1) Some SQL queries:
> create unlogged sequence popo;
> alter sequence popo increment 2;

The problem is that the above command does a relation rewrite but the
code doesn't know to copy the init fork of the sequence.  That will need
to be addressed.

> select nextval('popo');
> select nextval('popo');
> 2) Then a hard crash:
> pg_ctl stop -m immediate
> pg_ctl start
> 3) Again, with a crash:
> select nextval('popo');   
>   
> 
> #2  0x55ce60f3208d in ExceptionalCondition

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Multivariate MCV stats can leak data to unprivileged users

2019-06-23 Thread Tomas Vondra

On Sun, Jun 23, 2019 at 06:56:53PM +0100, Dean Rasheed wrote:

On Mon, 13 May 2019 at 23:36, Tomas Vondra  wrote:


On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
>While working on 1aebfbea83c, I noticed that the new multivariate MCV
>stats feature suffers from the same problem, and also the original
>problems that were fixed in e2d4ef8de8 and earlier --- namely that a
>user can see values in the MCV lists that they shouldn't see (values
>from tables that they don't have privileges on).
>
>I think there are 2 separate issues here:
>
>1). The table pg_statistic_ext is accessible to anyone, so any user
>can see the MCV lists of any table. I think we should give this the
>same treatment as pg_statistic, and hide it behind a security barrier
>view, revoking public access from the table.
>
>2). The multivariate MCV stats planner code can be made to invoke
>user-defined operators, so a user can create a leaky operator and use
>it to reveal data values from the MCV lists even if they have no
>permissions on the table.
>
>Attached is a draft patch to fix (2), which hooks into
>statext_is_compatible_clause().
>

I think that patch is good.



I realised that we forgot to push this second part, so I've just done so.



Whoops! Too many patches in this thread. Thanks for noticing.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: using explicit_bzero

2019-06-23 Thread Peter Eisentraut
On 2019-06-23 21:55, Peter Eisentraut wrote:
> On 2019-06-21 15:25, Tom Lane wrote:
>> Peter Eisentraut  writes:
>>> +#ifndef HAVE_EXPLICIT_BZERO
>>> +#define explicit_bzero(b, len) bzero(b, len)
>>> +#endif
>>
>> This presumes that every platform has bzero, which is unsafe (POSIX
>> doesn't specify it) and is an assumption we kicked to the curb a dozen
>> years ago (067a5cdb3).  Please use memset() for the substitute instead.
> 
> OK, done.

and with patch attached

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From e193fbcf04833de829069b5e737a27b83c667703 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 23 Jun 2019 21:53:14 +0200
Subject: [PATCH v2] Use explicit_bzero

---
 configure| 2 +-
 configure.in | 1 +
 src/backend/libpq/be-secure-common.c | 1 +
 src/include/pg_config.h.in   | 3 +++
 src/include/port.h   | 4 
 src/interfaces/libpq/fe-connect.c| 8 
 6 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index 8d47071e4a..c88153fff5 100755
--- a/configure
+++ b/configure
@@ -15176,7 +15176,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile explicit_bzero fdatasync getifaddrs 
getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul strsignal symlink sync_file_range uselocale utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 74938d4190..7e96d1d419 100644
--- a/configure.in
+++ b/configure.in
@@ -1615,6 +1615,7 @@ AC_CHECK_FUNCS(m4_normalize([
cbrt
clock_gettime
copyfile
+   explicit_bzero
fdatasync
getifaddrs
getpeerucred
diff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
index 877226d377..4c1c6cb3c4 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -118,6 +118,7 @@ run_ssl_passphrase_command(const char *prompt, bool 
is_server_start, char *buf,
buf[--len] = '\0';
 
 error:
+   explicit_bzero(buf, size);
pfree(command.data);
return len;
 }
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6cd4cfed0a..0062a4a426 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -201,6 +201,9 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_EDITLINE_READLINE_H
 
+/* Define to 1 if you have the `explicit_bzero' function. */
+#undef HAVE_EXPLICIT_BZERO
+
 /* Define to 1 if you have the `fdatasync' function. */
 #undef HAVE_FDATASYNC
 
diff --git a/src/include/port.h b/src/include/port.h
index b5c03d912b..7c8b5138ba 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -381,6 +381,10 @@ extern int isinf(double x);
 #endif /* __clang__ && 
!__cplusplus */
 #endif /* !HAVE_ISINF */
 
+#ifndef HAVE_EXPLICIT_BZERO
+#define explicit_bzero(b, len) memset(b, 0, len)
+#endif
+
 #ifndef HAVE_STRTOF
 extern float strtof(const char *nptr, char **endptr);
 #endif
diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index c800d7921e..887b8f6775 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3884,7 +3884,10 @@ freePGconn(PGconn *conn)
if (conn->connhost[i].port != NULL)
free(conn->connhost[i].port);
if (conn->connhost[i].password != NULL)
+   {
+   explicit_bzero(conn->connhost[i].password, 
strlen(conn->connhost[i].password));
free(conn->connhost[i].password);
+   }
}
free(conn->connhost);
}
@@ -3918,7 +3921,10 @@ freePGconn(PGconn *conn)
if (conn->pguser)
free(conn->pguser);
if (conn->pgpass)
+   {
+   explicit_bzero(conn->pgpass, strlen(conn->pgpass));
free(conn->pgpass);
+   }
if (conn->pgpassfile)
free(conn->pgpassfile);
if (conn->keepalives)
@@ -6935,6 +6941,7 @@ passwordFromFile(const char *hostname, const char *port, 
const char *dbname,

Re: using explicit_bzero

2019-06-23 Thread Peter Eisentraut
On 2019-06-21 15:45, Dagfinn Ilmari Mannsåker wrote:
> Also, on Linux it requires libbsd: https://libbsd.freedesktop.org/

No, it's in glibc.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Dean Rasheed
On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:
> One annoying thing I noticed is that the base_frequency tends to end up
> being 0, most likely due to getting too small. It's a bit strange, though,
> because with statistic target set to 10k the smallest frequency for a
> single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> something the float8 can represent).
>

Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.

Regards,
Dean




Re: how to run encoding-dependent tests by default

2019-06-23 Thread Peter Eisentraut
On 2019-06-17 18:39, Andres Freund wrote:
> Basically something like:
> 
> \gset SELECT my_encodings_are_compatible() AS compatible
> \if :compatible
> test;
> contents;
> \endif

Cool, that works out quite well.  See attached patch.  I flipped the
logic around to make it \quit if not compatible.  That way the
alternative expected file is shorter and doesn't need to be updated all
the time.  But it gets the job done either way.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1fd55c0a64b11ea23ab0c3791e21c21879a1b1de Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 23 Jun 2019 21:40:53 +0200
Subject: [PATCH] Run UTF8-requiring collation tests by default

The tests collate.icu.utf8 and collate.linux.utf8 were previously only
run when explicitly selected via EXTRA_TESTS.  They require a UTF8
database, because the error messages in the expected files refer to
that, and they use some non-ASCII characters in the tests.  Since
users can select any locale and encoding for the regression test run,
it was not possible to include these tests automatically.

To fix, use psql's \if facility to check the server encoding and quit
the tests at the very beginning if a server encoding other than UTF8
is set.  We then need to maintain alternative expected files for these
tests, but they are very tiny and never need to change after this.

These two tests are now run automatically as part of the regression
tests, given an appropriate build environment (Linux or ICU enabled,
respectively).

Discussion: 
https://www.postgresql.org/message-id/flat/052295c2-a2e1-9a21-bd36-8fbff8686cf3%402ndquadrant.com
---
 doc/src/sgml/regress.sgml  | 8 
 src/test/regress/GNUmakefile   | 8 
 src/test/regress/expected/collate.icu.utf8.out | 5 +
 src/test/regress/expected/collate.icu.utf8_1.out   | 7 +++
 src/test/regress/expected/collate.linux.utf8.out   | 4 
 src/test/regress/expected/collate.linux.utf8_1.out | 8 
 src/test/regress/sql/collate.icu.utf8.sql  | 6 ++
 src/test/regress/sql/collate.linux.utf8.sql| 5 +
 8 files changed, 43 insertions(+), 8 deletions(-)
 create mode 100644 src/test/regress/expected/collate.icu.utf8_1.out
 create mode 100644 src/test/regress/expected/collate.linux.utf8_1.out

diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index 448e2d19f7..101dc659e4 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -316,14 +316,6 @@ Extra Tests
 
 make check EXTRA_TESTS=numeric_big
 
-To run the collation tests:
-
-make check EXTRA_TESTS='collate.linux.utf8 collate.icu.utf8' LANG=en_US.utf8
-
-The collate.linux.utf8 test works only on Linux/glibc
-platforms.  The collate.icu.utf8 test only works when
-support for ICU was built.  Both tests will only succeed when run in a
-database that uses UTF-8 encoding.

   
 
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index a24cfd4e01..80d515003c 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -128,6 +128,14 @@ tablespace-setup:
 
 REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 $(EXTRA_REGRESS_OPTS)
 
+ifeq ($(PORTNAME),linux)
+EXTRA_TESTS += collate.linux.utf8
+endif
+
+ifeq ($(with_icu),yes)
+EXTRA_TESTS += collate.icu.utf8
+endif
+
 check: all tablespace-setup
$(pg_regress_check) $(REGRESS_OPTS) 
--schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out 
b/src/test/regress/expected/collate.icu.utf8.out
index 01bd9fb5dd..2a927771a2 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1,6 +1,11 @@
 /*
  * This test is for ICU collations.
  */
+/* skip test if not UTF8 server encoding */
+select getdatabaseencoding() <> 'UTF8' AS server_encoding_incompatible \gset
+\if :server_encoding_incompatible
+\quit
+\endif
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
diff --git a/src/test/regress/expected/collate.icu.utf8_1.out 
b/src/test/regress/expected/collate.icu.utf8_1.out
new file mode 100644
index 00..a622b35fb4
--- /dev/null
+++ b/src/test/regress/expected/collate.icu.utf8_1.out
@@ -0,0 +1,7 @@
+/*
+ * This test is for ICU collations.
+ */
+/* skip test if not UTF8 server encoding */
+select getdatabaseencoding() <> 'UTF8' AS server_encoding_incompatible \gset
+\if :server_encoding_incompatible
+\quit
diff --git a/src/test/regress/expected/collate.linux.utf8.out 
b/src/test/regress/expected/collate.linux.utf8.out
index 619688f851..000d883e70 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -3,6 +3,10 @@
  * locales is installed.  It must be run in a database with UTF-8 encoding,
  * because other 

Misleading comment in tuplesort_set_bound

2019-06-23 Thread James Coleman
While digging into the incremental sort patch, I noticed in
tuplesort.c at the beginning of the function in $SUBJECT we have this
comment and assertion:

tuplesort_set_bound(Tuplesortstate *state, int64 bound)
{
/* Assert we're called before loading any tuples */
Assert(state->status == TSS_INITIAL);

But AFAICT from reading the code in puttuple_common the state remains
TSS_INITIAL while tuples are inserted (unless we reach a point where
we decide to transition it to TSS_BOUNDED or TSS_BUILDRUNS).

Therefore it's not true that the assertion guards against having
loaded any tuples; rather it guarantees that we remain in standard
memory quicksort mode.

Assuming my understanding is correct, I've attached a small patch to
update the comment to "Assert we're still in memory quicksort mode and
haven't transitioned to heap or tape mode".

Note: this also means the function header comment "Must be called
before inserting any tuples" is a condition that isn't actually
validated, but I think that's fine given it's not a new problem and
even more so since the same comment goes on to say that that's
probably not a strict requirement.

James Coleman
commit bd185bbac4275299a587eae67eae58f856052a93
Author: jcoleman 
Date:   Sun Jun 23 19:13:09 2019 +

Correct assertion comment in tuplesort_set_bound

The implementation of puttuple_common keeps the state TSS_INITIAL
while tuples are inserted unless we reach a point where we decide
to transition to a different mode and set the state to TSS_BOUNDED
or TSS_BUILDRUNS.

Therefore it's not true that the assertion guards against having loaded
any tuples; rather it guarantees that we remain in standard memory
quicksort mode.

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b8e67899e..c108bd4513 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1185,7 +1185,9 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 void
 tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 {
-	/* Assert we're called before loading any tuples */
+	/*
+	 * Assert we're still in memory quicksort mode and haven't transitioned to
+	 * heap or tape mode. */
 	Assert(state->status == TSS_INITIAL);
 	Assert(state->memtupcount == 0);
 	Assert(!state->bounded);


Re: Add CREATE DATABASE LOCALE option

2019-06-23 Thread Peter Eisentraut
On 2019-06-06 21:52, Alvaro Herrera wrote:
> On 2019-Jun-06, Fabrízio de Royes Mello wrote:
> 
 Cool... would be nice also add some test cases.
>>>
>>> Right.  Any suggestions where to put them?
>>
>> Hmm... good question... I thought we already have some regression tests for
>> {CREATE|DROP} DATABASE but actually we don't... should we add a new one?
> 
> I think pg_dump/t/002_pg_dump.pl might be a good place.  Not the easiest
> program in the world to work with, admittedly.

Updated patch with test and expanded documentation.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 7ba4151235a32ebfd11bd80a5a01d9ef347d2c11 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 23 Jun 2019 20:09:00 +0200
Subject: [PATCH v2] Add CREATE DATABASE LOCALE option

This sets both LC_COLLATE and LC_CTYPE with one option.  Similar
behavior is already supported in initdb, CREATE COLLATION, and
createdb.

Discussion: 
https://www.postgresql.org/message-id/flat/d9d5043a-dc70-da8a-0166-1e218e6e34d4%402ndquadrant.com
---
 doc/src/sgml/ref/create_database.sgml | 25 +++--
 src/backend/commands/dbcommands.c | 20 
 src/bin/pg_dump/pg_dump.c | 23 +--
 src/bin/pg_dump/t/002_pg_dump.pl  |  9 +
 4 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml 
b/doc/src/sgml/ref/create_database.sgml
index b2c9e241c2..4014f6703b 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@
 [ [ WITH ] [ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
+   [ LOCALE [=] locale ]
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace_name ]
@@ -111,6 +112,26 @@ Parameters

   
  
+ 
+  locale
+  
+   
+This is a shortcut for setting LC_COLLATE
+and LC_CTYPE at once.  If you specify this,
+you cannot specify either of those parameters.
+   
+   
+
+ The other locale settings , , , and
+  are not fixed per database and are not
+ set by this command.  If you want to make them the default for a
+ specific database, you can use ALTER DATABASE
+ ... SET.
+
+   
+  
+ 
  
   lc_collate
   
@@ -287,7 +308,7 @@ Examples
To create a database music with a different locale:
 
 CREATE DATABASE music
-LC_COLLATE 'sv_SE.utf8' LC_CTYPE 'sv_SE.utf8'
+LOCALE 'sv_SE.utf8'
 TEMPLATE template0;
 
 In this example, the TEMPLATE template0 clause is 
required if
@@ -300,7 +321,7 @@ Examples
different character set encoding:
 
 CREATE DATABASE music2
-LC_COLLATE 'sv_SE.iso885915' LC_CTYPE 'sv_SE.iso885915'
+LOCALE 'sv_SE.iso885915'
 ENCODING LATIN9
 TEMPLATE template0;
 
diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 15207bf75a..ac275f7e64 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -124,6 +124,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
DefElem*downer = NULL;
DefElem*dtemplate = NULL;
DefElem*dencoding = NULL;
+   DefElem*dlocale = NULL;
DefElem*dcollate = NULL;
DefElem*dctype = NULL;
DefElem*distemplate = NULL;
@@ -184,6 +185,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 parser_errposition(pstate, 
defel->location)));
dencoding = defel;
}
+   else if (strcmp(defel->defname, "locale") == 0)
+   {
+   if (dlocale)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or 
redundant options"),
+parser_errposition(pstate, 
defel->location)));
+   dlocale = defel;
+   }
else if (strcmp(defel->defname, "lc_collate") == 0)
{
if (dcollate)
@@ -244,6 +254,11 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 parser_errposition(pstate, 
defel->location)));
}
 
+   if (dlocale && (dcollate || dctype))
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or redundant options")));
+
if (downer && downer->arg)
dbowner = defGetString(downer);
if (dtemplate && dtemplate->arg)
@@ -276,6 +291,11 @@ createdb(ParseState 

Re: Multivariate MCV stats can leak data to unprivileged users

2019-06-23 Thread Dean Rasheed
On Mon, 13 May 2019 at 23:36, Tomas Vondra  wrote:
>
> On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
> >While working on 1aebfbea83c, I noticed that the new multivariate MCV
> >stats feature suffers from the same problem, and also the original
> >problems that were fixed in e2d4ef8de8 and earlier --- namely that a
> >user can see values in the MCV lists that they shouldn't see (values
> >from tables that they don't have privileges on).
> >
> >I think there are 2 separate issues here:
> >
> >1). The table pg_statistic_ext is accessible to anyone, so any user
> >can see the MCV lists of any table. I think we should give this the
> >same treatment as pg_statistic, and hide it behind a security barrier
> >view, revoking public access from the table.
> >
> >2). The multivariate MCV stats planner code can be made to invoke
> >user-defined operators, so a user can create a leaky operator and use
> >it to reveal data values from the MCV lists even if they have no
> >permissions on the table.
> >
> >Attached is a draft patch to fix (2), which hooks into
> >statext_is_compatible_clause().
> >
>
> I think that patch is good.
>

I realised that we forgot to push this second part, so I've just done so.

Regards,
Dean




Re: Fix doc bug in logical replication.

2019-06-23 Thread Peter Eisentraut
On 2019-04-12 19:52, Robert Treat wrote:
> It is clear to me that the docs are wrong, but I don't see anything
> inherently incorrect about the code itself. Do you have suggestions
> for how you would like to see the code comments improved?

The question is perhaps whether we want to document that non-matching
data types do work.  It happens to work now, but do we always want to
guarantee that?  There is talk of a binary mode for example.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: check_recovery_target_lsn() does a PG_CATCH without a throw

2019-06-23 Thread Peter Eisentraut
On 2019-06-20 18:05, Andres Freund wrote:
> Hi,
> 
> On 2019-06-20 15:42:14 +0200, Peter Eisentraut wrote:
>> On 2019-06-12 13:16, Peter Eisentraut wrote:
>>> I haven't figured out the time zone issue yet, but I guess the solution
>>> might involve moving some of the code from check_recovery_target_time()
>>> to assign_recovery_target_time().
>>
>> I think that won't work either.  What we need to do is postpone the
>> interpretation of the timestamp string until after all the GUC
>> processing is done.  So check_recovery_target_time() would just do some
>> basic parsing checks, but stores the string.  Then when we need the
>> recovery_target_time_value we do the final parsing.  Then we can be sure
>> that the time zone is all set.
> 
> That sounds right to me.

Updated patch for that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 67099f9487354efd771fd4211b0737e1d7d40dec Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 12 Jun 2019 11:29:53 +0200
Subject: [PATCH v2 1/2] Remove explicit error handling for obsolete date/time
 values

The date/time values 'current', 'invalid', and 'undefined' were
removed a long time ago, but the code still contains explicit error
handling for the transition.  To simplify the code and avoid having to
handle these values everywhere, just remove the recognition of these
tokens altogether now.
---
 src/backend/utils/adt/date.c   |  8 
 src/backend/utils/adt/datetime.c   | 20 
 src/backend/utils/adt/timestamp.c  | 22 --
 src/include/utils/datetime.h   |  2 --
 src/interfaces/ecpg/pgtypeslib/dt.h|  2 --
 src/interfaces/ecpg/pgtypeslib/dt_common.c |  5 -
 src/interfaces/ecpg/pgtypeslib/timestamp.c |  4 
 src/test/regress/expected/date.out |  3 ---
 src/test/regress/expected/timestamp.out| 13 -
 src/test/regress/expected/timestamptz.out  | 13 -
 src/test/regress/sql/date.sql  |  1 -
 src/test/regress/sql/timestamp.sql |  4 
 src/test/regress/sql/timestamptz.sql   |  4 
 13 files changed, 101 deletions(-)

diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 1ff3cfea8b..e440a4fedd 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -138,14 +138,6 @@ date_in(PG_FUNCTION_ARGS)
case DTK_DATE:
break;
 
-   case DTK_CURRENT:
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("date/time value \"current\" is 
no longer supported")));
-
-   GetCurrentDateTime(tm);
-   break;
-
case DTK_EPOCH:
GetEpochTime(tm);
break;
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 9def318c57..e9add385ba 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -99,7 +99,6 @@ static const datetkn datetktbl[] = {
{"aug", MONTH, 8},
{"august", MONTH, 8},
{DB_C, ADBC, BC},   /* "bc" for years <= 0 */
-   {DCURRENT, RESERV, DTK_CURRENT},/* "current" is always now */
{"d", UNITS, DTK_DAY},  /* "day of month" for ISO input */
{"dec", MONTH, 12},
{"december", MONTH, 12},
@@ -113,7 +112,6 @@ static const datetkn datetktbl[] = {
{"friday", DOW, 5},
{"h", UNITS, DTK_HOUR}, /* "hour" */
{LATE, RESERV, DTK_LATE},   /* "infinity" reserved for "late time" 
*/
-   {INVALID, RESERV, DTK_INVALID}, /* "invalid" reserved for bad time */
{"isodow", UNITS, DTK_ISODOW},  /* ISO day of week, Sunday == 7 */
{"isoyear", UNITS, DTK_ISOYEAR},/* year in terms of the ISO 
week date */
{"j", UNITS, DTK_JULIAN},
@@ -157,7 +155,6 @@ static const datetkn datetktbl[] = {
{"tue", DOW, 2},
{"tues", DOW, 2},
{"tuesday", DOW, 2},
-   {"undefined", RESERV, DTK_INVALID}, /* pre-v6.1 invalid time */
{"wed", DOW, 3},
{"wednesday", DOW, 3},
{"weds", DOW, 3},
@@ -191,7 +188,6 @@ static const datetkn deltatktbl[] = {
{"hours", UNITS, DTK_HOUR}, /* "hours" relative */
{"hr", UNITS, DTK_HOUR},/* "hour" relative */
{"hrs", UNITS, DTK_HOUR},   /* "hours" relative */
-   {INVALID, RESERV, DTK_INVALID}, /* reserved for invalid time */
{"m", UNITS, DTK_MINUTE},   /* "minute" relative */
{"microsecon", UNITS, DTK_MICROSEC},/* "microsecond" relative */
{"mil", UNITS, DTK_MILLENNIUM}, /* "millennium" relative */
@@ -222,7 +218,6 @@ static const datetkn deltatktbl[] = {
{DTIMEZONE, UNITS, DTK_TZ}, /* "timezone" time 

Re: make \d pg_toast.foo show its indices ; and, \d toast show its main table ; and \d relkind=I show its partitions (and tablespace)

2019-06-23 Thread Justin Pryzby
I'm continuing this thread with an additional change to slash dee for
partitioned indexes.

postgres=# \d ttz_i_idx
 Partitioned index "public.ttz_i_idx"
 Column |  Type   | Key? | Definition 
+-+--+
 i  | integer | yes  | i
btree, for table "public.ttz"
Number of partitions: 2 (Use \d+ to list them.)

postgres=# \d+ ttz_i_idx
 Partitioned index "public.ttz_i_idx"
 Column |  Type   | Key? | Definition | Storage | Stats target 
+-+--++-+--
 i  | integer | yes  | i  | plain   | 
btree, for table "public.ttz"
Partitions: ttz1_i_idx,
ttz2_i_idx, PARTITIONED

Showing the list of index partitions is probably not frequently useful, but
consider the case of non-default names, for example due to truncation.

I didn't update regression output; note that this patch also, by chance, causes
tablespace of partitioned indexes to be output, which I think is good and an
oversight that it isn't currently shown.

I added CF entry and including previous two patches for CFBOT purposes.

Recap: Tom, Andreas, Robert, Stephen and I agree that \d toast should show the
main table.  Rafia and Alvaro think that \d on the main table should (also?)
show its toast.

Justin
>From 29e4c0b9700b9dee5f6ff2abc442e08e5221eb93 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 30 Apr 2019 19:05:53 -0500
Subject: [PATCH v3] print table associated with given TOAST table

---
 src/bin/psql/describe.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index c65bc82..ff98c4f 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2153,6 +2153,28 @@ describeOneTableDetails(const char *schemaname,
 		}
 	}
 
+	/* print table associated with given TOAST table */
+	if (tableinfo.relkind == RELKIND_TOASTVALUE)
+	{
+		PGresult   *result = NULL;
+		printfPQExpBuffer(,
+		  "SELECT relnamespace::pg_catalog.regnamespace, relname FROM pg_class WHERE reltoastrelid = '%s'",
+		  oid);
+		result = PSQLexec(buf.data);
+		if (!result) {
+			goto error_return;
+		} else if (1 != PQntuples(result)) {
+			PQclear(result);
+			goto error_return;
+		} else {
+			char	   *schemaname = PQgetvalue(result, 0, 0);
+			char	   *relname = PQgetvalue(result, 0, 1);
+			appendPQExpBuffer(, _("For table: \"%s.%s\""),
+		  schemaname, relname);
+			printTableAddFooter(, tmpbuf.data);
+		}
+	}
+
 	if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Get the partition key information  */
-- 
2.7.4

>From 185d8723bec45824a0db245f69e948080b7fbbb2 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 3 May 2019 09:24:51 -0500
Subject: [PATCH v3] make \d pg_toast.foo show its indices

---
 src/bin/psql/describe.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index af2f440..c65bc82 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2275,6 +2275,7 @@ describeOneTableDetails(const char *schemaname,
 	else if (tableinfo.relkind == RELKIND_RELATION ||
 			 tableinfo.relkind == RELKIND_MATVIEW ||
 			 tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+			 tableinfo.relkind == RELKIND_TOASTVALUE ||
 			 tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Footer information about a table */
-- 
2.7.4

>From 1281338ce47d0dbf0e0f93d1e483721396ac2402 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 19 Jun 2019 15:41:25 -0500
Subject: [PATCH v3] show childs of partitioned indices

---
 src/bin/psql/describe.c | 52 +++--
 1 file changed, 24 insertions(+), 28 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ff98c4f..3a55dc9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3063,6 +3063,7 @@ describeOneTableDetails(const char *schemaname,
 	if (tableinfo.relkind == RELKIND_RELATION ||
 		tableinfo.relkind == RELKIND_MATVIEW ||
 		tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+		tableinfo.relkind == RELKIND_PARTITIONED_INDEX ||
 		tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		PGresult   *result;
@@ -3178,7 +3179,8 @@ describeOneTableDetails(const char *schemaname,
 		 * Otherwise, we will not print "Partitions" section for a partitioned
 		 * table without any partitions.
 		 */
-		if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE && tuples == 0)
+		if (tuples == 0 && (tableinfo.relkind == RELKIND_PARTITIONED_TABLE ||
+	tableinfo.relkind == RELKIND_PARTITIONED_INDEX))
 		{
 			printfPQExpBuffer(, _("Number of partitions: %d"), tuples);
 			printTableAddFooter(, buf.data);
@@ -3188,7 +3190,7 @@ describeOneTableDetails(const char *schemaname,
 			/* print the number of child tables, if any */
 			if (tuples > 0)
 			{
-if (tableinfo.relkind != RELKIND_PARTITIONED_TABLE)
+if (tableinfo.relkind != RELKIND_PARTITIONED_TABLE && tableinfo.relkind != 

Re: New vacuum option to do only freezing

2019-06-23 Thread Michael Paquier
On Thu, Jun 20, 2019 at 03:50:32PM +0900, Michael Paquier wrote:
> And I finish with the attached.  Thoughts?

So, are there any objections with this patch?  Or do people think that
it's too late for v12 and that it is better to wait until v13 opens
for business?
--
Michael


signature.asc
Description: PGP signature


Refactoring base64 encoding and decoding into a safer interface

2019-06-23 Thread Michael Paquier
Hi all,

After the issues behind CVE-2019-10164, it seems that we can do much
better with the current interface of decoding and encoding functions
for base64 in src/common/.

The root issue is that the callers of pg_b64_decode() and
pg_b64_encode() provide a buffer where the result gets stored which is
allocated using respectively pg_b64_dec_len() and pg_b64_dec_enc()
(those routines overestimate the allocation on purpose) but we don't
allow callers to provide the length of the buffer allocated and hence
those routines lack sanity checks to make sure that what is in input
does not cause an overflow within the result buffer.

One thing I have noticed is that many projects on the net include this
code for their own purpose, and I have suspicions that many other
projects link to the code from Postgres and make use of it.  So that's
rather scary.

Attached is a refactoring patch for those interfaces, which introduces
a set of overflow checks so as we cannot repeat errors of the past.
This adds one argument to pg_b64_decode() and pg_b64_encode() as the
size of the result buffer, and we make use of it in the code to make
sure that an error is reported in case of an overflow.  That's the
status code -1 which is used for other errors for simplicity.  One
thing to note is that the decoding path can already complain on some
errors, basically an incorrectly shaped encoded string, but the
encoding path does not have any errors yet, so we need to make sure
that all the existing callers of pg_b64_encode() complain correctly
with the new interface.

I am adding that to the next CF for v13.

Any thoughts?
--
Michael
diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 6b60abe1dd..91ed71391d 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -510,9 +510,11 @@ scram_verify_plain_password(const char *username, const char *password,
 		return false;
 	}
 
-	salt = palloc(pg_b64_dec_len(strlen(encoded_salt)));
-	saltlen = pg_b64_decode(encoded_salt, strlen(encoded_salt), salt);
-	if (saltlen == -1)
+	saltlen = pg_b64_dec_len(strlen(encoded_salt));
+	salt = palloc(saltlen);
+	saltlen = pg_b64_decode(encoded_salt, strlen(encoded_salt), salt,
+			saltlen);
+	if (saltlen < 0)
 	{
 		ereport(LOG,
 (errmsg("invalid SCRAM verifier for user \"%s\"", username)));
@@ -596,9 +598,10 @@ parse_scram_verifier(const char *verifier, int *iterations, char **salt,
 	 * Verify that the salt is in Base64-encoded format, by decoding it,
 	 * although we return the encoded version to the caller.
 	 */
-	decoded_salt_buf = palloc(pg_b64_dec_len(strlen(salt_str)));
+	decoded_len = pg_b64_dec_len(strlen(salt_str));
+	decoded_salt_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(salt_str, strlen(salt_str),
-decoded_salt_buf);
+decoded_salt_buf, decoded_len);
 	if (decoded_len < 0)
 		goto invalid_verifier;
 	*salt = pstrdup(salt_str);
@@ -606,16 +609,18 @@ parse_scram_verifier(const char *verifier, int *iterations, char **salt,
 	/*
 	 * Decode StoredKey and ServerKey.
 	 */
-	decoded_stored_buf = palloc(pg_b64_dec_len(strlen(storedkey_str)));
+	decoded_len = pg_b64_dec_len(strlen(storedkey_str));
+	decoded_stored_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(storedkey_str, strlen(storedkey_str),
-decoded_stored_buf);
+decoded_stored_buf, decoded_len);
 	if (decoded_len != SCRAM_KEY_LEN)
 		goto invalid_verifier;
 	memcpy(stored_key, decoded_stored_buf, SCRAM_KEY_LEN);
 
-	decoded_server_buf = palloc(pg_b64_dec_len(strlen(serverkey_str)));
+	decoded_len = pg_b64_dec_len(strlen(serverkey_str));
+	decoded_server_buf = palloc(decoded_len);
 	decoded_len = pg_b64_decode(serverkey_str, strlen(serverkey_str),
-decoded_server_buf);
+decoded_server_buf, decoded_len);
 	if (decoded_len != SCRAM_KEY_LEN)
 		goto invalid_verifier;
 	memcpy(server_key, decoded_server_buf, SCRAM_KEY_LEN);
@@ -649,8 +654,18 @@ mock_scram_verifier(const char *username, int *iterations, char **salt,
 	/* Generate deterministic salt */
 	raw_salt = scram_mock_salt(username);
 
-	encoded_salt = (char *) palloc(pg_b64_enc_len(SCRAM_DEFAULT_SALT_LEN) + 1);
-	encoded_len = pg_b64_encode(raw_salt, SCRAM_DEFAULT_SALT_LEN, encoded_salt);
+	encoded_len = pg_b64_enc_len(SCRAM_DEFAULT_SALT_LEN);
+	/* don't forget the zero-terminator */
+	encoded_salt = (char *) palloc(encoded_len + 1);
+	encoded_len = pg_b64_encode(raw_salt, SCRAM_DEFAULT_SALT_LEN, encoded_salt,
+encoded_len);
+
+	/*
+	 * Note that we cannot reveal any information to an attacker here
+	 * so the error message needs to remain generic.
+	 */
+	if (encoded_len < 0)
+		elog(ERROR, "could not encode salt");
 	encoded_salt[encoded_len] = '\0';
 
 	*salt = encoded_salt;
@@ -1144,8 +1159,15 @@ build_server_first_message(scram_state *state)
 (errcode(ERRCODE_INTERNAL_ERROR),
  errmsg("could not generate random nonce")));
 
-	state->server_nonce = 

Re: Index Skip Scan

2019-06-23 Thread Tomas Vondra

Hi,

I've done some initial review on v20 - just reading through the code, no
tests at this point. Here are my comments:


1) config.sgml

I'm not sure why the enable_indexskipscan section says

   This parameter requires that enable_indexonlyscan
   is on.

I suppose it's the same thing as for enable_indexscan, and we don't say
anything like that for that GUC.


2) indices.sgml

The new section is somewhat unclear and difficult to understand, I think
it'd deserve a rewording. Also, I wonder if we really should link to the
wiki page about FSM problems. We have a couple of wiki links in the sgml
docs, but those seem more generic while this seems as a development page
that might disapper. But more importantly, that wiki page does not say
anything about "Loose Index scans" so is it even the right wiki page?


3) nbtsearch.c

 _bt_skip - comments are formatted incorrectly
 _bt_update_skip_scankeys - missing comment
 _bt_scankey_within_page - missing comment

4) explain.c

There are duplicate blocks of code for IndexScan and IndexOnlyScan:

   if (indexscan->skipPrefixSize > 0)
   {
   if (es->format != EXPLAIN_FORMAT_TEXT)
   ExplainPropertyInteger("Distinct Prefix", NULL,
  indexscan->skipPrefixSize,
  es);
   }

I suggest we wrap this into a function ExplainIndexSkipScanKeys() or
something like that.

Also, there's this:

   if (((IndexScan *) plan)->skipPrefixSize > 0)
   {
   ExplainPropertyText("Scan mode", "Skip scan", es);
   }

That does not make much sense - there's just a single 'scan mode' value.
So I suggest we do the same thing as for unique joins, i.e. 


   ExplainPropertyBool("Skip Scan",
   (((IndexScan *) plan)->skipPrefixSize > 0),
   es);


5) nodeIndexOnlyScan.c

In ExecInitIndexOnlyScan, we should initialize the ioss_ fields a bit
later, with the existing ones. This is just cosmetic issue, though.


6) nodeIndexScan.c

I wonder why we even add and initialize the ioss_ fields for IndexScan
nodes, when the skip scans require index-only scans?


7) pathnode.c

I wonder how much was the costing discussed. It seems to me the logic is
fairly similar to ideas discussed in the incremental sort patch, and
we've been discussing some weak points there. I'm not sure how much we
need to consider those issues here.


8) execnodes.h

The comment before IndexScanState mentions new field NumDistinctKeys,
but there's no such field added to the struct.


9) pathnodes.h

I don't understand what the uniq_distinct_pathkeys comment says :-(


10) plannodes.h

The naming of the new field (skipPrefixSize) added to IndexScan and
IndexOnlyScan is clearly inconsistent with the naming convention of the
existing fields.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Code comment change

2019-06-23 Thread Vik Fearing
On 23/06/2019 12:35, Thomas Munro wrote:
> On Sun, Jun 23, 2019 at 9:21 PM Vik Fearing  
> wrote:
>> There is some language in a code comment that has been bothering me for
>> several years now.  After pointing it out in a conversation off-list
>> recently, I figured it was past time to do something about it.
>>
>> Patch attached.
> 
> Pushed.  Thanks!

Thank you!
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support




Re: Index Skip Scan

2019-06-23 Thread Floris Van Nee

> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
> what's going on both as a B-Tree is descended, and as a range scan is
> performed. It also shows details of the insertion scankey that is set
> up within _bt_first(). This hasn't been adopted to the patch at all,
> so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my 
test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,1), (2, 1), 
(2,2), ... (2,1) until (5,1).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at 
a custom location in the tree, so it just starts from the beginning and 
searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert 
scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on 
this, it finds the first tuple larger than this: (2,1). This is not the tuple 
that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);

/* Now read the data */
if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page.  Try to advance to
* the next page.  Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before 
(2,1), which is (1,1)). This is done, because if this tuple were to be 
returned, there would be a call to _bt_next afterwards, which would then 
conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes 
things up, because it only reads tuples that match all the provided keys (so 
where b=2). The only tuple it'll return is (2,2). This will be the tuple that 
is set, however, on the call to _bt_next, the tuple is first incremented, so 
we'll find (2,3) there which doesn't match our keys. This leads it to skip 
(2,2) in our result set.

I was wondering about something else: don't we also have another problem with 
updating this current index tuple by skipping before calling 
btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples 
when scan->kill_prior_tuple is true. I'm not too familiar with the concept of 
killing dead tuples while doing index scans, but by looking at the code it 
seems to be possible that btgettuple returns a tuple, caller processes it and 
sets kill_prior_tuple to true in order to have it killed. However, then the 
skip scan kicks in, which sets the current tuple to a completely different 
tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my 
reasoning correct here or am I missing something?

-Floris




Re: Code comment change

2019-06-23 Thread Thomas Munro
On Sun, Jun 23, 2019 at 9:21 PM Vik Fearing  wrote:
> There is some language in a code comment that has been bothering me for
> several years now.  After pointing it out in a conversation off-list
> recently, I figured it was past time to do something about it.
>
> Patch attached.

Pushed.  Thanks!

--
Thomas Munro
https://enterprisedb.com




Re: Optimize partial TOAST decompression

2019-06-23 Thread Andrey Borodin
Hi, Binguo!

> 2 июня 2019 г., в 19:48, Binguo Bao  написал(а):
> 
> Hi, hackers!

> This seems to have a 10x improvement. If the number of toast data chunks is 
> more, I believe that patch can play a greater role, there are about 200 
> related TOAST data chunks for each entry in the case.

That's really cool that you could produce meaningful patch long before end of 
GSoC!

I'll describe what is going on a little:
1. We have compressed value, which resides in TOAST table.
2. We want only some fraction of this value. We want some prefix with length L.
3. Previously Paul Ramsey submitted patch that omits decompression of value 
beyond desired L bytes.
4. Binguo's patch tries to do not fetch compressed data which will not bee 
needed to decompressor. In fact it fetches L bytes from TOAST table.

This is not correct: L bytes of compressed data do not always can be decoded 
into at least L bytes of data. At worst we have one control byte per 8 bytes of 
literal bytes. This means at most we need (L*9 + 8) / 8 bytes with current pglz 
format.

Also, I'm not sure you use SET_VARSIZE_COMPRESSED correctly...

Best regards, Andrey Borodin.



Code comment change

2019-06-23 Thread Vik Fearing
There is some language in a code comment that has been bothering me for
several years now.  After pointing it out in a conversation off-list
recently, I figured it was past time to do something about it.

Patch attached.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4efc46..03570300d8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -268,7 +268,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
  *		its location from the metadata page, and then read the root page
  *		itself.  If no root page exists yet, we have to create one.  The
  *		standard class of race conditions exists here; I think I covered
- *		them all in the Hopi Indian rain dance of lock requests below.
+ *		them all in the intricate dance of lock requests below.
  *
  *		The access type parameter (BT_READ or BT_WRITE) controls whether
  *		a new root page will be created or not.  If access = BT_READ,