Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-29 Thread David Rowley
On Mon, 24 Jun 2019 at 23:12, David Rowley  wrote:
>
> On Mon, 24 Jun 2019 at 22:16, Michael Paquier  wrote:
> >
> > Don't take me bad, but I find the solution of defining and using a new
> > callback to call the table AM callback not really elegant, and keeping
> > all table AM callbacks called at a higher level than the executor
> > makes the code easier to follow.  Shouldn't we try to keep any calls
> > to table_finish_bulk_insert() within copy.c for each partition
> > instead?
>
> I'm not quite sure if I follow you since the call to
> table_finish_bulk_insert() is within copy.c still.
>
> The problem was that PartitionTupleRouting is private to
> execPartition.c, and we need a way to determine which of the
> partitions we routed tuples to. It seems inefficient to flush all of
> them if only a small number had tuples inserted into them and to me,
> it seems inefficient to add some additional tracking in CopyFrom(),
> like a hash table to store partition Oids that we inserted into. Using
> PartitionTupleRouting makes sense. It's just a question of how to
> access it, which is not so easy due to it being private.
>
> I did suggest a few other ways that we could solve this. I'm not so
> clear on which one of those you're suggesting or if you're thinking of
> something new.

Any further thoughts on this Michael?

Or Andres? Do you have a preference to which of the approaches
(mentioned upthread) I use for the fix?

If I don't hear anything I'll probably just push the first fix. The
inefficiency does not affect heap, so likely the people with the most
interest in improving that will be authors of other table AMs that
actually do something during table_finish_bulk_insert() for
partitions. We could revisit this in PG13 if someone comes up with a
need to improve things here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: [HACKERS] proposal: schema variables

2019-06-29 Thread Pavel Stehule
pá 24. 5. 2019 v 19:12 odesílatel Pavel Stehule 
napsal:

> Hi
>
> čt 9. 5. 2019 v 6:34 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> rebased patch
>>
>
> rebase after pgindent
>

fresh rebase

Regards

Pavel


> Regards
>
> Pavel
>
>>
>> Regards
>>
>> Pavel
>>
>>
>>


schema-variables-20190630.patch.gz
Description: application/gzip


Re: Where is SSPI auth username determined for TAP tests?

2019-06-29 Thread Michael Paquier
On Sat, Jun 29, 2019 at 04:36:51PM -0400, Tom Lane wrote:
> I think this is likely a consequence of ca129e58c0 having modified
> 010_dump_connstr.pl to use "regress_postgres" not "postgres" as the
> bootstrap superuser name in the source cluster.  I suppose I overlooked
> some dependency on the user name that only affects SSPI ... but what?
> I don't see anything about the destination cluster configuration (which
> already used a nondefault superuser name) that I didn't replicate
> in the source cluster configuration.

Didn't you get trapped with something similar to what has been fixed
in d9f543e?  If you want pg_hba.conf to be correctly set up for SSPI
on Windows, you should pass "auth_extra => ['--create-role',
'regress_postgres']" to the init() method initializing the node.

Looking at the commit...
 my $node = get_new_node('main');
-$node->init(extra => [ '--locale=C', '--encoding=LATIN1' ]);
+$node->init(extra =>
+ [ '-U', $src_bootstrap_super, '--locale=C', '--encoding=LATIN1' ]);
[...]
 $node->run_log(
 [
$ENV{PG_REGRESS}, '--config-auth',
$node->data_dir,  '--create-role',
-   "$dbname1,$dbname2,$dbname3,$dbname4"
+   "$username1,$username2,$username3,$username4"
 ]);
 
This part is wrong and just needs to be updated to as
$src_bootstrap_super also gets its role added in --create-role, which
would set up pg_hba.conf as you would like.
--
Michael


signature.asc
Description: PGP signature


Re: Usage of epoch in txid_current

2019-06-29 Thread Robert Haas
On Mon, Jun 24, 2019 at 1:43 PM Alvaro Herrera  wrote:
> I think enlarging multixacts to 64 bits is a terrible idea.  I would
> rather get rid of multixacts completely; zheap is proposing not to use
> multixacts at all, for example.

But zedstore, at least as of the Saturday after PGCon, is proposing to
keep using them, after first widening them to 64 bits:

https://www.postgresql.org/message-id/CA+TgmoYeTeQSmALox0PmSm5Gh03oe=unjhml+k+btofy_u2...@mail.gmail.com

I think we all have a visceral reaction against MultiXacts at this
point; they've just caused us too much pain, and the SLRU
infrastructure is a bunch of crap.[1] However, the case where N
sessions all do "SELECT * FROM sometab FOR SHARE" is pretty wretched
without multixacts.  You'll just have to keep reducing the tuple
density per page to fit all the locks, whereas the current heap is
totally fine and neither the heap nor the multixact space bloats all
that terribly much.

I currently think it's likely that zheap will use something
multixact-like rather than actually using multixacts per se.  I am
having trouble seeing how we can have some sort of fixed-bit-width ID
number that represents as set of XIDs for purposes of lock-tracking
without creating some nasty degenerate cases that don't exist
today.[2]

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] At least in comparison to other parts of the PostgreSQL
infrastructure which are more awesome.
[2] I'd like to credit Andres Freund for helping me understand this
issue better.




Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)

2019-06-29 Thread Robert Haas
On Thu, Jun 27, 2019 at 1:58 PM Tom Lane  wrote:
> It's not really clear to me that the IANA folk intend those files to
> be read as a list of preferred zone names.  If they do, what are we
> to make of the fact that no variant of "UTC" appears in them?

I think their intent is key.  We can't make reasonable decisions about
what to do with some data if we don't know what the data is intended
to mean.

> I think that predicting what IANA will do in the future is a fool's
> errand.  Our contract is to select some one of the aliases that the
> tzdb database presents, not to guess about whether it might present
> a different set in the future.  (Also note that a lot of the observed
> variation here has to do with whether individual platforms choose to
> install backward-compatibility zone names.  I think the odds that
> IANA proper will remove those links are near zero; TTBOMK they
> never have removed one yet.)

That doesn't make it a good idea to call Mountain time "Navajo," as
Andrew alleges we are doing.  Then again, the MacBook upon which I am
writing this email thinks that my time zone is "America/New_York,"
whereas I think it is "US/Eastern," which I suppose reinforces your
point about all of this being political. But on the third hand, if
somebody tells me that my time zone is America/New_York, I can say to
myself "oh, they mean Eastern time," whereas if they say that I'm on
"Navajo" time, I'm going to have to sit down with 'diff' and the
zoneinfo files to figure out what that actually means.

I note that https://github.com/eggert/tz/blob/master/backward seems
pretty clear about which things are backward compatibility aliases,
which seems to imply that we would not be taking a political position
separate from the upstream position if we tried to de-prioritize
those.

Also, https://github.com/eggert/tz/blob/master/theory.html says...

Names normally have the form
AREA/LOCATION, where
AREA is a continent or ocean, and
LOCATION is a specific location within the area.

...which seems to imply that AREA/LOCATION is the "normal" and thus
preferred form, and also that...

The file 'zone1970.tab' lists geographical locations used
to name timezones.
It is intended to be an exhaustive list of names for geographic
regions as described above; this is a subset of the timezones in the data.

...which seems to support Andrew's idea that you can identify
AREA/LOCATION time zones by looking in that file.

Long story short, I agree with you that most people probably don't
care about this very much, but I also agree with Andrew that some of
the current choices we're making are pretty strange, and I'm not
convinced as you are that it's impossible to make a principled choice
between alternatives in all cases. The upstream data appears to
contain some information about intent; it's not just a jumble of
exactly-equally-preferred alternatives.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Where is SSPI auth username determined for TAP tests?

2019-06-29 Thread Tom Lane
bowerbird is failing the pg_dump regression tests with a lot of

FATAL:  SSPI authentication failed for user "regress_postgres"

I think this is likely a consequence of ca129e58c0 having modified
010_dump_connstr.pl to use "regress_postgres" not "postgres" as the
bootstrap superuser name in the source cluster.  I suppose I overlooked
some dependency on the user name that only affects SSPI ... but what?
I don't see anything about the destination cluster configuration (which
already used a nondefault superuser name) that I didn't replicate
in the source cluster configuration.

regards, tom lane




Increasing default value for effective_io_concurrency?

2019-06-29 Thread Tomas Vondra

Hi,

I think we should consider changing the effective_io_concurrency default
value, i.e. the guc that determines how many pages we try to prefetch in
a couple of places (the most important being Bitmap Heap Scan).

The default is 1 since forever, but from my experience hardly the right
value, no matter what storage system you use. I've always ended up with
values that are either 0 (so, disabled prefetching) or significantly
higher (at least 8 or 16). In fact, e_i_c=1 can easily be detrimental
depending on the workload and storage system.

Which is an issue, because people often don't know how to tune this and
I see systems with the default value quite often.

So I do propose to increase the defaut to a value between 4 and 16.


I'm hardly the first person to notice this, as illustrated for example
by this [1] post by Merlin Moncure on pgsql-hackers from 2017, which
measured this behavior on Intel S3500 SSD:

 effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat
 effective_io_concurrency 2:  49.3 sec, ~ 158 mb/sec peak via iostat
 effective_io_concurrency 4:  29.1 sec, ~ 291 mb/sec peak via iostat
 effective_io_concurrency 8:  23.2 sec, ~ 385 mb/sec peak via iostat
 effective_io_concurrency 16:  22.1 sec, ~ 409 mb/sec peak via iostat
 effective_io_concurrency 32:  20.7 sec, ~ 447 mb/sec peak via iostat
 effective_io_concurrency 64:  20.0 sec, ~ 468 mb/sec peak via iostat
 effective_io_concurrency 128:  19.3 sec, ~ 488 mb/sec peak via iostat
 effective_io_concurrency 256:  19.2 sec, ~ 494 mb/sec peak via iostat

That's just one anecdotal example of behavior, of course, so I've
decided to do a couple of tests on different storage systems. Attached
is a couple of scripts I used to generate synthetic data sets with data
laid out in different patterns (random vs. regular), and running queries
scanning various fractions of the table (1%, 5%, ...) using plans using
bitmap index scans.

I've done that on three different storage systems:

1) SATA RAID (3 x 7.2k drives in RAID0)
2) SSD RAID (6 x SATA SSD in RAID0)
3) NVMe drive

Attached is a spreadsheet with a summary of results fo the tested cases.
In general, the data support what I already wrote above - the current
default is pretty bad.

In some cases it helps a bit, but a bit higher value (4 or 8) performs
significantly better. Consider for example this "sequential" data set
from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what
fraction of pages matches the query):

   pct 0 14 1664   128
   ---
 1 25990 18624  3269  2219  2189  2171
 5 88116 60242 14002  8663  8560  8726
10120556 99364 29856 17117 16590 17383
25101080184327 79212 47884 46846 46855
50130709309857163614103001 94267 94809
75126516435653248281156586139500140087

compared to the e_i_c=0 case, it looks like this:

   pct   14 1664   128
   
 1 72%  13% 9%8%8%
 5 68%  16%10%   10%   10%
10 82%  25%14%   14%   14%
25182%  78%47%   46%   46%
50237% 125%79%   72%   73%
75344% 196%   124%  110%  111%

So for 1% of the table the e_i_c=1 is faster by about ~30%, but with
e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not
just on this storage system.

The e_i_c=1 can perform pretty poorly, especially when the query matches
large fraction of the table - for example in this example it's 2-3x
slower compared to no prefetching, and higher e_i_c values limit the
damage quite a bit.

It's not entirely terrible because in most cases those queries would use
seqscan (the benchmark forces queries to use bitmap heap scan), but it's
not something we can ignore either because of possible underestimates.

Furthermore, there are cases with much worse behavior. For example, one
of the tests on SATA RAID behaves like this:

   pct   14 1664   128
   
 1147% 101%61%   52%   55%
 5180% 106%71%   71%   70%
10208% 106%73%   80%   79%
25225% 118%84%   96%   86%
50234% 123%91%  102%   95%
75241% 127%94%  103%   98%

Pretty much all cases are significantly slower with e_i_c=1.

Of course, I'm sure there may be other things to consider. For example,
these tests were done in isolation, while on actual systems there will
be other queries running concurrently (and those may also generate I/O).


regards

[1] 

base backup client as auxiliary backend process

2019-06-29 Thread Peter Eisentraut
Setting up a standby instance is still quite complicated.  You need to
run pg_basebackup with all the right options.  You need to make sure
pg_basebackup has the right permissions for the target directories.  The
created instance has to be integrated into the operating system's start
scripts.  There is this slightly awkward business of the --recovery-conf
option and how it interacts with other features.  And you should
probably run pg_basebackup under screen.  And then how do you get
notified when it's done.  And when it's done you have to log back in and
finish up.  Too many steps.

My idea is that the postmaster can launch a base backup worker, wait
till it's done, then proceed with the rest of the startup.  initdb gets
a special option to create a "minimal" data directory with only a few
files, directories, and the usual configuration files.  Then you create
a $PGDATA/basebackup.signal, start the postmaster as normal.  It sees
the signal file, launches an auxiliary process that runs the base
backup, then proceeds with normal startup in standby mode.

This makes a whole bunch of things much nicer: The connection
information for where to get the base backup from comes from
postgresql.conf, so you only need to specify it in one place.
pg_basebackup is completely out of the picture; no need to deal with
command-line options, --recovery-conf, screen, monitoring for
completion, etc.  If something fails, the base backup process can
automatically be restarted (maybe).  Operating system integration is
much easier: You only call initdb and then pg_ctl or postgres, as you
are already doing.  Automated deployment systems don't need to wait for
pg_basebackup to finish: You only call initdb, then start the server,
and then you're done -- waiting for the base backup to finish can be
done by the regular monitoring system.

Attached is a very hackish patch to implement this.  It works like this:

# (assuming you have a primary already running somewhere)
initdb -D data2 --minimal
$EDITOR data2/postgresql.conf  # set primary_conninfo
pg_ctl -D data2 start

(Curious side note: If you don’t set primary_conninfo in these steps,
then libpq defaults apply, so the default behavior might end up being
that a given instance attempts to replicate from itself.)

It works for basic cases.  It's missing tablespace support, proper
fsyncing, progress reporting, probably more.  Those would be pretty
straightforward I think.  The interesting bit is the delicate ordering
of the postmaster startup:  Normally, the pg_control file is read quite
early, but if starting from a minimal data directory, we need to wait
until the base backup is done.  There is also the question what you do
if the base backup fails halfway through.  Currently you probably need
to delete the whole data directory and start again with initdb.  Better
might be a way to start again and overwrite any existing files, but that
can clearly also be dangerous.  All this needs some careful analysis,
but I think it's doable.

Any thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 1cf36db2514b04428570496fc9d1145591fda0fc Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 29 Jun 2019 21:52:21 +0200
Subject: [PATCH v1] Base backup client as auxiliary backend process

---
 src/backend/access/transam/xlog.c |  14 +-
 src/backend/bootstrap/bootstrap.c |   9 +
 src/backend/postmaster/pgstat.c   |   6 +
 src/backend/postmaster/postmaster.c   |  95 +-
 src/backend/replication/Makefile  |   2 +-
 src/backend/replication/basebackup_client.c   |  33 ++
 .../libpqwalreceiver/libpqwalreceiver.c   | 308 ++
 src/bin/initdb/initdb.c   |  39 ++-
 src/include/access/xlog.h |   4 +
 src/include/miscadmin.h   |   2 +
 src/include/pgstat.h  |   1 +
 src/include/replication/basebackup_client.h   |   1 +
 src/include/replication/walreceiver.h |   4 +
 13 files changed, 502 insertions(+), 16 deletions(-)
 create mode 100644 src/backend/replication/basebackup_client.c
 create mode 100644 src/include/replication/basebackup_client.h

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index e08320e829..da97970703 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -905,7 +905,6 @@ static XLogRecord *ReadCheckpointRecord(XLogReaderState 
*xlogreader,

XLogRecPtr RecPtr, int whichChkpti, bool report);
 static bool rescanLatestTimeLine(void);
 static void WriteControlFile(void);
-static void ReadControlFile(void);
 static char *str_time(pg_time_t tnow);
 static bool CheckForStandbyTrigger(void);
 
@@ -4572,7 +4571,7 @@ WriteControlFile(void)
 

Re: Avoid full GIN index scan when possible

2019-06-29 Thread Alexander Korotkov
On Sat, Jun 29, 2019 at 3:51 PM Julien Rouhaud  wrote:
> On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra
>  wrote:
> >
> > On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote:
> > >On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
> > >> -- patched
> > >> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE 
> > >> '%1%';
> > >>   QUERY PLAN
> > >> ---
> > >>  Bitmap Heap Scan on test  (cost=20.43..176.79 rows=42 width=6) (actual 
> > >> time=0.287..0.424 rows=300 loops=1)
> > >>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
> > >>Rows Removed by Index Recheck: 2
> > >>Heap Blocks: exact=114
> > >>->  Bitmap Index Scan on test_t_idx  (cost=0.00..20.42 rows=42 
> > >> width=0) (actual time=0.271..0.271 rows=302 loops=1)
> > >>  Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
> > >>  Planning Time: 0.080 ms
> > >>  Execution Time: 0.450 ms
> > >> (8 rows)
> > >
> > >One thing that's bothering me is that the explain implies that the
> > >LIKE '%i% was part of the index scan, while in reality it wasn't.  One
> > >of the reason why I tried to modify the qual while generating the path
> > >was to have the explain be clearer about what is really done.
> >
> > Yeah, I think that's a bit annoying - it'd be nice to make it clear
> > which quals were actually used to scan the index. It some cases it may
> > not be possible (e.g. in cases when the decision is done at runtime, not
> > while planning the query), but it'd be nice to show it when possible.
>
> Maybe we could somehow add some runtime information about ignored
> quals, similar to the "never executed" information for loops?

+1,
This sounds reasonable for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Avoid full GIN index scan when possible

2019-06-29 Thread Alexander Korotkov
On Sat, Jun 29, 2019 at 1:25 PM Tomas Vondra
 wrote:
> A related issue is that during costing is too late to modify cardinality
> estimates, so the 'Bitmap Index Scan' will be expected to return fewer
> rows than it actually returns (after ignoring the full-scan quals).
> Ignoring redundant quals (the way btree does it at execution) does not
> have such consequence, of course.

Adjust cardinality estimates should be possible in gincostestimate(),
because we call extractquery() method there.  However, it seems to be
quite independent issue.  Number of rows returned by 'Bitmap Index
Scan' doesn't vary much whether we execute GIN_SEARCH_MODE_ALL or not.
The only difference is for multicolumn index, GIN_SEARCH_MODE_ALL
allows to exclude NULL on one column, when normal scan is performed on
another column.  And we can take it into account in gincostestimate().

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] Regression tests vs existing users in an installation

2019-06-29 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> We could make the new subdirectory be something specific like
>> "src/test/modules/test_rolenames", but I think very likely we'll be
>> wanting some additional test scripts that we likewise deem unsafe to
>> run during "installcheck".  So I'd rather choose a more generic module
>> name, but I'm not sure what ... "unsafe_tests"?

> Agreed but haven't got any particularly good suggestions on names..

Hearing no better suggestions, I went with "unsafe_tests" in the
attached.

This patch just moves rolenames.sql lock-stock-and-barrel into
src/test/modules/unsafe_tests.  Another approach would be to split
the test script into a portion that doesn't violate any installcheck
rule and could be kept in the core tests, versus the unsafe tests.
I lack the interest to do that, but if somebody else is excited enough
about it, have at it.

I'm wondering whether we ought to back-patch this.  The odds that
somebody would be affected by "make installcheck" resetting the
application_name property of existing roles seem pretty small,
but it could be nasty if it did matter.  Perhaps squeezing this into
v12 is good enough.

Another idea would be to just take out the ALTER USER ALL tests
in the back branches.

Thoughts?

regards, tom lane

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index dfd0956..60d6d7b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rbtree \
 		  test_rls_hooks \
 		  test_shm_mq \
+		  unsafe_tests \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/unsafe_tests/.gitignore b/src/test/modules/unsafe_tests/.gitignore
new file mode 100644
index 000..5dcb3ff
--- /dev/null
+++ b/src/test/modules/unsafe_tests/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/unsafe_tests/Makefile b/src/test/modules/unsafe_tests/Makefile
new file mode 100644
index 000..321252f
--- /dev/null
+++ b/src/test/modules/unsafe_tests/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/unsafe_tests/Makefile
+
+REGRESS = rolenames
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/unsafe_tests
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/unsafe_tests/README b/src/test/modules/unsafe_tests/README
new file mode 100644
index 000..a7e5b2a
--- /dev/null
+++ b/src/test/modules/unsafe_tests/README
@@ -0,0 +1,8 @@
+This directory doesn't actually contain any extension module.
+
+What it is is a home for regression tests that we don't want to run
+during "make installcheck" because they could have side-effects that
+seem undesirable for a production installation.
+
+An example is that rolenames.sql tests ALTER USER ALL and so could
+have effects on pre-existing roles.
diff --git a/src/test/modules/unsafe_tests/expected/rolenames.out b/src/test/modules/unsafe_tests/expected/rolenames.out
new file mode 100644
index 000..03c1a25
--- /dev/null
+++ b/src/test/modules/unsafe_tests/expected/rolenames.out
@@ -0,0 +1,1010 @@
+CREATE OR REPLACE FUNCTION chkrolattr()
+ RETURNS TABLE ("role" name, rolekeyword text, canlogin bool, replication bool)
+ AS $$
+SELECT r.rolname, v.keyword, r.rolcanlogin, r.rolreplication
+ FROM pg_roles r
+ JOIN (VALUES(CURRENT_USER, 'current_user'),
+ (SESSION_USER, 'session_user'),
+ ('current_user', '-'),
+ ('session_user', '-'),
+ ('Public', '-'),
+ ('None', '-'))
+  AS v(uname, keyword)
+  ON (r.rolname = v.uname)
+ ORDER BY 1;
+$$ LANGUAGE SQL;
+CREATE OR REPLACE FUNCTION chksetconfig()
+ RETURNS TABLE (db name, "role" name, rolkeyword text, setconfig text[])
+ AS $$
+SELECT COALESCE(d.datname, 'ALL'), COALESCE(r.rolname, 'ALL'),
+	   COALESCE(v.keyword, '-'), s.setconfig
+ FROM pg_db_role_setting s
+ LEFT JOIN pg_roles r ON (r.oid = s.setrole)
+ LEFT JOIN pg_database d ON (d.oid = s.setdatabase)
+ LEFT JOIN (VALUES(CURRENT_USER, 'current_user'),
+ (SESSION_USER, 'session_user'))
+  AS v(uname, keyword)
+  ON (r.rolname = v.uname)
+   WHERE (r.rolname) IN ('Public', 'current_user', 'regress_testrol1', 'regress_testrol2')
+ORDER BY 1, 2;
+$$ LANGUAGE SQL;
+CREATE OR REPLACE FUNCTION chkumapping()
+ RETURNS TABLE (umname name, umserver name, umoptions text[])
+ AS $$
+SELECT r.rolname, s.srvname, m.umoptions
+ FROM pg_user_mapping m
+ LEFT JOIN pg_roles r ON (r.oid = m.umuser)
+ JOIN pg_foreign_server s ON (s.oid = m.umserver)
+ ORDER BY 2;
+$$ LANGUAGE SQL;
+--
+-- We test creation and use of these role names to ensure that the server
+-- correctly distinguishes role keywords from quoted names that look like
+-- those keywords.  In a test environment, creation of these roles may
+-- provoke 

Re: Avoid full GIN index scan when possible

2019-06-29 Thread Alexander Korotkov
Hi!

On Sat, Jun 29, 2019 at 1:52 AM Nikita Glukhov  wrote:
> We have a similar solution for this problem.  The idea is to avoid full index
> scan inside GIN itself when we have some GIN entries, and forcibly recheck
> all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no
> GIN entries.
>
> The attached patch in its current shape contain at least two ugly places:
>
> 1. We still need to initialize empty scan key to call triconsistent(), but
>then we have to remove it from the list of scan keys.  Simple refactoring
>of ginFillScanKey() can be helpful here.
>
> 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL
>if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL
>because we need to skip NULLs in GIN_SEARCH_MODE_ALL.  Simplest example 
> here
>is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not 
> forced,
>and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked.  Maybe
>it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL.

Thank you for publishing this!

What would happen when two-columns index have GIN_SEARCH_MODE_DEFAULT
scan on first column and GIN_SEARCH_MODE_ALL on second?  I think even
if triconsistent() for second column returns GIN_TRUE, we still need
to recheck to verify second columns is not NULL.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Fix runtime errors from -fsanitize=undefined

2019-06-29 Thread didier
Hi,

I tested this patch with clang 7 on master.
- On unpatched master I can't reproduce errors with make check-world in:
src/backend/access/heap/heapam.c
src/backend/utils/cache/relcache.c (IIRC I triggered this one in a pg
previous version)
src/backend/utils/misc/guc.c

- I have a hard to reproduce one not in this patched:
src/backend/storage/ipc/shm_mq.c line 727

About the changes
- in
src/fe_utils/print.c
line memset(header_done, false, col_count * sizeof(bool));
is redundant and should be remove not guarded with if (hearder_done),
header_done  is either null or already zeroed, it's pg_malloc0 ed.

In all cases but one patched version shortcut an undefined no ops but in
src/backend/access/transam/clog.c
memcmp 0 bytes return 0 thus current change modifies code path, before
with nsubxids == 0 if branch was taken now it's not.
Could wait more often while taking lock, no idea if it's relevant.

Regards
Didier

On Thu, Jun 6, 2019 at 11:36 AM Peter Eisentraut
 wrote:
>
> On 2019-06-05 21:30, Robert Haas wrote:
> > On Mon, Jun 3, 2019 at 3:22 PM Peter Eisentraut
> >  wrote:
> >> After many years of trying, it seems the -fsanitize=undefined checking
> >> in gcc is now working somewhat reliably.  Attached is a patch that fixes
> >> all errors of the kind
> >
> > Is this as of some particular gcc version?
>
> I used gcc-8.
>
> The option has existed in gcc for quite some time, but in previous
> releases it always tended to hang or get confused somewhere.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>




Re: Avoid full GIN index scan when possible

2019-06-29 Thread Tomas Vondra

On Sat, Jun 29, 2019 at 03:28:11PM +0200, Julien Rouhaud wrote:

On Sat, Jun 29, 2019 at 3:11 PM Tomas Vondra
 wrote:


On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote:
>On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra
>> A related issue is that during costing is too late to modify cardinality
>> estimates, so the 'Bitmap Index Scan' will be expected to return fewer
>> rows than it actually returns (after ignoring the full-scan quals).
>> Ignoring redundant quals (the way btree does it at execution) does not
>> have such consequence, of course.
>>
>> Which may be an issue, because we essentially want to modify the list of
>> quals to minimize the cost of
>>
>>bitmap index scan + recheck during bitmap heap scan
>>
>> OTOH it's not a huge issue, because it won't affect the rest of the plan
>> (because that uses the bitmap heap scan estimates, and those are not
>> affected by this).
>
>Doesn't this problem already exists, as the quals that we could drop
>can't actually reduce the node's results?

How could it not reduce the node's results, if you ignore some quals
that are not redundant? My understanding is we have a plan like this:

Bitmap Heap Scan
  -> Bitmap Index Scan

and by ignoring some quals at the index scan level, we trade the (high)
cost of evaluating the qual there for a plain recheck at the bitmap heap
scan. But it means the index scan may produce more rows, so it's only a
win if the "extra rechecks" are cheaper than the (removed) full scan.


Sorry, by node I meant the BitmapIndexScan.  AIUI, if you have for
instance WHERE val LIKE '%abcde%' AND val LIKE '%z%' and a trgm index,
the BitmapIndexScan will have to through the whole index and discard
rows based on the only opclass-optimizable qual (LIKE '%abcde%'),
letting the recheck do the proper filtering for the other qual.  So
whether you have the LIKE '%z%' or not in the index scan, the
BitmapIndexScan will return the same number of rows, the only
difference being that in one case you'll have to scan the whole index,
while in the other case you won't.



Oh! I thought 'full scan' means we have to scan all the keys in the GIN
index, but we can still eliminate some of the keys (for example for the
trigrams we might check if the trigram contains the short string). But
clearly I was mistaken and it does not work like that ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: mcvstats serialization code is still shy of a load

2019-06-29 Thread Tomas Vondra

On Thu, Jun 27, 2019 at 01:26:32PM +0200, Tomas Vondra wrote:

On Thu, Jun 27, 2019 at 12:04:30AM -0400, Tom Lane wrote:

Tomas Vondra  writes:

OK. Attached is a patch ditching the alignment in serialized data. I've
ditched the macros to access parts of serialized data, and everything
gets copied.


I lack energy to actually read this patch right now, and I don't currently
have an opinion about whether it's worth another catversion bump to fix
this stuff in v12.  But I did test the patch, and I can confirm it gets
through the core regression tests on hppa (both gaur's host environment
with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1).



Thanks for running it through regression tests, that alone is a very
useful piece of information for me.

As for the catversion bump - I'd probably vote to do it. Not just because
of this serialization stuff, but to fix the pg_mcv_list_items function.
It's not something I'm very enthusiastic about (kinda embarassed about it,
really), but it seems better than shipping something that we'll need to
rework in PG13.



Attached is a slightly improved version of the serialization patch. The
main difference is that when serializing varlena values, the previous
patch version stored

   length (uint32) + full varlena (incl. the header)

which is kinda redundant, because the varlena stores the length too. So
now it only stores the length + data, without the varlena header. I
don't think there's a better way to store varlena values without
enforcing alignment (which is what happens in current master).

There's one additional change I failed to mention before - I had to add
another field to DimensionInfo, tracking how much space will be needed
for deserialized data. This is needed because the deserialization
allocates the whole MCV as a single chunk of memory, to reduce palloc
overhead. It could parse the data twice (first to determine the space,
then to actually parse it), this allows doing just a single pass. Which
seems useful for large MCV lists, but maybe it's not worth it?

Barring objections I'll commit this together with the pg_mcv_list_items
fix, posted in a separate thread. Of course, this requires catversion
bump - an alternative would be to keep enforcing the alignment, but
tweak the macros to work on all platforms without SIGBUS.

Considering how troublesome this serialiation part of the patch turner
out to be, I'm not really sure by anything at this point. So I'd welcome
thoughts about the proposed changes.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..256728a02a 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -52,15 +52,7 @@
  *  ndim * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)
  */
 #define ITEM_SIZE(ndims)   \
-   MAXALIGN((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
-
-/*
- * Macros for convenient access to parts of a serialized MCV item.
- */
-#define ITEM_INDEXES(item) ((uint16 *) (item))
-#define ITEM_NULLS(item,ndims) ((bool *) (ITEM_INDEXES(item) + 
(ndims)))
-#define ITEM_FREQUENCY(item,ndims) ((double *) (ITEM_NULLS(item, ndims) + 
(ndims)))
-#define ITEM_BASE_FREQUENCY(item,ndims)((double *) 
(ITEM_FREQUENCY(item, ndims) + 1))
+   ((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
 
 /*
  * Used to compute size of serialized MCV list representation.
@@ -68,10 +60,16 @@
 #define MinSizeOfMCVList   \
(VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber))
 
+/*
+ * Size of the serialized MCV list, exluding the space needed for
+ * deduplicated per-dimension values. The macro is meant to be used
+ * when it's not yet safe to access the serialized info about amount
+ * of data for each column.
+ */
 #define SizeOfMCVList(ndims,nitems)\
-   (MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
-MAXALIGN((ndims) * sizeof(DimensionInfo)) + \
-MAXALIGN((nitems) * ITEM_SIZE(ndims)))
+   ((MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \
+((ndims) * sizeof(DimensionInfo)) + \
+((nitems) * ITEM_SIZE(ndims)))
 
 static MultiSortSupport build_mss(VacAttrStats **stats, int numattrs);
 
@@ -491,19 +489,16 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats 
**stats)
int i;
int dim;
int ndims = mcvlist->ndimensions;
-   int itemsize = ITEM_SIZE(ndims);
 
SortSupport ssup;
DimensionInfo *info;
 
Sizetotal_length;
 
-   /* allocate the item just once */
-   char   *item = palloc0(itemsize);
-
/* serialized items (indexes into arrays, etc.) */
-   char   *raw;
+   bytea  *raw;
char   *ptr;
+   

Re: TM format can mix encodings in to_char()

2019-06-29 Thread Tom Lane
Juanjo Santamaria Flecha  writes:
> It looks as if no work is left for this patch, so maybe updating the author 
> to Tom Lane (I'm just a repoter at this point, which it's fine) and the 
> status to ready for committer would better reflect its current status. Does 
> anyone think otherwise?

Yeah, this was dealt with in 7ad1cd31b et al.  I didn't realize there
was a CF entry for it, or I would have closed it then.  I've done so now.

regards, tom lane




Re: Optimize partial TOAST decompression

2019-06-29 Thread Andrey Borodin
Hi!
Please, do not use top-posting, i.e. reply style where you quote whole message 
under your response. It makes reading of archives terse.

> 24 июня 2019 г., в 7:53, Binguo Bao  написал(а):
> 
>> This is not correct: L bytes of compressed data do not always can be decoded 
>> into at least L bytes of data. At worst we have one control byte per 8 bytes 
>> of literal bytes. This means at most we need (L*9 + 8) / 8 bytes with 
>> current pglz format.
> 
> Good catch! I've corrected the related code in the patch.
> ...
> <0001-Optimize-partial-TOAST-decompression-2.patch>

I've took a look into the code.
I think we should extract function for computation of max_compressed_size and 
put it somewhere along with pglz code. Just in case something will change 
something about pglz so that they would not forget about compression algorithm 
assumption.

Also I suggest just using 64 bit computation to avoid overflows. And I think it 
worth to check if max_compressed_size is whole data and use min of 
(max_compressed_size, uncompressed_data_size).

Also you declared needsize and max_compressed_size too far from use. But this 
will be solved by function extraction anyway.

Thanks!

Best regards, Andrey Borodin.



Re: Avoid full GIN index scan when possible

2019-06-29 Thread Julien Rouhaud
On Sat, Jun 29, 2019 at 3:11 PM Tomas Vondra
 wrote:
>
> On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote:
> >On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra
> >> A related issue is that during costing is too late to modify cardinality
> >> estimates, so the 'Bitmap Index Scan' will be expected to return fewer
> >> rows than it actually returns (after ignoring the full-scan quals).
> >> Ignoring redundant quals (the way btree does it at execution) does not
> >> have such consequence, of course.
> >>
> >> Which may be an issue, because we essentially want to modify the list of
> >> quals to minimize the cost of
> >>
> >>bitmap index scan + recheck during bitmap heap scan
> >>
> >> OTOH it's not a huge issue, because it won't affect the rest of the plan
> >> (because that uses the bitmap heap scan estimates, and those are not
> >> affected by this).
> >
> >Doesn't this problem already exists, as the quals that we could drop
> >can't actually reduce the node's results?
>
> How could it not reduce the node's results, if you ignore some quals
> that are not redundant? My understanding is we have a plan like this:
>
> Bitmap Heap Scan
>   -> Bitmap Index Scan
>
> and by ignoring some quals at the index scan level, we trade the (high)
> cost of evaluating the qual there for a plain recheck at the bitmap heap
> scan. But it means the index scan may produce more rows, so it's only a
> win if the "extra rechecks" are cheaper than the (removed) full scan.

Sorry, by node I meant the BitmapIndexScan.  AIUI, if you have for
instance WHERE val LIKE '%abcde%' AND val LIKE '%z%' and a trgm index,
the BitmapIndexScan will have to through the whole index and discard
rows based on the only opclass-optimizable qual (LIKE '%abcde%'),
letting the recheck do the proper filtering for the other qual.  So
whether you have the LIKE '%z%' or not in the index scan, the
BitmapIndexScan will return the same number of rows, the only
difference being that in one case you'll have to scan the whole index,
while in the other case you won't.

> clearly whatever we do the results from the bitmap heap scan
> must remain the same.

Of course.




Re: Avoid full GIN index scan when possible

2019-06-29 Thread Tomas Vondra

On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote:

On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra
 wrote:


On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote:
>On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
>> -- patched
>> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%';
>>   QUERY PLAN
>> 
---
>>  Bitmap Heap Scan on test  (cost=20.43..176.79 rows=42 width=6) (actual 
time=0.287..0.424 rows=300 loops=1)
>>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>>Rows Removed by Index Recheck: 2
>>Heap Blocks: exact=114
>>->  Bitmap Index Scan on test_t_idx  (cost=0.00..20.42 rows=42 width=0) 
(actual time=0.271..0.271 rows=302 loops=1)
>>  Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>>  Planning Time: 0.080 ms
>>  Execution Time: 0.450 ms
>> (8 rows)
>
>One thing that's bothering me is that the explain implies that the
>LIKE '%i% was part of the index scan, while in reality it wasn't.  One
>of the reason why I tried to modify the qual while generating the path
>was to have the explain be clearer about what is really done.

Yeah, I think that's a bit annoying - it'd be nice to make it clear
which quals were actually used to scan the index. It some cases it may
not be possible (e.g. in cases when the decision is done at runtime, not
while planning the query), but it'd be nice to show it when possible.


Maybe we could somehow add some runtime information about ignored
quals, similar to the "never executed" information for loops?



Maybe. I suppose it depends on when exactly we make the decision about
which quals to ignore.


A related issue is that during costing is too late to modify cardinality
estimates, so the 'Bitmap Index Scan' will be expected to return fewer
rows than it actually returns (after ignoring the full-scan quals).
Ignoring redundant quals (the way btree does it at execution) does not
have such consequence, of course.

Which may be an issue, because we essentially want to modify the list of
quals to minimize the cost of

   bitmap index scan + recheck during bitmap heap scan

OTOH it's not a huge issue, because it won't affect the rest of the plan
(because that uses the bitmap heap scan estimates, and those are not
affected by this).


Doesn't this problem already exists, as the quals that we could drop
can't actually reduce the node's results?


How could it not reduce the node's results, if you ignore some quals
that are not redundant? My understanding is we have a plan like this:

   Bitmap Heap Scan
 -> Bitmap Index Scan

and by ignoring some quals at the index scan level, we trade the (high)
cost of evaluating the qual there for a plain recheck at the bitmap heap
scan. But it means the index scan may produce more rows, so it's only a
win if the "extra rechecks" are cheaper than the (removed) full scan.

So the full scan might actually reduce the number of rows from the index
scan, but clearly whatever we do the results from the bitmap heap scan
must remain the same.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Choosing values for multivariate MCV lists

2019-06-29 Thread Tomas Vondra

On Tue, Jun 25, 2019 at 11:18:19AM +0200, Tomas Vondra wrote:

On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote:

On Mon, 24 Jun 2019 at 00:42, Tomas Vondra  wrote:


On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:

On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:

One annoying thing I noticed is that the base_frequency tends to end up
being 0, most likely due to getting too small. It's a bit strange, though,
because with statistic target set to 10k the smallest frequency for a
single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
something the float8 can represent).



Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.



Yeah, I realized that too, eventually. One way to fix that would be
adding %.15f to the sprintf() call, but that just adds ugliness. It's
probably time to rewrite the function to build the tuple from datums,
instead of relying on BuildTupleFromCStrings.



OK, attached is a patch doing this. It's pretty simple, and it does
resolve the issue with frequency precision.

There's one issue with the signature, though - currently the function
returns null flags as bool array, but values are returned as simple
text value (formatted in array-like way, but still just a text).

In the attached patch I've reworked both to proper arrays, but obviously
that'd require a CATVERSION bump - and there's not much apetite for that
past beta2, I suppose. So I'll just undo this bit.



Hmm, I didn't spot that the old code was using a single text value
rather than a text array. That's clearly broken, especially since it
wasn't even necessarily constructing a valid textual representation of
an array (e.g., if an individual value's textual representation
included the array markers "{" or "}").

IMO fixing this to return a text array is worth doing, even though it
means a catversion bump.



Yeah :-(

It used to be just a "debugging" function, but now that we're using it
e.g. in pg_stats_ext definition, we need to be more careful about the
output. Presumably we could keep the text output and make sure it's
escaped properly etc. We could even build an array internally and then
run it through an output function. That'd not require catversion bump.

I'll cleanup the patch changing the function signature. If others think
the catversion bump would be too significant annoyance at this point, I
will switch back to the text output (with proper formatting).

Opinions?



Attached is a cleaned-up version of that patch. The main difference is
that instead of using construct_md_array() this uses ArrayBuildState to
construct the arrays, which is much easier. The docs don't need any
update because those were already using text[] for the parameter, the
code was inconsistent with it.

This does require catversion bump, but as annoying as it is, I think
it's worth it (and there's also the thread discussing the serialization
issues). Barring objections, I'll get it committed later next week, once
I get back from PostgresLondon.

As I mentioned before, if we don't want any additional catversion bumps,
it's possible to pass the arrays through output functions - that would
allow us keeping the text output (but correct, unlike what we have now).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 5fe61ea0a4..498d704a62 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1134,9 +1134,6 @@ Datum
 pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
 {
FuncCallContext *funcctx;
-   int call_cntr;
-   int max_calls;
-   AttInMetadata *attinmeta;
 
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
@@ -1166,13 +1163,13 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("function returning record 
called in context "
"that cannot accept 
type record")));
+   tupdesc = BlessTupleDesc(tupdesc);
 
/*
 * generate attribute metadata needed later to produce tuples 
from raw
 * C strings
 */
-   attinmeta = TupleDescGetAttInMetadata(tupdesc);
-   funcctx->attinmeta = attinmeta;
+   funcctx->attinmeta = TupleDescGetAttInMetadata(tupdesc);
 
MemoryContextSwitchTo(oldcontext);
}
@@ -1180,18 +1177,14 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS)
/* stuff done on every call of the function */
  

Re: Avoid full GIN index scan when possible

2019-06-29 Thread Julien Rouhaud
On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra
 wrote:
>
> On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote:
> >On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
> >> -- patched
> >> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%';
> >>   QUERY PLAN
> >> ---
> >>  Bitmap Heap Scan on test  (cost=20.43..176.79 rows=42 width=6) (actual 
> >> time=0.287..0.424 rows=300 loops=1)
> >>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
> >>Rows Removed by Index Recheck: 2
> >>Heap Blocks: exact=114
> >>->  Bitmap Index Scan on test_t_idx  (cost=0.00..20.42 rows=42 width=0) 
> >> (actual time=0.271..0.271 rows=302 loops=1)
> >>  Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
> >>  Planning Time: 0.080 ms
> >>  Execution Time: 0.450 ms
> >> (8 rows)
> >
> >One thing that's bothering me is that the explain implies that the
> >LIKE '%i% was part of the index scan, while in reality it wasn't.  One
> >of the reason why I tried to modify the qual while generating the path
> >was to have the explain be clearer about what is really done.
>
> Yeah, I think that's a bit annoying - it'd be nice to make it clear
> which quals were actually used to scan the index. It some cases it may
> not be possible (e.g. in cases when the decision is done at runtime, not
> while planning the query), but it'd be nice to show it when possible.

Maybe we could somehow add some runtime information about ignored
quals, similar to the "never executed" information for loops?

> A related issue is that during costing is too late to modify cardinality
> estimates, so the 'Bitmap Index Scan' will be expected to return fewer
> rows than it actually returns (after ignoring the full-scan quals).
> Ignoring redundant quals (the way btree does it at execution) does not
> have such consequence, of course.
>
> Which may be an issue, because we essentially want to modify the list of
> quals to minimize the cost of
>
>bitmap index scan + recheck during bitmap heap scan
>
> OTOH it's not a huge issue, because it won't affect the rest of the plan
> (because that uses the bitmap heap scan estimates, and those are not
> affected by this).

Doesn't this problem already exists, as the quals that we could drop
can't actually reduce the node's results?




Re: Multivariate MCV list vs. statistics target

2019-06-29 Thread Tomas Vondra

On Fri, Jun 21, 2019 at 08:09:18AM +0100, Dean Rasheed wrote:

On Thu, 20 Jun 2019 at 23:12, Tomas Vondra  wrote:


On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:
>On Tue, 18 Jun 2019 at 22:34, Tomas Vondra  
wrote:
>>
>> So I'm thinking we should allow tweaking the statistics for extended
>> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
>> why that would be a bad idea?
>
>Seems reasonable to me. This might not be the only option we'll ever
>want to add though, so perhaps a "stxoptions text[]" column along the
>lines of a relation's reloptions would be the way to go.

I don't know - I kinda dislike the idea of stashing stuff like this into
text[] arrays unless there's a clear need for such flexibility (i.e.
vision to have more such options). Which I'm not sure is the case here.
And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
the same approach here.



Hmm, maybe. I can certainly understand your dislike of using text[].
I'm not sure that we can confidently say that multivariate statistics
won't ever need additional tuning knobs, but I have no idea at the
moment what they might be, and nothing else has come up so far in all
the time spent considering MCV lists and histograms, so maybe this is
OK.



OK, attached is a patch implementing this - it adds

   ALTER STATISTICS ... SET STATISTICS ...

modifying a new stxstattarget column in pg_statistic_ext catalog,
following the same logic as pg_attribute.attstattarget.

During analyze, the per-ext-statistic value is determined like this:

1) When pg_statistic_ext.stxstattarget != (-1), we just use this value
and we're done.

2) Otherwise we inspect per-column attstattarget values, and use the
largest value. This is what we do now, so it's backwards-compatible
behavior.

3) If the value is still (-1), we use default_statistics_target.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 58c7ed020d..987f9dbc6f 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,6 +26,7 @@ PostgreSQL documentation
 ALTER STATISTICS name OWNER TO { 
new_owner | CURRENT_USER | 
SESSION_USER }
 ALTER STATISTICS name RENAME TO 
new_name
 ALTER STATISTICS name SET SCHEMA 
new_schema
+ALTER STATISTICS name SET 
STATISTICS new_target
 
  
 
@@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA 
  
 
+ 
+  new_target
+  
+   
+The statistic-gathering target for this statistic object for subsequent
+ operations.
+The target can be set in the range 0 to 1; alternatively, set it
+to -1 to revert to using the system default statistics
+target ().
+For more information on the use of statistics by the
+PostgreSQL query planner, refer to
+.
+   
+  
+ 
+
 

   
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index d7004e5313..caa4fca99d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -490,6 +490,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
}
}
 
+   /*
+* Look at extended statistic objects too, because those may define 
their
+* own statistic target. So we need to sample enough rows and then build
+* the statistics with enough detail.
+*/
+   {
+   int nrows = ComputeExtStatisticsTarget(onerel,
+   
   attr_cnt, vacattrstats);
+
+   if (targrows < nrows)
+   targrows = nrows;
+   }
+
/*
 * Acquire the sample rows
 */
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index cf406f6f96..356b6096ac 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/heapam.h"
 #include "access/relation.h"
 #include "access/relscan.h"
 #include "access/table.h"
@@ -21,6 +22,7 @@
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
+#include "catalog/objectaccess.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_statistic_ext.h"
 #include "catalog/pg_statistic_ext_data.h"
@@ -29,6 +31,7 @@
 #include "miscadmin.h"
 #include "statistics/statistics.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -336,6 +339,7 @@ CreateStatistics(CreateStatsStmt *stmt)
values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid);
values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum();
values[Anum_pg_statistic_ext_stxnamespace - 1] = 

Re: Avoid full GIN index scan when possible

2019-06-29 Thread Tomas Vondra

On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote:

On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
 wrote:>

On 29.06.2019 1:23, Julien Rouhaud wrote:

But that kinda resembles stuff we already have - selectivity/cost. So
why shouldn't this be considered as part of costing?

Yeah, I'm not entirely convinced that we need anything new here.
The cost estimate function can detect such situations, and so can
the index AM at scan start --- for example, btree checks for
contradictory quals at scan start.  There's a certain amount of
duplicative effort involved there perhaps, but you also have to
keep in mind that we don't know the values of run-time-determined
comparison values until scan start.  So if you want certainty rather
than just a cost estimate, you may have to do these sorts of checks
at scan start.

Ah, I didn't know about _bt_preprocess_keys().  I'm not familiar with
this code, so please bear with me.  IIUC the idea would be to add
additional logic in gingetbitmap() / ginNewScanKey() to drop some
quals at runtime.  But that would mean that additional logic would
also be required in BitmapHeapScan, or that all the returned bitmap
should be artificially marked as lossy to enforce a recheck?

We have a similar solution for this problem.  The idea is to avoid full index
scan inside GIN itself when we have some GIN entries, and forcibly recheck
all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no
GIN entries.


Thanks for looking at it.  That's I think a way better approach.


The attached patch in its current shape contain at least two ugly places:

1. We still need to initialize empty scan key to call triconsistent(), but
   then we have to remove it from the list of scan keys.  Simple refactoring
   of ginFillScanKey() can be helpful here.

2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL
   if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL
   because we need to skip NULLs in GIN_SEARCH_MODE_ALL.  Simplest example here
   is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not forced,
   and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked.  Maybe
   it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL.


Also

+   if (searchMode == GIN_SEARCH_MODE_ALL && nQueryValues <= 0)
+   {
+   /*
+* Don't emit ALL key with no entries, check only whether
+* unconditional recheck is needed.
+*/
+   GinScanKey  key = >keys[--so->nkeys];
+
+   hasSearchAllMode = true;
+   so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE;
+   }

Shouldn't you make sure that  the forcedRecheck flag can't reset?


-- patched
EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%';
  QUERY PLAN
---
 Bitmap Heap Scan on test  (cost=20.43..176.79 rows=42 width=6) (actual 
time=0.287..0.424 rows=300 loops=1)
   Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
   Rows Removed by Index Recheck: 2
   Heap Blocks: exact=114
   ->  Bitmap Index Scan on test_t_idx  (cost=0.00..20.42 rows=42 width=0) 
(actual time=0.271..0.271 rows=302 loops=1)
 Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
 Planning Time: 0.080 ms
 Execution Time: 0.450 ms
(8 rows)


One thing that's bothering me is that the explain implies that the
LIKE '%i% was part of the index scan, while in reality it wasn't.  One
of the reason why I tried to modify the qual while generating the path
was to have the explain be clearer about what is really done.


Yeah, I think that's a bit annoying - it'd be nice to make it clear
which quals were actually used to scan the index. It some cases it may
not be possible (e.g. in cases when the decision is done at runtime, not
while planning the query), but it'd be nice to show it when possible.

A related issue is that during costing is too late to modify cardinality
estimates, so the 'Bitmap Index Scan' will be expected to return fewer
rows than it actually returns (after ignoring the full-scan quals).
Ignoring redundant quals (the way btree does it at execution) does not
have such consequence, of course.

Which may be an issue, because we essentially want to modify the list of
quals to minimize the cost of

  bitmap index scan + recheck during bitmap heap scan

OTOH it's not a huge issue, because it won't affect the rest of the plan
(because that uses the bitmap heap scan estimates, and those are not
affected by this).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Avoid full GIN index scan when possible

2019-06-29 Thread Julien Rouhaud
On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov
 wrote:>
> On 29.06.2019 1:23, Julien Rouhaud wrote:
>
> But that kinda resembles stuff we already have - selectivity/cost. So
> why shouldn't this be considered as part of costing?
>
> Yeah, I'm not entirely convinced that we need anything new here.
> The cost estimate function can detect such situations, and so can
> the index AM at scan start --- for example, btree checks for
> contradictory quals at scan start.  There's a certain amount of
> duplicative effort involved there perhaps, but you also have to
> keep in mind that we don't know the values of run-time-determined
> comparison values until scan start.  So if you want certainty rather
> than just a cost estimate, you may have to do these sorts of checks
> at scan start.
>
> Ah, I didn't know about _bt_preprocess_keys().  I'm not familiar with
> this code, so please bear with me.  IIUC the idea would be to add
> additional logic in gingetbitmap() / ginNewScanKey() to drop some
> quals at runtime.  But that would mean that additional logic would
> also be required in BitmapHeapScan, or that all the returned bitmap
> should be artificially marked as lossy to enforce a recheck?
>
> We have a similar solution for this problem.  The idea is to avoid full index
> scan inside GIN itself when we have some GIN entries, and forcibly recheck
> all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no
> GIN entries.

Thanks for looking at it.  That's I think a way better approach.

> The attached patch in its current shape contain at least two ugly places:
>
> 1. We still need to initialize empty scan key to call triconsistent(), but
>then we have to remove it from the list of scan keys.  Simple refactoring
>of ginFillScanKey() can be helpful here.
>
> 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL
>if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL
>because we need to skip NULLs in GIN_SEARCH_MODE_ALL.  Simplest example 
> here
>is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not 
> forced,
>and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked.  Maybe
>it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL.

Also

+   if (searchMode == GIN_SEARCH_MODE_ALL && nQueryValues <= 0)
+   {
+   /*
+* Don't emit ALL key with no entries, check only whether
+* unconditional recheck is needed.
+*/
+   GinScanKey  key = >keys[--so->nkeys];
+
+   hasSearchAllMode = true;
+   so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE;
+   }

Shouldn't you make sure that  the forcedRecheck flag can't reset?

> -- patched
> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%';
>   QUERY PLAN
> ---
>  Bitmap Heap Scan on test  (cost=20.43..176.79 rows=42 width=6) (actual 
> time=0.287..0.424 rows=300 loops=1)
>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>Rows Removed by Index Recheck: 2
>Heap Blocks: exact=114
>->  Bitmap Index Scan on test_t_idx  (cost=0.00..20.42 rows=42 width=0) 
> (actual time=0.271..0.271 rows=302 loops=1)
>  Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text))
>  Planning Time: 0.080 ms
>  Execution Time: 0.450 ms
> (8 rows)

One thing that's bothering me is that the explain implies that the
LIKE '%i% was part of the index scan, while in reality it wasn't.  One
of the reason why I tried to modify the qual while generating the path
was to have the explain be clearer about what is really done.




Re: proposal - patch: psql - sort_by_size

2019-06-29 Thread Pavel Stehule
so 29. 6. 2019 v 9:32 odesílatel Fabien COELHO  napsal:

>
> Hello Pavel,
>
> > \set SORT_BY_SIZE on
> > \dt -- sorted by schema, name (size is not calculated and is not visible)
> > \dt+ -- sorted by size
>
> Patch applies cleanly, compiles, runs. "make check" ok. doc build ok.
>
> There are no tests. Some infrastructure should be in place so that such
> features can be tested, eg so psql-specific TAP tests. ISTM that there was
> a patch submitted for that, but I cannot find it:-( Maybe it is combined
> with some other patch in the CF.
>

It is not possible - the size of relations is not stable (can be different
on some platforms), and because showing the size is base of this patch, I
cannot to write tests. Maybe only only set/unset of variable.


>
> I agree that the simpler the better for such a feature.
>
> ISTM that the fact that the option is ignored on \dt is a little bit
> annoying. It means that \dt and \dt+ would not show their results in the
> same order. I understand that the point is to avoid the cost of computing
> the sizes, but if the user asked for it, should it be done anyway?
>

It was one objection against some previous patches. In this moment I don't
see any wrong on different order between \dt and \dt+. When column "size"
will be displayed, then ordering of report will be clean.

I am not strongly against this - implementation of support SORT_BY_SIZE for
non verbose mode is +/- few lines more. But now (and it is just my opinion
and filing, nothing more), I think so sorting reports by invisible columns
can be messy. But if somebody will have strong different option on this
point, I am able to accept it. Both variants can have some sense, and some
benefits - both variants are consistent with some rules (but cannot be
together).


> I'm wondering whether it would make sense to have a slightly more generic
> interface allowing for more values, eg:
>
>   \set DESCRIPTION_SORT "name"
>   \set DESCRIPTION_SORT "size"
>
> Well, possibly this is a bad idea, so it is really a question.
>

We was at this point already :). If you introduce this, then you have to
support combinations schema_name, name_schema, size, schema_size, ...

My goal is implementation of most common missing alternative into psql -
but I would not to do too generic implementation - it needs more complex
design (and UI), and I don't think so people use it. SORT_BY_SIZE (on/off)
looks simply, and because (if will not be changed) it has not impact on non
verbose mode, then it can be active permanently (and if not, it is not
mental hard work to set it).

I think so more generic solution needs interactive UI. Now I working on
vertical cursor support for pspg https://github.com/okbob/pspg. Next step
will be sort by column under vertical cursor. So, I hope, it can be good
enough for simply sorting by any column of report (but to be user friendly,
it needs interactive UI). Because not everywhere is pspg installed, I would
to push some simple solution (I prefer simplicity against generic) to psql.



>
> +   Setting this variable to on causes so results of
> +   \d* commands will be sorted by size, when size
> +   is displayed.
>
> Maybe the simpler: "Setting this variable on sorts \d* outputs by size,
> when size is displayed."
>
> ISTM that the documentation is more generic than reality. Does it work
> with \db+? It seems to work with \dm+.
>
> On equality, ISTM it it should sort by name as a secondary criterion.
>
> I tested a few cases, although not partitioned tables.
>

Thank you - I support now relations (tables, indexes, ), databases, and
tablespaces. The column size is displayed  for data types report, but I am
not sure about any benefit in this case.

Regards

Pavel


> --
> Fabien.
>
>
>


Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-06-29 Thread Thomas Munro
On Sat, Jun 29, 2019 at 9:47 AM David Steele  wrote:
> On 6/28/19 1:15 PM, Tom Lane wrote:
> > Stephen Frost  writes:
> >> sh, don't look now, but there might be a "Resend email" button in
> >> the archives now that you can click to have an email sent to you...
> >
> > Oooh, lovely.
> >
> >> (thank you Magnus)
> >
> > +many
>
> Thank you, Magnus, this is really helpful!

Thanks, that's great news.  So, just to recap for new people who want
to get involved in testing and reviewing, the steps are:

1.  Subscribe to the pgsql-hackers mailing list, starting here:
https://lists.postgresql.org/
2.  In the process of doing that, you'll create a PostgreSQL community account.
3.  Choose a patch you're interested in from
https://commitfest.postgresql.org/23/ , and possibly add yourself as a
reviewer.
4.  Follow the link to the email thread.
5.  Click on the shiny new "Resend email" link on the latest email in
the thread to receive a copy, if you didn't have it already.
6.  You can reply-all to that email to join the discussion.

(As with all busy mailing lists, you'll probably want to set up
filtering to put pgsql-hackers messages into a seperate folder/label
due to volume.)

-- 
Thomas Munro
https://enterprisedb.com




Re: proposal - patch: psql - sort_by_size

2019-06-29 Thread Fabien COELHO



Hello Pavel,


\set SORT_BY_SIZE on
\dt -- sorted by schema, name (size is not calculated and is not visible)
\dt+ -- sorted by size


Patch applies cleanly, compiles, runs. "make check" ok. doc build ok.

There are no tests. Some infrastructure should be in place so that such 
features can be tested, eg so psql-specific TAP tests. ISTM that there was 
a patch submitted for that, but I cannot find it:-( Maybe it is combined 
with some other patch in the CF.


I agree that the simpler the better for such a feature.

ISTM that the fact that the option is ignored on \dt is a little bit 
annoying. It means that \dt and \dt+ would not show their results in the 
same order. I understand that the point is to avoid the cost of computing 
the sizes, but if the user asked for it, should it be done anyway?


I'm wondering whether it would make sense to have a slightly more generic 
interface allowing for more values, eg:


 \set DESCRIPTION_SORT "name"
 \set DESCRIPTION_SORT "size"

Well, possibly this is a bad idea, so it is really a question.


+   Setting this variable to on causes so results of
+   \d* commands will be sorted by size, when size
+   is displayed.

Maybe the simpler: "Setting this variable on sorts \d* outputs by size, 
when size is displayed."


ISTM that the documentation is more generic than reality. Does it work 
with \db+? It seems to work with \dm+.


On equality, ISTM it it should sort by name as a secondary criterion.

I tested a few cases, although not partitioned tables.

--
Fabien.




Re: TM format can mix encodings in to_char()

2019-06-29 Thread Juanjo Santamaria Flecha
It looks as if no work is left for this patch, so maybe updating the author to 
Tom Lane (I'm just a repoter at this point, which it's fine) and the status to 
ready for committer would better reflect its current status. Does anyone think 
otherwise?

Regards,

Juan José Santamaría Flecha