Re: Custom table AMs need to include heapam.h because of BulkInsertState
On Mon, 24 Jun 2019 at 23:12, David Rowley wrote: > > On Mon, 24 Jun 2019 at 22:16, Michael Paquier wrote: > > > > Don't take me bad, but I find the solution of defining and using a new > > callback to call the table AM callback not really elegant, and keeping > > all table AM callbacks called at a higher level than the executor > > makes the code easier to follow. Shouldn't we try to keep any calls > > to table_finish_bulk_insert() within copy.c for each partition > > instead? > > I'm not quite sure if I follow you since the call to > table_finish_bulk_insert() is within copy.c still. > > The problem was that PartitionTupleRouting is private to > execPartition.c, and we need a way to determine which of the > partitions we routed tuples to. It seems inefficient to flush all of > them if only a small number had tuples inserted into them and to me, > it seems inefficient to add some additional tracking in CopyFrom(), > like a hash table to store partition Oids that we inserted into. Using > PartitionTupleRouting makes sense. It's just a question of how to > access it, which is not so easy due to it being private. > > I did suggest a few other ways that we could solve this. I'm not so > clear on which one of those you're suggesting or if you're thinking of > something new. Any further thoughts on this Michael? Or Andres? Do you have a preference to which of the approaches (mentioned upthread) I use for the fix? If I don't hear anything I'll probably just push the first fix. The inefficiency does not affect heap, so likely the people with the most interest in improving that will be authors of other table AMs that actually do something during table_finish_bulk_insert() for partitions. We could revisit this in PG13 if someone comes up with a need to improve things here. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] proposal: schema variables
pá 24. 5. 2019 v 19:12 odesílatel Pavel Stehule napsal: > Hi > > čt 9. 5. 2019 v 6:34 odesílatel Pavel Stehule > napsal: > >> Hi >> >> rebased patch >> > > rebase after pgindent > fresh rebase Regards Pavel > Regards > > Pavel > >> >> Regards >> >> Pavel >> >> >> schema-variables-20190630.patch.gz Description: application/gzip
Re: Where is SSPI auth username determined for TAP tests?
On Sat, Jun 29, 2019 at 04:36:51PM -0400, Tom Lane wrote: > I think this is likely a consequence of ca129e58c0 having modified > 010_dump_connstr.pl to use "regress_postgres" not "postgres" as the > bootstrap superuser name in the source cluster. I suppose I overlooked > some dependency on the user name that only affects SSPI ... but what? > I don't see anything about the destination cluster configuration (which > already used a nondefault superuser name) that I didn't replicate > in the source cluster configuration. Didn't you get trapped with something similar to what has been fixed in d9f543e? If you want pg_hba.conf to be correctly set up for SSPI on Windows, you should pass "auth_extra => ['--create-role', 'regress_postgres']" to the init() method initializing the node. Looking at the commit... my $node = get_new_node('main'); -$node->init(extra => [ '--locale=C', '--encoding=LATIN1' ]); +$node->init(extra => + [ '-U', $src_bootstrap_super, '--locale=C', '--encoding=LATIN1' ]); [...] $node->run_log( [ $ENV{PG_REGRESS}, '--config-auth', $node->data_dir, '--create-role', - "$dbname1,$dbname2,$dbname3,$dbname4" + "$username1,$username2,$username3,$username4" ]); This part is wrong and just needs to be updated to as $src_bootstrap_super also gets its role added in --create-role, which would set up pg_hba.conf as you would like. -- Michael signature.asc Description: PGP signature
Re: Usage of epoch in txid_current
On Mon, Jun 24, 2019 at 1:43 PM Alvaro Herrera wrote: > I think enlarging multixacts to 64 bits is a terrible idea. I would > rather get rid of multixacts completely; zheap is proposing not to use > multixacts at all, for example. But zedstore, at least as of the Saturday after PGCon, is proposing to keep using them, after first widening them to 64 bits: https://www.postgresql.org/message-id/CA+TgmoYeTeQSmALox0PmSm5Gh03oe=unjhml+k+btofy_u2...@mail.gmail.com I think we all have a visceral reaction against MultiXacts at this point; they've just caused us too much pain, and the SLRU infrastructure is a bunch of crap.[1] However, the case where N sessions all do "SELECT * FROM sometab FOR SHARE" is pretty wretched without multixacts. You'll just have to keep reducing the tuple density per page to fit all the locks, whereas the current heap is totally fine and neither the heap nor the multixact space bloats all that terribly much. I currently think it's likely that zheap will use something multixact-like rather than actually using multixacts per se. I am having trouble seeing how we can have some sort of fixed-bit-width ID number that represents as set of XIDs for purposes of lock-tracking without creating some nasty degenerate cases that don't exist today.[2] -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] At least in comparison to other parts of the PostgreSQL infrastructure which are more awesome. [2] I'd like to credit Andres Freund for helping me understand this issue better.
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
On Thu, Jun 27, 2019 at 1:58 PM Tom Lane wrote: > It's not really clear to me that the IANA folk intend those files to > be read as a list of preferred zone names. If they do, what are we > to make of the fact that no variant of "UTC" appears in them? I think their intent is key. We can't make reasonable decisions about what to do with some data if we don't know what the data is intended to mean. > I think that predicting what IANA will do in the future is a fool's > errand. Our contract is to select some one of the aliases that the > tzdb database presents, not to guess about whether it might present > a different set in the future. (Also note that a lot of the observed > variation here has to do with whether individual platforms choose to > install backward-compatibility zone names. I think the odds that > IANA proper will remove those links are near zero; TTBOMK they > never have removed one yet.) That doesn't make it a good idea to call Mountain time "Navajo," as Andrew alleges we are doing. Then again, the MacBook upon which I am writing this email thinks that my time zone is "America/New_York," whereas I think it is "US/Eastern," which I suppose reinforces your point about all of this being political. But on the third hand, if somebody tells me that my time zone is America/New_York, I can say to myself "oh, they mean Eastern time," whereas if they say that I'm on "Navajo" time, I'm going to have to sit down with 'diff' and the zoneinfo files to figure out what that actually means. I note that https://github.com/eggert/tz/blob/master/backward seems pretty clear about which things are backward compatibility aliases, which seems to imply that we would not be taking a political position separate from the upstream position if we tried to de-prioritize those. Also, https://github.com/eggert/tz/blob/master/theory.html says... Names normally have the form AREA/LOCATION, where AREA is a continent or ocean, and LOCATION is a specific location within the area. ...which seems to imply that AREA/LOCATION is the "normal" and thus preferred form, and also that... The file 'zone1970.tab' lists geographical locations used to name timezones. It is intended to be an exhaustive list of names for geographic regions as described above; this is a subset of the timezones in the data. ...which seems to support Andrew's idea that you can identify AREA/LOCATION time zones by looking in that file. Long story short, I agree with you that most people probably don't care about this very much, but I also agree with Andrew that some of the current choices we're making are pretty strange, and I'm not convinced as you are that it's impossible to make a principled choice between alternatives in all cases. The upstream data appears to contain some information about intent; it's not just a jumble of exactly-equally-preferred alternatives. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Where is SSPI auth username determined for TAP tests?
bowerbird is failing the pg_dump regression tests with a lot of FATAL: SSPI authentication failed for user "regress_postgres" I think this is likely a consequence of ca129e58c0 having modified 010_dump_connstr.pl to use "regress_postgres" not "postgres" as the bootstrap superuser name in the source cluster. I suppose I overlooked some dependency on the user name that only affects SSPI ... but what? I don't see anything about the destination cluster configuration (which already used a nondefault superuser name) that I didn't replicate in the source cluster configuration. regards, tom lane
Increasing default value for effective_io_concurrency?
Hi, I think we should consider changing the effective_io_concurrency default value, i.e. the guc that determines how many pages we try to prefetch in a couple of places (the most important being Bitmap Heap Scan). The default is 1 since forever, but from my experience hardly the right value, no matter what storage system you use. I've always ended up with values that are either 0 (so, disabled prefetching) or significantly higher (at least 8 or 16). In fact, e_i_c=1 can easily be detrimental depending on the workload and storage system. Which is an issue, because people often don't know how to tune this and I see systems with the default value quite often. So I do propose to increase the defaut to a value between 4 and 16. I'm hardly the first person to notice this, as illustrated for example by this [1] post by Merlin Moncure on pgsql-hackers from 2017, which measured this behavior on Intel S3500 SSD: effective_io_concurrency 1: 46.3 sec, ~ 170 mb/sec peak via iostat effective_io_concurrency 2: 49.3 sec, ~ 158 mb/sec peak via iostat effective_io_concurrency 4: 29.1 sec, ~ 291 mb/sec peak via iostat effective_io_concurrency 8: 23.2 sec, ~ 385 mb/sec peak via iostat effective_io_concurrency 16: 22.1 sec, ~ 409 mb/sec peak via iostat effective_io_concurrency 32: 20.7 sec, ~ 447 mb/sec peak via iostat effective_io_concurrency 64: 20.0 sec, ~ 468 mb/sec peak via iostat effective_io_concurrency 128: 19.3 sec, ~ 488 mb/sec peak via iostat effective_io_concurrency 256: 19.2 sec, ~ 494 mb/sec peak via iostat That's just one anecdotal example of behavior, of course, so I've decided to do a couple of tests on different storage systems. Attached is a couple of scripts I used to generate synthetic data sets with data laid out in different patterns (random vs. regular), and running queries scanning various fractions of the table (1%, 5%, ...) using plans using bitmap index scans. I've done that on three different storage systems: 1) SATA RAID (3 x 7.2k drives in RAID0) 2) SSD RAID (6 x SATA SSD in RAID0) 3) NVMe drive Attached is a spreadsheet with a summary of results fo the tested cases. In general, the data support what I already wrote above - the current default is pretty bad. In some cases it helps a bit, but a bit higher value (4 or 8) performs significantly better. Consider for example this "sequential" data set from the 6xSSD RAID system (x-axis shows e_i_c values, pct means what fraction of pages matches the query): pct 0 14 1664 128 --- 1 25990 18624 3269 2219 2189 2171 5 88116 60242 14002 8663 8560 8726 10120556 99364 29856 17117 16590 17383 25101080184327 79212 47884 46846 46855 50130709309857163614103001 94267 94809 75126516435653248281156586139500140087 compared to the e_i_c=0 case, it looks like this: pct 14 1664 128 1 72% 13% 9%8%8% 5 68% 16%10% 10% 10% 10 82% 25%14% 14% 14% 25182% 78%47% 46% 46% 50237% 125%79% 72% 73% 75344% 196% 124% 110% 111% So for 1% of the table the e_i_c=1 is faster by about ~30%, but with e_i_c=4 (or more) it's ~10x faster. This is a fairly common pattern, not just on this storage system. The e_i_c=1 can perform pretty poorly, especially when the query matches large fraction of the table - for example in this example it's 2-3x slower compared to no prefetching, and higher e_i_c values limit the damage quite a bit. It's not entirely terrible because in most cases those queries would use seqscan (the benchmark forces queries to use bitmap heap scan), but it's not something we can ignore either because of possible underestimates. Furthermore, there are cases with much worse behavior. For example, one of the tests on SATA RAID behaves like this: pct 14 1664 128 1147% 101%61% 52% 55% 5180% 106%71% 71% 70% 10208% 106%73% 80% 79% 25225% 118%84% 96% 86% 50234% 123%91% 102% 95% 75241% 127%94% 103% 98% Pretty much all cases are significantly slower with e_i_c=1. Of course, I'm sure there may be other things to consider. For example, these tests were done in isolation, while on actual systems there will be other queries running concurrently (and those may also generate I/O). regards [1]
base backup client as auxiliary backend process
Setting up a standby instance is still quite complicated. You need to run pg_basebackup with all the right options. You need to make sure pg_basebackup has the right permissions for the target directories. The created instance has to be integrated into the operating system's start scripts. There is this slightly awkward business of the --recovery-conf option and how it interacts with other features. And you should probably run pg_basebackup under screen. And then how do you get notified when it's done. And when it's done you have to log back in and finish up. Too many steps. My idea is that the postmaster can launch a base backup worker, wait till it's done, then proceed with the rest of the startup. initdb gets a special option to create a "minimal" data directory with only a few files, directories, and the usual configuration files. Then you create a $PGDATA/basebackup.signal, start the postmaster as normal. It sees the signal file, launches an auxiliary process that runs the base backup, then proceeds with normal startup in standby mode. This makes a whole bunch of things much nicer: The connection information for where to get the base backup from comes from postgresql.conf, so you only need to specify it in one place. pg_basebackup is completely out of the picture; no need to deal with command-line options, --recovery-conf, screen, monitoring for completion, etc. If something fails, the base backup process can automatically be restarted (maybe). Operating system integration is much easier: You only call initdb and then pg_ctl or postgres, as you are already doing. Automated deployment systems don't need to wait for pg_basebackup to finish: You only call initdb, then start the server, and then you're done -- waiting for the base backup to finish can be done by the regular monitoring system. Attached is a very hackish patch to implement this. It works like this: # (assuming you have a primary already running somewhere) initdb -D data2 --minimal $EDITOR data2/postgresql.conf # set primary_conninfo pg_ctl -D data2 start (Curious side note: If you don’t set primary_conninfo in these steps, then libpq defaults apply, so the default behavior might end up being that a given instance attempts to replicate from itself.) It works for basic cases. It's missing tablespace support, proper fsyncing, progress reporting, probably more. Those would be pretty straightforward I think. The interesting bit is the delicate ordering of the postmaster startup: Normally, the pg_control file is read quite early, but if starting from a minimal data directory, we need to wait until the base backup is done. There is also the question what you do if the base backup fails halfway through. Currently you probably need to delete the whole data directory and start again with initdb. Better might be a way to start again and overwrite any existing files, but that can clearly also be dangerous. All this needs some careful analysis, but I think it's doable. Any thoughts? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 1cf36db2514b04428570496fc9d1145591fda0fc Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sat, 29 Jun 2019 21:52:21 +0200 Subject: [PATCH v1] Base backup client as auxiliary backend process --- src/backend/access/transam/xlog.c | 14 +- src/backend/bootstrap/bootstrap.c | 9 + src/backend/postmaster/pgstat.c | 6 + src/backend/postmaster/postmaster.c | 95 +- src/backend/replication/Makefile | 2 +- src/backend/replication/basebackup_client.c | 33 ++ .../libpqwalreceiver/libpqwalreceiver.c | 308 ++ src/bin/initdb/initdb.c | 39 ++- src/include/access/xlog.h | 4 + src/include/miscadmin.h | 2 + src/include/pgstat.h | 1 + src/include/replication/basebackup_client.h | 1 + src/include/replication/walreceiver.h | 4 + 13 files changed, 502 insertions(+), 16 deletions(-) create mode 100644 src/backend/replication/basebackup_client.c create mode 100644 src/include/replication/basebackup_client.h diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e08320e829..da97970703 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -905,7 +905,6 @@ static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpti, bool report); static bool rescanLatestTimeLine(void); static void WriteControlFile(void); -static void ReadControlFile(void); static char *str_time(pg_time_t tnow); static bool CheckForStandbyTrigger(void); @@ -4572,7 +4571,7 @@ WriteControlFile(void)
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 3:51 PM Julien Rouhaud wrote: > On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra > wrote: > > > > On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote: > > >On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov > > >> -- patched > > >> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE > > >> '%1%'; > > >> QUERY PLAN > > >> --- > > >> Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual > > >> time=0.287..0.424 rows=300 loops=1) > > >>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) > > >>Rows Removed by Index Recheck: 2 > > >>Heap Blocks: exact=114 > > >>-> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 > > >> width=0) (actual time=0.271..0.271 rows=302 loops=1) > > >> Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) > > >> Planning Time: 0.080 ms > > >> Execution Time: 0.450 ms > > >> (8 rows) > > > > > >One thing that's bothering me is that the explain implies that the > > >LIKE '%i% was part of the index scan, while in reality it wasn't. One > > >of the reason why I tried to modify the qual while generating the path > > >was to have the explain be clearer about what is really done. > > > > Yeah, I think that's a bit annoying - it'd be nice to make it clear > > which quals were actually used to scan the index. It some cases it may > > not be possible (e.g. in cases when the decision is done at runtime, not > > while planning the query), but it'd be nice to show it when possible. > > Maybe we could somehow add some runtime information about ignored > quals, similar to the "never executed" information for loops? +1, This sounds reasonable for me. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 1:25 PM Tomas Vondra wrote: > A related issue is that during costing is too late to modify cardinality > estimates, so the 'Bitmap Index Scan' will be expected to return fewer > rows than it actually returns (after ignoring the full-scan quals). > Ignoring redundant quals (the way btree does it at execution) does not > have such consequence, of course. Adjust cardinality estimates should be possible in gincostestimate(), because we call extractquery() method there. However, it seems to be quite independent issue. Number of rows returned by 'Bitmap Index Scan' doesn't vary much whether we execute GIN_SEARCH_MODE_ALL or not. The only difference is for multicolumn index, GIN_SEARCH_MODE_ALL allows to exclude NULL on one column, when normal scan is performed on another column. And we can take it into account in gincostestimate(). -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Regression tests vs existing users in an installation
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> We could make the new subdirectory be something specific like >> "src/test/modules/test_rolenames", but I think very likely we'll be >> wanting some additional test scripts that we likewise deem unsafe to >> run during "installcheck". So I'd rather choose a more generic module >> name, but I'm not sure what ... "unsafe_tests"? > Agreed but haven't got any particularly good suggestions on names.. Hearing no better suggestions, I went with "unsafe_tests" in the attached. This patch just moves rolenames.sql lock-stock-and-barrel into src/test/modules/unsafe_tests. Another approach would be to split the test script into a portion that doesn't violate any installcheck rule and could be kept in the core tests, versus the unsafe tests. I lack the interest to do that, but if somebody else is excited enough about it, have at it. I'm wondering whether we ought to back-patch this. The odds that somebody would be affected by "make installcheck" resetting the application_name property of existing roles seem pretty small, but it could be nasty if it did matter. Perhaps squeezing this into v12 is good enough. Another idea would be to just take out the ALTER USER ALL tests in the back branches. Thoughts? regards, tom lane diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index dfd0956..60d6d7b 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -19,6 +19,7 @@ SUBDIRS = \ test_rbtree \ test_rls_hooks \ test_shm_mq \ + unsafe_tests \ worker_spi $(recurse) diff --git a/src/test/modules/unsafe_tests/.gitignore b/src/test/modules/unsafe_tests/.gitignore new file mode 100644 index 000..5dcb3ff --- /dev/null +++ b/src/test/modules/unsafe_tests/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/src/test/modules/unsafe_tests/Makefile b/src/test/modules/unsafe_tests/Makefile new file mode 100644 index 000..321252f --- /dev/null +++ b/src/test/modules/unsafe_tests/Makefile @@ -0,0 +1,14 @@ +# src/test/modules/unsafe_tests/Makefile + +REGRESS = rolenames + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/unsafe_tests +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/unsafe_tests/README b/src/test/modules/unsafe_tests/README new file mode 100644 index 000..a7e5b2a --- /dev/null +++ b/src/test/modules/unsafe_tests/README @@ -0,0 +1,8 @@ +This directory doesn't actually contain any extension module. + +What it is is a home for regression tests that we don't want to run +during "make installcheck" because they could have side-effects that +seem undesirable for a production installation. + +An example is that rolenames.sql tests ALTER USER ALL and so could +have effects on pre-existing roles. diff --git a/src/test/modules/unsafe_tests/expected/rolenames.out b/src/test/modules/unsafe_tests/expected/rolenames.out new file mode 100644 index 000..03c1a25 --- /dev/null +++ b/src/test/modules/unsafe_tests/expected/rolenames.out @@ -0,0 +1,1010 @@ +CREATE OR REPLACE FUNCTION chkrolattr() + RETURNS TABLE ("role" name, rolekeyword text, canlogin bool, replication bool) + AS $$ +SELECT r.rolname, v.keyword, r.rolcanlogin, r.rolreplication + FROM pg_roles r + JOIN (VALUES(CURRENT_USER, 'current_user'), + (SESSION_USER, 'session_user'), + ('current_user', '-'), + ('session_user', '-'), + ('Public', '-'), + ('None', '-')) + AS v(uname, keyword) + ON (r.rolname = v.uname) + ORDER BY 1; +$$ LANGUAGE SQL; +CREATE OR REPLACE FUNCTION chksetconfig() + RETURNS TABLE (db name, "role" name, rolkeyword text, setconfig text[]) + AS $$ +SELECT COALESCE(d.datname, 'ALL'), COALESCE(r.rolname, 'ALL'), + COALESCE(v.keyword, '-'), s.setconfig + FROM pg_db_role_setting s + LEFT JOIN pg_roles r ON (r.oid = s.setrole) + LEFT JOIN pg_database d ON (d.oid = s.setdatabase) + LEFT JOIN (VALUES(CURRENT_USER, 'current_user'), + (SESSION_USER, 'session_user')) + AS v(uname, keyword) + ON (r.rolname = v.uname) + WHERE (r.rolname) IN ('Public', 'current_user', 'regress_testrol1', 'regress_testrol2') +ORDER BY 1, 2; +$$ LANGUAGE SQL; +CREATE OR REPLACE FUNCTION chkumapping() + RETURNS TABLE (umname name, umserver name, umoptions text[]) + AS $$ +SELECT r.rolname, s.srvname, m.umoptions + FROM pg_user_mapping m + LEFT JOIN pg_roles r ON (r.oid = m.umuser) + JOIN pg_foreign_server s ON (s.oid = m.umserver) + ORDER BY 2; +$$ LANGUAGE SQL; +-- +-- We test creation and use of these role names to ensure that the server +-- correctly distinguishes role keywords from quoted names that look like +-- those keywords. In a test environment, creation of these roles may +-- provoke
Re: Avoid full GIN index scan when possible
Hi! On Sat, Jun 29, 2019 at 1:52 AM Nikita Glukhov wrote: > We have a similar solution for this problem. The idea is to avoid full index > scan inside GIN itself when we have some GIN entries, and forcibly recheck > all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no > GIN entries. > > The attached patch in its current shape contain at least two ugly places: > > 1. We still need to initialize empty scan key to call triconsistent(), but >then we have to remove it from the list of scan keys. Simple refactoring >of ginFillScanKey() can be helpful here. > > 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL >if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL >because we need to skip NULLs in GIN_SEARCH_MODE_ALL. Simplest example > here >is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not > forced, >and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked. Maybe >it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL. Thank you for publishing this! What would happen when two-columns index have GIN_SEARCH_MODE_DEFAULT scan on first column and GIN_SEARCH_MODE_ALL on second? I think even if triconsistent() for second column returns GIN_TRUE, we still need to recheck to verify second columns is not NULL. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Fix runtime errors from -fsanitize=undefined
Hi, I tested this patch with clang 7 on master. - On unpatched master I can't reproduce errors with make check-world in: src/backend/access/heap/heapam.c src/backend/utils/cache/relcache.c (IIRC I triggered this one in a pg previous version) src/backend/utils/misc/guc.c - I have a hard to reproduce one not in this patched: src/backend/storage/ipc/shm_mq.c line 727 About the changes - in src/fe_utils/print.c line memset(header_done, false, col_count * sizeof(bool)); is redundant and should be remove not guarded with if (hearder_done), header_done is either null or already zeroed, it's pg_malloc0 ed. In all cases but one patched version shortcut an undefined no ops but in src/backend/access/transam/clog.c memcmp 0 bytes return 0 thus current change modifies code path, before with nsubxids == 0 if branch was taken now it's not. Could wait more often while taking lock, no idea if it's relevant. Regards Didier On Thu, Jun 6, 2019 at 11:36 AM Peter Eisentraut wrote: > > On 2019-06-05 21:30, Robert Haas wrote: > > On Mon, Jun 3, 2019 at 3:22 PM Peter Eisentraut > > wrote: > >> After many years of trying, it seems the -fsanitize=undefined checking > >> in gcc is now working somewhat reliably. Attached is a patch that fixes > >> all errors of the kind > > > > Is this as of some particular gcc version? > > I used gcc-8. > > The option has existed in gcc for quite some time, but in previous > releases it always tended to hang or get confused somewhere. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > >
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 03:28:11PM +0200, Julien Rouhaud wrote: On Sat, Jun 29, 2019 at 3:11 PM Tomas Vondra wrote: On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote: >On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra >> A related issue is that during costing is too late to modify cardinality >> estimates, so the 'Bitmap Index Scan' will be expected to return fewer >> rows than it actually returns (after ignoring the full-scan quals). >> Ignoring redundant quals (the way btree does it at execution) does not >> have such consequence, of course. >> >> Which may be an issue, because we essentially want to modify the list of >> quals to minimize the cost of >> >>bitmap index scan + recheck during bitmap heap scan >> >> OTOH it's not a huge issue, because it won't affect the rest of the plan >> (because that uses the bitmap heap scan estimates, and those are not >> affected by this). > >Doesn't this problem already exists, as the quals that we could drop >can't actually reduce the node's results? How could it not reduce the node's results, if you ignore some quals that are not redundant? My understanding is we have a plan like this: Bitmap Heap Scan -> Bitmap Index Scan and by ignoring some quals at the index scan level, we trade the (high) cost of evaluating the qual there for a plain recheck at the bitmap heap scan. But it means the index scan may produce more rows, so it's only a win if the "extra rechecks" are cheaper than the (removed) full scan. Sorry, by node I meant the BitmapIndexScan. AIUI, if you have for instance WHERE val LIKE '%abcde%' AND val LIKE '%z%' and a trgm index, the BitmapIndexScan will have to through the whole index and discard rows based on the only opclass-optimizable qual (LIKE '%abcde%'), letting the recheck do the proper filtering for the other qual. So whether you have the LIKE '%z%' or not in the index scan, the BitmapIndexScan will return the same number of rows, the only difference being that in one case you'll have to scan the whole index, while in the other case you won't. Oh! I thought 'full scan' means we have to scan all the keys in the GIN index, but we can still eliminate some of the keys (for example for the trigrams we might check if the trigram contains the short string). But clearly I was mistaken and it does not work like that ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: mcvstats serialization code is still shy of a load
On Thu, Jun 27, 2019 at 01:26:32PM +0200, Tomas Vondra wrote: On Thu, Jun 27, 2019 at 12:04:30AM -0400, Tom Lane wrote: Tomas Vondra writes: OK. Attached is a patch ditching the alignment in serialized data. I've ditched the macros to access parts of serialized data, and everything gets copied. I lack energy to actually read this patch right now, and I don't currently have an opinion about whether it's worth another catversion bump to fix this stuff in v12. But I did test the patch, and I can confirm it gets through the core regression tests on hppa (both gaur's host environment with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1). Thanks for running it through regression tests, that alone is a very useful piece of information for me. As for the catversion bump - I'd probably vote to do it. Not just because of this serialization stuff, but to fix the pg_mcv_list_items function. It's not something I'm very enthusiastic about (kinda embarassed about it, really), but it seems better than shipping something that we'll need to rework in PG13. Attached is a slightly improved version of the serialization patch. The main difference is that when serializing varlena values, the previous patch version stored length (uint32) + full varlena (incl. the header) which is kinda redundant, because the varlena stores the length too. So now it only stores the length + data, without the varlena header. I don't think there's a better way to store varlena values without enforcing alignment (which is what happens in current master). There's one additional change I failed to mention before - I had to add another field to DimensionInfo, tracking how much space will be needed for deserialized data. This is needed because the deserialization allocates the whole MCV as a single chunk of memory, to reduce palloc overhead. It could parse the data twice (first to determine the space, then to actually parse it), this allows doing just a single pass. Which seems useful for large MCV lists, but maybe it's not worth it? Barring objections I'll commit this together with the pg_mcv_list_items fix, posted in a separate thread. Of course, this requires catversion bump - an alternative would be to keep enforcing the alignment, but tweak the macros to work on all platforms without SIGBUS. Considering how troublesome this serialiation part of the patch turner out to be, I'm not really sure by anything at this point. So I'd welcome thoughts about the proposed changes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c index 5fe61ea0a4..256728a02a 100644 --- a/src/backend/statistics/mcv.c +++ b/src/backend/statistics/mcv.c @@ -52,15 +52,7 @@ * ndim * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double) */ #define ITEM_SIZE(ndims) \ - MAXALIGN((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)) - -/* - * Macros for convenient access to parts of a serialized MCV item. - */ -#define ITEM_INDEXES(item) ((uint16 *) (item)) -#define ITEM_NULLS(item,ndims) ((bool *) (ITEM_INDEXES(item) + (ndims))) -#define ITEM_FREQUENCY(item,ndims) ((double *) (ITEM_NULLS(item, ndims) + (ndims))) -#define ITEM_BASE_FREQUENCY(item,ndims)((double *) (ITEM_FREQUENCY(item, ndims) + 1)) + ((ndims) * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double)) /* * Used to compute size of serialized MCV list representation. @@ -68,10 +60,16 @@ #define MinSizeOfMCVList \ (VARHDRSZ + sizeof(uint32) * 3 + sizeof(AttrNumber)) +/* + * Size of the serialized MCV list, exluding the space needed for + * deduplicated per-dimension values. The macro is meant to be used + * when it's not yet safe to access the serialized info about amount + * of data for each column. + */ #define SizeOfMCVList(ndims,nitems)\ - (MAXALIGN(MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \ -MAXALIGN((ndims) * sizeof(DimensionInfo)) + \ -MAXALIGN((nitems) * ITEM_SIZE(ndims))) + ((MinSizeOfMCVList + sizeof(Oid) * (ndims)) + \ +((ndims) * sizeof(DimensionInfo)) + \ +((nitems) * ITEM_SIZE(ndims))) static MultiSortSupport build_mss(VacAttrStats **stats, int numattrs); @@ -491,19 +489,16 @@ statext_mcv_serialize(MCVList *mcvlist, VacAttrStats **stats) int i; int dim; int ndims = mcvlist->ndimensions; - int itemsize = ITEM_SIZE(ndims); SortSupport ssup; DimensionInfo *info; Sizetotal_length; - /* allocate the item just once */ - char *item = palloc0(itemsize); - /* serialized items (indexes into arrays, etc.) */ - char *raw; + bytea *raw; char *ptr; +
Re: TM format can mix encodings in to_char()
Juanjo Santamaria Flecha writes: > It looks as if no work is left for this patch, so maybe updating the author > to Tom Lane (I'm just a repoter at this point, which it's fine) and the > status to ready for committer would better reflect its current status. Does > anyone think otherwise? Yeah, this was dealt with in 7ad1cd31b et al. I didn't realize there was a CF entry for it, or I would have closed it then. I've done so now. regards, tom lane
Re: Optimize partial TOAST decompression
Hi! Please, do not use top-posting, i.e. reply style where you quote whole message under your response. It makes reading of archives terse. > 24 июня 2019 г., в 7:53, Binguo Bao написал(а): > >> This is not correct: L bytes of compressed data do not always can be decoded >> into at least L bytes of data. At worst we have one control byte per 8 bytes >> of literal bytes. This means at most we need (L*9 + 8) / 8 bytes with >> current pglz format. > > Good catch! I've corrected the related code in the patch. > ... > <0001-Optimize-partial-TOAST-decompression-2.patch> I've took a look into the code. I think we should extract function for computation of max_compressed_size and put it somewhere along with pglz code. Just in case something will change something about pglz so that they would not forget about compression algorithm assumption. Also I suggest just using 64 bit computation to avoid overflows. And I think it worth to check if max_compressed_size is whole data and use min of (max_compressed_size, uncompressed_data_size). Also you declared needsize and max_compressed_size too far from use. But this will be solved by function extraction anyway. Thanks! Best regards, Andrey Borodin.
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 3:11 PM Tomas Vondra wrote: > > On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote: > >On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra > >> A related issue is that during costing is too late to modify cardinality > >> estimates, so the 'Bitmap Index Scan' will be expected to return fewer > >> rows than it actually returns (after ignoring the full-scan quals). > >> Ignoring redundant quals (the way btree does it at execution) does not > >> have such consequence, of course. > >> > >> Which may be an issue, because we essentially want to modify the list of > >> quals to minimize the cost of > >> > >>bitmap index scan + recheck during bitmap heap scan > >> > >> OTOH it's not a huge issue, because it won't affect the rest of the plan > >> (because that uses the bitmap heap scan estimates, and those are not > >> affected by this). > > > >Doesn't this problem already exists, as the quals that we could drop > >can't actually reduce the node's results? > > How could it not reduce the node's results, if you ignore some quals > that are not redundant? My understanding is we have a plan like this: > > Bitmap Heap Scan > -> Bitmap Index Scan > > and by ignoring some quals at the index scan level, we trade the (high) > cost of evaluating the qual there for a plain recheck at the bitmap heap > scan. But it means the index scan may produce more rows, so it's only a > win if the "extra rechecks" are cheaper than the (removed) full scan. Sorry, by node I meant the BitmapIndexScan. AIUI, if you have for instance WHERE val LIKE '%abcde%' AND val LIKE '%z%' and a trgm index, the BitmapIndexScan will have to through the whole index and discard rows based on the only opclass-optimizable qual (LIKE '%abcde%'), letting the recheck do the proper filtering for the other qual. So whether you have the LIKE '%z%' or not in the index scan, the BitmapIndexScan will return the same number of rows, the only difference being that in one case you'll have to scan the whole index, while in the other case you won't. > clearly whatever we do the results from the bitmap heap scan > must remain the same. Of course.
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 02:50:51PM +0200, Julien Rouhaud wrote: On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra wrote: On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote: >On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov >> -- patched >> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%'; >> QUERY PLAN >> --- >> Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual time=0.287..0.424 rows=300 loops=1) >>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) >>Rows Removed by Index Recheck: 2 >>Heap Blocks: exact=114 >>-> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 width=0) (actual time=0.271..0.271 rows=302 loops=1) >> Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) >> Planning Time: 0.080 ms >> Execution Time: 0.450 ms >> (8 rows) > >One thing that's bothering me is that the explain implies that the >LIKE '%i% was part of the index scan, while in reality it wasn't. One >of the reason why I tried to modify the qual while generating the path >was to have the explain be clearer about what is really done. Yeah, I think that's a bit annoying - it'd be nice to make it clear which quals were actually used to scan the index. It some cases it may not be possible (e.g. in cases when the decision is done at runtime, not while planning the query), but it'd be nice to show it when possible. Maybe we could somehow add some runtime information about ignored quals, similar to the "never executed" information for loops? Maybe. I suppose it depends on when exactly we make the decision about which quals to ignore. A related issue is that during costing is too late to modify cardinality estimates, so the 'Bitmap Index Scan' will be expected to return fewer rows than it actually returns (after ignoring the full-scan quals). Ignoring redundant quals (the way btree does it at execution) does not have such consequence, of course. Which may be an issue, because we essentially want to modify the list of quals to minimize the cost of bitmap index scan + recheck during bitmap heap scan OTOH it's not a huge issue, because it won't affect the rest of the plan (because that uses the bitmap heap scan estimates, and those are not affected by this). Doesn't this problem already exists, as the quals that we could drop can't actually reduce the node's results? How could it not reduce the node's results, if you ignore some quals that are not redundant? My understanding is we have a plan like this: Bitmap Heap Scan -> Bitmap Index Scan and by ignoring some quals at the index scan level, we trade the (high) cost of evaluating the qual there for a plain recheck at the bitmap heap scan. But it means the index scan may produce more rows, so it's only a win if the "extra rechecks" are cheaper than the (removed) full scan. So the full scan might actually reduce the number of rows from the index scan, but clearly whatever we do the results from the bitmap heap scan must remain the same. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Choosing values for multivariate MCV lists
On Tue, Jun 25, 2019 at 11:18:19AM +0200, Tomas Vondra wrote: On Mon, Jun 24, 2019 at 02:54:01PM +0100, Dean Rasheed wrote: On Mon, 24 Jun 2019 at 00:42, Tomas Vondra wrote: On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote: On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote: On Sat, 22 Jun 2019 at 15:10, Tomas Vondra wrote: One annoying thing I noticed is that the base_frequency tends to end up being 0, most likely due to getting too small. It's a bit strange, though, because with statistic target set to 10k the smallest frequency for a single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is something the float8 can represent). Yeah, it should be impossible for the base frequency to underflow to 0. However, it looks like the problem is with mcv_list_items()'s use of %f to convert to text, which is pretty ugly. Yeah, I realized that too, eventually. One way to fix that would be adding %.15f to the sprintf() call, but that just adds ugliness. It's probably time to rewrite the function to build the tuple from datums, instead of relying on BuildTupleFromCStrings. OK, attached is a patch doing this. It's pretty simple, and it does resolve the issue with frequency precision. There's one issue with the signature, though - currently the function returns null flags as bool array, but values are returned as simple text value (formatted in array-like way, but still just a text). In the attached patch I've reworked both to proper arrays, but obviously that'd require a CATVERSION bump - and there's not much apetite for that past beta2, I suppose. So I'll just undo this bit. Hmm, I didn't spot that the old code was using a single text value rather than a text array. That's clearly broken, especially since it wasn't even necessarily constructing a valid textual representation of an array (e.g., if an individual value's textual representation included the array markers "{" or "}"). IMO fixing this to return a text array is worth doing, even though it means a catversion bump. Yeah :-( It used to be just a "debugging" function, but now that we're using it e.g. in pg_stats_ext definition, we need to be more careful about the output. Presumably we could keep the text output and make sure it's escaped properly etc. We could even build an array internally and then run it through an output function. That'd not require catversion bump. I'll cleanup the patch changing the function signature. If others think the catversion bump would be too significant annoyance at this point, I will switch back to the text output (with proper formatting). Opinions? Attached is a cleaned-up version of that patch. The main difference is that instead of using construct_md_array() this uses ArrayBuildState to construct the arrays, which is much easier. The docs don't need any update because those were already using text[] for the parameter, the code was inconsistent with it. This does require catversion bump, but as annoying as it is, I think it's worth it (and there's also the thread discussing the serialization issues). Barring objections, I'll get it committed later next week, once I get back from PostgresLondon. As I mentioned before, if we don't want any additional catversion bumps, it's possible to pass the arrays through output functions - that would allow us keeping the text output (but correct, unlike what we have now). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c index 5fe61ea0a4..498d704a62 100644 --- a/src/backend/statistics/mcv.c +++ b/src/backend/statistics/mcv.c @@ -1134,9 +1134,6 @@ Datum pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; - int call_cntr; - int max_calls; - AttInMetadata *attinmeta; /* stuff done only on the first call of the function */ if (SRF_IS_FIRSTCALL()) @@ -1166,13 +1163,13 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS) (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("function returning record called in context " "that cannot accept type record"))); + tupdesc = BlessTupleDesc(tupdesc); /* * generate attribute metadata needed later to produce tuples from raw * C strings */ - attinmeta = TupleDescGetAttInMetadata(tupdesc); - funcctx->attinmeta = attinmeta; + funcctx->attinmeta = TupleDescGetAttInMetadata(tupdesc); MemoryContextSwitchTo(oldcontext); } @@ -1180,18 +1177,14 @@ pg_stats_ext_mcvlist_items(PG_FUNCTION_ARGS) /* stuff done on every call of the function */
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 12:25 PM Tomas Vondra wrote: > > On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote: > >On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov > >> -- patched > >> EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%'; > >> QUERY PLAN > >> --- > >> Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual > >> time=0.287..0.424 rows=300 loops=1) > >>Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) > >>Rows Removed by Index Recheck: 2 > >>Heap Blocks: exact=114 > >>-> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 width=0) > >> (actual time=0.271..0.271 rows=302 loops=1) > >> Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) > >> Planning Time: 0.080 ms > >> Execution Time: 0.450 ms > >> (8 rows) > > > >One thing that's bothering me is that the explain implies that the > >LIKE '%i% was part of the index scan, while in reality it wasn't. One > >of the reason why I tried to modify the qual while generating the path > >was to have the explain be clearer about what is really done. > > Yeah, I think that's a bit annoying - it'd be nice to make it clear > which quals were actually used to scan the index. It some cases it may > not be possible (e.g. in cases when the decision is done at runtime, not > while planning the query), but it'd be nice to show it when possible. Maybe we could somehow add some runtime information about ignored quals, similar to the "never executed" information for loops? > A related issue is that during costing is too late to modify cardinality > estimates, so the 'Bitmap Index Scan' will be expected to return fewer > rows than it actually returns (after ignoring the full-scan quals). > Ignoring redundant quals (the way btree does it at execution) does not > have such consequence, of course. > > Which may be an issue, because we essentially want to modify the list of > quals to minimize the cost of > >bitmap index scan + recheck during bitmap heap scan > > OTOH it's not a huge issue, because it won't affect the rest of the plan > (because that uses the bitmap heap scan estimates, and those are not > affected by this). Doesn't this problem already exists, as the quals that we could drop can't actually reduce the node's results?
Re: Multivariate MCV list vs. statistics target
On Fri, Jun 21, 2019 at 08:09:18AM +0100, Dean Rasheed wrote: On Thu, 20 Jun 2019 at 23:12, Tomas Vondra wrote: On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote: >On Tue, 18 Jun 2019 at 22:34, Tomas Vondra wrote: >> >> So I'm thinking we should allow tweaking the statistics for extended >> stats, and serialize it in the pg_statistic_ext catalog. Any opinions >> why that would be a bad idea? > >Seems reasonable to me. This might not be the only option we'll ever >want to add though, so perhaps a "stxoptions text[]" column along the >lines of a relation's reloptions would be the way to go. I don't know - I kinda dislike the idea of stashing stuff like this into text[] arrays unless there's a clear need for such flexibility (i.e. vision to have more such options). Which I'm not sure is the case here. And we kinda have a precedent in pg_attribute.attstattarget, so I'd use the same approach here. Hmm, maybe. I can certainly understand your dislike of using text[]. I'm not sure that we can confidently say that multivariate statistics won't ever need additional tuning knobs, but I have no idea at the moment what they might be, and nothing else has come up so far in all the time spent considering MCV lists and histograms, so maybe this is OK. OK, attached is a patch implementing this - it adds ALTER STATISTICS ... SET STATISTICS ... modifying a new stxstattarget column in pg_statistic_ext catalog, following the same logic as pg_attribute.attstattarget. During analyze, the per-ext-statistic value is determined like this: 1) When pg_statistic_ext.stxstattarget != (-1), we just use this value and we're done. 2) Otherwise we inspect per-column attstattarget values, and use the largest value. This is what we do now, so it's backwards-compatible behavior. 3) If the value is still (-1), we use default_statistics_target. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/ref/alter_statistics.sgml b/doc/src/sgml/ref/alter_statistics.sgml index 58c7ed020d..987f9dbc6f 100644 --- a/doc/src/sgml/ref/alter_statistics.sgml +++ b/doc/src/sgml/ref/alter_statistics.sgml @@ -26,6 +26,7 @@ PostgreSQL documentation ALTER STATISTICS name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } ALTER STATISTICS name RENAME TO new_name ALTER STATISTICS name SET SCHEMA new_schema +ALTER STATISTICS name SET STATISTICS new_target @@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA + + new_target + + +The statistic-gathering target for this statistic object for subsequent + operations. +The target can be set in the range 0 to 1; alternatively, set it +to -1 to revert to using the system default statistics +target (). +For more information on the use of statistics by the +PostgreSQL query planner, refer to +. + + + + diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index d7004e5313..caa4fca99d 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -490,6 +490,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params, } } + /* +* Look at extended statistic objects too, because those may define their +* own statistic target. So we need to sample enough rows and then build +* the statistics with enough detail. +*/ + { + int nrows = ComputeExtStatisticsTarget(onerel, + attr_cnt, vacattrstats); + + if (targrows < nrows) + targrows = nrows; + } + /* * Acquire the sample rows */ diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c index cf406f6f96..356b6096ac 100644 --- a/src/backend/commands/statscmds.c +++ b/src/backend/commands/statscmds.c @@ -14,6 +14,7 @@ */ #include "postgres.h" +#include "access/heapam.h" #include "access/relation.h" #include "access/relscan.h" #include "access/table.h" @@ -21,6 +22,7 @@ #include "catalog/dependency.h" #include "catalog/indexing.h" #include "catalog/namespace.h" +#include "catalog/objectaccess.h" #include "catalog/pg_namespace.h" #include "catalog/pg_statistic_ext.h" #include "catalog/pg_statistic_ext_data.h" @@ -29,6 +31,7 @@ #include "miscadmin.h" #include "statistics/statistics.h" #include "utils/builtins.h" +#include "utils/fmgroids.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -336,6 +339,7 @@ CreateStatistics(CreateStatsStmt *stmt) values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid); values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum(); values[Anum_pg_statistic_ext_stxnamespace - 1] =
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 11:10:03AM +0200, Julien Rouhaud wrote: On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov wrote:> On 29.06.2019 1:23, Julien Rouhaud wrote: But that kinda resembles stuff we already have - selectivity/cost. So why shouldn't this be considered as part of costing? Yeah, I'm not entirely convinced that we need anything new here. The cost estimate function can detect such situations, and so can the index AM at scan start --- for example, btree checks for contradictory quals at scan start. There's a certain amount of duplicative effort involved there perhaps, but you also have to keep in mind that we don't know the values of run-time-determined comparison values until scan start. So if you want certainty rather than just a cost estimate, you may have to do these sorts of checks at scan start. Ah, I didn't know about _bt_preprocess_keys(). I'm not familiar with this code, so please bear with me. IIUC the idea would be to add additional logic in gingetbitmap() / ginNewScanKey() to drop some quals at runtime. But that would mean that additional logic would also be required in BitmapHeapScan, or that all the returned bitmap should be artificially marked as lossy to enforce a recheck? We have a similar solution for this problem. The idea is to avoid full index scan inside GIN itself when we have some GIN entries, and forcibly recheck all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no GIN entries. Thanks for looking at it. That's I think a way better approach. The attached patch in its current shape contain at least two ugly places: 1. We still need to initialize empty scan key to call triconsistent(), but then we have to remove it from the list of scan keys. Simple refactoring of ginFillScanKey() can be helpful here. 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL because we need to skip NULLs in GIN_SEARCH_MODE_ALL. Simplest example here is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not forced, and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked. Maybe it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL. Also + if (searchMode == GIN_SEARCH_MODE_ALL && nQueryValues <= 0) + { + /* +* Don't emit ALL key with no entries, check only whether +* unconditional recheck is needed. +*/ + GinScanKey key = >keys[--so->nkeys]; + + hasSearchAllMode = true; + so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE; + } Shouldn't you make sure that the forcedRecheck flag can't reset? -- patched EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%'; QUERY PLAN --- Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual time=0.287..0.424 rows=300 loops=1) Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) Rows Removed by Index Recheck: 2 Heap Blocks: exact=114 -> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 width=0) (actual time=0.271..0.271 rows=302 loops=1) Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) Planning Time: 0.080 ms Execution Time: 0.450 ms (8 rows) One thing that's bothering me is that the explain implies that the LIKE '%i% was part of the index scan, while in reality it wasn't. One of the reason why I tried to modify the qual while generating the path was to have the explain be clearer about what is really done. Yeah, I think that's a bit annoying - it'd be nice to make it clear which quals were actually used to scan the index. It some cases it may not be possible (e.g. in cases when the decision is done at runtime, not while planning the query), but it'd be nice to show it when possible. A related issue is that during costing is too late to modify cardinality estimates, so the 'Bitmap Index Scan' will be expected to return fewer rows than it actually returns (after ignoring the full-scan quals). Ignoring redundant quals (the way btree does it at execution) does not have such consequence, of course. Which may be an issue, because we essentially want to modify the list of quals to minimize the cost of bitmap index scan + recheck during bitmap heap scan OTOH it's not a huge issue, because it won't affect the rest of the plan (because that uses the bitmap heap scan estimates, and those are not affected by this). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Avoid full GIN index scan when possible
On Sat, Jun 29, 2019 at 12:51 AM Nikita Glukhov wrote:> > On 29.06.2019 1:23, Julien Rouhaud wrote: > > But that kinda resembles stuff we already have - selectivity/cost. So > why shouldn't this be considered as part of costing? > > Yeah, I'm not entirely convinced that we need anything new here. > The cost estimate function can detect such situations, and so can > the index AM at scan start --- for example, btree checks for > contradictory quals at scan start. There's a certain amount of > duplicative effort involved there perhaps, but you also have to > keep in mind that we don't know the values of run-time-determined > comparison values until scan start. So if you want certainty rather > than just a cost estimate, you may have to do these sorts of checks > at scan start. > > Ah, I didn't know about _bt_preprocess_keys(). I'm not familiar with > this code, so please bear with me. IIUC the idea would be to add > additional logic in gingetbitmap() / ginNewScanKey() to drop some > quals at runtime. But that would mean that additional logic would > also be required in BitmapHeapScan, or that all the returned bitmap > should be artificially marked as lossy to enforce a recheck? > > We have a similar solution for this problem. The idea is to avoid full index > scan inside GIN itself when we have some GIN entries, and forcibly recheck > all tuples if triconsistent() returns GIN_MAYBE for the keys that emitted no > GIN entries. Thanks for looking at it. That's I think a way better approach. > The attached patch in its current shape contain at least two ugly places: > > 1. We still need to initialize empty scan key to call triconsistent(), but >then we have to remove it from the list of scan keys. Simple refactoring >of ginFillScanKey() can be helpful here. > > 2. We need to replace GIN_SEARCH_MODE_EVERYTHING with GIN_SEARCH_MODE_ALL >if there are no GIN entries and some key requested GIN_SEARCH_MODE_ALL >because we need to skip NULLs in GIN_SEARCH_MODE_ALL. Simplest example > here >is "array @> '{}'": triconsistent() returns GIN_TRUE, recheck is not > forced, >and GIN_SEARCH_MODE_EVERYTHING returns NULLs that are not rechecked. Maybe >it would be better to introduce new GIN_SEARCH_MODE_EVERYTHING_NON_NULL. Also + if (searchMode == GIN_SEARCH_MODE_ALL && nQueryValues <= 0) + { + /* +* Don't emit ALL key with no entries, check only whether +* unconditional recheck is needed. +*/ + GinScanKey key = >keys[--so->nkeys]; + + hasSearchAllMode = true; + so->forcedRecheck = key->triConsistentFn(key) != GIN_TRUE; + } Shouldn't you make sure that the forcedRecheck flag can't reset? > -- patched > EXPLAIN ANALYZE SELECT * FROM test WHERE t LIKE '%1234%' AND t LIKE '%1%'; > QUERY PLAN > --- > Bitmap Heap Scan on test (cost=20.43..176.79 rows=42 width=6) (actual > time=0.287..0.424 rows=300 loops=1) >Recheck Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) >Rows Removed by Index Recheck: 2 >Heap Blocks: exact=114 >-> Bitmap Index Scan on test_t_idx (cost=0.00..20.42 rows=42 width=0) > (actual time=0.271..0.271 rows=302 loops=1) > Index Cond: ((t ~~ '%1234%'::text) AND (t ~~ '%1%'::text)) > Planning Time: 0.080 ms > Execution Time: 0.450 ms > (8 rows) One thing that's bothering me is that the explain implies that the LIKE '%i% was part of the index scan, while in reality it wasn't. One of the reason why I tried to modify the qual while generating the path was to have the explain be clearer about what is really done.
Re: proposal - patch: psql - sort_by_size
so 29. 6. 2019 v 9:32 odesílatel Fabien COELHO napsal: > > Hello Pavel, > > > \set SORT_BY_SIZE on > > \dt -- sorted by schema, name (size is not calculated and is not visible) > > \dt+ -- sorted by size > > Patch applies cleanly, compiles, runs. "make check" ok. doc build ok. > > There are no tests. Some infrastructure should be in place so that such > features can be tested, eg so psql-specific TAP tests. ISTM that there was > a patch submitted for that, but I cannot find it:-( Maybe it is combined > with some other patch in the CF. > It is not possible - the size of relations is not stable (can be different on some platforms), and because showing the size is base of this patch, I cannot to write tests. Maybe only only set/unset of variable. > > I agree that the simpler the better for such a feature. > > ISTM that the fact that the option is ignored on \dt is a little bit > annoying. It means that \dt and \dt+ would not show their results in the > same order. I understand that the point is to avoid the cost of computing > the sizes, but if the user asked for it, should it be done anyway? > It was one objection against some previous patches. In this moment I don't see any wrong on different order between \dt and \dt+. When column "size" will be displayed, then ordering of report will be clean. I am not strongly against this - implementation of support SORT_BY_SIZE for non verbose mode is +/- few lines more. But now (and it is just my opinion and filing, nothing more), I think so sorting reports by invisible columns can be messy. But if somebody will have strong different option on this point, I am able to accept it. Both variants can have some sense, and some benefits - both variants are consistent with some rules (but cannot be together). > I'm wondering whether it would make sense to have a slightly more generic > interface allowing for more values, eg: > > \set DESCRIPTION_SORT "name" > \set DESCRIPTION_SORT "size" > > Well, possibly this is a bad idea, so it is really a question. > We was at this point already :). If you introduce this, then you have to support combinations schema_name, name_schema, size, schema_size, ... My goal is implementation of most common missing alternative into psql - but I would not to do too generic implementation - it needs more complex design (and UI), and I don't think so people use it. SORT_BY_SIZE (on/off) looks simply, and because (if will not be changed) it has not impact on non verbose mode, then it can be active permanently (and if not, it is not mental hard work to set it). I think so more generic solution needs interactive UI. Now I working on vertical cursor support for pspg https://github.com/okbob/pspg. Next step will be sort by column under vertical cursor. So, I hope, it can be good enough for simply sorting by any column of report (but to be user friendly, it needs interactive UI). Because not everywhere is pspg installed, I would to push some simple solution (I prefer simplicity against generic) to psql. > > + Setting this variable to on causes so results of > + \d* commands will be sorted by size, when size > + is displayed. > > Maybe the simpler: "Setting this variable on sorts \d* outputs by size, > when size is displayed." > > ISTM that the documentation is more generic than reality. Does it work > with \db+? It seems to work with \dm+. > > On equality, ISTM it it should sort by name as a secondary criterion. > > I tested a few cases, although not partitioned tables. > Thank you - I support now relations (tables, indexes, ), databases, and tablespaces. The column size is displayed for data types report, but I am not sure about any benefit in this case. Regards Pavel > -- > Fabien. > > >
Re: Commitfest 2019-07, the first of five* for PostgreSQL 13
On Sat, Jun 29, 2019 at 9:47 AM David Steele wrote: > On 6/28/19 1:15 PM, Tom Lane wrote: > > Stephen Frost writes: > >> sh, don't look now, but there might be a "Resend email" button in > >> the archives now that you can click to have an email sent to you... > > > > Oooh, lovely. > > > >> (thank you Magnus) > > > > +many > > Thank you, Magnus, this is really helpful! Thanks, that's great news. So, just to recap for new people who want to get involved in testing and reviewing, the steps are: 1. Subscribe to the pgsql-hackers mailing list, starting here: https://lists.postgresql.org/ 2. In the process of doing that, you'll create a PostgreSQL community account. 3. Choose a patch you're interested in from https://commitfest.postgresql.org/23/ , and possibly add yourself as a reviewer. 4. Follow the link to the email thread. 5. Click on the shiny new "Resend email" link on the latest email in the thread to receive a copy, if you didn't have it already. 6. You can reply-all to that email to join the discussion. (As with all busy mailing lists, you'll probably want to set up filtering to put pgsql-hackers messages into a seperate folder/label due to volume.) -- Thomas Munro https://enterprisedb.com
Re: proposal - patch: psql - sort_by_size
Hello Pavel, \set SORT_BY_SIZE on \dt -- sorted by schema, name (size is not calculated and is not visible) \dt+ -- sorted by size Patch applies cleanly, compiles, runs. "make check" ok. doc build ok. There are no tests. Some infrastructure should be in place so that such features can be tested, eg so psql-specific TAP tests. ISTM that there was a patch submitted for that, but I cannot find it:-( Maybe it is combined with some other patch in the CF. I agree that the simpler the better for such a feature. ISTM that the fact that the option is ignored on \dt is a little bit annoying. It means that \dt and \dt+ would not show their results in the same order. I understand that the point is to avoid the cost of computing the sizes, but if the user asked for it, should it be done anyway? I'm wondering whether it would make sense to have a slightly more generic interface allowing for more values, eg: \set DESCRIPTION_SORT "name" \set DESCRIPTION_SORT "size" Well, possibly this is a bad idea, so it is really a question. + Setting this variable to on causes so results of + \d* commands will be sorted by size, when size + is displayed. Maybe the simpler: "Setting this variable on sorts \d* outputs by size, when size is displayed." ISTM that the documentation is more generic than reality. Does it work with \db+? It seems to work with \dm+. On equality, ISTM it it should sort by name as a secondary criterion. I tested a few cases, although not partitioned tables. -- Fabien.
Re: TM format can mix encodings in to_char()
It looks as if no work is left for this patch, so maybe updating the author to Tom Lane (I'm just a repoter at this point, which it's fine) and the status to ready for committer would better reflect its current status. Does anyone think otherwise? Regards, Juan José Santamaría Flecha