Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS
On Fri, Dec 21, 2018 at 02:51:51PM +0900, Tatsuro Yamada wrote: > Attached file is a WIP patch. Before sorting out the exotic part of the feature, why not first fixing the simple cases where we know that tab completion is broken and can be improved? This is what I meant in this email: https://www.postgresql.org/message-id/20181219092255.gc...@paquier.xyz The attached patch implements the idea. What do you think? -- Michael diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 5ba6ffba8c..7b603508db 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1566,9 +1566,20 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH("PARTITION"); else if (Matches("ALTER", "INDEX", MatchAny, "ATTACH", "PARTITION")) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes, NULL); - /* ALTER INDEX ALTER COLUMN */ - else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny)) + /* ALTER INDEX ALTER [COLUMN] */ + else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny) || + Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny)) COMPLETE_WITH("SET STATISTICS"); + /* ALTER INDEX ALTER [COLUMN] SET */ + else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny, "SET") || + Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny, "SET")) + COMPLETE_WITH("STATISTICS"); + /* ALTER INDEX ALTER [COLUMN] SET STATISTICS */ + else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", MatchAny, "SET", "STATISTICS") || + Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STATISTICS")) + { + /* Enforce no completion here, as an integer has to be specified */ + } /* ALTER INDEX SET */ else if (Matches("ALTER", "INDEX", MatchAny, "SET")) COMPLETE_WITH("(", "TABLESPACE"); @@ -1874,6 +1885,12 @@ psql_completion(const char *text, int start, int end) else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STORAGE") || Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "SET", "STORAGE")) COMPLETE_WITH("PLAIN", "EXTERNAL", "EXTENDED", "MAIN"); + /* ALTER TABLE ALTER [COLUMN] SET STATISTICS */ + else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "SET", "STATISTICS") || + Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "SET", "STATISTICS")) + { + /* Enforce no completion here, as an integer has to be specified */ + } /* ALTER TABLE ALTER [COLUMN] DROP */ else if (Matches("ALTER", "TABLE", MatchAny, "ALTER", "COLUMN", MatchAny, "DROP") || Matches("ALTER", "TABLE", MatchAny, "ALTER", MatchAny, "DROP")) signature.asc Description: PGP signature
RE: Timeout parameters
Hi, Fabien. The next CF will start so I want to restart the discussion. > About "socket_timeout" > If you face the following situation, this parameter will be needed. If you feel that this situation can't happen or the use case is too limited, please point out so. > > I think that there is some kind of a misnomer: this is not a > > socket-level timeout, but a client-side query timeout, so it should be > named differently? > Yes, I think so. > > > I'm not sure how to name it, though. > Me too. Since I want to use the monitoring target as the parameter name, let's decide the parameter name while designing. > > I think that the way it works is a little extreme, basically the > > connection is aborted from within pqWait, and then restarted from scratch. Which motion seems to be uncomfortable? Or both? > > There is no clear way to know about the value of the setting (SHOW, > > \set, \pset...). That is a nice idea! If this parameter implementation is decide, I'll also add these features. > About "TCP_USER_TIMEOUT" > I introduce the test methods of TCP_USER_TIMEOUT. I only came up with this test methods with "iptables". Since this command can be used only by root, I didn't create a script. Best regards, - Ryohei Nagaura
Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot
Hi, I want to speed up the creation of UPDATE/DELETE generic plan for tables partitioned into a lot. Currently, creating a generic plan of UPDATE/DELTE for such table, planner creates a plan to scan all partitions. So it takes a very long time. I tried with a table partitioned into 8192, it took 12 seconds. *setup* postgres=# create table t(aid int, abalance int) partition by range(aid); CREATE TABLE postgres=# \o /dev/null postgres=# select 'create table t_' || x || ' partition of t for values from (' || x || ') to (' || x+1 || ')' from generate_series(1, 8192) x; postgres=# \gexec *explan analyze* I use master(commit 71a05b2232 Wed Dec 5) + v8 patch[1] + v1 patch[2] postgres=# explain analyze execute update_stmt(999); QUERY PLAN - Update on t (cost=0.00..313569.28 rows=90112 width=14) (actual time=42.805..42.805 rows=0 loops=1) Update on t_1 Update on t_2 Update on t_3 ... -> Seq Scan on t_1 (cost=0.00..38.28 rows=11 width=14) (actual time=0.021..0.022 rows=0 loops=1) Filter: (aid = $1) -> Seq Scan on t_2 (cost=0.00..38.28 rows=11 width=14) (actual time=0.004..0.005 rows=0 loops=1) Filter: (aid = $1) -> Seq Scan on t_3 (cost=0.00..38.28 rows=11 width=14) (actual time=0.004..0.004 rows=0 loops=1) Filter: (aid = $1) -> Seq Scan on t_4 (cost=0.00..38.28 rows=11 width=14) (actual time=0.004..0.005 rows=0 loops=1) ... Planning Time: 12367.833 ms Execution Time: 490.082 ms (24579 rows) In most cases, since the partitions to access are partial, I think planner does not need to create a Scan path for every partition. Is there any better way? For example, can planner create generic plans from the parameters specified for EXECUTE? [1]:https://www.postgresql.org/message-id/9d7c5112-cb99-6a47-d3be-cf1ee6862...@lab.ntt.co.jp [2]:https://www.postgresql.org/message-id/CAKJS1f-=fnmqmqp6qitkd+xeddxw22yslp-0xfk3jaqux2y...@mail.gmail.com regards, Sho Kato
Re: Remove Deprecated Exclusive Backup Mode
On 12/21/18 2:01 AM, Michael Paquier wrote: On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote: Cannot move patch to the same commitfest, and multiple future commitfests exist! I am not sure what it means either. What if you just mark the existing entry as returned with feedback, and create a new one ahead? Yeah, I can do that, but it would be nice to know why this is not working. It would also be nice to preserve the history. Magnus? -- -David da...@pgmasters.net
Re: Change pgarch_readyXlog() to return .history files first
On 12/21/18 6:49 AM, Kyotaro HORIGUCHI wrote: "else if (!historyFound || ishistory)" strcpy(xlog, newxlog); The caller prepares sufficient memory for basename, and we no longer copy ".ready" into newxlog. Douldn't we work directly on xlog instead of allocating newxlog? I thought about doing that, but wanted to focus on the task at hand. It does save a strcpy and a bit of stack space, so seems like a win. Overall, the patch looks good to me. I think breaking up the if does make the code more readable. Regards, -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On 12/20/18 10:56 PM, Robert Haas wrote: On Fri, Dec 14, 2018 at 6:05 PM David Steele wrote: The question in my mind: is it safe to back-patch? I cannot imagine it being a good idea to stick a behavioral change like this into a minor release. Yeah, it lets people get out from under this problem a lot sooner, but it potentially breaks backup scripts even for people who were not suffering in the first place. I don't think solving this problem for the people who have it is worth inflicting that kind of breakage on everybody. Fair enough -- figured I would make an argument and see what people thought. It's easy enough to solve on my end, but I'll wait and see what naming scheme we come up with. -- -David da...@pgmasters.net
Re: Tab completion for ALTER INDEX|TABLE ALTER COLUMN SET STATISTICS
On 2018/12/20 10:47, Tatsuro Yamada wrote: On 2018/12/20 10:38, Michael Paquier wrote: On Thu, Dec 20, 2018 at 10:05:30AM +0900, Tatsuro Yamada wrote: Alright, I'll create new patches including these: - No completion after "ALTER TABLE/INDEX SET STATISTICS" instead of schema names - Complete "ALTER INDEX foo ALTER COLUMN SET" with STATISTICS by using *column_numbers* Thanks for considering it! My pleasure, Neo. :) Please wait for new WIP patches. Attached file is a WIP patch. *Example of after patching create table hoge (a integer, b integer, c integer); create index ind_hoge on hoge(a, b, c, (c*1), (c*2), (c*3), (c*4), (c*5), (c*6), (c*7), (c*8), (c*9)); # \d+ ind_hoge Index "public.ind_hoge" Column | Type | Key? | Definition | Storage | Stats target +-+--++-+-- a | integer | yes | a | plain | b | integer | yes | b | plain | c | integer | yes | c | plain | expr | integer | yes | (c * 1)| plain | expr1 | integer | yes | (c * 2)| plain | expr2 | integer | yes | (c * 3)| plain | expr3 | integer | yes | (c * 4)| plain | expr4 | integer | yes | (c * 5)| plain | expr5 | integer | yes | (c * 6)| plain | expr6 | integer | yes | (c * 7)| plain | expr7 | integer | yes | (c * 8)| plain | expr8 | integer | yes | (c * 9)| plain | btree, for table "public.hoge" # alter index ind_hoge alter column 1 10 11 12 2 3 4 5 6 7 8 9 # alter index ind_hoge alter column 1 1 10 11 12 # alter index ind_hoge alter column 10 SET STATISTICS # alter index ind_hoge alter column 10 SET STATISTICS 100; ALTER INDEX # \d+ ind_hoge Index "public.ind_hoge" Column | Type | Key? | Definition | Storage | Stats target +-+--++-+-- a | integer | yes | a | plain | b | integer | yes | b | plain | c | integer | yes | c | plain | expr | integer | yes | (c * 1)| plain | expr1 | integer | yes | (c * 2)| plain | expr2 | integer | yes | (c * 3)| plain | expr3 | integer | yes | (c * 4)| plain | expr4 | integer | yes | (c * 5)| plain | expr5 | integer | yes | (c * 6)| plain | expr6 | integer | yes | (c * 7)| plain | 100 expr7 | integer | yes | (c * 8)| plain | expr8 | integer | yes | (c * 9)| plain | btree, for table "public.hoge" As you know above completed 1, 2 and 3 are not expression columns, so it might better to remove these from the completion. However, I didn't do that because a query for getting more suitable attnum of index are became complicated. Then, the patch includes new query to get attribute_numbers like this: +#define Query_for_list_of_attribute_numbers \ +"SELECT attnum "\ +" FROM pg_catalog.pg_attribute a, "\ +" pg_catalog.pg_class c "\ +" WHERE c.oid = a.attrelid "\ +" AND a.attnum > 0 "\ +" AND NOT a.attisdropped "\ +" /* %d %s */" \ +" AND a.attrelid = (select oid from pg_catalog.pg_class where relname = '%s') "\ +" AND pg_catalog.pg_table_is_visible(c.oid) "\ +"order by a.attnum asc " I have a question. I read following comment of _complete_from_query(), however I'm not sure whether "%d" is needed or not in above query. Any advices welcome! * 1. A simple query which must contain a %d and a %s, which will be replaced * by the string length of the text and the text itself. The query may also * have up to four more %s in it; the first two such will be replaced by the * value of completion_info_charp, the next two by the value of * completion_info_charp2. Thanks, Tatsuro Yamada diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 5ba6ffba8c..28f2e9a0f9 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -583,6 +583,18 @@ static const SchemaQuery Query_for_list_of_statistics = { "OR '\"' || relname || '\"'='%s') "\ " AND pg_catalog.pg_table_is_visible(c.oid)" +#define Query_for_list_of_attribute_numbers \ +"SELECT attnum "\ +" FROM pg_catalog.pg_attribute a, "\ +" pg_catalog.pg_class c "\ +" WHERE c.oid = a.attrelid "\ +" AND a.attnum > 0 "\ +" AND NOT a.attisdropped "\ +" /* %d %s */" \ +" AND a.attrelid = (select oid from pg_catalog.pg_class where relname = '%s') "\ +" AND pg_catalog.pg_table_is_visible(c.oid) "\ +"order by a.attnum asc " + #define Query_for_list_of_attributes_with_schema \ "SELECT
Re: Change pgarch_readyXlog() to return .history files first
At Fri, 21 Dec 2018 14:17:25 +0900, Michael Paquier wrote in <20181221051724.gg1...@paquier.xyz> > On Fri, Dec 21, 2018 at 01:49:18PM +0900, Kyotaro HORIGUCHI wrote: > > FWIW it seems to me a bug that making an inconsistent set of > > files in archive directory. > > Okay, point taken! FWIW, I have no actual objections in not > back-patching that. I maybe(?) know. > > At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier > > wrote in <20181221031921.ge1...@paquier.xyz> > >> On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote: > >> > Good point. The new patch uses IsTLHistoryFileName(). > >> > >> OK, I have been reviewing the patch and the logic is correct, still I > >> could not resist reducing the number of inner if's for readability. I > > > > +1 basically. But I think that tail(name, 6) != ".ready" can > > happen with a certain frequency but strspn(name) < basenamelen > > rarely in the normal case. > > So that +0.5 if "basically" means a partial agreement? :p Mmm. No, +0.9. > > In the else if condition, ishisotry must be false in the right > > hand of ||. What we do here is ignoring non-history files once > > history file found. (Just a logic condensing and it would be done > > by compiler, though) > > Yes, this can be simplified. So let's do so. > > > The caller prepares sufficient memory for basename, and we no > > longer copy ".ready" into newxlog. Wouldn't we work directly on Sorry for silly typo, but the 'W' was 'C', I meant:p > > xlog instead of allocating newxlog? > > Okay, let's simplify that as you suggest. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Change pgarch_readyXlog() to return .history files first
On Fri, Dec 21, 2018 at 01:49:18PM +0900, Kyotaro HORIGUCHI wrote: > FWIW it seems to me a bug that making an inconsistent set of > files in archive directory. Okay, point taken! FWIW, I have no actual objections in not back-patching that. > At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier > wrote in <20181221031921.ge1...@paquier.xyz> >> On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote: >> > Good point. The new patch uses IsTLHistoryFileName(). >> >> OK, I have been reviewing the patch and the logic is correct, still I >> could not resist reducing the number of inner if's for readability. I > > +1 basically. But I think that tail(name, 6) != ".ready" can > happen with a certain frequency but strspn(name) < basenamelen > rarely in the normal case. So that +0.5 if "basically" means a partial agreement? :p > In the else if condition, ishisotry must be false in the right > hand of ||. What we do here is ignoring non-history files once > history file found. (Just a logic condensing and it would be done > by compiler, though) Yes, this can be simplified. So let's do so. > The caller prepares sufficient memory for basename, and we no > longer copy ".ready" into newxlog. Wouldn't we work directly on > xlog instead of allocating newxlog? Okay, let's simplify that as you suggest. -- Michael signature.asc Description: PGP signature
Re: Change pgarch_readyXlog() to return .history files first
Hello. FWIW it seems to me a bug that making an inconsistent set of files in archive directory. At Fri, 21 Dec 2018 12:19:21 +0900, Michael Paquier wrote in <20181221031921.ge1...@paquier.xyz> > On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote: > > Good point. The new patch uses IsTLHistoryFileName(). > > OK, I have been reviewing the patch and the logic is correct, still I > could not resist reducing the number of inner if's for readability. I +1 basically. But I think that tail(name, 6) != ".ready" can happen with a certain frequency but strspn(name) < basenamelen rarely in the normal case. > also did not like the high-jacking of rlde->d_name so instead let's use > an intermediate variable to store the basename of a scanned entry. The > format of the if/elif with comments in-between was not really consistent > with the common practice as well. pg_indent has also been applied. > > > Ah, I see. Yes, that's exactly how I tested it, in addition to doing real > > promotions. > > OK, so am I doing. > > Attached is an updated patch. Does that look fine to you? The base > logic is unchanged, and after a promotion history files get archived > before the last partial segment. Renaming history to ishistory looks good. if (!found || (ishistory && !historyFound)) { /* init */ found = true; historyFound = ishistory; } else if (ishistory || (!ishstory && !historyFound)) /* compare/replace */ In the else if condition, ishisotry must be false in the right hand of ||. What we do here is ignoring non-history files once history file found. (Just a logic condensing and it would be done by compiler, though) "else if (!historyFound || ishistory)" > strcpy(xlog, newxlog); The caller prepares sufficient memory for basename, and we no longer copy ".ready" into newxlog. Douldn't we work directly on xlog instead of allocating newxlog? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Clean up some elog messages and comments for do_pg_stop_backup and do_pg_start_backup
Hi all, Alvaro has cleaned up a couple of error messages recently so as they do not include the function name in what gets translated as per 68f6f2b7. While looking in the code for similar patterns, I have been reminded that pg_stop_backup() is included in some messages when waiting for segments to be archived. This has resulted in an exchange between Tom and me here: https://www.postgresql.org/message-id/e1gzg2w-0008lq...@gemulon.postgresql.org The thing is that the current messages are actually misleading, because for base backups taken by the replication protocol pg_stop_backup is never called, which is I think confusing. While looking around I have also noticed that the top comments of do_pg_start_backup and do_pg_stop_backup also that they are used with BASE_BACKUP. Attached is a patch to reduce the confusion and improve the related comments and messages. Thoughts? -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c80b14ed97..57b5a91f96 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -10163,9 +10163,10 @@ XLogFileNameP(TimeLineID tli, XLogSegNo segno) } /* - * do_pg_start_backup is the workhorse of the user-visible pg_start_backup() - * function. It creates the necessary starting checkpoint and constructs the - * backup label file. + * do_pg_start_backup + * + * Utility function called at the start of an online backup. It creates the + * necessary starting checkpoint and constructs the backup label file. * * There are two kind of backups: exclusive and non-exclusive. An exclusive * backup is started with pg_start_backup(), and there can be only one active @@ -10705,8 +10706,10 @@ get_backup_status(void) } /* - * do_pg_stop_backup is the workhorse of the user-visible pg_stop_backup() - * function. + * do_pg_stop_backup + * + * Utility function called at the end of an online backup. It cleans up the + * backup state and can optionally wait for WAL segments to be archived. * * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops * the non-exclusive backup specified by 'labelfile'. @@ -11077,7 +11080,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) if (!reported_waiting && waits > 5) { ereport(NOTICE, - (errmsg("pg_stop_backup cleanup done, waiting for required WAL segments to be archived"))); + (errmsg("waiting for required WAL segments to be archived"))); reported_waiting = true; } @@ -11087,16 +11090,16 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) { seconds_before_warning *= 2; /* This wraps in >10 years... */ ereport(WARNING, - (errmsg("pg_stop_backup still waiting for all required WAL segments to be archived (%d seconds elapsed)", + (errmsg("still waiting for all required WAL segments to be archived (%d seconds elapsed)", waits), errhint("Check that your archive_command is executing properly. " - "pg_stop_backup can be canceled safely, " + "Backup can be canceled safely, " "but the database backup will not be usable without all the WAL segments."))); } } ereport(NOTICE, -(errmsg("pg_stop_backup complete, all required WAL segments have been archived"))); +(errmsg("all required WAL segments have been archived"))); } else if (waitforarchive) ereport(NOTICE, signature.asc Description: PGP signature
Re: Tid scan improvements
Edmund Horner writes: > For the forward scan, I seem to recall, from your merge join example, > that it's useful to set the pathkeys even when there are no > query_pathkeys. We just have to unconditionally set them so that the > larger plan can make use of them. No. Look at indxpath.c: it does not worry about pathkeys unless has_useful_pathkeys is true, and it definitely does not generate pathkeys that don't get past truncate_useless_pathkeys. Those functions are responsible for worrying about whether mergejoin can use the pathkeys. It's not tidpath.c's job to outthink them. regards, tom lane
Re: Improve selectivity estimate for range queries
Hello. At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" wrote in <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> > In my environment, the selectivity for id > 0 was 0.1, > and the selectivity for id < 1 was 0.1. Then, the > value of rqlist->hibound and rqlist->lobound are set respectively. > On the other hand, DEFAULT_INEQ_SEL is 0.. As a result, > the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL, > since the condition of the second if statement was satisfied. However, > these selectivities were computed according to the statistics, so the > final selectivity had to be calculated from rqlist->hibound + rqlist->lobound > - 1.0. > My test case might be uncommon, but I think it would cause some problems. Yeah, its known behavior as the comment just above. If in your example query, the table weer very large and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure how much it matters and your example doesn't show that. The behvavior discussed here is introduced by this commit. | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a | Author: Tom Lane | Date: Tue Nov 9 00:34:46 2004 + | | Use a hopefully-more-reliable method of detecting default selectivity | estimates when combining the estimates for a range query. As pointed out | by Miquel van Smoorenburg, the existing check for an impossible combined | result would quite possibly fail to detect one default and one non-default | input. It seems better to use the default range query estimate in such | cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL. | This is a bit ugly because it introduces additional coupling between | clauselist_selectivity and scalarltsel/scalargtsel, but it's not like | there wasn't plenty already... > To handle such cases I've thought up of an idea based on a previous commit[1] > which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like > this modification, I added a flag which shows whether the selectivity The commit [1] added almost no complexity, but this does. Affects many functions to introduce tighter coupling between operator-selectivity functions and clause selectivity functions. isdefault travells alone too long just to bind remote functions. We might need more pricipled way to handle that. > was calculated according to the statistics or not to clauselist_selectivity, > and used it as a condition of the if statement instead of > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. > I am afraid that changes were more than I had expected, but we can estimate > selectivity correctly. > > Patch attached. Do you have any thoughts? I've just run over the patch, but some have comments. + if (isdefault) + rqelem->lobound_isdefault = true; This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds' isdefault. As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead to similar kind of problem? Selectivity lobound;/* Selectivity of a var > something clause */ Selectivity hibound;/* Selectivity of a var < something clause */ +boollobound_isdefault;/* lobound is a default selectivity? */ +boolhibound_isdefault;/* hibound is a default selectivity? */ } RangeQueryClause; Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity. The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is saying that it's better that Selectivity has certinty component then building a selectivity arithmetics involving uncertainty, though I don't have a clear picture. > [1] > https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=4c2777d0b733220d9029f78817af8ce regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Macros bundling RELKIND_* conditions
Mmm. My mail on this topic seems to have sent to nowhere.. At Fri, 21 Dec 2018 07:50:04 +0530, Ashutosh Bapat wrote in > On Wed, Dec 19, 2018 at 11:37 PM Alvaro Herrera > wrote: > > > On 2017-Aug-02, Tom Lane wrote: > > > > > I think Peter's got the error and the detail backwards. It should be > > > more like > > > > > > ERROR: "someview" cannot have constraints > > > DETAIL: "someview" is a view. > > > > > > If we do it like that, we need one ERROR message per error reason, > > > and one DETAIL per relkind, which should be manageable. > > > > I support this idea. Here's a proof-of-concept patch that corresponds > > to one of the cases that Ashutosh was on about (specifically, the one > > that uses the RELKIND_CAN_HAVE_STORAGE macro I just added). If there > > are no objections to this approach, I'm going to complete it along these > > lines. > > > > I put the new function at the bottom of heapam.c but I think it probably > > needs a better place. > > > > BTW are there other opinions on the RELKIND_HAS_STORAGE vs. > > RELKIND_CAN_HAVE_STORAGE debate? I'm inclined to change it to the > > former. > > > > +1 I liked the idea. +1. as I posted to another thread [1]. [1] https://www.postgresql.org/message-id/20181218.145600.172055615.horiguchi.kyot...@lab.ntt.co.jp regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Tid scan improvements
On Fri, 21 Dec 2018 at 13:25, David Rowley wrote: > On Fri, 21 Dec 2018 at 13:09, Edmund Horner wrote: > > On Fri, 21 Dec 2018 at 11:21, Tom Lane wrote: > > > I'm having a hard time wrapping my mind around why you'd bother with > > > backwards TID scans. The amount of code needed versus the amount of > > > usefulness seems like a pretty bad cost/benefit ratio, IMO. I can > > > see that there might be value in knowing that a regular scan has > > > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin > > > on TID without an explicit sort). It does not, however, follow that > > > there's any additional value in supporting the DESC case. > > > > I have occasionally found myself running "SELECT MAX(ctid) FROM t" > > when I was curious about why a table is so big after vacuuming. > > > > Perhaps that's not a common enough use case to justify the amount of > > code, especially the changes to heapam.c and explain.c. > > > > We'd still need the pathkeys to make good use of forward scans. (And > > I think the executor still needs to support seeking backward for > > cursors.) > > I think the best thing to do here is separate out all the additional > backwards scan code into a separate patch to allow it to be easier > considered and approved, or rejected. I think if there's any hint of > this blocking the main patch then it should be a separate patch to > allow it's worth to be considered independently. Yeah I think you're right. I'll separate those parts into the basic forward scan, and then the optional backward scan support. I think we'll still only generate a backward scan if the query_pathkeys makes use of it. For the forward scan, I seem to recall, from your merge join example, that it's useful to set the pathkeys even when there are no query_pathkeys. We just have to unconditionally set them so that the larger plan can make use of them. > Also, my primary interest in this patch is to find tuples that are > stopping the heap being truncated during a vacuum. Generally, when I'm > looking for that I have a good idea of what size I expect the relation > should be, (otherwise I'd not think it was bloated), in which case I'd > be doing WHERE ctid >= '(N,1)'. However, it might be easier to write > some auto-bloat-removal script if we could have an ORDER BY ctid DESC > LIMIT n.
Re: Change pgarch_readyXlog() to return .history files first
On Thu, Dec 20, 2018 at 01:57:30PM +0200, David Steele wrote: > Good point. The new patch uses IsTLHistoryFileName(). OK, I have been reviewing the patch and the logic is correct, still I could not resist reducing the number of inner if's for readability. I also did not like the high-jacking of rlde->d_name so instead let's use an intermediate variable to store the basename of a scanned entry. The format of the if/elif with comments in-between was not really consistent with the common practice as well. pg_indent has also been applied. > Ah, I see. Yes, that's exactly how I tested it, in addition to doing real > promotions. OK, so am I doing. Attached is an updated patch. Does that look fine to you? The base logic is unchanged, and after a promotion history files get archived before the last partial segment. -- Michael diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index e88d545d65..bbd6311b35 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -695,11 +695,12 @@ pgarch_archiveXlog(char *xlog) * 2) because the oldest ones will sooner become candidates for * recycling at time of checkpoint * - * NOTE: the "oldest" comparison will presently consider all segments of - * a timeline with a smaller ID to be older than all segments of a timeline - * with a larger ID; the net result being that past timelines are given - * higher priority for archiving. This seems okay, or at least not - * obviously worth changing. + * NOTE: the "oldest" comparison will consider any .history file to be older + * than any other file except another .history file. Segments on a timeline + * with a smaller ID will be older than all segments on a timeline with a + * larger ID; the net result being that past timelines are given higher + * priority for archiving. This seems okay, or at least not obviously worth + * changing. */ static bool pgarch_readyXlog(char *xlog) @@ -715,6 +716,7 @@ pgarch_readyXlog(char *xlog) DIR *rldir; struct dirent *rlde; bool found = false; + bool historyFound = false; snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status"); rldir = AllocateDir(XLogArchiveStatusDir); @@ -722,32 +724,54 @@ pgarch_readyXlog(char *xlog) while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL) { int basenamelen = (int) strlen(rlde->d_name) - 6; + char basename[MAX_XFN_CHARS + 1]; + bool ishistory; - if (basenamelen >= MIN_XFN_CHARS && - basenamelen <= MAX_XFN_CHARS && - strspn(rlde->d_name, VALID_XFN_CHARS) >= basenamelen && - strcmp(rlde->d_name + basenamelen, ".ready") == 0) + /* Ignore entries with unexpected number of characters */ + if (basenamelen < MIN_XFN_CHARS || + basenamelen > MAX_XFN_CHARS) + continue; + + /* Ignore entries with unexpected characters */ + if (strspn(rlde->d_name, VALID_XFN_CHARS) < basenamelen) + continue; + + /* Ignore anything not suffixed with .ready */ + if (strcmp(rlde->d_name + basenamelen, ".ready") != 0) + continue; + + /* Truncate off the .ready */ + memcpy(basename, rlde->d_name, basenamelen); + basename[basenamelen] = '\0'; + + /* Is this a history file? */ + ishistory = IsTLHistoryFileName(basename); + + /* + * Consume the file to archive. History files have the highest + * priority. If this is the first file or the first history file + * ever, copy it. In the presence of a history file already chosen as + * target, ignore all other files except history files which have been + * generated for an older timeline than what is already chosen as + * target to archive. + */ + if (!found || (ishistory && !historyFound)) { - if (!found) - { -strcpy(newxlog, rlde->d_name); -found = true; - } - else - { -if (strcmp(rlde->d_name, newxlog) < 0) - strcpy(newxlog, rlde->d_name); - } + strcpy(newxlog, basename); + found = true; + historyFound = ishistory; + } + else if (ishistory || (!ishistory && !historyFound)) + { + if (strcmp(basename, newxlog) < 0) +strcpy(newxlog, basename); } } FreeDir(rldir); if (found) - { - /* truncate off the .ready */ - newxlog[strlen(newxlog) - 6] = '\0'; strcpy(xlog, newxlog); - } + return found; } signature.asc Description: PGP signature
Re: [HACKERS] Macros bundling RELKIND_* conditions
On Wed, Dec 19, 2018 at 11:37 PM Alvaro Herrera wrote: > On 2017-Aug-02, Tom Lane wrote: > > > I think Peter's got the error and the detail backwards. It should be > > more like > > > > ERROR: "someview" cannot have constraints > > DETAIL: "someview" is a view. > > > > If we do it like that, we need one ERROR message per error reason, > > and one DETAIL per relkind, which should be manageable. > > I support this idea. Here's a proof-of-concept patch that corresponds > to one of the cases that Ashutosh was on about (specifically, the one > that uses the RELKIND_CAN_HAVE_STORAGE macro I just added). If there > are no objections to this approach, I'm going to complete it along these > lines. > > I put the new function at the bottom of heapam.c but I think it probably > needs a better place. > > BTW are there other opinions on the RELKIND_HAS_STORAGE vs. > RELKIND_CAN_HAVE_STORAGE debate? I'm inclined to change it to the > former. > +1 I liked the idea. -- Best Wishes, Ashutosh Bapat
RE: [Proposal] Add accumulated statistics
On Wed, Nov 21, 2018 at 9:27 PM, Bruce Momjian wrote: Hi, thank you for the information. I understood that sampling is effective for investigation of waiting events. By the way, you can see the number of wait events with "LWLOCK_STATS", right? Is this function implemented because it is necessary to know the number of waiting events for investigation? If so, is not that the number of wait events is useful information? Now, I need to rebuild to get this information and I feel inconvenience. So, how about checking the number of wait events in the view? Also, I think that it will be useful if you know the waiting time. I think that it is easy to investigate when it is clearly known how long waiting time is occupied with processing time. -- Naoki Yotsunaga
Re: A few new options for vacuumdb
On Thu, Dec 20, 2018 at 04:48:11PM +, Bossart, Nathan wrote: > The --skip-locked option in vacuumdb is part of 0002, so I don't think > there's much precedent here. It looks like I was not looking at the master branch here ;) > We do currently fall back to the > unparenthesized syntax for VACUUM for versions before 9.0, so there is > some version handling already. What do you think about removing > version checks for unsupported major versions? I am not exactly sure down to which version this needs to be supported and what's the policy about that (if others have an opinion to share, please feel free) but surely it does not hurt to keep those code paths either from a maintenance point of view. > If we went that route, these new patches could add version checks only > for options that were added in a supported major version (e.g. > mxid_age() was added in 9.5). Either way, we'll still have to decide > whether to fail or to silently skip the option if you do something > like specify --min-mxid-age for a 9.4 server. There are downsides and upsides for each approach. Reporting a failure makes it clear that an option is not available with a clear error message, however it complicates a bit the error handling for parallel runs. And vacuum_one_database should be the code path doing as that's where all the connections are taken even when all databases are processed. Ignoring the option, as your patch set does, has the disadvantage to potentially confuse users, say we may see complains like "why is VACUUM locking even if I specified --skip-locked?". Still this keeps the code minimalistic and simple. Just passing down the options and seeing a failure in the query generated is also a confusing experience I guess for not-so-advanced users even if it may simplify the error handling. Issuing a proper error feels more natural to me I think, as users can react on that properly. Opinions from others are welcome. -- Michael signature.asc Description: PGP signature
Re: lock level for DETACH PARTITION looks sketchy
On 2018/12/21 6:07, Alvaro Herrera wrote: > On 2018-Dec-20, Robert Haas wrote: > >> On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera >> wrote: >>> I think what prompted the lock to be AccessShareLock for the child rel >>> in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO >>> INHERIT) uses AccessShare for the *parent* relation. >> >> Seems like apples and oranges, > > Yes, but I can understand someone looking at ATExecDropInherit while > writing ATExecDetachRelation and thinking "ah, I have to grab > AccessShareLock on the other relation" without stopping to think in what > direction the parenthood between the rels goes. That was definitely wrong. Partition's schema changes when it's detached, whereas a (regular) inheritance parent's doesn't when one of its children is removed. >> and also maybe not that safe. > > I think it's strange, but I'm not interested in analyzing that at this > time. Its comment do say that DROP TABLE (of the child, I surmise) does > not acquire *any* lock on the parent, which is also strange. Per comments in ATExecDropInherit, the reason we lock parent with AccessShareLock in the DROP INHERIT case is that RemoveInheritance inspects parent's schema to see which of the child's columns to mark as having one less parent. DROP TABLE child doesn't need to do that as you can obviously imagine why. Thank you both for working on this. Thanks, Amit
Re: Tid scan improvements
On Fri, 21 Dec 2018 at 13:09, Edmund Horner wrote: > On Fri, 21 Dec 2018 at 11:21, Tom Lane wrote: > > I'm having a hard time wrapping my mind around why you'd bother with > > backwards TID scans. The amount of code needed versus the amount of > > usefulness seems like a pretty bad cost/benefit ratio, IMO. I can > > see that there might be value in knowing that a regular scan has > > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin > > on TID without an explicit sort). It does not, however, follow that > > there's any additional value in supporting the DESC case. > > I have occasionally found myself running "SELECT MAX(ctid) FROM t" > when I was curious about why a table is so big after vacuuming. > > Perhaps that's not a common enough use case to justify the amount of > code, especially the changes to heapam.c and explain.c. > > We'd still need the pathkeys to make good use of forward scans. (And > I think the executor still needs to support seeking backward for > cursors.) I think the best thing to do here is separate out all the additional backwards scan code into a separate patch to allow it to be easier considered and approved, or rejected. I think if there's any hint of this blocking the main patch then it should be a separate patch to allow it's worth to be considered independently. Also, my primary interest in this patch is to find tuples that are stopping the heap being truncated during a vacuum. Generally, when I'm looking for that I have a good idea of what size I expect the relation should be, (otherwise I'd not think it was bloated), in which case I'd be doing WHERE ctid >= '(N,1)'. However, it might be easier to write some auto-bloat-removal script if we could have an ORDER BY ctid DESC LIMIT n. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Remove Deprecated Exclusive Backup Mode
On Thu, Dec 20, 2018 at 12:29:48PM +0200, David Steele wrote: > Cannot move patch to the same commitfest, and multiple future commitfests > exist! I am not sure what it means either. What if you just mark the existing entry as returned with feedback, and create a new one ahead? -- Michael signature.asc Description: PGP signature
Re: gist microvacuum doesn't appear to care about hot standby?
On Thu, Dec 20, 2018 at 1:41 AM Alexander Korotkov wrote: > Please, find attached two patches I'm going to commit: for master and > for backpatching. So, pushed. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Tid scan improvements
Hi, On 2018-12-20 18:06:41 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2018-12-20 17:21:07 -0500, Tom Lane wrote: > >> I'm having a hard time wrapping my mind around why you'd bother with > >> backwards TID scans. > > > I've not followed this thread, but wouldn't that be quite useful to be > > able to move old tuples to free space earlier in the table? > > I've written multiple scripts that update the later pages in a table, to > > force reuse of earlier free pages (in my case by generating ctid = ANY() > > style queries with all possible tids for the last few pages, the most > > efficient way I could think of). > > Sure, but wouldn't you now write those using something on the order of > > WHERE ctid >= '(cutoff_page_here, 1)' > > ? I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n" > because you wouldn't know what value of n to use to get all the > tuples on some-number-of-ending-pages. I think you'd want both, to make sure there's not more tuples than estimated. With the limit calculated to ensure there's enough free space for them to actually fit. Greetings, Andres Freund
Re: Tid scan improvements
Andres Freund writes: > On 2018-12-20 17:21:07 -0500, Tom Lane wrote: >> I'm having a hard time wrapping my mind around why you'd bother with >> backwards TID scans. > I've not followed this thread, but wouldn't that be quite useful to be > able to move old tuples to free space earlier in the table? > I've written multiple scripts that update the later pages in a table, to > force reuse of earlier free pages (in my case by generating ctid = ANY() > style queries with all possible tids for the last few pages, the most > efficient way I could think of). Sure, but wouldn't you now write those using something on the order of WHERE ctid >= '(cutoff_page_here, 1)' ? I don't see that you'd want to write "ORDER BY ctid DESC LIMIT n" because you wouldn't know what value of n to use to get all the tuples on some-number-of-ending-pages. regards, tom lane
Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?
Hi, On 2018-12-19 16:56:36 -0800, Andres Freund wrote: > On 2018-12-19 19:39:33 -0500, Tom Lane wrote: > > Robert Haas writes: > > > On Wed, Dec 19, 2018 at 5:37 PM Andres Freund wrote: > > >> What's gained by the logic of emitting that warning in VACUUM after a > > >> crash? I don't really see any robustness advantages in it. > > > > > I don't know. I am just normally reluctant to change things > > > precipitously that are of long tenure. > > > > Me too, but I think Andres has a point here. Those warnings in VACUUM > > are ancient, probably predating the introduction of WAL :-(. At the > > time there was good reason to be suspicious of zeroed pages in tables. > > Now, though, we have (what we think is) a bulletproof crash recovery > > procedure in which possibly-zeroed pages are to be expected; so we're > > just causing users unnecessary alarm by warning about them. I think > > it's reasonable to, if not remove the messages entirely, at least > > downgrade them to a low DEBUG level. > > Yea, I think that'd be reasonable. > > I think we ought to add them, as new/zero pages, to the FSM. If we did > so both during vacuum and RelationAddExtraBlocks() we'd avoid the > redundant writes, and avoid depending on heap layout in > RelationAddExtraBlocks(). > > I wonder if it'd make sense to only log a DEBUG if there's no > corresponding FSM entry. That'd obviously not be bulltproof, but it'd > reduce the spammyness considerably. Here's a prototype patch implementing this change. I'm not sure the new DEBUG message is really worth it? Looking at the surrounding code made me wonder about the wisdom of entering empty pages as all-visible and all-frozen into the VM. That'll mean we'll never re-discover them on a primary, after promotion. There's no mechanism to add such pages to the FSM on a standby (in contrast to e.g. pages where tuples are modified), as vacuum will never visit that page again. Now obviously it has the advantage of avoiding re-processing the page in the next vacuum, but is that really an important goal? If there's frequent vacuums, there got to be a fair amount of modifications, which ought to lead to re-use of such pages at a not too far away point? Greetings, Andres Freund diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index b8b5871559b..aa0bc4d8f9c 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -216,18 +216,15 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) BufferGetBlockNumber(buffer), RelationGetRelationName(relation)); - PageInit(page, BufferGetPageSize(buffer), 0); - /* - * We mark all the new buffers dirty, but do nothing to write them - * out; they'll probably get used soon, and even if they are not, a - * crash will leave an okay all-zeroes page on disk. + * Add the page to the FSM without initializing. If we were to + * initialize here the page would potentially get flushed out to disk + * before we add any useful content. There's no guarantee that that'd + * happen however, so we need to deal with unitialized pages anyway, + * thus let's avoid the potential for unnecessary writes. */ - MarkBufferDirty(buffer); - - /* we'll need this info below */ blockNum = BufferGetBlockNumber(buffer); - freespace = PageGetHeapFreeSpace(page); + freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData; UnlockReleaseBuffer(buffer); @@ -479,6 +476,18 @@ loop: * we're done. */ page = BufferGetPage(buffer); + + /* + * Initialize page, it'll be used soon. We could avoid dirtying the + * buffer here, and rely on the caller to do so whenever it puts a + * tuple onto the page, but there seems not much benefit in doing so. + */ + if (PageIsNew(page)) + { + PageInit(page, BufferGetPageSize(buffer), 0); + MarkBufferDirty(buffer); + } + pageFreeSpace = PageGetHeapFreeSpace(page); if (len + saveFreeSpace <= pageFreeSpace) { diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 8134c52253e..2d8d3569372 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -862,42 +862,42 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, if (PageIsNew(page)) { /* - * An all-zeroes page could be left over if a backend extends the - * relation but crashes before initializing the page. Reclaim such - * pages for use. + * All-zeroes pages can be left over if either a backend extends + * the relation by a single page, but crashes before initializing + * the page, or when bulk-extending the relation (which creates a + * number of empty pages at the tail end of the relation, but + * enters them into the FSM). * - * We have to be careful here because we could be looking at a - * page that someone has just added to the relation and not yet - * been able to initialize (see RelationGetBufferForTuple). To - *
Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
John Naylor writes: > On 12/18/18, Tom Lane wrote: >> I'd be kind of inclined to convert all uses of ScanKeyword to the new way, >> if only for consistency's sake. On the other hand, I'm not the one >> volunteering to do the work. > That's reasonable, as long as the design is nailed down first. Along > those lines, attached is a heavily WIP patch that only touches plpgsql > unreserved keywords, to test out the new methodology in a limited > area. After settling APIs and name/directory bikeshedding, I'll move > on to the other four keyword types. Let the bikeshedding begin ... > There's a new Perl script, src/common/gen_keywords.pl, I'd be inclined to put the script in src/tools, I think. IMO src/common is for code that actually gets built into our executables. > which takes > pl_unreserved_kwlist.h as input and outputs > pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. I wonder whether we'd not be better off producing just one output file, in which we have the offsets emitted as PG_KEYWORD macros and then the giant string emitted as a macro definition, ie something like #define PG_KEYWORD_STRING \ "absolute\0" \ "alias\0" \ ... That simplifies the Makefile-hacking, at least, and it possibly gives callers more flexibility about what they actually want to do with the string. > The > output headers are not installed or symlinked anywhere. Since the > input keyword lists will never be #included directly, they might be > better as .txt files, like errcodes.txt. If we went that far, we might > also remove the PG_KEYWORD macros (they'd still be in the output > files) and rename parser/kwlist.h to common/core_kwlist.txt. There's > also a case for not changing things unnecessarily, especially if > there's ever a new reason to include the base keyword list directly. I'm for "not change things unnecessarily". People might well be scraping the keyword list out of parser/kwlist.h for other purposes right now --- indeed, it's defined the way it is exactly to let people do that. I don't see a good reason to force them to redo whatever tooling they have that depends on that. So let's build kwlist_offsets.h alongside that, but not change kwlist.h itself. > To keep the other keyword types functional, I had to add a separate > new struct ScanKeywordOffset and new function > ScanKeywordLookupOffset(), so the patch is a bit messier than the > final will be. Check. > I used the global .gitignore, but maybe that's an abuse of it. Yeah, I'd say it is. > +# TODO: Error out if the keyword names are not in ASCII order. +many for including such a check. Also note that we don't require people to have Perl installed when building from a tarball. Therefore, these derived headers must get built during "make distprep" and removed by maintainer-clean but not distclean. I think this also has some implications for VPATH builds, but as long as you follow the pattern used for other derived header files (e.g. fmgroids.h), you should be fine. regards, tom lane
Re: Tid scan improvements
Hi, On 2018-12-20 17:21:07 -0500, Tom Lane wrote: > Edmund Horner writes: > > [ tid scan patches ] > > I'm having a hard time wrapping my mind around why you'd bother with > backwards TID scans. The amount of code needed versus the amount of > usefulness seems like a pretty bad cost/benefit ratio, IMO. I can > see that there might be value in knowing that a regular scan has > "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin > on TID without an explicit sort). It does not, however, follow that > there's any additional value in supporting the DESC case. I've not followed this thread, but wouldn't that be quite useful to be able to move old tuples to free space earlier in the table? I've written multiple scripts that update the later pages in a table, to force reuse of earlier free pages (in my case by generating ctid = ANY() style queries with all possible tids for the last few pages, the most efficient way I could think of). Greetings, Andres Freund
Re: Tid scan improvements
Edmund Horner writes: > [ tid scan patches ] I'm having a hard time wrapping my mind around why you'd bother with backwards TID scans. The amount of code needed versus the amount of usefulness seems like a pretty bad cost/benefit ratio, IMO. I can see that there might be value in knowing that a regular scan has "ORDER BY ctid ASC" pathkeys (mainly, that it might let us mergejoin on TID without an explicit sort). It does not, however, follow that there's any additional value in supporting the DESC case. regards, tom lane
monitoring CREATE INDEX [CONCURRENTLY]
Monitoring progress of CREATE INDEX [CONCURRENTLY] is sure to be welcome, so here's a proposal. There are three distinct interesting cases. One is straight CREATE INDEX of a standalone table; then we have CREATE INDEX CONCURRENTLY; finally, CREATE INDEX on a partitioned table. Note that there's no CONCURRENTLY for partitioned tables. A non-concurrent build is a very straightforward: we call create_index, which does index_build, done. See below for how to report for index_build, which is the interesting part. I propose not to report anything else than that for non-concurrent build. There's some preparatory work that's identical than for CIC (see below). Like VACUUM, it seems a bit pointless to report an initial phase that's almost immediate, so I propose we just don't report anything until the actual index building starts. CREATE INDEX CONCURRENTLY does these things first, which we would not report (this is just like VACUUM, which only starts reporting once it starts scanning blocks): a. lock rel. No metrics to report. b. other prep; includes lots of catalog access. Unlikely to lock, but not impossible. No metrics to report. c. create_index. CIC skips index_build here, so there's no reason to report it either. We would start reporting at this point, with these phases: 1. WaitForLockers 1. Report how many xacts do we need to wait for, how many are done. 2. index_build. See below. 3. WaitForLockers 2. Report how many xacts do we need to wait for, how many are done. 4. validate_index. Scans the whole rel again. Report number of blocks scanned. 5. wait for virtual XIDs. Like WaitForLockers: report how many xacts we need to wait for, how many are done. We're done. (Alternatively, we could have an initial "prep" phase for a/b/c for the concurrent case and a/b for non-concurrent. I'm just not sure it's useful.) index_build --- The actual index building is an AM-specific undertaking, and we report its progress separately from the AM-agnostic code. That is, each AM has freedom to define its own list of phases and counters, separate from the generic code. This avoids the need to provide a new AM method or invoke callbacks. So when you see that CREATE_INDEX_PHASE is either "index build" you'll have a separate BTREE_CREATE_PHASE value set to either "scanning heap" or "sorting" or "building upper layers"; equivalently for other AMs. Partitioned indexes --- For partitioned indexes, we only have the index build phase, but we repeat it for each partition. In addition to the index_build metrics described above, we should report how many partitions we need to handle in total and how many partitions are already done. (I'm avoiding getting in the trouble of reporting *which* partition we're currently handling and have already handled.) Thoughts? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On Thu, Dec 20, 2018 at 4:11 PM Alvaro Herrera wrote: > Oh, so maybe this case is already handled by plan invalidation -- I > mean, if we run DDL, the stored plan is thrown away and a new one > recomputed. IOW this was already a solved problem and I didn't need to > spend effort on it. /me slaps own forehead I'm kinda saying the opposite - I'm not sure that it's safe even with the higher lock levels. If the plan is relying on the same partition descriptor being in effect at plan time as at execution time, that sounds kinda dangerous to me. Lowering the lock level might also make something that was previously safe into something unsafe, because now there's no longer a guarantee that invalidation messages are received soon enough. With AccessExclusiveLock, we'll send invalidation messages before releasing the lock, and other processes will acquire the lock and then AcceptInvalidationMessages(). But with ShareUpdateExclusiveLock the locks can coexist, so now there might be trouble. I think this is an area where we need to do some more investigation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On 2018-Dec-20, Robert Haas wrote: > I didn't handle that. If partition pruning relies on nothing changing > between planning and execution, isn't that broken regardless of any of > this? It's true that with the simple query protocol we'll hold locks > continuously from planning into execution, and therefore with the > current locking regime we couldn't really have a problem. But unless > I'm confused, with the extended query protocol it's quite possible to > generate a plan, release locks, and then reacquire locks at execution > time. Unless we have some guarantee that a new plan will always be > generated if any DDL has happened in the middle, I think we've got > trouble, and I don't think that is guaranteed in all cases. Oh, so maybe this case is already handled by plan invalidation -- I mean, if we run DDL, the stored plan is thrown away and a new one recomputed. IOW this was already a solved problem and I didn't need to spend effort on it. /me slaps own forehead -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: lock level for DETACH PARTITION looks sketchy
On 2018-Dec-20, Robert Haas wrote: > On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera > wrote: > > I think what prompted the lock to be AccessShareLock for the child rel > > in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO > > INHERIT) uses AccessShare for the *parent* relation. > > Seems like apples and oranges, Yes, but I can understand someone looking at ATExecDropInherit while writing ATExecDetachRelation and thinking "ah, I have to grab AccessShareLock on the other relation" without stopping to think in what direction the parenthood between the rels goes. > and also maybe not that safe. I think it's strange, but I'm not interested in analyzing that at this time. Its comment do say that DROP TABLE (of the child, I surmise) does not acquire *any* lock on the parent, which is also strange. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera wrote: > > Introduce the concept of a partition directory. > > > > Teach the optimizer and executor to use it, so that a single planning > > cycle or query execution gets the same PartitionDesc for the same table > > every time it looks it up. This does not prevent changes between > > planning and execution, nor does it guarantee that all tables are > > expanded according to the same snapshot. > > Namely: how does this handle the case of partition pruning structure > being passed from planner to executor, if an attach happens in the > middle of it and puts a partition in between existing partitions? Array > indexes of any partitions that appear later in the partition descriptor > will change. > > This is the reason I used the query snapshot rather than EState. I didn't handle that. If partition pruning relies on nothing changing between planning and execution, isn't that broken regardless of any of this? It's true that with the simple query protocol we'll hold locks continuously from planning into execution, and therefore with the current locking regime we couldn't really have a problem. But unless I'm confused, with the extended query protocol it's quite possible to generate a plan, release locks, and then reacquire locks at execution time. Unless we have some guarantee that a new plan will always be generated if any DDL has happened in the middle, I think we've got trouble, and I don't think that is guaranteed in all cases. Maybe I'm all wet, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: lock level for DETACH PARTITION looks sketchy
On Thu, Dec 20, 2018 at 3:58 PM Alvaro Herrera wrote: > I think what prompted the lock to be AccessShareLock for the child rel > in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO > INHERIT) uses AccessShare for the *parent* relation. Seems like apples and oranges, and also maybe not that safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ATTACH/DETACH PARTITION CONCURRENTLY
Thanks for this work! I like the name "partition directory". On 2018-Dec-20, Robert Haas wrote: > 0002 introduces the concept of a partition directory. The idea is > that the planner will create a partition directory, and so will the > executor, and all calls which occur in those places to > RelationGetPartitionDesc() will instead call > PartitionDirectoryLookup(). Every lookup for the same relation in the > same partition directory is guaranteed to produce the same answer. I > believe this patch still has a number of weaknesses. More on that > below. The commit message for this one also points out another potential problem, > Introduce the concept of a partition directory. > > Teach the optimizer and executor to use it, so that a single planning > cycle or query execution gets the same PartitionDesc for the same table > every time it looks it up. This does not prevent changes between > planning and execution, nor does it guarantee that all tables are > expanded according to the same snapshot. Namely: how does this handle the case of partition pruning structure being passed from planner to executor, if an attach happens in the middle of it and puts a partition in between existing partitions? Array indexes of any partitions that appear later in the partition descriptor will change. This is the reason I used the query snapshot rather than EState. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: lock level for DETACH PARTITION looks sketchy
I think what prompted the lock to be AccessShareLock for the child rel in the first place is the fact that ATExecDropInherit() (ALTER TABLE NO INHERIT) uses AccessShare for the *parent* relation. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add timeline to partial WAL segments
On Fri, Dec 14, 2018 at 6:05 PM David Steele wrote: > The question in my mind: is it safe to back-patch? I cannot imagine it being a good idea to stick a behavioral change like this into a minor release. Yeah, it lets people get out from under this problem a lot sooner, but it potentially breaks backup scripts even for people who were not suffering in the first place. I don't think solving this problem for the people who have it is worth inflicting that kind of breakage on everybody. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier wrote: > On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote: > > OK. I'll post what I have by the end of the week. > > Thanks, Robert. OK, so I got slightly delayed here by utterly destroying my laptop, but I've mostly reconstructed what I did. I think there are some remaining problems, but this seems like a good time to share what I've got so far. I'm attaching three patches. 0001 is one which I posted before. It attempts to fix up RelationBuildPartitionDesc() so that this function will always return a partition descriptor based on a consistent snapshot of the catalogs. Without this, I think there's nothing to prevent a commit which happens while the function is running from causing the function to fail or produce nonsense answers. 0002 introduces the concept of a partition directory. The idea is that the planner will create a partition directory, and so will the executor, and all calls which occur in those places to RelationGetPartitionDesc() will instead call PartitionDirectoryLookup(). Every lookup for the same relation in the same partition directory is guaranteed to produce the same answer. I believe this patch still has a number of weaknesses. More on that below. 0003 actually lowers the lock level. The comment here might need some more work. Here is a list of possible or definite problems that are known to me: - I think we need a way to make partition directory lookups consistent across backends in the case of parallel query. I believe this can be done with a dshash and some serialization and deserialization logic, but I haven't attempted that yet. - I refactored expand_inherited_rtentry() to drive partition expansion entirely off of PartitionDescs. The reason why this is necessary is that it clearly will not work to have find_all_inheritors() use a current snapshot to decide what children we have and lock them, and then consult a different source of truth to decide which relations to open with NoLock. There's nothing to keep the lists of partitions from being different in the two cases, and that demonstrably causes assertion failures if you SELECT with an ATTACH/DETACH loop running in the background. However, it also changes the order in which tables get locked. Possibly that could be fixed by teaching expand_partitioned_rtentry() to qsort() the OIDs the way find_inheritance_children() does. It also loses the infinite-loop protection which find_all_inheritors() has. Not sure what to do about that. - In order for the new PartitionDirectory machinery to avoid use-after-free bugs, we have to either copy the PartitionDesc out of the relcache into the partition directory or avoid freeing it while it is still in use. Copying it seems unappealing for performance reasons, so I took the latter approach. However, what I did here in terms of reclaiming memory is just about the least aggressive strategy short of leaking it altogether - it just keeps it around until the next rebuild that occurs while the relcache entry is not in use. We might want to do better, e.g. freeing old copies immediately as soon as the relcache reference count drops to 0. I just did it this way because it was simple to code. - I tried this with Alvaro's isolation tests and it fails some tests. That's because Alvaro's tests expect that the list of accessible partitions is based on the query snapshot. For reasons I attempted to explain in the comments in 0003, I think the idea that we can choose the set of accessible partitions based on the query snapshot is very wrong. For example, suppose transaction 1 begins, reads an unrelated table to establish a snapshot, and then goes idle. Then transaction 2 comes along, detaches a partition from an important table, and then does crazy stuff with that table -- changes the column list, drops it, reattaches it with different bounds, whatever. Then it commits. If transaction 1 now comes along and uses the query snapshot to decide that the detached partition ought to still be seen as a partition of that partitioned table, disaster will ensue. - I don't have any tests here, but I think it would be good to add some, perhaps modeled on Alvaro's, and also some that involve multiple ATTACH and DETACH operations mixed with other interesting kinds of DDL. I also didn't make any attempt to update the documentation, which is another thing that will probably have to be done at some point. Not sure how much documentation we have of any of this now. - I am uncertain whether the logic that builds up the partition constraint is really safe in the face of concurrent DDL. I kinda suspect there are some problems there, but maybe not. Once you hold ANY lock on a partition, the partition constraint can't concurrently become stricter, because no ATTACH can be done without AccessExclusiveLock on the partition being attached; but it could concurrently become less strict, because you only need a lesser lock for a detach. Not
Re: Switching to 64-bit Bitmapsets
On Fri, 21 Dec 2018 at 06:26, Tom Lane wrote: > Pushed with some fiddling with the comment. Great. Thanks! > I wasn't excited about the test case you offered --- on HEAD, it pretty > much all devolves to file access operations (probably, checking the > current length of all the child relations). Instead I experimented > with an old test case I had for a complex-to-plan query, and got these > results: oh meh. I forgot to mention I was running with plan_cache_mode = force_generic_plan and max_parallel_workers_per_gather = 0. If you tried without those then you'd have seen a massively different result. Your test seems better, regardless. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: lock level for DETACH PARTITION looks sketchy
On 2018-Dec-20, Robert Haas wrote: > One problem about which I thought is the partition check constraint. > When we attach, we need to validate that the check constraint holds of > every row in the partition, which means that we need to keep new rows > from being added while we're attaching. But when we detach, there's > no corresponding problem. If we detach a leaf partition, the check > constraint just disappears; if we detach an intermediate partitioned > table, the check constraint loses some clauses. Either way, there's > no way for a row that was previously acceptable to become unacceptable > after the detach operation. There is, however, the possibility that > the invalidation messages generated by the detach operation might not > get processed immediately by other backends, so those backends might > continue to enforce the partition constraint for some period of time > after it changes. That seems OK. Check. > There would be trouble, though, if > we freed the old copy of the partition constraint during a relcache > rebuild while somebody still had a pointer to it. I'm not sure that > can happen, though. (Reading the pg10 source) AFAICS the partition constraint is stored in resultRelInfo->ri_PartitionCheck, not affected by relcache invals, so it seems fine. We also read a copy of it at plan time for constraint exclusion purposes, similarly not affected by relcache inval. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Performance issue in foreign-key-aware join estimation
In connection with David Rowley's proposal to change bitmapset.c to use 64-bit words, I dug out an old test case I had for a complex-to-plan query (attached). Andres Freund posted this to the lists perhaps ten years ago, though I can't locate the original posting right now. I was distressed to discover via perf that 69% of the runtime of this test now goes into match_eclasses_to_foreign_key_col(). That seems clearly unacceptable. There aren't an unreasonable number of foreign key constraints in the underlying schema --- mostly one per table, and there's only one table with as many as 3. However, poking around in the planner data structures, it turns out there are: 888 base relations 1005 EquivalenceClasses 167815 fkey_list entries initially 690 fkey_list entries after match_foreign_keys_to_quals trims them So the reason match_eclasses_to_foreign_key_col is so dominant in the runtime is it's invoked 167815 times and has to scan a thousand EquivalenceClasses (unsuccessfully) on most of those calls. How did the fkey_list get that big? I think the issue is that the query touches the same tables many many times (888 baserels, but there are only 20 distinct tables in the schema) and we get an O(N^2) growth in the apparent number of FKs. Clearly, we ought to rethink that data structure. I'm not sure offhand how to make it better, but this is pretty awful. Perhaps there'd also be some use in having better indexing for the EquivalenceClass list, but again I'm not sure what that'd look like. regards, tom lane \set ON_ERROR_STOP on DROP SCHEMA IF EXISTS test_data CASCADE; DROP SCHEMA IF EXISTS test_view CASCADE; CREATE SCHEMA test_data; CREATE SCHEMA test_view; SET SEARCH_PATH = test_data, test_view; CREATE TABLE proband ( proband_id bigserial PRIMARY KEY ); CREATE TABLE sample ( sample_id bigserial PRIMARY KEY ); CREATE TABLE proband__sample ( proband_id bigint NOT NULL REFERENCES proband, sample_id bigint NOT NULL REFERENCES sample, UNIQUE(proband_id, sample_id) ); CREATE TABLE project ( project_id bigserial PRIMARY KEY, name text NOT NULL UNIQUE ); /* * Stuff like: * -Double Entry * -Single Entry * -Machine Read * - ... */ CREATE TABLE data_quality ( data_quality_id bigserial PRIMARY KEY, name text NOT NULL UNIQUE, description text ); CREATE TABLE information_set ( information_set_id bigserial PRIMARY KEY, name text NOT NULL UNIQUE, description text ); CREATE TABLE information ( information_id bigserial PRIMARY KEY, information_set_id bigint NOT NULL REFERENCES information_set, name text NOT NULL UNIQUE, description text ); CREATE INDEX information__information_set_id ON information (information_set_id); CREATE TABLE information_set_instance ( information_set_instance_id bigserial PRIMARY KEY, information_set_id bigint NOT NULL REFERENCES information_set ); CREATE INDEX information_set_instance__information_set_id ON information_set_instance (information_set_id); CREATE TABLE information_set_instance__proband ( information_set_instance_id bigint NOT NULL REFERENCES information_set_instance, proband_id bigint NOT NULL REFERENCES proband, UNIQUE (information_set_instance_id, proband_id) ); CREATE INDEX information_set_instance__proband__information_set_id ON information_set_instance__proband (information_set_instance_id); CREATE INDEX information_set_instance__proband__proband_id ON information_set_instance__proband (proband_id); CREATE TABLE information_set_instance__sample ( information_set_instance_id bigint NOT NULL REFERENCES information_set_instance, sample_id bigint NOT NULL REFERENCES sample, UNIQUE (information_set_instance_id, sample_id) ); CREATE INDEX information_set_instance__sample__information_set_id ON information_set_instance__sample (information_set_instance_id); CREATE INDEX information_set_instance__sample__sample_id ON information_set_instance__sample (sample_id); CREATE TABLE information_instance ( information_instance_id bigserial PRIMARY KEY, information_set_instance_id bigint NOT NULL REFERENCES information_set_instance, information_id bigint NOT NULL REFERENCES information, data_quality_id int REFERENCES data_quality ); CREATE INDEX information_instance__information_set_instance_id ON information_set_instance(information_set_instance_id); CREATE INDEX information_instance__information_id ON information(information_id); CREATE TABLE information_about_tnm ( information_about_tnm bigserial PRIMARY KEY, information_instance_id bigint REFERENCES information_instance, t int, n int, m int ); CREATE INDEX information_about_tnm__information_instance_id ON information_about_tnm(information_instance_id); CREATE TABLE information_about_sequenced_data (
Re: lock level for DETACH PARTITION looks sketchy
On Thu, Dec 20, 2018 at 9:29 AM Alvaro Herrera wrote: > I can patch that one too, but it's separate -- it goes back to pg10 I > think (the other to pg11) -- and let's think about the lock mode for a > bit: as far as I can see, ShareUpdateExclusive is enough; the difference > with AccessExclusive is that it lets through modes AccessShare, RowShare > and RowExclusive. According to mvcc.sgml, that means we're letting > SELECT, SELECT FOR SHARE/UPDATE and INS/UPD/DEL in the partition being > detached, respectively. All those seem acceptable to me. All DDL is > blocked, of course. One problem about which I thought is the partition check constraint. When we attach, we need to validate that the check constraint holds of every row in the partition, which means that we need to keep new rows from being added while we're attaching. But when we detach, there's no corresponding problem. If we detach a leaf partition, the check constraint just disappears; if we detach an intermediate partitioned table, the check constraint loses some clauses. Either way, there's no way for a row that was previously acceptable to become unacceptable after the detach operation. There is, however, the possibility that the invalidation messages generated by the detach operation might not get processed immediately by other backends, so those backends might continue to enforce the partition constraint for some period of time after it changes. That seems OK. There would be trouble, though, if we freed the old copy of the partition constraint during a relcache rebuild while somebody still had a pointer to it. I'm not sure that can happen, though. > So right now if you insert via the parent table, detaching the partition > would be blocked because we also acquire lock on the parent (and detach > acquires AccessExclusive on the parent). But after DETACH CONCURRENTLY, > inserting via the parent should let rows to the partition through, > because there's no conflicting lock to stop them. Inserting rows to the > partition directly would be let through both before and after DETACH > CONCURRENTLY. Yeah, that's worrisome. Initially, I thought it was flat-out insane, because if somebody modifies that table in some way that makes it unsuitable as an insertion target for the existing stream of tuples -- adding constraints, changing column definitions, attaching to a new parent, dropping -- then we'd be in big trouble. Then I thought that maybe it's OK, because all of those operations take AccessExclusiveLock and therefore even if the detach goes through without a lock, the subsequent change that is needed to create a problem can't. I can't say that I'm 100% convinced that's bulletproof, though. Suppose for example that a single session opens a cursor, then does the detach, then does something else, then tries to read more from the cursor. Now the locks don't conflict, but we're still OK (I think) because those operations should all CheckTableNotInUse(). But if we try to lower the lock level for some other things in the future, we might open up other problems here, and it's unclear to me what guiding principles we can rely on here to keep us out of trouble. > One note about DETACH CONCURRENTLY: detaching the index from parent now > uses AccessExclusiveLock (the block I just patched). For concurrent > detach that won't work, so we'll need to reduce that lock level too. Not > sure offhand if there are any problems with that. Me neither. I'm pretty sure that you need at least ShareUpdateExclusiveLock, but I don't know whether you need AccessExclusiveLock. Here again, the biggest risk I can see is the sinval race - nothing forces the invalidation messages to be read before opening the relation for read or write. I think it would be nice if it ended up that the lock level is the same for the table and the indexes, because otherwise the legal sequences of operations vary depending on whether you've got a table with inherited indexes or one without, and that multiplies the difficulty of testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Switching to 64-bit Bitmapsets
David Rowley writes: > On Thu, 20 Dec 2018 at 17:50, Tom Lane wrote: >> Hm, are you thinking of making BITS_PER_BITMAPWORD match sizeof(Pointer) >> or something like that? That seems like a good compromise from here. > Yeah, something along those lines. I've implemented that in the attached. Pushed with some fiddling with the comment. I wasn't excited about the test case you offered --- on HEAD, it pretty much all devolves to file access operations (probably, checking the current length of all the child relations). Instead I experimented with an old test case I had for a complex-to-plan query, and got these results: HEAD, asserts off: pgbench reports: latency average = 11704.992 ms tps = 0.085434 (including connections establishing) tps = 0.085437 (excluding connections establishing) Top hits according to perf: + 65.42%65.16%158499 postmaster postgres [.] match_eclasses_to_foreign_key_col +6.85% 6.82% 16634 postmaster postgres [.] bms_overlap +6.25% 6.22% 15173 postmaster postgres [.] generate_join_implied_equalities_for_ecs +3.91% 3.88% 9465 postmaster postgres [.] make_canonical_pathkey +3.04% 3.01% 7351 postmaster postgres [.] have_relevant_eclass_joinclause +2.06% 2.05% 4992 postmaster postgres [.] bms_is_subset +1.19% 1.18% 2887 postmaster postgres [.] equal +0.95% 0.95% 2309 postmaster postgres [.] get_eclass_for_sort_expr +0.90% 0.90% 2189 postmaster postgres [.] add_paths_to_joinrel With patch: latency average = 10741.595 ms tps = 0.093096 (including connections establishing) tps = 0.093099 (excluding connections establishing) + 69.03%68.76%178278 postmaster postgres [.] match_eclasses_to_foreign_key_col +5.85% 5.82% 15138 postmaster postgres [.] generate_join_implied_equalities_for_ecs +4.58% 4.55% 11842 postmaster postgres [.] bms_overlap +4.38% 4.36% 11338 postmaster postgres [.] make_canonical_pathkey +2.77% 2.75% 7155 postmaster postgres [.] have_relevant_eclass_joinclause +1.30% 1.29% 3364 postmaster postgres [.] equal +1.26% 1.25% 3261 postmaster postgres [.] bms_is_subset +1.06% 1.05% 2722 postmaster postgres [.] get_eclass_for_sort_expr +0.70% 0.70% 1813 postmaster postgres [.] add_paths_to_joinrel Ignoring the, um, elephant in the room, this shows a pretty clear win for the performance of bms_overlap and bms_is_subset, confirming the thesis that this change is worthwhile. I'll start a separate thread about match_eclasses_to_foreign_key_col. regards, tom lane
Re: A case for UPDATE DISTINCT attribute
Hello Gajus, I have observed that the following pattern is repeating in our data management programs: UPDATE event SET fuid = ${fuid}, venue_id = ${venueId}, url = ${url} WHERE id = ${id} AND fuid IS != ${fuid} AND venue_id IS != ${venueId} AND url IS DISTINCT FROM ${url}; ... Meanwhile, a WHERE condition that excludes rows with matching values makes this into a noop in case of matching target column values. For this to hold, you need your conditions in WHERE to be ORed, not ANDed. UPDATE DISTINCT event SET fuid = ${fuid}, venue_id = ${venueId}, url = ${url} WHERE id = ${id} would encourage greater adoption of such pattern. Is there a technical reason this does not existing already? ᐧ At least a bunch of questions and concerns arise. Like these: 1) We cannot treat it as a syntactic sugar only and just expand it on parsing stage, as the expression to generate the value assigned may be volatile, like UPDATE ... SET ... = random(); 2) How should this interact with triggers? E.g. when NEW and OLD were the same before BEFORE UPDATE trigger execution, but would be different after. Or visa versa. Should they be included into transition tables? 3) Should RETURNING clause return the non-updated rows? 4) It must be not easy to guarantee anything if there is a FROM clause, a target row is present in the join more than once. 5) We need to fail correctly if one of the column data types doesn't have an equality operator. Best regards, Alexey
Re: A few new options for vacuumdb
Hi Michael, Thanks for taking a look. On 12/19/18, 8:05 PM, "Michael Paquier" wrote: > On Wed, Dec 19, 2018 at 08:50:10PM +, Bossart, Nathan wrote: >> If an option is specified for a server version that is not supported, >> the option is silently ignored. For example, SKIP_LOCKED was only >> added to VACUUM and ANALYZE for v12. Alternatively, I think we could >> fail in vacuum_one_database() if an unsupported option is specified. >> Some of these options will work on all currently supported versions, >> so I am curious what others think about skipping some of these version >> checks altogether. > > prepare_vacuum_command() already handles that by ignoring silently > unsupported options (see the case of SKIP_LOCKED). So why not doing the > same? The --skip-locked option in vacuumdb is part of 0002, so I don't think there's much precedent here. We do currently fall back to the unparenthesized syntax for VACUUM for versions before 9.0, so there is some version handling already. What do you think about removing version checks for unsupported major versions? If we went that route, these new patches could add version checks only for options that were added in a supported major version (e.g. mxid_age() was added in 9.5). Either way, we'll still have to decide whether to fail or to silently skip the option if you do something like specify --min-mxid-age for a 9.4 server. >> It does not seem clear whether the user wants us to process mytable >> only if it is at least 1 GB, or if we should process mytable in >> addition to any other relations over 1 GB. Either way, I think trying >> to support these combinations of options adds more complexity than it >> is worth. > > It seems to me that a combination of both options means that the listed > table should be processed only if its minimum size is 1GB. If multiple > tables are specified with --table, then only those reaching 1GB would be > processed. So this restriction can go away. The same applies for the > proposed --min-xid-age and --min-mxid-age. Okay. This should be pretty easy to do using individual catalog lookups before we call prepare_vacuum_command(). Alternatively, I could add a clause for each specified table in the existing query in vacuum_one_database(). At that point, it may be cleaner to just use that catalog query in all cases instead of leaving paths where tables can remain NULL. > + > +Only execute the vacuum or analyze commands on tables with a > multixact > +ID age of at least class="parameter">mxid_age. > + > Adding a link to explain the multixact business may be helpful, say > vacuum-for-multixact-wraparound. Same comment for XID. Good idea. > There is an argument about also adding DISABLE_PAGE_SKIPPING. I'll add this in the next set of patches. I need to add the parenthesized syntax and --skip-locked for ANALYZE in prepare_vacuum_command(), too. Nathan
Re: insensitive collations
Tom Lane wrote: > I don't really find it "natural" for equality to consider obviously > distinct values to be equal. According to https://www.merriam-webster.com/dictionary/natural "natural" has no less than 15 meanings. The first in the list is "based on an inherent sense of right and wrong" which I admit is not what we want to imply in this context. The meaning that I was thinking about was close to definitions 4: "following from the nature of the one in question " or 7: "having a specified character by nature " or 13: "closely resembling an original : true to nature" When postgres uses the comparison from a collation with no modification whatsoever, it's true to that collation. When it changes the result from equal to non-equal, it's not. If a collation says that "ABC" = "abc" and postgres says, mmh, OK thanks but I'll go with "ABC" != "abc", then that denatures the collation, in the sense of: "to deprive of natural qualities : change the nature of" (https://www.merriam-webster.com/dictionary/denature) Aside from that, I'd be +1 for "linguistic" as the opposite of "bytewise", I think it tends to be easily understood when expressing that a strcoll()-like function is used as opposed to a strcmp()-like function. I'm -1 for "deterministic" as a replacement for "bytewise". Even if Unicode has choosen that term for exactly the behavior we're talking about, it's heavily used in the more general sense of: "given a particular input, will always produce the same output" (quoted from https://en.wikipedia.org/wiki/Deterministic_algorithm) which we very much expect from all our string comparisons no matter the flags we may put on the collations. "bytewise" might be less academic but it has less potential for wrong interpretations. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Re: lock level for DETACH PARTITION looks sketchy
On 2018-Dec-19, Robert Haas wrote: > On Wed, Dec 19, 2018 at 2:44 PM Alvaro Herrera > wrote: > > Oh, I remember eyeing that suspiciously, but was too distracted on > > making the other thing work to realize it was actually wrong :-( > > I agree that it's wrong. > > OK, cool. If you're going to push a fix for the other changes, do you > want to do this one too, or should I fix it separately? I can patch that one too, but it's separate -- it goes back to pg10 I think (the other to pg11) -- and let's think about the lock mode for a bit: as far as I can see, ShareUpdateExclusive is enough; the difference with AccessExclusive is that it lets through modes AccessShare, RowShare and RowExclusive. According to mvcc.sgml, that means we're letting SELECT, SELECT FOR SHARE/UPDATE and INS/UPD/DEL in the partition being detached, respectively. All those seem acceptable to me. All DDL is blocked, of course. So right now if you insert via the parent table, detaching the partition would be blocked because we also acquire lock on the parent (and detach acquires AccessExclusive on the parent). But after DETACH CONCURRENTLY, inserting via the parent should let rows to the partition through, because there's no conflicting lock to stop them. Inserting rows to the partition directly would be let through both before and after DETACH CONCURRENTLY. One note about DETACH CONCURRENTLY: detaching the index from parent now uses AccessExclusiveLock (the block I just patched). For concurrent detach that won't work, so we'll need to reduce that lock level too. Not sure offhand if there are any problems with that. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
> On Wed, Nov 21, 2018 at 9:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Nov 21, 2018 at 4:38 PM Alexander Kuzmenkov > > wrote: > > > > On 11/18/18 02:27, Dmitry Dolgov wrote: > > > > > > [0001-Index-skip-scan-v4.patch] > > > > I ran a couple of tests on this, please see the cases below. As before, > > I'm setting total_cost = 1 for index skip scan so that it is chosen. > > Case 1 breaks because we determine the high key incorrectly, it is the > > second tuple on page or something like that, not the last tuple. > > From what I see it wasn't about the high key, just a regular off by one error. > But anyway, thanks for noticing - for some reason it wasn't always > reproduceable for me, so I missed this issue. Please find fixed patch > attached. > Also I think it invalidates my previous performance tests, so I would > appreciate if you can check it out too. I've performed some testing, and on my environment with a dataset of 10^7 records: * everything below 7.5 * 10^5 unique records out of 10^7 was faster with skip scan. * above 7.5 * 10^5 unique records skip scan was slower, e.g. for 10^6 unique records it was about 20% slower than the regular index scan. For me these numbers sound good, since even in quite extreme case of approximately 10 records per group the performance of index skip scan is close to the same for the regular index only scan. > On Tue, Dec 4, 2018 at 4:26 AM Peter Geoghegan wrote: > > On Wed, Nov 21, 2018 at 12:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Well, no, it's callled with ScanDirectionIsForward(dir). But as far as I > > remember from the previous discussions the entire topic of backward scan is > > questionable for this patch, so I'll try to invest some time in it. > > Another thing that I think is related to skip scans that you should be > aware of is dynamic prefix truncation, which I started a thread on > just now [1]. While I see one big problem with the POC patch I came up > with, I think that that optimization is going to be something that > ends up happening at some point. Repeatedly descending a B-Tree when > the leading column is very low cardinality can be made quite a lot > less expensive by dynamic prefix truncation. Actually, it's almost a > perfect case for it. Thanks, sounds cool. I'll try it out as soon as I'll have some spare time.
START/END line number for COPY FROM
Hi, Currently we can skip header line on COPY FROM but having the ability to skip and stop copying at any line can use to divide long copy operation and enable to copy a subset of the file and skipping footer. Attach is a patch for it Regards Surafel diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 411941ed31..86f9a6a905 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -43,6 +43,8 @@ COPY { table_name [ ( column_name [, ...] ) FORCE_NULL ( column_name [, ...] ) ENCODING 'encoding_name' +START starting_line_number +END ending_line_number @@ -353,6 +355,24 @@ COPY { table_name [ ( + +START + + + Specifies the line number to begin copying. + + + + + +END + + + Specifies the line number to end copying. + + + + diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 4311e16007..4d07716a9f 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -121,6 +121,8 @@ typedef struct CopyStateData int file_encoding; /* file or remote side's character encoding */ bool need_transcoding; /* file encoding diff from server? */ bool encoding_embeds_ascii; /* ASCII can be non-first byte? */ + int start_postion; /* copying star line */ + int end_postion; /* copying end line */ /* parameters from the COPY command */ Relation rel; /* relation to copy to or from */ @@ -347,6 +349,7 @@ static void CopySendInt32(CopyState cstate, int32 val); static bool CopyGetInt32(CopyState cstate, int32 *val); static void CopySendInt16(CopyState cstate, int16 val); static bool CopyGetInt16(CopyState cstate, int16 *val); +static void skipLines(CopyState cstate); /* @@ -1223,6 +1226,34 @@ ProcessCopyOptions(ParseState *pstate, defel->defname), parser_errposition(pstate, defel->location))); } + else if (strcmp(defel->defname, "start") == 0) + { + if (cstate->start_postion) +ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"), + parser_errposition(pstate, defel->location))); + cstate->start_postion = defGetInt64(defel); + if (cstate->start_postion < 0) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid START line number"), + parser_errposition(pstate, defel->location))); + } + else if (strcmp(defel->defname, "end") == 0) + { + if (cstate->end_postion) +ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"), + parser_errposition(pstate, defel->location))); + cstate->end_postion = defGetInt64(defel); + if (cstate->end_postion < 0) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid END line number"), + parser_errposition(pstate, defel->location))); + } else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), @@ -1373,6 +1404,13 @@ ProcessCopyOptions(ParseState *pstate, ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("CSV quote character must not appear in the NULL specification"))); + + if (cstate->end_postion != 0 && + cstate->start_postion > cstate->end_postion) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("START line number can not be greater then END line number"))); + } /* @@ -2317,6 +2355,7 @@ CopyFrom(CopyState cstate) uint64 lastPartitionSampleLineNo = 0; uint64 nPartitionChanges = 0; double avgTuplesPerPartChange = 0; + uint64 end_line = 1; Assert(cstate->rel); @@ -2619,6 +2658,20 @@ CopyFrom(CopyState cstate) has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc && resultRelInfo->ri_TrigDesc->trig_insert_instead_row); + if (cstate->start_postion) + { + end_line = cstate->start_postion; + skipLines(cstate); + } + + /* throw the header line away means start copying at second line */ + if (cstate->start_postion == 0 && cstate->header_line) + { + cstate->start_postion = 2; + end_line = cstate->start_postion; + skipLines(cstate); + } + /* * Check BEFORE STATEMENT insertion triggers. It's debatable whether we * should do this for COPY, since it's not really an "INSERT" statement as @@ -2644,6 +2697,9 @@ CopyFrom(CopyState cstate) TupleTableSlot *slot; bool skip_tuple; + if (cstate->end_postion !=0 && cstate->end_postion < end_line++) + break; + CHECK_FOR_INTERRUPTS(); if (nBufferedTuples == 0) @@ -3394,14 +3450,6 @@ NextCopyFromRawFields(CopyState cstate, char ***fields, int *nfields) /* only available for text or csv input */ Assert(!cstate->binary); - /* on input just throw the header line away */ - if (cstate->cur_lineno == 0 && cstate->header_line) - { - cstate->cur_lineno++; - if (CopyReadLine(cstate)) - return false; /* done */ - } - cstate->cur_lineno++; /* Actually read the line into memory here
Re: Add timeline to partial WAL segments
On 12/15/18 1:56 AM, Michael Paquier wrote: On Fri, Dec 14, 2018 at 06:05:18PM -0500, David Steele wrote: On 12/14/18 3:26 PM, Robert Haas wrote: The new TLI is the only thing that is guaranteed to be unique with each new promotion, and I would guess that it is therefore the right thing to use to disambiguate them. This is the conclusion we came to after a few months of diagnosing and working on this problem. Hm. Okay. From that point of view I think that we should also make pg_receivewal able to generate the same kind of segment format when it detects a timeline switch for the last partial segment of the previous timeline. Switching to a new timeline makes the streaming finish, so we could use that to close properly the WAL segment with the new name. At the same time it would be nice to have a macro able to generate a partial segment name and also check after it. Or perhaps just always add the timeline to the .partial? That way it doesn't need to be renamed later. Also, there would be a consistent name, rather than sometimes .partial, sometimes ..partial. The question in my mind: is it safe to back-patch? That's unfortunately a visibility change, meaning that any existing backup solution able to handle .partial segments would get broken in a minor release. From that point of view this should go only on HEAD. That means every archive command needs to deal with this issue on its own. It's definitely not a trivial thing to do. It's obviously strong to call this a bug, but there's very clearly an issue here. I wonder if there is anything else we can do that would work? The other patch to reorder the priority of the .ready files could go down without any problem though. Glad to hear it. -- -David da...@pgmasters.net
Re: Change pgarch_readyXlog() to return .history files first
On 12/15/18 2:10 AM, Michael Paquier wrote: On Fri, Dec 14, 2018 at 08:43:20AM -0500, David Steele wrote: On 12/13/18 7:15 PM, Michael Paquier wrote: Or you could just use IsTLHistoryFileName here? We'd have to truncate .ready off the string to make that work, which seems easy enough. Is that what you were thinking? Yes, that's the idea. pgarch_readyXlog returns the segment name which should be archived, so you could just compute it after detecting a .ready file. One thing to consider is the check above is more efficient than IsTLHistoryFileName() and it potentially gets run a lot. This check misses strspn(), so any file finishing with .history would get eaten even if that's unlikely to happen. Good point. The new patch uses IsTLHistoryFileName(). If one wants to simply check this code, you can just create dummy orphan files in archive_status and see in which order they get removed. Seems awfully racy. Are there currently any tests like this for the archiver that I can look at extending? Sorry for the confusion, I was referring to manual testing here. Ah, I see. Yes, that's exactly how I tested it, in addition to doing real promotions. Thinking about it, we could have an automatic test to check for the file order pattern by creating dummy files, starting the server with archiver enabled, and then parse the logs as orphan .ready files would get removed in the order their archiving is attempted with one WARNING entry generated for each. I am not sure if that is worth a test though. Yes, parsing the logs was the best thing I could think of, too. -- -David da...@pgmasters.net >From 1987a90b724506c7d9e453d9a976e3f982f625ec Mon Sep 17 00:00:00 2001 From: David Steele Date: Thu, 20 Dec 2018 13:28:51 +0200 Subject: [PATCH] Change pgarch_readyXlog() to return .history files first. The alphabetical ordering of pgarch_readyXlog() means that on promotion 000100010001.partial will be archived before 0002.history. This appears harmless, but the .history files are what other potential primaries use to decide what timeline they should pick. The additional latency of compressing/transferring the much larger partial file means that archiving of the .history file is delayed and greatly increases the chance that another primary will promote to the same timeline. Teach pgarch_readyXlog() to return .history files first (and in order) to reduce the window where this can happen. This won't prevent all conflicts, but it is a simple change and should greatly reduce real-world occurrences. --- src/backend/postmaster/pgarch.c | 36 +++-- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index e88d545d65..50dc2027b1 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -695,11 +695,11 @@ pgarch_archiveXlog(char *xlog) * 2) because the oldest ones will sooner become candidates for * recycling at time of checkpoint * - * NOTE: the "oldest" comparison will presently consider all segments of - * a timeline with a smaller ID to be older than all segments of a timeline - * with a larger ID; the net result being that past timelines are given - * higher priority for archiving. This seems okay, or at least not - * obviously worth changing. + * NOTE: the "oldest" comparison will consider any .history file to be older + * than any other file except another .history file. Segments on a timeline + * with a smaller ID will be older than all segments on a timeline with a larger + * ID; the net result being that past timelines are given higher priority for + * archiving. This seems okay, or at least not obviously worth changing. */ static bool pgarch_readyXlog(char *xlog) @@ -715,6 +715,7 @@ pgarch_readyXlog(char *xlog) DIR*rldir; struct dirent *rlde; boolfound = false; + boolhistoryFound = false; snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status"); rldir = AllocateDir(XLogArchiveStatusDir); @@ -728,12 +729,28 @@ pgarch_readyXlog(char *xlog) strspn(rlde->d_name, VALID_XFN_CHARS) >= basenamelen && strcmp(rlde->d_name + basenamelen, ".ready") == 0) { - if (!found) +/* Truncate off the .ready */ +rlde->d_name[basenamelen] = '\0'; + + /* Is this a history file? */ + bool history = IsTLHistoryFileName(rlde->d_name); + + /* +* If this is the first file or the first history file, copy it +*/ + if (!found || (history && !historyFound)) { strcpy(newxlog, rlde->d_name); found = true; +
Re: Tid scan improvements
Review of v5: 0001: looks good. 0002: 1. I don't think you need palloc0() here. palloc() looks like it would be fine. if (tidRangeArray->ranges == NULL) tidRangeArray->ranges = (TidRange *) palloc0(tidRangeArray->numAllocated * sizeof(TidRange)); if that wasn't the case, then you'll need to also zero the additional memory when you repalloc(). 2. Can't the following code be moved into the correct forwards/backwards if block inside the if inscan block above? /* If we've finished iterating over the ranges, exit the loop. */ if (node->tss_CurrentTidRange >= numRanges || node->tss_CurrentTidRange < 0) break; Something like: if (bBackward) { if (node->tss_CurrentTidRange < 0) { /* initialize for backward scan */ node->tss_CurrentTidRange = numRanges - 1; } else if (node->tss_CurrentTidRange == 0) break; else node->tss_CurrentTidRange--; } else { if (node->tss_CurrentTidRange < 0) { /* initialize for forward scan */ node->tss_CurrentTidRange = 0; } else if (node->tss_CurrentTidRange >= numRanges - 1) break; else node->tss_CurrentTidRange++; } I think that's a few less lines and instructions and (I think) a bit neater too. 3. if (found_quals != NIL) (yeah, I Know there's already lots of places not doing this) /* If we found any, make an AND clause out of them. */ if (found_quals) likewise in: /* Use a range qual if any were found. */ if (found_quals) 4. The new tests in tidscan.sql should drop the newly created tables. (I see some get dropped in the 0004 patch, but not all. Best not to rely on a later patch to do work that this patch should do) 0003: looks okay. 0004: 5. Please add a comment to scandir in: typedef struct TidScan { Scan scan; List*tidquals; /* qual(s) involving CTID = something */ ScanDirection scandir; } TidScan; /* forward or backward or don't care */ would do. Likewise for struct TidPath. Likely IndexPath can be used for guidance. 6. Is it worth adding a Merge Join regression test for this patch? Something like: postgres=# explain select * from t1 inner join t1 t2 on t1.ctid = t2.ctid order by t1.ctid desc; QUERY PLAN - Merge Join (cost=0.00..21.25 rows=300 width=14) Merge Cond: (t1.ctid = t2.ctid) -> Tid Scan Backward on t1 (cost=0.00..8.00 rows=300 width=10) -> Materialize (cost=0.00..8.75 rows=300 width=10) -> Tid Scan Backward on t1 t2 (cost=0.00..8.00 rows=300 width=10) (5 rows) 0005: 7. I see the logic behind this new patch, but quite possibly the majority of the time the relpages will be out of date and you'll mistakenly apply this to not the final page. I'm neither here nor there with it. I imagine you might feel the same since you didn't merge it with 0001. Maybe we can leave it out for now and see what others think. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Using POPCNT and other advanced bit manipulation instructions
On 20/12/18 6:53, David Rowley wrote: Back in 2016 [1] there was some discussion about using the POPCNT instruction to improve the performance of counting the number of bits set in a word. Improving this helps various cases, such as bms_num_members and also things like counting the allvisible and frozen pages in the visibility map. [snip] I've put together a very rough patch to implement using POPCNT and the leading and trailing 0-bit instructions to improve the performance of bms_next_member() and bms_prev_member(). The correct function should be determined on the first call to each function by way of setting a function pointer variable to the most suitable supported implementation. I've not yet gone through and removed all the number_of_ones[] arrays to replace with a pg_popcount*() call. IMVHO: Please do not disregard potential optimization by the compiler around those calls.. o_0 That might explain the reduced performance improvement observed. Not that I can see any obvious alternative to your implementation right away ... That seems to have mostly been done in Thomas' patch [3], part of which I've used for the visibilitymap.c code changes. If this patch proves to be possible, then I'll look at including the other changes Thomas made in his patch too. What I'm really looking for by posting now are reasons why we can't do this. I'm also interested in getting some testing done on older machines, particularly machines with processors that are from before 2007, both AMD and Intel. I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test on. Feel free to toss me some test code. cpuinfo flags: fpu de tsc msr pae mce cx8 apic mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall 2007-2008 seems to be around the time both AMD and Intel added support for POPCNT and LZCNT, going by [4]. Thanks
[patch] de.po REINDEX error
de.po's error message when you try to "REINDEX DATABASE otherdb" has been the wrong way round since 2011: diff --git a/src/backend/po/de.po b/src/backend/po/de.po index ca52df6731..6aa4354d82 100644 --- a/src/backend/po/de.po +++ b/src/backend/po/de.po @@ -7540,7 +7540,7 @@ msgstr "Tabelle »%s« hat keine Indexe" #: commands/indexcmds.c:2329 #, c-format msgid "can only reindex the currently open database" -msgstr "aktuell geöffnete Datenbank kann nicht reindiziert werden" +msgstr "nur die aktuell geöffnete Datenbank kann reindiziert werden" #: commands/indexcmds.c:2435 #, c-format Spotted by a participant during a PostgreSQL training. Christoph
Re: Remove Deprecated Exclusive Backup Mode
On 12/20/18 1:35 AM, Michael Paquier wrote: On Wed, Dec 19, 2018 at 06:38:00PM +0200, David Steele wrote: I'll push this to the first post-PG12 CF when one appears. Can we just go ahead and create a 2019-07 CF now? I know we generally discuss this at PGCon, but we can always rename it, right? The schedule gets decided at the developer meeting, still the commit fests created can have their name and start/end dates modified as needed So, I think that it is not an issue to add one ahead now. And here you go: https://commitfest.postgresql.org/23/ This request was going to pop out at some point anyway in the follow-up months. Thanks, Michael. Getting this error, not sure what it means: Cannot move patch to the same commitfest, and multiple future commitfests exist! -- -David da...@pgmasters.net
Re: [HACKERS] WAL logging problem in 9.4.3?
Hello. At Fri, 30 Nov 2018 18:27:05 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in > > On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI > > wrote: > > > > 0004 was shot by e9edc1ba0b. Rebased to the current HEAD. > > Successfully built and passeed all regression/recovery tests > > including additional recovery/t/016_wal_optimize.pl. > > Thank you for working on this patch. Unfortunately, cfbot complains that > v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts. > Could you please post a rebased version one more time? Thanks. Here's the rebased version. I found no other amendment required other than the apparent conflict. > > On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan > > wrote: > > > > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote: > > > On 18/07/18 16:29, Robert Haas wrote: > > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier > > >> wrote: > > What's wrong with the approach proposed in > > http://postgr.es/m/55afc302.1060...@iki.fi ? > > >>> > > >>> For back-branches that's very invasive so that seems risky to me > > >>> particularly seeing the low number of complaints on the matter. > > >> > > >> Hmm. I think that if you disable the optimization, you're betting that > > >> people won't mind losing performance in this case in a maintenance > > >> release. If you back-patch Heikki's approach, you're betting that the > > >> committed version doesn't have any bugs that are worse than the status > > >> quo. Personally, I'd rather take the latter bet. Maybe the patch > > >> isn't all there yet, but that seems like something we can work > > >> towards. If we just give up and disable the optimization, we won't > > >> know how many people we ticked off or how badly until after we've done > > >> it. > > > > > > Yeah. I'm not happy about backpatching a big patch like what I > > > proposed, and Kyotaro developed further. But I think it's the least > > > bad option we have, the other options discussed seem even worse. > > > > > > One way to review the patch is to look at what it changes, when > > > wal_level is *not* set to minimal, i.e. what risk or overhead does it > > > pose to users who are not affected by this bug? It seems pretty safe > > > to me. > > > > > > The other aspect is, how confident are we that this actually fixes the > > > bug, with least impact to users using wal_level='minimal'? I think > > > it's the best shot we have so far. All the other proposals either > > > don't fully fix the bug, or hurt performance in some legit cases. > > > > > > I'd suggest that we continue based on the patch that Kyotaro posted at > > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp. > > > > > I have just spent some time reviewing Kyatoro's patch. I'm a bit > > nervous, too, given the size. But I'm also nervous about leaving things > > as they are. I suspect the reason we haven't heard more about this is > > that these days use of "wal_level = minimal" is relatively rare. > > I'm totally out of context of this patch, but reading this makes me nervous > too. Taking into account that the problem now is lack of review, do you have > plans to spend more time reviewing this patch? regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 120f3f1d4dc47eb74a6ad7fde3c116e31b8eab3e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/4] TAP test for copy-truncation optimization. --- src/test/recovery/t/016_wal_optimize.pl | 192 1 file changed, 192 insertions(+) create mode 100644 src/test/recovery/t/016_wal_optimize.pl diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl new file mode 100644 index 00..310772a2b3 --- /dev/null +++ b/src/test/recovery/t/016_wal_optimize.pl @@ -0,0 +1,192 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 14; + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +)); + $node->start; + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial PRIMARY KEY); + TRUNCATE test1; + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + my $result =
Improve selectivity estimate for range queries
Hi, I found the problem about selectivity estimate for range queries on master as follows. This is my test case: postgres=# CREATE TABLE test(id int, val text); CREATE TABLE postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,2)i; INSERT 0 3 postgres=# VACUUM analyze test; VACUUM postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 1; QUERY PLAN --- Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows= loops=1) Filter: ((id > 0) AND (id < 1)) Rows Removed by Filter: 20001 Planning Time: 0.099 ms Execution Time: 37.142 ms (5 rows) Although the actual number of rows was , the estimated number of rows was 150. So I dug to see what happened and thought that the following part in clauselist_selectivity caused this problem. /* * Now scan the rangequery pair list. */ while (rqlist != NULL) { RangeQueryClause *rqnext; if (rqlist->have_lobound && rqlist->have_hibound) { /* Successfully matched a pair of range clauses */ Selectivity s2; /* * Exact equality to the default value probably means the * selectivity function punted. This is not airtight but should * be good enough. */ if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound == DEFAULT_INEQ_SEL) { s2 = DEFAULT_RANGE_INEQ_SEL; } else { s2 = rqlist->hibound + rqlist->lobound - 1.0; In my environment, the selectivity for id > 0 was 0.1, and the selectivity for id < 1 was 0.1. Then, the value of rqlist->hibound and rqlist->lobound are set respectively. On the other hand, DEFAULT_INEQ_SEL is 0.. As a result, the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement was satisfied. However, these selectivities were computed according to the statistics, so the final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0. My test case might be uncommon, but I think it would cause some problems. To handle such cases I've thought up of an idea based on a previous commit[1] which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag which shows whether the selectivity was calculated according to the statistics or not to clauselist_selectivity, and used it as a condition of the if statement instead of rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. I am afraid that changes were more than I had expected, but we can estimate selectivity correctly. Patch attached. Do you have any thoughts? [1] https://git.postgresql.org/gitweb/?p=postgresql.git=commitdiff=4c2777d0b733220d9029f78817af8ce Best regards, Yuzuko Hosoya NTT Open Source Software Center improve_selectivity_estimate_for_range_queries.patch Description: Binary data
RE: Localization tools for Postgres
Hi cc’ed pgsql-translators I personally uses ‘poedit’ when working around po files. Takeshi Ideriha Fujitsu Limited From: Дмитрий Воронин [mailto:carriingfat...@yandex.ru] Sent: Thursday, December 20, 2018 12:54 AM To: pgsql-hack...@postgresql.org Subject: Localization tools for Postgres Hi, hackers. Can you tell me which tools do you prefer when you update *.po translations? Thanks!
Re: Using POPCNT and other advanced bit manipulation instructions
On Thu, 20 Dec 2018 at 20:17, Gavin Flower wrote: > Looking at the normalized standard deviations, the patched results have > a higher than 5% chance of being better simply by chance. I suspect > that you have made an improvement, but the statistics are not convincing. Yeah, I'd hoped that I could have gotten a better signal to noise ratio by running the test many times, but you're right. That was on my laptop. I've run the test again on an AWS instance and the results seem to be a bit more stable. Same table with 1 int column and 100m rows. statistics set to 10. Unpatched postgres=# analyze a; Time: 38.248 ms Time: 35.185 ms Time: 35.067 ms Time: 34.879 ms Time: 34.816 ms Time: 34.558 ms Time: 34.722 ms Time: 34.427 ms Time: 34.214 ms Time: 34.301 ms Time: 35.751 ms Time: 33.993 ms Time: 33.880 ms Time: 33.617 ms Time: 33.381 ms Time: 33.326 ms Patched: postgres=# analyze a; Time: 34.421 ms Time: 33.523 ms Time: 33.230 ms Time: 33.678 ms Time: 32.987 ms Time: 32.914 ms Time: 33.165 ms Time: 32.707 ms Time: 32.645 ms Time: 32.814 ms Time: 32.082 ms Time: 32.143 ms Time: 32.310 ms Time: 31.966 ms Time: 31.702 ms Time: 32.089 ms Avg +5.72%, Median +5.29% -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services