[HACKERS] Quorum commit for multiple synchronous replication.
Hi all, In 9.6 development cycle, we had been discussed about configuration syntax for a long time while considering expanding. As a result, we had new dedicated language for multiple synchronous replication, but it supports only priority method. We know that quorum commit is very useful for many users and can expand dedicated language easily for quorum commit. So I'd like to propose quorum commit for multiple synchronous replication here. The followings are changes attached patches made. - Add new syntax 'Any N ( node1, node2, ... )' to synchornous_standby_names for quorum commit. - In quorum commit, the master can return commit to client after received ACK from *at least* any N servers of listed standbys. - sync_priority of all listed servers are same, 1. - Add regression test for quorum commit. I was thinking that the syntax for quorum method would use '[ ... ]' but it will be confused with '( ... )' priority method used. 001 patch adds 'Any N ( ... )' style syntax but I know that we still might need to discuss about better syntax, discussion is very welcome. Attached draft patch, please give me feedback. Regards, -- Masahiko Sawada diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 67249d8..0ce5399 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -76,9 +76,9 @@ char *SyncRepStandbyNames; #define SyncStandbysDefined() \ (SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0') -static bool announce_next_takeover = true; +SyncRepConfigData *SyncRepConfig = NULL; -static SyncRepConfigData *SyncRepConfig = NULL; +static bool announce_next_takeover = true; static int SyncRepWaitMode = SYNC_REP_NO_WAIT; static void SyncRepQueueInsert(int mode); @@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, XLogRecPtr *applyPtr, bool *am_sync); +static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, + XLogRecPtr *flushPtr, + XLogRecPtr *applyPtr, + int pos, bool *am_sync); static int SyncRepGetStandbyPriority(void); +static int cmp_lsn(const void *a, const void *b); #ifdef USE_ASSERT_CHECKING static bool SyncRepQueueIsOrderedByLSN(int mode); @@ -391,7 +396,7 @@ SyncRepReleaseWaiters(void) XLogRecPtr writePtr; XLogRecPtr flushPtr; XLogRecPtr applyPtr; - bool got_oldest; + bool got_recptr; bool am_sync; int numwrite = 0; int numflush = 0; @@ -418,11 +423,16 @@ SyncRepReleaseWaiters(void) LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); /* - * Check whether we are a sync standby or not, and calculate the oldest - * positions among all sync standbys. + * Check whether we are a sync standby or not, and calculate the synced + * positions among all sync standbys using method. */ - got_oldest = SyncRepGetOldestSyncRecPtr(, , - , _sync); + if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY) + got_recptr = SyncRepGetOldestSyncRecPtr(, , + , _sync); + else /* SYNC_REP_QUORUM */ + got_recptr = SyncRepGetNNewestSyncRecPtr(, , + , SyncRepConfig->num_sync, + _sync); /* * If we are managing a sync standby, though we weren't prior to this, @@ -440,7 +450,7 @@ SyncRepReleaseWaiters(void) * If the number of sync standbys is less than requested or we aren't * managing a sync standby then just leave. */ - if (!got_oldest || !am_sync) + if (!got_recptr || !am_sync) { LWLockRelease(SyncRepLock); announce_next_takeover = !am_sync; @@ -476,6 +486,88 @@ SyncRepReleaseWaiters(void) } /* + * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys. + * + * Return false if the number of sync standbys is less than + * synchronous_standby_names specifies. Otherwise return true and + * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr. + * + * On return, *am_sync is set to true if this walsender is connecting to + * sync standby. Otherwise it's set to false. + */ +static bool +SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, + XLogRecPtr *applyPtr, int pos, bool *am_sync) +{ + XLogRecPtr *write_array; + XLogRecPtr *flush_array; + XLogRecPtr *apply_array; + List *sync_standbys; + ListCell *cell; + int len; + int i = 0; + + *writePtr = InvalidXLogRecPtr; + *flushPtr = InvalidXLogRecPtr; + *applyPtr = InvalidXLogRecPtr; + *am_sync = false; + + /* Get standbys that are considered as synchronous at this moment */ + sync_standbys = SyncRepGetSyncStandbys(am_sync); + + /* + * Quick exit if we are not managing a sync standby or there are not + * enough synchronous standbys. + */ + if (!(*am_sync) || + SyncRepConfig == NULL || + list_length(sync_standbys) < SyncRepConfig->num_sync) + { + list_free(sync_standbys); + return false; + } + + len = list_length(sync_standbys); + write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len); + flush_array =
Re: [HACKERS] Why we lost Uber as a user
On 03/08/16 02:27, Robert Haas wrote: Personally, I think that incremental surgery on our current heap format to try to fix this is not going to get very far. If you look at the history of this, 8.3 was a huge release for timely cleanup of dead tuple. There was also significant progress in 8.4 as a result of 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we then made no progress at all in 9.0 - 9.4. We made a very small improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but that's pretty niche. In 9.6, we have "snapshot too old", which I'd argue is potentially a large improvement, but it was big and invasive and will no doubt pose code maintenance hazards in the years to come; also, many people won't be able to use it or won't realize that they should use it. I think it is likely that further incremental improvements here will be quite hard to find, and the amount of effort will be large relative to the amount of benefit. I think we need a new storage format where the bloat is cleanly separated from the data rather than intermingled with it; every other major RDMS works that way. Perhaps this is a case of "the grass is greener on the other side of the fence", but I don't think so. Yeah, I think this is a good summary of the state of play. The only other new db development to use a non-overwriting design like ours that I know of was Jim Starky's Falcon engine for (ironically) Mysql 6.0. Not sure if anyone is still progressing that at all now. I do wonder if Uber could have successfully tamed dead tuple bloat with aggressive per-table autovacuum settings (and if in fact they tried), but as I think Robert said earlier, it is pretty easy to come up with a highly update (or insert + delete) workload that makes for a pretty ugly bloat component even with real aggressive autovacuuming. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] regression test for extended query protocol
[Sorry for a top post. For unknown reason I couldn' get mails from pgsql-hackers since this morning and I have to check the archive] >> In my understanding, we don't have any regression test for protocol >> level prepared query (we do have SQL level prepared query tests, >> though). Shouldn't we add those tests to the regression test suites? > > I thought that ECPG was covering a portion of that. Right? > -- > Michael In my understanding, ECPG calls libpq, thus the test cases are limited to the cases which are only possible with libpq (or it may be even limited to the cases where ECPG deal with.) Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
On Sun, Jul 31, 2016 at 05:57:12PM -0400, Tom Lane wrote: > In hindsight it seems clear that what a lot of apps want out of extended > protocol is only the ability to send parameter values out-of-line instead > of having to quote/escape them into SQL literals. Maybe an idea for the > fabled V4 protocol update is some compromise query type that corresponds > precisely to PQexecParams's feature set: you can send parameter values > out-of-line, and you can specify text or binary results, but there's no > notion of any persistent state being created and no feedback about > parameter data types. Do you want this on the TODO list? -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] regression test for extended query protocol
On Wed, Aug 3, 2016 at 11:33 AM, Tatsuo Ishiiwrote: > In my understanding, we don't have any regression test for protocol > level prepared query (we do have SQL level prepared query tests, > though). Shouldn't we add those tests to the regression test suites? I thought that ECPG was covering a portion of that. Right? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] regression test for extended query protocol
In my understanding, we don't have any regression test for protocol level prepared query (we do have SQL level prepared query tests, though). Shouldn't we add those tests to the regression test suites? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On Tue, Aug 2, 2016 at 07:30:22PM -0700, Alfred Perlstein wrote: > So for instance, let's say there is a bug in the master's write to disk. > The logical replication acts as a barrier from that bad write going to the > slaves. With bad writes going to slaves then any corruption experienced on > the master will quickly reach the slaves and they too will be corrupted. > > With logical replication a bug may be stopped at the replication layer. At > that point you can resync the slave from the master. > > Now in the case of physical replication all your base are belong to zuul and > you are in a very bad state. > > That said with logical replication, who's to say that if the statement is > replicated to a slave that the slave won't experience the same bug and also > corrupt itself. > > We may be saying the same thing, but still there is something to be said for > logical replication... also, didnt they show that logical replication was > faster for some use cases at Uber? I saw from the Uber article that they weren't going to per-row logical replication but _statement_ replication, which is very hard to do because typical SQL doesn't record what concurrent transactions committed before a new statement's transaction snapshot is taken, and doesn't record lock order for row updates blocked by concurrent activity --- both of which affect the final result from the query. So, for statement replication, it is not a question of whether the code has bugs, but that the replay is not 100% possible in all cases, unless you switch to some statement-row-lock hybrid ability. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 8/2/16 2:14 PM, Tom Lane wrote: Stephen Frostwrites: With physical replication, there is the concern that a bug in *just* the physical (WAL) side of things could cause corruption. Right. But with logical replication, there's the same risk that the master's state could be fine but a replication bug creates corruption on the slave. Assuming that the logical replication works by issuing valid SQL commands to the slave, one could hope that this sort of "corruption" only extends to having valid data on the slave that fails to match the master. But that's still not a good state to be in. And to the extent that performance concerns lead the implementation to bypass some levels of the SQL engine, you can easily lose that guarantee too. In short, I think Uber's position that logical replication is somehow more reliable than physical is just wishful thinking. If anything, my money would be on the other way around: there's a lot less mechanism that can go wrong in physical replication. Which is not to say there aren't good reasons to use logical replication; I just do not believe that one. regards, tom lane The reason it can be less catastrophic is that for logical replication you may futz up your data, but you are safe from corrupting your entire db. Meaning if an update is missed or doubled that may be addressed by a fixup SQL stmt, however if the replication causes a write to the entirely wrong place in the db file then you need to "fsck" your db and hope that nothing super critical was blown away. The impact across a cluster is potentially magnified by physical replication. So for instance, let's say there is a bug in the master's write to disk. The logical replication acts as a barrier from that bad write going to the slaves. With bad writes going to slaves then any corruption experienced on the master will quickly reach the slaves and they too will be corrupted. With logical replication a bug may be stopped at the replication layer. At that point you can resync the slave from the master. Now in the case of physical replication all your base are belong to zuul and you are in a very bad state. That said with logical replication, who's to say that if the statement is replicated to a slave that the slave won't experience the same bug and also corrupt itself. We may be saying the same thing, but still there is something to be said for logical replication... also, didnt they show that logical replication was faster for some use cases at Uber? -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests
On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrerawrote: > Michael Paquier wrote: > >> Here using pg_xlog_replay_resume() is not the correct solution because >> this would cause the node to finish recovery before we want it to, and >> so is recovery_target_action = 'promote'. If we look at the test, it >> is doing the following when getting the TXID that is used as recovery >> target: >> $node_master->safe_psql('postgres', >> "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); >> my $recovery_txid = >> $node_master->safe_psql('postgres', "SELECT txid_current()"); >> my $lsn2 = >> $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); >> What I think we had better do is reverse the calls >> pg_current_xlog_location() and txid_current() so as we are sure that >> the LSN we track for replay is lower than the real target LSN. The >> same problem exists when defining the timestamp target. >> >> The patch attached does that, > > Why not capture both items in a single select, such as in the attached > patch? Let me test this [... A while after ...] This looks to work properly. 12 runs in a row have passed. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
Doesn't this patch break an existing behavior of unnamed statements? That is, an unnamed statement shall exist until next parse message using unnamed statement received. It is possible to use the same unnamed statement multiple times in a transaction. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Tue, Aug 2, 2016 at 5:05 PM, Robert Haaswrote: > I think that's just making life difficult. If nothing else, sqlsmith > hunts around for functions it can call that return internal errors, > and if we refuse to fix all of them to return user-facing errors, then > it's just crap for the people running sqlsmith to sift through and > it's a judgment call whether to fix each particular case. Even aside > from that, I think it's much better to have a clear and unambiguous > rule that elog is only for can't-happen things, not > we-don't-recommend-it things. +1. This also has value in the context of automatically surfacing situations where "can't happen" errors do in fact happen at scale. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Tue, Aug 2, 2016 at 7:16 PM, Tom Lanewrote: > Robert Haas writes: >> On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar wrote: >>> There are many more such exposed functions, which can throw cache lookup >>> failure error if we pass wrong value. >>> >>> i.e. >>> record_in >>> domain_in >>> fmgr_c_validator >>> plpgsql_validator >>> pg_procedure_is_visible >>> >>> Are we planning to change these also.. > >> I think if they are SQL-callable functions that users can invoke from >> a SQL prompt, they shouldn't throw "cache lookup failed" errors or, >> for that matter, any other error that is reported with elog(). > > FWIW, I would restrict that to "functions that users are *intended* to > call from a SQL prompt". If you are fooling around calling internal > functions with garbage input, you should not be surprised to get an > internal error back. So of the above list, only pg_procedure_is_visible > seems particularly worth changing to me. And for it, returning NULL > (ie, "unknown") seems reasonable enough. I think that's just making life difficult. If nothing else, sqlsmith hunts around for functions it can call that return internal errors, and if we refuse to fix all of them to return user-facing errors, then it's just crap for the people running sqlsmith to sift through and it's a judgment call whether to fix each particular case. Even aside from that, I think it's much better to have a clear and unambiguous rule that elog is only for can't-happen things, not we-don't-recommend-it things. > There's no such animal as pg_procedure_is_visible. I suspect that's an > EDB-ism that hasn't caught up with commit 66bb74dbe8206a35. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changed SRF in targetlist handling
On 2016-08-02 19:02:38 -0400, Tom Lane wrote: > Andres Freundwrites: > > I've an implementation that > > > 1) turns all targetlist SRF (tSRF from now on) into ROWS FROM > >expressions. If there's tSRFs in the argument of a tSRF those becomes > >a separate, lateral, ROWS FROM expression. > > > 2) If grouping/window functions are present, the entire query is wrapped > >in a subquery RTE, except for the set-returning function. All > >referenced Var|Aggref|GroupingFunc|WindowFunc|Param nodes in the > >original targetlist are made to reference that subquery, which gets a > >TargetEntry for them. > > FWIW, I'd be inclined to do the subquery RTE all the time, Yea, that's what I ended up doing. > adding some > optimization fence to ensure it doesn't get folded back. That fixes > your problem here: > > So far I have one problem without an easy solution: Historically queries > > like > > =# SELECT id, generate_series(1,2) FROM (VALUES(1),(2)) few(id); > > ┌┬─┐ > > │ id │ generate_series │ > > ├┼─┤ > > │ 1 │ 1 │ > > │ 1 │ 2 │ > > │ 2 │ 1 │ > > │ 2 │ 2 │ > > └┴─┘ > > have preserved the SRF output ordering. But by turning the SRF into a > > ROWS FROM, there's no guarantee that the cross join between "few" and > > generate_series(1,3) above is implemented in that order. But I don't see how that fixes the above problem? The join, on the top-level because of aggregates, still can be implemented as subquery join srf or as srf join subquery, with the different output order that implies. I've duct-taped together a solution for that, by forcing the lateral machinery to always see a dependency from the SRF to the subquery; but that probably needs a nicer fix than a RangeTblEntry->deps field which is processed in extract_lateral_references() ;) > > Besides that I'm structurally wondering whether turning the original > > query into a subquery is the right thing to do. It requires some kind of > > ugly munching of Query->*, and has the above problem. > > It does not seem like it should be that hard, certainly no worse than > subquery pullup. Want to show code? It's not super hard, there's some stuff like pushing/not-pushing various sortgrouprefs to the subquery. But I think we can live with it. Let me clean up the code some, hope to have something today or tomorrow. > > An alternative approach would be to do this during parse-analysis, but I > > think that might end up being confusing, because stored rules would > > suddenly have a noticeably different structure, and it'd tie us a lot > > more to the details of that transformation than I'd like. > > -1 on that; we do not want this transformation visible in stored rules. Agreed. > > Besides the removal of the least-common-multiple behaviour of tSRF queries, > > there's some other consequences that using function scans have: > > Previously if a tSRF was never evaluated, it didn't cause the number of > > rows from being increased. E.g. > > SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id); > > only produced two rows. But using joins means that a simple > > implementation of using ROWS FROM returns four rows. > > Hmm. I don't mind changing behavior in that sort of corner case. > If we're prepared to discard the LCM behavior, this seems at least > an order of magnitude less likely to be worth worrying about. I think it's fine, and potentially less confusing. > Would it be worth detecting SRFs below CASE/COALESCE/etc and throwing > an error? It would be easier to sell throwing an error than silently > changing behavior, I think. Hm. We could, but I think the new behaviour would actually make sense in the long run. Interpreting the coalesce to run on the output of the SRF doesn't seem bad to me. I found another edgecase, which we need to make a decision about. 'record' returning SRFs can't be transformed easily into a ROWS FROM. Consider e.g. the following from the regression tests: create function array_to_set(anyarray) returns setof record as $$ select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i $$ language sql strict immutable; select array_to_set(array['one', 'two']); ┌──┐ │ array_to_set │ ├──┤ │ (1,one) │ │ (2,two) │ └──┘ (2 rows) which currently works. That currently can't be modeled as ROWS FROM() directly, because that desperately wants to return the columns as columns, which we can't do for 'record' returning things, because they don't have defined columns. For composite returning SRFs I've currently implemented that by generating a ROWS() expression, but that doesn't work for record. So it seems like we need some, not necessarily user exposed, way of making nodeFunctionscan.c return the return value as one datum. One way, as suggested by Andrew G. on IRC, was to interpret empty column
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
Robert Haaswrites: > On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar wrote: >> There are many more such exposed functions, which can throw cache lookup >> failure error if we pass wrong value. >> >> i.e. >> record_in >> domain_in >> fmgr_c_validator >> plpgsql_validator >> pg_procedure_is_visible >> >> Are we planning to change these also.. > I think if they are SQL-callable functions that users can invoke from > a SQL prompt, they shouldn't throw "cache lookup failed" errors or, > for that matter, any other error that is reported with elog(). FWIW, I would restrict that to "functions that users are *intended* to call from a SQL prompt". If you are fooling around calling internal functions with garbage input, you should not be surprised to get an internal error back. So of the above list, only pg_procedure_is_visible seems particularly worth changing to me. And for it, returning NULL (ie, "unknown") seems reasonable enough. Actually, it looks to me like we already fixed that for the built-in is_visible functions: # select pg_function_is_visible(0) is null; ?column? -- t (1 row) There's no such animal as pg_procedure_is_visible. I suspect that's an EDB-ism that hasn't caught up with commit 66bb74dbe8206a35. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changed SRF in targetlist handling
Andres Freundwrites: > I've an implementation that > 1) turns all targetlist SRF (tSRF from now on) into ROWS FROM >expressions. If there's tSRFs in the argument of a tSRF those becomes >a separate, lateral, ROWS FROM expression. > 2) If grouping/window functions are present, the entire query is wrapped >in a subquery RTE, except for the set-returning function. All >referenced Var|Aggref|GroupingFunc|WindowFunc|Param nodes in the >original targetlist are made to reference that subquery, which gets a >TargetEntry for them. FWIW, I'd be inclined to do the subquery RTE all the time, adding some optimization fence to ensure it doesn't get folded back. That fixes your problem here: > So far I have one problem without an easy solution: Historically queries > like > =# SELECT id, generate_series(1,2) FROM (VALUES(1),(2)) few(id); > ââââââ¬ââââââââââââââââââ > â id â generate_series â > ââââââ¼âââââââââââââââââ⤠> â 1 â 1 â > â 1 â 2 â > â 2 â 1 â > â 2 â 2 â > ââââââ´ââââââââââââââââââ > have preserved the SRF output ordering. But by turning the SRF into a > ROWS FROM, there's no guarantee that the cross join between "few" and > generate_series(1,3) above is implemented in that order. > Besides that I'm structurally wondering whether turning the original > query into a subquery is the right thing to do. It requires some kind of > ugly munching of Query->*, and has the above problem. It does not seem like it should be that hard, certainly no worse than subquery pullup. Want to show code? > An alternative approach would be to do this during parse-analysis, but I > think that might end up being confusing, because stored rules would > suddenly have a noticeably different structure, and it'd tie us a lot > more to the details of that transformation than I'd like. -1 on that; we do not want this transformation visible in stored rules. > Besides the removal of the least-common-multiple behaviour of tSRF queries, > there's some other consequences that using function scans have: > Previously if a tSRF was never evaluated, it didn't cause the number of > rows from being increased. E.g. > SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id); > only produced two rows. But using joins means that a simple > implementation of using ROWS FROM returns four rows. Hmm. I don't mind changing behavior in that sort of corner case. If we're prepared to discard the LCM behavior, this seems at least an order of magnitude less likely to be worth worrying about. Having said that, I do seem to recall a bug report about misbehavior when a SRF was present in just one arm of a CASE statement. That would have the same type of behavior as you describe here, and evidently there's at least one person out there depending on it. Would it be worth detecting SRFs below CASE/COALESCE/etc and throwing an error? It would be easier to sell throwing an error than silently changing behavior, I think. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Double invocation of InitPostmasterChild in bgworker with -DEXEC_BACKEND
Thomas Munrowrites: > I discovered that if you build with -DEXEC_BACKEND on a Unix system > and then try to start a background worker, it dies in > InitializeLatchSupport: > TRAP: FailedAssertion("!(selfpipe_readfd == -1)", File: "latch.c", Line: 161) > That's because InitPostmasterChild is called twice. I can > successfully use regular parallel query workers and bgworkers created > by extensions if I apply the attached patch. Confirmed, fix pushed. I wonder if we should have a buildfarm member running the Unix EXEC_BACKEND code ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests
Michael Paquier wrote: > Here using pg_xlog_replay_resume() is not the correct solution because > this would cause the node to finish recovery before we want it to, and > so is recovery_target_action = 'promote'. If we look at the test, it > is doing the following when getting the TXID that is used as recovery > target: > $node_master->safe_psql('postgres', > "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); > my $recovery_txid = > $node_master->safe_psql('postgres', "SELECT txid_current()"); > my $lsn2 = > $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); > What I think we had better do is reverse the calls > pg_current_xlog_location() and txid_current() so as we are sure that > the LSN we track for replay is lower than the real target LSN. The > same problem exists when defining the timestamp target. > > The patch attached does that, Why not capture both items in a single select, such as in the attached patch? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl index 20b878e..3864e60 100644 --- a/src/test/recovery/t/003_recovery_targets.pl +++ b/src/test/recovery/t/003_recovery_targets.pl @@ -66,17 +66,16 @@ $node_master->backup('my_backup'); # target TXID. $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); -my $recovery_txid = - $node_master->safe_psql('postgres', "SELECT txid_current()"); -my $lsn2 = - $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); +my $ret = + $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), txid_current();"); +my ($lsn2, $recovery_txid) = split /\|/, $ret; # More data, with recovery target timestamp $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(2001,3000))"); -my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()"); -my $lsn3 = - $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); +$ret = + $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();"); +my ($lsn3, $recovery_time) = split /\|/, $ret; # Even more data, this time with a recovery target name $node_master->safe_psql('postgres', -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database
Moving to -hackers since this is getting into details Bug Report # 14247 On Tue, Aug 2, 2016 at 4:58 PM, Tom Lanewrote: > "David G. Johnston" writes: > > Do you have an opinion on this following? > > I think the real problem in this area is that the division of labor > we have between pg_dump and pg_dumpall isn't very good for real-world > use cases. I won't disagree here. > I'm not at all sure what a better answer would look like. > But I'd rather see us take two steps back and consider the issue > holistically instead of trying to band-aid individual misbehaviors. > > The fact that pg_dump is emitting COMMENT ON DATABASE at all is > fundamentally wrong given the existing division-of-labor decisions, > namely that pg_dump is responsible for objects within a database > not for database-level properties. But I have to disagree with this in the presence of the --create flag. > It's entirely possible that some feature like COMMENT ON CURRENT DATABASE > would be a necessary component of a holistic solution, > [ various other ON CURRENT commands elidded ] > In reading the code for the public schema bug I never fully got my head around how dump/restore interacts with multiple databases. Specifically, in RestoreArchive, why we basically continue instead of breaking once we've encountered the "DATABASE" entry and decide that we need to drop the DB due to (createDB && dropSchema). OTOH, seeing the code for the public schema exception I'm inclined to think that adding something like the follow just after the public schema block just patched: /* If the user didn't request that a new database be created skip emitting * commands targeting the current database as they may not have the same * name as the database being restored into. * * XXX This too is ugly but the treatment of globals is not presently amenable to pretty solutions */ if (!ropt->createDB)) { if (strcmp(te->desc, "DATABASE") != 0 //this is quite possibly too broad a check...I'm always concerned about false positives in cases like these. return; } Though reading it now that seems a bit like throwing out the baby with the bath water; but then again if they are not requesting a database be created and they happen to have a database of the same name already in place it would be unexpected that it would not have been properly configured. Assuming I'm not missing something here it seems like an easy and internally consistent hack/fix for a rare but real concern. However, since the existing code doesn't provoke an error it doesn't seem like something this should back-patched (I'm not convinced either way though). David J.
Re: [HACKERS] Why we lost Uber as a user
Stephen Frostwrites: > With physical replication, there is the concern that a bug in *just* the > physical (WAL) side of things could cause corruption. Right. But with logical replication, there's the same risk that the master's state could be fine but a replication bug creates corruption on the slave. Assuming that the logical replication works by issuing valid SQL commands to the slave, one could hope that this sort of "corruption" only extends to having valid data on the slave that fails to match the master. But that's still not a good state to be in. And to the extent that performance concerns lead the implementation to bypass some levels of the SQL engine, you can easily lose that guarantee too. In short, I think Uber's position that logical replication is somehow more reliable than physical is just wishful thinking. If anything, my money would be on the other way around: there's a lot less mechanism that can go wrong in physical replication. Which is not to say there aren't good reasons to use logical replication; I just do not believe that one. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel.c is not marked as test covered
On Tue, Aug 2, 2016 at 1:17 PM, Peter Eisentrautwrote: > On 6/19/16 10:00 PM, Robert Haas wrote: >>> Independent of that, it would help testing things like this if we allowed >>> > setting max_worker_processes to 0, instead of the current minimum 1. If >>> > there a reason for the minimum of 1? >> I believe that's pure brain fade on my part. I think the minimum should be >> 0. > > Fixed. Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement
On Thu, Jul 28, 2016 at 7:29 PM, Andres Freundwrote: > I think just iterating through the active snapshots would have been > fine. Afaics there's no guarantee that the first active snapshot pushed > is the relevant one - in contrast to registered one, which are ordered > by virtue of the heap. After a bit of scrutiny, I don't think this is a problem, and I don't want to introduce belt-and-suspenders protections against can't-happen conditions that may get cargo-culted into other places. (How's that for cramming as many Tom-Lane-isms as possible into one sentence? More might be done, but it would be some sort of bespoke crock of the first water.) Anyway, the reason I think it's OK is that PushActiveSnapshot() does not copy the input snapshot unless it's CurrentSnapshot or SecondarySnapshot -- so every snapshot that gets pushed on the active snapshot stack is either one that's already registered or one that was just taken (because CurrentSnapshot will get overwritten on next call to GetTransactionSnapshot). In the former case, it's included in the registered snapshots so it doesn't even matter whether we notice that it is also on the active snapshot stack. In the latter case, having just been taken, it must be newer than the oldest registered snapshot, which was necessarily taken earlier. I think that the only time it matters whether we even look at the active snapshot stack is when there are no registered snapshots whatsoever. In many cases -- like QueryDescs -- we register the snapshot first and then later put it on the active snapshot stack, but there are some things like CLUSTER that only make the snapshot active and don't register it. But if they do that, they can only do it in first-in, first-out fashion. >> >> But there's a bit of spooky action at a >> >> distance: if we don't see any snapshots, then we have to assume that >> >> the scan is being performed with a non-MVCC snapshot and therefore we >> >> don't need to worry. But there's no way to verify that via an >> >> assertion, because the connection between the relevant MVCC snapshot >> >> and the corresponding TOAST snapshot is pretty indirect, mediated only >> >> by snapmgr.c. >> > >> > Hm. Could we perhaps assert that the session has a valid xmin? >> >> I don't think so. CLUSTER? > > That should have one during any toast lookups afaics - the relevant code > is > /* Start a new transaction for each relation. */ > StartTransactionCommand(); > /* functions in indexes may want a snapshot set */ > PushActiveSnapshot(GetTransactionSnapshot()); > /* Do the job. */ > cluster_rel(rvtc->tableOid, rvtc->indexOid, true, > stmt->verbose); > PopActiveSnapshot(); > CommitTransactionCommand(); > right? And > Snapshot > GetSnapshotData(Snapshot snapshot) > { > ... > > if (!TransactionIdIsValid(MyPgXact->xmin)) > MyPgXact->xmin = TransactionXmin = xmin; > sets xmin. Yeah. Actually, consistent with the above, I discovered that as long as we consult both the active snapshot stack and the pairingheap of registered snapshots, it seems to be fine to just insist that we always get a snapshot back from the snapmgr and use that to initialize the TOAST snapshot. So I did it that way in the attached version. New patch attached. Barring objections, I will commit this tomorrow. If there are objections, I will send another status update tomorrow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company ost-heap-toast-disagreement-v3.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL 10 kick-off
Peter Eisentraut wrote: > On 8/1/16 1:08 PM, Fabrízio de Royes Mello wrote: > > What knowledge is expected for a CFM? I'm really would like to help and > > also learn more about our development. > > If you search the wiki for "commitfest" you will find several pages of > varying outdatedness that describe the process. I have cleaned up the wiki a bit, so the authoritative info should be at https://wiki.postgresql.org/wiki/CommitFest_Checklist Note that what that page describes is not exactly what we do nowadays; if someone wants to edit that page to better reflect reality, be my guest. Also see the old (2009) page where an older workflow was described: https://wiki.postgresql.org/index.php?title=Running_a_CommitFest=9369 Perhaps it'd be good to resurrect some of that contents, in modernized form, to the current page. Thanks -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HandleParallelMessages contains CHECK_FOR_INTERRUPTS?
I wrote: > Alvaro Herrerawrites: >> I notice you just removed the CHECK_FOR_INTERRUPTS in >> HandleParallelMessages(). Did you notice that HandleParallelMessages >> calls shm_mq_receive(), which calls shm_mq_receive_bytes(), which >> contains a CHECK_FOR_INTERRUPTS() call? After study, I believe that that CHECK_FOR_INTERRUPTS() is unreachable given that HandleParallelMessages passes nowait = true. But it's not unlikely that future changes in shm_mq.c might introduce such calls that are reachable. > I wonder whether we should make use of HOLD_INTERRUPTS/RESUME_INTERRUPTS > to avoid the recursion scenario here. I concluded that that would be good future-proofing, whether or not it's strictly necessary today, so I pushed it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
* Robert Haas (robertmh...@gmail.com) wrote: > On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlsteinwrote: > > You are quite technical, my feeling is that you will understand it, however > > it will need to be a self learned lesson. > > I don't know what this is supposed to mean, but I think that Geoff's > point is somewhat valid. No matter how you replicate data, there is > always the possibility that you will replicate any corruption along > with the data - or that your copy will be unfaithful to the original. I believe what Geoff was specifically getting at is probably best demonstrated with an example. Consider a bug in the btree index code which will accept a value but not store it correctly. INSERT INTO mytable (indexed_column) VALUES (-10); /* oops, bug, this value gets stored in the wrong place in the btree */ We happily accept the record and insert it into the btree index, but that insert is incorrect and results in the btree being corrupted because some bug doesn't handle such large values correctly. In such a case, either approach to replication (replicating the query statement, or replicating the changes to the btree page exactly) would result in corruption on the replica. The above represents a bug in *just* the btree side of things (the physical replication did its job correctly, even though the result is a corrupted index on the replica). With physical replication, there is the concern that a bug in *just* the physical (WAL) side of things could cause corruption. That is, we correctly accept and store the value on the primary, but the records generated to send that data to the replica are incorrect and result in an invalid state on the replica. Of course, a bug in the physical side of things which caused corruption would mean that *crash recovery* would also cause corruption. As I understand it, that same concern exists for MySQL, so, moving to logical replication doesn't actually mean you don't need to worry about bugs in the crash recovery side of things, assuming you depend on the database to come back up in a consistent manner after a crash. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] No longer possible to query catalogs for index capabilities?
On Mon, Aug 1, 2016 at 10:19:43AM -0400, Tom Lane wrote: > As far as I understood Andrew's use case, he was specifically *not* > interested in a complete representation of an index definition, but > rather about whether it had certain properties that would be of > interest to query-constructing applications. Would it be helpful to output an array of strings representing the index definition? -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility
On Tue, Aug 2, 2016 at 2:33 PM, Bruce Momjianwrote: > On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote: >> Even so, I'd be curious whether it would break anything to have >> xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero >> pages written to fill out a segment. At least until it's felt >> that archive_timeout has been so decidedly obsoleted by streaming >> replication that it is removed, and the log-tail zeroing code >> with it. >> >> That at least would eliminate the risk of anyone else repeating >> my astonishment. :) I had read that 9.4 added built-in log-zeroing >> code, and my first reaction was "cool! that may make the compression >> technique we're using unnecessary, but certainly can't make it worse" >> only to discover that it did, by ~ 300x, becoming now 3x *worse* than >> plain gzip, which itself is ~ 100x worse than what we had. > > My guess is that the bytes are there to detect problems where a 512-byte > disk sector is zeroed by a disk failure. I don't see use changing that > for the use-case you have described. Is there actually any code that makes such a check? I'm inclined to doubt that was the motivation, though admittedly we're both speculating about the contents of Heikki's brain, a tricky proposition on a good day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlsteinwrote: > You are quite technical, my feeling is that you will understand it, however > it will need to be a self learned lesson. I don't know what this is supposed to mean, but I think that Geoff's point is somewhat valid. No matter how you replicate data, there is always the possibility that you will replicate any corruption along with the data - or that your copy will be unfaithful to the original. The possible advantage of logical replication rather than physical replication is that any errors you replicate will be logical errors rather than physical errors - so if the heap gets out of step with the indexes on the master, the same problem will not necessarily occur on the slave. On the flip side, despite what Uber found in their environment, physical replication tends to be high-performance because the replay is dead simple. Andres and others have done a good job making our logical decoding facility fast, but I believe it's still slower than plain old physical replication and probably always will be, and the trigger-based logical replication solutions are slower still. Consequently, I believe that both physical and logical replication have advantages, and that's why we should support both of them. Then, each individual user can make the trade-offs they prefer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility
On 08/02/2016 02:33 PM, Bruce Momjian wrote: > My guess is that the bytes are there to detect problems where > a 512-byte disk sector is zeroed by a disk failure. Does that seem plausible? (a) there is only one such header for every 16 512-byte disk sectors, so it only affords a 6% chance of detecting a zeroed sector, and (b) the header contains other non-zero values in fields other than xlp_pageaddr, so the use of a fixed value for _that field_ in zeroed tail blocks would not prevent (or even reduce the 6% probability of) detecting a sector zeroed by a defect. -Chap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
Karan Sikkawrites: > Just for the record, I'm leaning to the side of "not worth it". My > thoughts are, if you are at a scale where you care about this 20% > speedup, you would want to go all the way to an indexed structure, > because the gains you would realize would exceed 20%, and 20% may not be > a sufficient speedup anyway. If I'm reading your test case correctly, 20% is actually a rather impressive number, because it's 20% *overall* gain on queries that will also involve TOAST fetch and decompress on the source data. (Decompress definitely, and I'm guessing those 5K strings don't compress well enough to avoid getting pushed out-of-line; though it might be worth repeating the test with chunks of 10K or 20K to be sure.) So the percentage improvement in the LIKE test proper must have been a lot more than that. However, I'm dubious that LIKE patterns with long fixed substrings are a common use-case, so I'm afraid that this might be quite a lot of work for something that won't much benefit most users. I'm also worried that the setup costs might be enough to make it a net loss in many cases. There are probably ways to amortize the setup costs, since typical scenarios involve the same LIKE pattern across many rows, but implementing that would add even more work. (Having said that, I've had a bee in my bonnet for a long time about removing per-row setup cost for repetitive regex matches, and whatever infrastructure that needs would work for this too. And for strpos' B-M-H setup, looks like. So this might be something to look into with a suitably wide view of what the problem is.) Not sure what advice to give you here. I think this is in the grey zone where it's hard to be sure whether it's worth putting work into. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
> > I really don't get what's problematic with posting a message on a mailing > list about a potential performance issue, to try to get people's reactions, > without diving into profiling right away > "Benchmark data is a perfect substitute for benchmarking results. Data is easy to misinterpret, so try not to do that." (see [1], and slide 73 of [2]) The key points are: 0) It is extremely easy to take a wrong way unless you analyze the benchmark results. 1) If you (or someone else) thinks that "ok, the original email did meet its goal, as Vladimir did provide a patch with measurements", then I failed. The only reason for me doing the benchmark and patch was to teach you how to do that. 2) Have you seen recent discussion "TODO item: Implement Boyer-Moore searching in LIKE queries" on the list? It does include relevant details right from the start. (see https://www.postgresql.org/message-id/CALkFZpcbipVJO%3DxVvNQMZ7uLUgHzBn65GdjtBHdeb47QV4XzLw%40mail.gmail.com ) I'm not a PostgreSQL developer > Neither am I. > I have other urgent things to do > So do I. > and don't even spend most of my programming time in C. > Java and SQL covers 99% of my time. [1]: https://twitter.com/shipilev/status/760387758246486017 [2]: https://shipilev.net/talks/jvmls-July2014-benchmarking.pdf Vladimir
Re: [HACKERS] Why we lost Uber as a user
> On Aug 2, 2016, at 2:33 AM, Geoff Winklesswrote: > >> On 2 August 2016 at 08:11, Alfred Perlstein wrote: >>> On 7/2/16 4:39 AM, Geoff Winkless wrote: >>> I maintain that this is a nonsense argument. Especially since (as you >>> pointed out and as I missed first time around) the bug actually occurred at >>> different records on different slaves, so he invalidates his own point. > >> Seriously? > > No, I make a habit of spouting off random arguments to a list full of > people whose opinions I massively respect purely for kicks. What do > you think? > >> There's a valid point here, you're sending over commands at the block level, >> effectively "write to disk at this location" versus "update this record >> based on PK", obviously this has some drawbacks that are reason for concern. > > Writing values directly into file offsets is only problematic if > something else has failed that has caused the file to be an inexact > copy. If a different bug occurred that caused the primary key to be > corrupted on the slave (or indeed the master), PK-based updates would > exhibit similar propagation errors. > > To reiterate my point, uber's described problem came about because of > a bug. Every software has bugs at some point in its life, to pretend > otherwise is simply naive. I'm not trying to excuse the bug, or to > belittle the impact that such a bug has on data integrity or on uber > or indeed on the reputation of PostgreSQL. While I'm prepared to > accept (because I have a job that requires I spend time on things > other than digging through obscure reddits and mailing lists to > understand more fully the exact cause) that in _this particular > instance_ the bug was propagated because of the replication mechanism > (although I'm still dubious about that, as per my comment above), that > does _not_ preclude other bugs propagating in a statement-based > replication. That's what I said is a nonsense argument, and no-one has > yet explained in what way that's incorrect. > > Geoff Geoff, You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrerawrote: >> > How desirable is this feature? To begin answering that question, >> > I did some initial benchmarking with an English text corpus to find how >> > much >> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the >> > results >> > were as follows: BMH can be up to 20% faster than LIKE, but for short >> > substrings, it's roughly comparable or slower. >> > >> > Here are the results visualized: >> > >> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png >> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png >> >> Based on these results, this looks to me like a pretty unexciting >> thing upon which to spend time. > > Uh, a 20% different is "unexciting" for you? I think it's interesting. > Now, really, users shouldn't be running LIKE on constant strings all the > time but rather use some sort of indexed search, but once in a while > there is a need to run some custom query and you need to LIKE-scan a > large portion of a table. For those cases an algorithm that performs > 20% better is surely welcome. Sure, but an algorithm that performs 20% faster in the best case and worse in some other cases is not the same thing as a 20% across-the-board performance improvement. I guess if we had a way of deciding which algorithm to use in particular cases it might make sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
Yeah, you make a fair point. I was hoping more people would chime in with strong opinions to make the decision easier, but perhaps the lack of noise indicates this is a low-pri feature. I will ask around in my coworker circles to see what they think, could you do the same? Just for the record, I'm leaning to the side of "not worth it". My thoughts are, if you are at a scale where you care about this 20% speedup, you would want to go all the way to an indexed structure, because the gains you would realize would exceed 20%, and 20% may not be a sufficient speedup anyway. On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrerawrote: > Robert Haas wrote: > > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka > wrote: > > > Following the patch to implement strpos with Boyer-Moore-Horspool, > > > it was proposed we bring BMH to LIKE as well. > > > > > > Original thread: > > > > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us > > > > > > I'm a first time hacker and I found this task on the TODO list. It > seemed > > > interesting and isolated enough. But after looking at the code in > > > like_match.c, it's clearly a non-trivial task. > > > > > > How desirable is this feature? To begin answering that question, > > > I did some initial benchmarking with an English text corpus to find > how much > > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the > results > > > were as follows: BMH can be up to 20% faster than LIKE, but for short > > > substrings, it's roughly comparable or slower. > > > > > > Here are the results visualized: > > > > > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png > > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png > > > > Based on these results, this looks to me like a pretty unexciting > > thing upon which to spend time. > > Uh, a 20% different is "unexciting" for you? I think it's interesting. > Now, really, users shouldn't be running LIKE on constant strings all the > time but rather use some sort of indexed search, but once in a while > there is a need to run some custom query and you need to LIKE-scan a > large portion of a table. For those cases an algorithm that performs > 20% better is surely welcome. > > I wouldn't be so quick to dismiss this. > > Of course, it needs to work in all cases, or failing that, be able to > fall back to the original code if it cannot support some corner case. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] PostmasterContext survives into parallel workers!?
Alvaro Herrerawrites: > Tom Lane wrote: >> It looks to me like the reason for it is simply not having bothered to >> copy the rw->rw_worker data to somewhere that would survive deletion >> of the PostmasterContext. I wonder though if anyone remembers a more >> fundamental reason? Surely the bgworker is not supposed to touch any >> of the rest of the BackgroundWorkerList? > I just checked BDR, which is the more complex code using workers I know > of, and I don't see any reason why this cannot be changed. The attached patch passes "make check-world" for me. Can you check it against BDR? (I'd be hesitant to back-patch it in any case, but I think it's okay for HEAD unless we can easily find something it breaks.) regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 19d11e0..e48703a 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *** do_start_bgworker(RegisteredBgWorker *rw *** 5529,5537 /* Close the postmaster's sockets */ ClosePostmasterPorts(false); ! /* Do NOT release postmaster's working memory context */ - MyBgworkerEntry = >rw_worker; StartBackgroundWorker(); break; #endif --- 5529,5547 /* Close the postmaster's sockets */ ClosePostmasterPorts(false); ! /* ! * Before blowing away PostmasterContext, save this bgworker's ! * data where it can find it. ! */ ! MyBgworkerEntry = (BackgroundWorker *) ! MemoryContextAlloc(TopMemoryContext, sizeof(BackgroundWorker)); ! memcpy(MyBgworkerEntry, >rw_worker, sizeof(BackgroundWorker)); ! ! /* Release postmaster's working memory context */ ! MemoryContextSwitchTo(TopMemoryContext); ! MemoryContextDelete(PostmasterContext); ! PostmasterContext = NULL; StartBackgroundWorker(); break; #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests
Michael Paquier wrote: > On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier >wrote: > > There is still an issue with pg_basebackup when testing stream mode > > and replication slots. I am digging into this one now.. > > After 5 hours running this test in a row and 30 attempts torturing > hamster with a script running make check in an infinite loop with set > -e I am giving up on that for the time being... I have added the patch > to make the tests more stable to next CF so as it is not forgotten: > https://commitfest.postgresql.org/10/693/ Great, thanks for the effort, will push. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility
On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote: > Even so, I'd be curious whether it would break anything to have > xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero > pages written to fill out a segment. At least until it's felt > that archive_timeout has been so decidedly obsoleted by streaming > replication that it is removed, and the log-tail zeroing code > with it. > > That at least would eliminate the risk of anyone else repeating > my astonishment. :) I had read that 9.4 added built-in log-zeroing > code, and my first reaction was "cool! that may make the compression > technique we're using unnecessary, but certainly can't make it worse" > only to discover that it did, by ~ 300x, becoming now 3x *worse* than > plain gzip, which itself is ~ 100x worse than what we had. My guess is that the bytes are there to detect problems where a 512-byte disk sector is zeroed by a disk failure. I don't see use changing that for the use-case you have described. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostmasterContext survives into parallel workers!?
Tom Lane wrote: > Robert Haaswrites: > > On Mon, Aug 1, 2016 at 4:18 PM, Tom Lane wrote: > >> Now, I'm undecided whether to flush that context only in parallel workers, > >> or to try to make it go away for all bgworkers of any stripe. The latter > >> seems a little better from a security standpoint, but I wonder if anyone > >> has a use-case where that'd be a bad idea? > > > I think it would be better to get rid of it in all bgworkers. > > I looked into this, and immediately found this in the spot in postmaster.c > that would be the obvious place to kill the PostmasterContext: > > /* Do NOT release postmaster's working memory context */ > > MyBgworkerEntry = >rw_worker; > StartBackgroundWorker(); > > This comment was in Alvaro's original commit adding bgworkers (da07a1e8). Hm, I don't have the development branch in this laptop. I might find some evidence in the old one, but I won't be able to reach it till tonight. > It looks to me like the reason for it is simply not having bothered to > copy the rw->rw_worker data to somewhere that would survive deletion > of the PostmasterContext. I wonder though if anyone remembers a more > fundamental reason? Surely the bgworker is not supposed to touch any > of the rest of the BackgroundWorkerList? I just checked BDR, which is the more complex code using workers I know of, and I don't see any reason why this cannot be changed. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL 10 kick-off
On 8/1/16 1:08 PM, Fabrízio de Royes Mello wrote: > What knowledge is expected for a CFM? I'm really would like to help and > also learn more about our development. If you search the wiki for "commitfest" you will find several pages of varying outdatedness that describe the process. Mainly, you will need to be nagging a lot of people. ;-) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostmasterContext survives into parallel workers!?
Robert Haaswrites: > On Mon, Aug 1, 2016 at 4:18 PM, Tom Lane wrote: >> Now, I'm undecided whether to flush that context only in parallel workers, >> or to try to make it go away for all bgworkers of any stripe. The latter >> seems a little better from a security standpoint, but I wonder if anyone >> has a use-case where that'd be a bad idea? > I think it would be better to get rid of it in all bgworkers. I looked into this, and immediately found this in the spot in postmaster.c that would be the obvious place to kill the PostmasterContext: /* Do NOT release postmaster's working memory context */ MyBgworkerEntry = >rw_worker; StartBackgroundWorker(); This comment was in Alvaro's original commit adding bgworkers (da07a1e8). It looks to me like the reason for it is simply not having bothered to copy the rw->rw_worker data to somewhere that would survive deletion of the PostmasterContext. I wonder though if anyone remembers a more fundamental reason? Surely the bgworker is not supposed to touch any of the rest of the BackgroundWorkerList? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slowness of extended protocol
On Mon, Aug 1, 2016 at 12:12 PM, Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > The attached patch passes `make check` and it gains 31221 -> 33547 improvement for "extended pgbench of SELECT 1". > > The same version gains 35682 in "simple" mode, and "prepared" mode achieves 46367 (just in case). That's great, thanks for looking into it! I hope your patch gets merged. > Shay, why don't you use a profiler? Seriously. > I'm afraid "iterate the per-message loop in PostgresMain five times not once" /"just discussing what may or may not be a problem..." is just hand-waving. > > Come on, it is not that hard. I really don't get what's problematic with posting a message on a mailing list about a potential performance issue, to try to get people's reactions, without diving into profiling right away. I'm not a PostgreSQL developer, have other urgent things to do and don't even spend most of my programming time in C.
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
Robert Haas wrote: > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikkawrote: > > Following the patch to implement strpos with Boyer-Moore-Horspool, > > it was proposed we bring BMH to LIKE as well. > > > > Original thread: > > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us > > > > I'm a first time hacker and I found this task on the TODO list. It seemed > > interesting and isolated enough. But after looking at the code in > > like_match.c, it's clearly a non-trivial task. > > > > How desirable is this feature? To begin answering that question, > > I did some initial benchmarking with an English text corpus to find how much > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results > > were as follows: BMH can be up to 20% faster than LIKE, but for short > > substrings, it's roughly comparable or slower. > > > > Here are the results visualized: > > > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png > > Based on these results, this looks to me like a pretty unexciting > thing upon which to spend time. Uh, a 20% different is "unexciting" for you? I think it's interesting. Now, really, users shouldn't be running LIKE on constant strings all the time but rather use some sort of indexed search, but once in a while there is a need to run some custom query and you need to LIKE-scan a large portion of a table. For those cases an algorithm that performs 20% better is surely welcome. I wouldn't be so quick to dismiss this. Of course, it needs to work in all cases, or failing that, be able to fall back to the original code if it cannot support some corner case. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MSVC pl-perl error message is not verbose enough
On Mon, Aug 1, 2016 at 9:39 PM, Michael Paquierwrote: > On Tue, Aug 2, 2016 at 2:08 AM, Robert Haas wrote: > > Did you add this to the next CommitFest? > > I have added it here: > https://commitfest.postgresql.org/10/691/ > John, it would be good if you could get a community account and add > your name to this patch as its author. I could not find you. > Done. If there's anything else that I should do, please let me know. I'd be glad to help out. Thanks! -John
Re: [HACKERS] parallel.c is not marked as test covered
On 6/19/16 10:00 PM, Robert Haas wrote: >> Independent of that, it would help testing things like this if we allowed >> > setting max_worker_processes to 0, instead of the current minimum 1. If >> > there a reason for the minimum of 1? > I believe that's pure brain fade on my part. I think the minimum should be 0. Fixed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_size_pretty, SHOW, and spaces
On Tue, Aug 2, 2016 at 11:29:01AM +0200, Christoph Berg wrote: > > The issue is that we output "10 bytes", not "10bytes", but for units we > > use "977KB". That seems inconsistent, but it is the normal policy > > people use. I think this is because "977KB" is really "977K bytes", but > > we just append the "B" after the "K" for bevity. > > It's the other way round: > > https://en.wikipedia.org/wiki/International_System_of_Units#General_rules > > | The value of a quantity is written as a number followed by a space > | (representing a multiplication sign) and a unit symbol; e.g., 2.21 kg > [...] > > I'd opt to omit the space anywhere where the value is supposed to be > fed back into the config (SHOW, --parameters), but use the "pretty" > format with space everywhere otherwise (documentation, memory counts > in explain output, pg_size_pretty() etc.) Yes, that's a strong argument for using a space. I have adjusted the patch to use spaces in all reasonable places. Patch attached, which I have gzipped because it was 133 KB. (Ah, see what I did there?) :-) I am thinking of leaving the 9.6 docs alone as I have already made them consistent (no space) with minimal changes. We can make it consistent the other way in PG 10. -- Bruce Momjianhttp://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + kilo2.diff.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanting to learn about pgsql design decision
> "Tom" == Tom Lanewrites: >> - Why to read from a table, both a usage permission on the schema >> and a read access permission on the table is needed? Tom> Because the SQL standard says so. You'd think, but in fact it doesn't; the spec (at least 2008 and the 2011 drafts) has no concept of grantable permissions on schemas, and ties table ownership and schema ownership together. (See the definition of to see that there's nothing there for schemas, and the definition of for the fact that it's the schema owner who also owns the table and gets the initial grants on it, and and to confirm that only the schema owner can alter or drop the table. The access rules for only require permission on a table column, no mention of schemas.) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PATCH: two slab-like memory allocators
Hi, Back in the bug #14231 thread [1], dealing with performance issues in reorderbuffer due to excessive number of expensive free() calls, I've proposed to resolve that by a custom slab-like memory allocator, suitable for fixed-size allocations. I'd like to put this into the next CF, and it's probably too invasive change to count as a bugfix anyway. [1] https://www.postgresql.org/message-id/flat/20160706185502.1426.28143%40wrigleys.postgresql.org This patch actually includes two new memory allocators (not one). Very brief summary (for more detailed explanation of the ideas, see comments at the beginning of slab.c and genslab.c): Slab * suitable for fixed-length allocations (other pallocs fail) * much simpler than AllocSet (no global freelist management etc.) * free space is tracked per block (using a simple bitmap) * which allows freeing the block once all chunks are freed (AllocSet will hold the memory forever, in the hope of reusing it) GenSlab --- * suitable for non-fixed-length allocations, but with chunks of mostly the same size (initially unknown, the context will tune itself) * a combination AllocSet and Slab (or a sequence of Slab allocators) * the goal is to do most allocations in Slab context * there's always a single 'current' Slab context, and every now and and then it's replaced with a new generation (with the chunk size computed from recent requests) * the AllocSet context is used for chunks too large for current Slab So none of this is meant as a universal replacement of AllocSet, but in the suitable cases the results seem really promising. For example for the simple test query in [1], the performance improvement is this: N | master | patched - 1 |100ms |100ms 5 | 15000ms |350ms 10 | 146000ms |700ms 20 |? | 1400ms That's a fairly significant improvement, and the submitted version of the patches should perform even better (~2x, IIRC). There's a bunch of TODOs - e.g. handling of realloc() calls in the GenSlab, and probably things I haven't thought about. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-simple-slab-allocator-fixed-size-allocations.patch Description: binary/octet-stream 0002-generational-slab-auto-tuning-allocator.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 2 August 2016 at 15:27, Robert Haaswrote: > On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila wrote: >> Why we need to add a record in all indexes if only the key >> corresponding to one of indexes is updated? Basically, if the tuple >> can fit on same page, why can't we consider it as HOT (or HPT - heap >> partial tuple or something like that), unless it updates all the keys >> for all the indexes. Now, we can't consider such tuple versions for >> pruning as we do for HOT. The downside of this could be that we might >> need to retain some of the line pointers for more time (as we won't be >> able to reuse the line pointer till it is used in any one of the >> indexes and those could be reused once we make next non-HOT update). >> However, this should allow us not to update the indexes for which the >> corresponding column in tuple is not updated. I think it is a basic >> premise that if any index column is updated then the update will be >> considered as non-HOT, so there is a good chance that I might be >> missing something here. > > Well, I think that the biggest advantage of a HOT update is the fact > that it enables HOT pruning. In other words, we're not primarily > trying to minimize index traffic; we're trying to make cleanup of the > heap cheaper. So this could certainly be done, but I'm not sure it > would buy us enough to be worth the engineering effort involved. (Hi, just back from leave and catching up on emails.) The above suggested design is something I've been working on for last few days. In my design I referred to "intermediate root" tuples. I've got a detailed design for it and it works, yay!... but Pavan has managed to shoot it down with some accurate observations about it leading to an annoying accumulation of root pointers and complex logic to remove them. So I'm not pursuing it further at this stage. I'm writing up my conclusions around what we should do now, so should post later today. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanting to learn about pgsql design decision
Tal Walterwrites: >- Why in the roles system, user are actually roles with login attribute >and not a separate entity. Groups and users used to be separate concepts, actually, a long time ago. We got rid of that because it was a PITA; in particular, grants to groups had to be represented separately from grants to individual users. Looking at the git history, that happened in mid-2005, so you might trawl the pgsql-hackers archives from around that time for discussion. >- Why to read from a table, both a usage permission on the schema and a >read access permission on the table is needed? Because the SQL standard says so. You might want to get a copy. While the "official" releases cost lots o' money, draft versions are freely available on the net, and are generally close enough. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostmasterContext survives into parallel workers!?
Robert Haaswrites: > On Mon, Aug 1, 2016 at 7:42 PM, Tom Lane wrote: >> ... This is what makes me dubious that getting rid >> of doing work in the postmaster's signal handlers is really going >> to add any noticeable increment of safety. It might make the >> code look cleaner, but I'm afraid it's going to be a lot of churn >> for not much gain. > It's not just a cosmetic issue. > See > https://www.postgresql.org/message-id/CA+Tgmobr+Q2WgWeasdbDNefVwJkAGALxA=-vtegntqgl1v2...@mail.gmail.com > and d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4. I do not think that patch particularly bears on this question. We have had logic bugs in the postmaster in the current coding structure, and we undoubtedly would still have logic bugs in the postmaster if it were rewritten to avoid using nontrivial signal handlers. They'd just be in different places. Also, the message you mention does a fine job of summarizing why getting out of the signal-handler-based design will be fraught with a bunch of new portability hazards. But what I was responding to here is the claim that we have portability hazards built into the current code as a consequence of doing work in the signal handlers. AFAICT, that's just FUD, and is sufficiently disproven by the many years of reliable service we've gotten out of the current design. Anyway, if someone is really motivated to rewrite the postmaster, I won't stand in the way. I'm just opining that that it will be a lot of work that would be better expended elsewhere. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote: > On Tue, Aug 2, 2016 at 11:19 AM, Noah Mischwrote: > > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: > >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: > >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > >> >> Consider the below scenario. > >> >> > >> >> Vacuum > >> >> a. acquires a cleanup lock for page - 10 > >> >> b. busy in checking visibility of tuples > >> >> --assume, here it takes some time and in the meantime Session-1 > >> >> performs step (a) and (b) and start waiting in step- (c) > >> >> c. marks the page as all-visible (PageSetAllVisible) > >> >> d. unlockandrelease the buffer > >> >> > >> >> Session-1 > >> >> a. In heap_lock_tuple(), readbuffer for page-10 > >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't > >> >> acquire the visbilitymap_pin > >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > >> >> release the lock > >> >> d. Got the lock, but now the page is marked as all-visible, so ideally > >> >> need to recheck the page and acquire the visibilitymap_pin > >> > > >> > So, I've tried pretty hard to reproduce that. While the theory above is > >> > sound, I believe the relevant code-path is essentially dead for SQL > >> > callable code, because we'll always hold a buffer pin before even > >> > entering heap_update/heap_lock_tuple. > >> > > >> > >> It is possible that we don't hold any buffer pin before entering > >> heap_update() and or heap_lock_tuple(). For heap_update(), it is > >> possible when it enters via simple_heap_update() path. For > >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement > >> and may be others as well. > > > > This is currently listed as a 9.6 open item. Is it indeed a regression in > > 9.6, or do released versions have the same defect? If it is a 9.6 > > regression, > > do you happen to know which commit, or at least which feature, caused it? > > > > Commit eca0f1db is the reason for this specific issue. [Action required within 72 hours. This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed in advance of shipping 9.6rc1 next week. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PostgreSQL 10 Roadmaps
https://wiki.postgresql.org/wiki/PostgreSQL10_Roadmap -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapilawrote: > Why we need to add a record in all indexes if only the key > corresponding to one of indexes is updated? Basically, if the tuple > can fit on same page, why can't we consider it as HOT (or HPT - heap > partial tuple or something like that), unless it updates all the keys > for all the indexes. Now, we can't consider such tuple versions for > pruning as we do for HOT. The downside of this could be that we might > need to retain some of the line pointers for more time (as we won't be > able to reuse the line pointer till it is used in any one of the > indexes and those could be reused once we make next non-HOT update). > However, this should allow us not to update the indexes for which the > corresponding column in tuple is not updated. I think it is a basic > premise that if any index column is updated then the update will be > considered as non-HOT, so there is a good chance that I might be > missing something here. Well, I think that the biggest advantage of a HOT update is the fact that it enables HOT pruning. In other words, we're not primarily trying to minimize index traffic; we're trying to make cleanup of the heap cheaper. So this could certainly be done, but I'm not sure it would buy us enough to be worth the engineering effort involved. Personally, I think that incremental surgery on our current heap format to try to fix this is not going to get very far. If you look at the history of this, 8.3 was a huge release for timely cleanup of dead tuple. There was also significant progress in 8.4 as a result of 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we then made no progress at all in 9.0 - 9.4. We made a very small improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but that's pretty niche. In 9.6, we have "snapshot too old", which I'd argue is potentially a large improvement, but it was big and invasive and will no doubt pose code maintenance hazards in the years to come; also, many people won't be able to use it or won't realize that they should use it. I think it is likely that further incremental improvements here will be quite hard to find, and the amount of effort will be large relative to the amount of benefit. I think we need a new storage format where the bloat is cleanly separated from the data rather than intermingled with it; every other major RDMS works that way. Perhaps this is a case of "the grass is greener on the other side of the fence", but I don't think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New version numbering practices
Greg Starkwrites: > On Tue, Aug 2, 2016 at 2:10 AM, Alvaro Herrera > wrote: >> That said, I'm not opposed to REL_10 and so on. In 89 years there will >> be a problem with sorting REL_100 but I'm sure they can find a solution >> then, if computers still need humans to write programs for them. > It would be nice if there was a consistent way of referring to a > version regardless of how old it was. > There would be nothing stopping us from going back and adding tags for > existing versions. The discussion here is about branches, not tags. I don't know of any way to have an alias for a branch (though I'm no git expert). > It would also give a convenient chance > to fix the inconsistencies in how some of the older branches were > tagged. I thought we'd pretty much done that cleanup during the cvs->git conversion? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracking wait event for latches
On Wed, Jun 8, 2016 at 10:11 AM, Michael Paquierwrote: > Yeah, that's as well my line of thoughts on the matter since the > beginning: keep it simple and done. What is written just after those > words is purely hand-waving and I have no way to prove it, but my > instinctive guess is that more than 90% of the real use cases where we > need to track the latch waits in pgstat would be covered without the > need of this extra shared memory infrastructure for extensions. So, I have done some extra tests with my patch to see if I move the event reporting down to WaitEventSetWait and see what is the effect on secure_read and secure_write. And the conclusion is that I am seeing no difference, so I changed the patch to the way suggested by Thomas. In this test, what I have done was using the following pgbench script with -C via an SSL connection: \set id random(1,1000) As the script file is not empty, a connection to the server is opened, so this goes though secure_read at minimal cost on the backend. Also, I have change the event ID notation as follows to be consistent with the routine names: WAL -> Wal PG -> Pg I also found that LATCH_RECOVERY_WAL_ALL was reporting "XLogAfterAvailable" as name, which was incorrect. Finally, I have changed the patch so as it does not track "Latch" events, but "EventSet" events instead, per the suggestion of Thomas. "WaitEventSet" is too close to wait_event in my taste so I shortened the suggeston. There were also some conflicts caused by the recent commit 887feefe, which are fixed. Attached is an updated patch. -- Michael wait-event-set-v2.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans
> -Original Message- > From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp] > Sent: Tuesday, August 02, 2016 9:36 PM > To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown > plans > > On 2016/08/01 20:15, Etsuro Fujita wrote: > > I thought about the Relations line a bit more and noticed that there are > > cases where the table reference names for joining relations in the > > Relations line are printed incorrectly. Here is an example: > > > > postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1, > > ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2 > > where t1.a = t2.a) as t(t1a, t2a); > > QUERY PLAN > > > -- > -- > > > > Unique (cost=204.12..204.13 rows=2 width=8) > >Output: t1.a, t2.a > >-> Sort (cost=204.12..204.12 rows=2 width=8) > > Output: t1.a, t2.a > > Sort Key: t1.a, t2.a > > -> Append (cost=100.00..204.11 rows=2 width=8) > >-> Foreign Scan (cost=100.00..102.04 rows=1 width=8) > > Output: t1.a, t2.a > > Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) > > Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1 > > INNER JOIN public.t2 r2 ON (((r1.a = r2.a > >-> Foreign Scan (cost=100.00..102.04 rows=1 width=8) > > Output: t1_1.a, t2_1.a > > Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) > > Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1 > > INNER JOIN public.t2 r2 ON (((r1.a = r2.a > > (14 rows) > > > > The table reference names for ft1 and ft2 in the Relations line for the > > second Foreign Scan should be t1_1 and t2_1 respectively. > > > > Another concern about the Relations line is, that represents just an > > internal representation of a pushed-down join, so that would not > > necessarily match a deparsed query shown in the Remote SQL line. Here > > is an example, which I found when working on supporting pushing down > > full outer join a lot more, by improving the deparsing logic so that > > postgres_fdw can build a remote query that involves subqueries [1], > > which I'll work on for 10.0: > > > > + -- full outer join with restrictions on the joining relations > > + EXPLAIN (COSTS false, VERBOSE) > > + SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND > > 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON > > (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1; > > + QUERY > > PLAN > > + > > > -- > -- > -- > - > > > > + Foreign Scan > > +Output: ft4.c1, ft5.c1 > > +Relations: (public.ft4) FULL JOIN (public.ft5) > > +Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3" > > WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM > > "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 = > > ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST > > + (4 rows) > > > > "(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not > > exactly match the deparsed query in the Remote SQL line, which I think > > would be rather confusing for users. (We may be able to print more > > exact information in the Relations line so as to match the depaserd > > query, but I think that that would make the Relations line redundant.) > > > > Would we really need the Relations line? If joining relations are > > printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1" > > as proposed upthread, we can see those relations from that, not the > > Relations line. Also we can see the join tree structure from the > > deparsed query in the Remote SQL line. The Relations line seems to be > > not that useful anymore, then. What do you think about that? > > I removed the Relations line. Here is an updated version of the patch. > > * As I said upthread, I left the upper-relation handling for another > patch. Currently, the patch prints "Foreign Scan" with no relations in > that case. > > * I kept custom joins as-is. We would need discussions about how to > choose relations we print in EXPLAIN, so I'd also like to leave that for > yet another patch. > Please don't rely on fs_relids bitmap to pick up relations to be printed. It just hold a set of underlying relations, but it does not mean all of these relations are actually scanned inside of the ForeignScan. You
Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5
On Fri, Jul 29, 2016 at 11:58 AM, Tom Lanewrote: > Maybe the real problem here is that the abstraction layer is so incomplete > and apparently haphazard, so that it's not very obvious where you should > use a pq_xxx name and where you should refer to socket_xxx. For some of > the functions in pqcomm.c, such as internal_flush, it's impossible to tell > which side of the abstraction boundary they're supposed to be on. > (I suspect that that particular function has simply been ignored on the > undocumented assumption that a bgworker could never call it, but that's > the kind of half-baked work that I don't find acceptable.) > > I think the core work that needs to be done here is to clearly identify > each function in pqcomm.c as either above or below the PqCommMethods > abstraction boundary. Maybe we should split one set or the other out > to a new source file. In any case, the naming convention ought to > reflect that hierarchy. > > I withdraw the idea that we should try to revert all these functions back > to their old names, since that would obscure the layering distinction > between socket-specific and general-usage functions. I now think that > the problem is that that distinction hasn't been drawn sharply enough. If memory serves, there was some discussion of this at the time the patch that changed this was originally submitted. The original patch, IIRC, just installed one or two hooks and it was argued that I should instead create some kind of abstraction layer. The present coding is the result of my attempt to do just that. I have to admit that I wasn't very eager to churn the contents of this file more than necessary. It seems like old, crufty code to me. I don't object if you want to refactor it, but I'm not sure what problem it solves. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans
On 2016/08/01 20:15, Etsuro Fujita wrote: I thought about the Relations line a bit more and noticed that there are cases where the table reference names for joining relations in the Relations line are printed incorrectly. Here is an example: postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1, ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2 where t1.a = t2.a) as t(t1a, t2a); QUERY PLAN Unique (cost=204.12..204.13 rows=2 width=8) Output: t1.a, t2.a -> Sort (cost=204.12..204.12 rows=2 width=8) Output: t1.a, t2.a Sort Key: t1.a, t2.a -> Append (cost=100.00..204.11 rows=2 width=8) -> Foreign Scan (cost=100.00..102.04 rows=1 width=8) Output: t1.a, t2.a Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1 INNER JOIN public.t2 r2 ON (((r1.a = r2.a -> Foreign Scan (cost=100.00..102.04 rows=1 width=8) Output: t1_1.a, t2_1.a Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2) Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1 INNER JOIN public.t2 r2 ON (((r1.a = r2.a (14 rows) The table reference names for ft1 and ft2 in the Relations line for the second Foreign Scan should be t1_1 and t2_1 respectively. Another concern about the Relations line is, that represents just an internal representation of a pushed-down join, so that would not necessarily match a deparsed query shown in the Remote SQL line. Here is an example, which I found when working on supporting pushing down full outer join a lot more, by improving the deparsing logic so that postgres_fdw can build a remote query that involves subqueries [1], which I'll work on for 10.0: + -- full outer join with restrictions on the joining relations + EXPLAIN (COSTS false, VERBOSE) + SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1; + QUERY PLAN + --- + Foreign Scan +Output: ft4.c1, ft5.c1 +Relations: (public.ft4) FULL JOIN (public.ft5) +Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 = ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST + (4 rows) "(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not exactly match the deparsed query in the Remote SQL line, which I think would be rather confusing for users. (We may be able to print more exact information in the Relations line so as to match the depaserd query, but I think that that would make the Relations line redundant.) Would we really need the Relations line? If joining relations are printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1" as proposed upthread, we can see those relations from that, not the Relations line. Also we can see the join tree structure from the deparsed query in the Remote SQL line. The Relations line seems to be not that useful anymore, then. What do you think about that? I removed the Relations line. Here is an updated version of the patch. * As I said upthread, I left the upper-relation handling for another patch. Currently, the patch prints "Foreign Scan" with no relations in that case. * I kept custom joins as-is. We would need discussions about how to choose relations we print in EXPLAIN, so I'd also like to leave that for yet another patch. Best regards, Etsuro Fujita explain-for-foreign-join-pushdown.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement
On Tue, Aug 2, 2016 at 9:10 AM, Robert Haaswrote: > On Sat, Jul 30, 2016 at 8:17 AM, Amit Kapila wrote: >> On Fri, Jul 29, 2016 at 1:10 AM, Robert Haas wrote: >>> On Wed, Jul 27, 2016 at 7:26 PM, Andres Freund wrote: >>> >>> New version attached. >> >> +static inline void >> +InitToastSnapshot(Snapshot snapshot, XLogRecPtr lsn) >> +{ >> + snapshot->satisfies = HeapTupleSatisfiesToast; >> + snapshot->lsn = lsn; >> +} >> >> Here, don't you need to initialize whenTaken as that is also used in >> TestForOldSnapshot_impl() to report error "snapshot too old". > > Hmm, yeah. This is actually a bit confusing. We want the "oldest" > snapshot, but there are three different notions of "oldest": > > 1. Smallest LSN. > 2. Smallest whenTaken. > 3. Smallest xmin. > > Which one do we use? > Which ever notion we choose, I think we should use the LSN and whenTaken from the same snapshot. I think we can choose the snapshot with smallest xmin and then use the LSN and whenTaken from it. I think xmin comparison makes more sense because you are already retrieving smallest xmin snapshot from the registered snapshots. Whatever value we choose here, I think we can't guarantee that it will be in sync with the value used for heap. So there is a chance that we might spuriously raise "snapshot too old" error. I think we should mentioned this as a caveat in docs [1] where we explain behaviour of about old_snapshot_threshold. [1] - https://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumarwrote: > There are many more such exposed functions, which can throw cache lookup > failure error if we pass wrong value. > > i.e. > record_in > domain_in > fmgr_c_validator > plpgsql_validator > pg_procedure_is_visible > > Are we planning to change these also.. I think if they are SQL-callable functions that users can invoke from a SQL prompt, they shouldn't throw "cache lookup failed" errors or, for that matter, any other error that is reported with elog(). Now, I don't necessarily think that these functions should return NULL in that case the way we did with the others; that's not a sensible behavior for a type-input function AFAIK. But we should emit a better error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New version numbering practices
On Tue, Aug 2, 2016 at 2:10 AM, Alvaro Herrerawrote: > That said, I'm not opposed to REL_10 and so on. In 89 years there will > be a problem with sorting REL_100 but I'm sure they can find a solution > then, if computers still need humans to write programs for them. It would be nice if there was a consistent way of referring to a version regardless of how old it was. There would be nothing stopping us from going back and adding tags for existing versions. We could add REL_09_5 back to REL_06_5 if we wanted to. Then we could easily refer to any version without special cases or rules about pre-10 vs post-10. It would also give a convenient chance to fix the inconsistencies in how some of the older branches were tagged. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken order-of-operations in parallel query latch manipulation
On Mon, Aug 1, 2016 at 8:45 PM, Tom Lanewrote: > Amit Kapila writes: >> On Mon, Aug 1, 2016 at 1:58 AM, Tom Lane wrote: >>> I believe this is wrong and the CHECK_FOR_INTERRUPTS needs to be before >>> or after the two latch operations. As-is, if the reason somebody set >>> our latch was to get us to notice that a CHECK_FOR_INTERRUPTS condition >>> happened, there's a race condition where we'd fail to realize that. > >> I could see that in nodeGather.c, it might fail to notice the SetLatch >> done by worker process or spuriously woken up due to SetLatch for some >> unrelated reason. However, I don't see what problem it can cause >> apart from one extra loop cycle where it will try to process the tuple >> when actually there is no tuple in the queue. > > Consider the following sequence of events: > > 1. gather_readnext reaches the WaitLatch, and is allowed to pass through > it for some unrelated reason, perhaps some long-since-handled SIGUSR1 > from a worker process. > > 2. gather_readnext does CHECK_FOR_INTERRUPTS(), and sees nothing pending. > > 3. A SIGINT is received. StatementCancelHandler sets QueryCancelPending > and does SetLatch(MyLatch). > > 4. gather_readnext does ResetLatch(MyLatch). > > 5. gather_readnext runs through its loop again, finds nothing to do, and > reaches the WaitLatch. > > 6. The process is now sleeping on its latch, and might sit there a long > time before noticing the pending query cancel. > > Obviously the window for this race condition is pretty tight --- there's > not many instructions between steps 2 and 4. But it can happen. If > memory serves, we've had actual field reports for race condition bugs > where the window that was being hit wasn't much more than a single > instruction. > > Also, it's entirely possible that the bug could be masked, if there was > another CHECK_FOR_INTERRUPTS lurking anywhere in the code called within > the loop. That doesn't excuse this coding practice, though. > Right and Thanks for the detailed explanation. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Tue, Aug 2, 2016 at 3:33 PM, Dilip Kumarwrote: > There are many more such exposed functions, which can throw cache lookup > failure error if we pass wrong value. > > i.e. > record_in > domain_in > fmgr_c_validator > edb_get_objecttypeheaddef > plpgsql_validator > pg_procedure_is_visible > > Are we planning to change these also.. > > below query is one example (from sqlsmith). > > ERROR: cache lookup failed for procedure 0 > > postgres=# select > > postgres-# pg_catalog.record_in( > > postgres(# cast(pg_catalog.numerictypmodout( > > postgres(# cast(98 as integer)) as cstring), > > postgres(# cast(1 as oid), > > postgres(# cast(35 as integer)); > > ERROR: cache lookup failed for type 1 > By mistake I mentioned edb_get_objecttypeheaddef function also, please ignore that. So actual list is as below.. record_in domain_in fmgr_c_validator plpgsql_validator pg_procedure_is_visible -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Sat, Jul 30, 2016 at 4:27 AM, Michael Paquierwrote: > OK for me. Thanks for the commit. There are many more such exposed functions, which can throw cache lookup failure error if we pass wrong value. i.e. record_in domain_in fmgr_c_validator edb_get_objecttypeheaddef plpgsql_validator pg_procedure_is_visible Are we planning to change these also.. below query is one example (from sqlsmith). ERROR: cache lookup failed for procedure 0 postgres=# select postgres-# pg_catalog.record_in( postgres(# cast(pg_catalog.numerictypmodout( postgres(# cast(98 as integer)) as cstring), postgres(# cast(1 as oid), postgres(# cast(35 as integer)); ERROR: cache lookup failed for type 1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Why we lost Uber as a user
On Sat, Jul 30, 2016 at 12:06 AM, Stephen Frostwrote: > * Bruce Momjian (br...@momjian.us) wrote: >> On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote: >> > Another thought that was kicking around in my head related to that is if >> > we could have indexes that only provide page-level information (similar >> > to BRIN, but maybe a btree) and which also would allow HOT updates. >> > Those indexes would typically be used in a bitmap index scan where we're >> > going to be doing a bitmap heap scan with a recheck, of course, though I >> > wonder if we could come up with a way to do an in-order bitmap index >> > scan where we sort the tuples on the page and then perform some kind of >> > mergejoin recheck (or just pull out whatever the lowest-not-seen each >> > time we sort the tuples on the page). >> >> So allow HOT updates if the updated row is on the same page, even if the >> indexed column was changed, by scanning the page --- got it. > > The idea I had was to allow creation of indexes which *only* include the > page ID. Your rephrase seems to imply that we'd have a regular index > but then be able to figure out if a given tuple had any HOT updates > performed on it and, if so, scan the entire page. As I understand it, > it's more complicated than that because we must involve an index when > updating a tuple in some cases (UNIQUE?) and therefore we don't perform > HOT in the case where any indexed column is being changed. > Why we need to add a record in all indexes if only the key corresponding to one of indexes is updated? Basically, if the tuple can fit on same page, why can't we consider it as HOT (or HPT - heap partial tuple or something like that), unless it updates all the keys for all the indexes. Now, we can't consider such tuple versions for pruning as we do for HOT. The downside of this could be that we might need to retain some of the line pointers for more time (as we won't be able to reuse the line pointer till it is used in any one of the indexes and those could be reused once we make next non-HOT update). However, this should allow us not to update the indexes for which the corresponding column in tuple is not updated. I think it is a basic premise that if any index column is updated then the update will be considered as non-HOT, so there is a good chance that I might be missing something here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 2 August 2016 at 08:11, Alfred Perlsteinwrote: > On 7/2/16 4:39 AM, Geoff Winkless wrote: > > I maintain that this is a nonsense argument. Especially since (as you > > pointed out and as I missed first time around) the bug actually occurred at > > different records on different slaves, so he invalidates his own point. > Seriously? No, I make a habit of spouting off random arguments to a list full of people whose opinions I massively respect purely for kicks. What do you think? > There's a valid point here, you're sending over commands at the block level, > effectively "write to disk at this location" versus "update this record based > on PK", obviously this has some drawbacks that are reason for concern. Writing values directly into file offsets is only problematic if something else has failed that has caused the file to be an inexact copy. If a different bug occurred that caused the primary key to be corrupted on the slave (or indeed the master), PK-based updates would exhibit similar propagation errors. To reiterate my point, uber's described problem came about because of a bug. Every software has bugs at some point in its life, to pretend otherwise is simply naive. I'm not trying to excuse the bug, or to belittle the impact that such a bug has on data integrity or on uber or indeed on the reputation of PostgreSQL. While I'm prepared to accept (because I have a job that requires I spend time on things other than digging through obscure reddits and mailing lists to understand more fully the exact cause) that in _this particular instance_ the bug was propagated because of the replication mechanism (although I'm still dubious about that, as per my comment above), that does _not_ preclude other bugs propagating in a statement-based replication. That's what I said is a nonsense argument, and no-one has yet explained in what way that's incorrect. Geoff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_size_pretty, SHOW, and spaces
Re: Peter Eisentraut 2016-08-01> > PostgreSQL uses the spaces inconsistently, though. pg_size_pretty uses > > spaces: > > > > # select pg_size_pretty((2^20)::bigint); > > pg_size_pretty > > > > 1024 kB > > because it's "pretty" :) :) > > SHOW does not: > > > > # show work_mem; > > work_mem > > ── > > 1MB > > The original idea might have been to allow that value to be passed back > into the settings system, without having to quote the space. I'm not > sure, but I think changing that might cause some annoyance. That's a good argument for keeping it that way, yes. Re: Bruce Momjian 2016-08-01 <20160801162508.ga28...@momjian.us> > Looking at the Wikipedia article I posted earlier, that also doesn't use > spaces: > > https://en.wikipedia.org/wiki/Binary_prefix That article has plenty of occurrences of "10 MB" "528 MB/s" and the like. > I think the only argument _for_ spaces is the output of pg_size_pretty() > now looks odd, e.g.: > >10 | 10 bytes | -10 bytes > 1000 | 1000 bytes | -1000 bytes > 100 | 977KB | -977KB >10 | 954MB | -954MB > 1 | 931GB | -931GB > 1000 | 909TB | -909TB > ^ ^ > > The issue is that we output "10 bytes", not "10bytes", but for units we > use "977KB". That seems inconsistent, but it is the normal policy > people use. I think this is because "977KB" is really "977K bytes", but > we just append the "B" after the "K" for bevity. It's the other way round: https://en.wikipedia.org/wiki/International_System_of_Units#General_rules | The value of a quantity is written as a number followed by a space | (representing a multiplication sign) and a unit symbol; e.g., 2.21 kg [...] I'd opt to omit the space anywhere where the value is supposed to be fed back into the config (SHOW, --parameters), but use the "pretty" format with space everywhere otherwise (documentation, memory counts in explain output, pg_size_pretty() etc.) Christoph -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Aug 2, 2016 at 11:19 AM, Noah Mischwrote: > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: >> >> Consider the below scenario. >> >> >> >> Vacuum >> >> a. acquires a cleanup lock for page - 10 >> >> b. busy in checking visibility of tuples >> >> --assume, here it takes some time and in the meantime Session-1 >> >> performs step (a) and (b) and start waiting in step- (c) >> >> c. marks the page as all-visible (PageSetAllVisible) >> >> d. unlockandrelease the buffer >> >> >> >> Session-1 >> >> a. In heap_lock_tuple(), readbuffer for page-10 >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't >> >> acquire the visbilitymap_pin >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to >> >> release the lock >> >> d. Got the lock, but now the page is marked as all-visible, so ideally >> >> need to recheck the page and acquire the visibilitymap_pin >> > >> > So, I've tried pretty hard to reproduce that. While the theory above is >> > sound, I believe the relevant code-path is essentially dead for SQL >> > callable code, because we'll always hold a buffer pin before even >> > entering heap_update/heap_lock_tuple. >> > >> >> It is possible that we don't hold any buffer pin before entering >> heap_update() and or heap_lock_tuple(). For heap_update(), it is >> possible when it enters via simple_heap_update() path. For >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement >> and may be others as well. > > This is currently listed as a 9.6 open item. Is it indeed a regression in > 9.6, or do released versions have the same defect? If it is a 9.6 regression, > do you happen to know which commit, or at least which feature, caused it? > Commit eca0f1db is the reason for this specific issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Wanting to learn about pgsql design decision
Dear list, I'm interested in pgsql, and would like to know more about the design decisions behind it's features. Where should I search in order to know more about subjects, for example: - Why in the roles system, user are actually roles with login attribute and not a separate entity. - Why to read from a table, both a usage permission on the schema and a read access permission on the table is needed? Alternatively, there could be a usage permission on schema just to alter the schema itself or add tables to it, and not require it in the case of selecting from a table from inside it. And other questions of this sort. Thank you very much!
Re: [HACKERS] asynchronous and vectorized execution
Thank you for the comment. At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekarwrote in > On 21 July 2016 at 15:20, Kyotaro HORIGUCHI > wrote: > > > > > After some consideration, I found that ExecAsyncWaitForNode > > cannot be reentrant because it means that the control goes into > > async-unaware nodes while having not-ready nodes, that is > > inconsistent state. To inhibit such reentering, I allocated node > > identifiers in depth-first order so that ascendant-descendant > > relationship can be checked (nested-set model) in simple way and > > call ExecAsyncConfigureWait only for the descendant nodes of the > > parameter planstate. > > > > > We have estate->waiting_nodes containing a mix of async-aware and > non-async-aware nodes. I was thinking, an asynchrony tree would have only > async-aware nodes, with possible multiple asynchrony sub-trees in a tree. > Somehow, if we restrict the bubbling up of events only upto the root of the > asynchrony subtree, do you think we can simplify some of the complexities ? The current code prohibiting regsitration of nodes outside the current subtree to avoid the reentring-disaster. Indeed leaving the "waiting node" mark or something like on every root node at the first visit will enable the propagation to stop upto the root of any async-subtree. Neverheless, when an async-child in an inactive async-root fires, the new tuple is loaded but is not consumed then the succeeding firing on the same child leads to a dead-lock (without result queueing). However, that can be avoided if ExecAsyncConfigureWait doesn't register nodes in ready state. On the other hand, any two or more asynchronous nodes can share a syncronization object. For instance, multiple postgres_fdw scan node can share one server connection and only one of them can get into waitable state at once. If no async-child in the current async subtree is waitable, it must be stuck. So I think it is crucial for ExecAsyncWaitForNode to force at least one child *in the current async subtree* to get into waiting state for such situation. The ascendant-descendant relationship is necessary to do that anyway. Since we should have the node-id to detect ascendant-descendant relationship anyway and finally should restrict async-nodes with it, activating only descendant node from the first would make the things rather simple than avoiding possible dead-lock laster as described above. # It is implemented as per-subtree waiting-node list but it was # fragile and too ugly.. > For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because > it has to handle non-async-nodes also, and that's the reason I believe you > have introduced modes such as ASYNCCONF_FORCE_ADD. As explained above, the ASYNCCONF_FORCE_ADD is not for non-async-nodes, but for sets of async nodes that share a synchronization object. We could let ExecAsyncConfigureWait force acquire async-object from the first, but it in turn causes possiblly unnecessary transfer of a sync-object among the nodes sharing it. I wish the above sentsnces are readable enough, but any questions are welcome even the meaning of a sentence. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/26/16 9:54 AM, Joshua D. Drake wrote: Hello, The following article is a very good look at some of our limitations and highlights some of the pains many of us have been working "around" since we started using the software. https://eng.uber.com/mysql-migration/ Specifically: * Inefficient architecture for writes * Inefficient data replication * Issues with table corruption * Poor replica MVCC support * Difficulty upgrading to newer releases It is a very good read and I encourage our hackers to do so with an open mind. Sincerely, JD It was a good read. Having based a high performance web tracking service as well as a high performance security appliance on Postgresql I too have been bitten by these issues. I had a few questions that maybe the folks with core knowledge can answer: 1) Would it be possible to create a "star-like" schema to fix this problem? For example, let's say you have a table that is similar to Uber's: col0pk, col1, col2, col3, col4, col5 All cols are indexed. Assuming that updates happen to only 1 column at a time. Why not figure out some way to encourage or automate the splitting of this table into multiple tables that present themselves as a single table? What I mean is that you would then wind up with the following tables: table1: col0pk, col1 table2: col0pk, col2 table3: col0pk, col3 table4: col0pk, col4 table5: col0pk, col5 Now when you update "col5" on a row, you only have to update the index on table5:col5 and table5:col0pk as opposed to beforehand where you would have to update more indecies. In addition I believe that vacuum would be somewhat mitigated as well in this case. 2) Why not have a look at how innodb does its storage, would it be possible to do this? 3) For the small-ish table that Uber mentioned, is there a way to "have it in memory" however provide some level of sync to disk so that it is consistent? thanks! -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/28/16 7:08 AM, Merlin Moncure wrote: *) postgres may not be the ideal choice for those who want a thin and simple database This is a huge market, addressing it will bring mindshare and more jobs, code and braintrust to psql. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/28/16 4:39 AM, Geoff Winkless wrote: On 28 Jul 2016 12:19, "Vitaly Burovoy"> wrote: > > On 7/28/16, Geoff Winkless > wrote: > > On 27 July 2016 at 17:04, Bruce Momjian > wrote: > > > >> Well, their big complaint about binary replication is that a bug can > >> spread from a master to all slaves, which doesn't happen with statement > >> level replication. > > > > > > I'm not sure that that makes sense to me. If there's a database bug that > > occurs when you run a statement on the master, it seems there's a decent > > chance that that same bug is going to occur when you run the same statement > > on the slave. > > > > Obviously it depends on the type of bug and how identical the slave is, but > > statement-level replication certainly doesn't preclude such a bug from > > propagating. > > > > Geoff > > Please, read the article first! The bug is about wrong visibility of > tuples after applying WAL at slaves. > For example, you can see two different records selecting from a table > by a primary key (moreover, their PKs are the same, but other columns > differ). I read the article. It affected slaves as well as the master. I quote: "because of the way replication works, this issue has the potential to spread into all of the databases in a replication hierarchy" I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time around) the bug actually occurred at different records on different slaves, so he invalidates his own point. Geoff Seriously? There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this location" versus "update this record based on PK", obviously this has some drawbacks that are reason for concern. Does it validate the move on its own? NO. Does it add to the reasons to move away? Yes, that much is obvious. Please read this thread: https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n Do I love postgresql? Yes. Have I been bitten by things such as this? Yes. Should the community learn from these things and think of ways to avoid it? Absolutely! -Alfred
Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans
> -Original Message- > From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp] > Sent: Tuesday, August 02, 2016 2:45 PM > To: Kaigai Kouhei(海外 浩平); Ashutosh Bapat > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown > plans > > On 2016/08/02 13:32, Kouhei Kaigai wrote: > > I wrote: > >> My concern here is EXPLAIN for foreign joins. I think it's another > >> problem how we handle Foreign Scan plan nodes representing > >> post-scan/join operations in EXPLAIN, so I'd like to leave that for > >> another patch. > > > What is the post-scan/join operations? Are you saying EPQ recheck and > > alternative local join plan? > > No. I mean e.g., aggregation, window functions, sorting, or table > modification. In other words, Foreign Scan plan nodes created from > ForeignPath paths created from GetForeignUpperPaths. > Why do you need to handle these upper paths individually? FDW driver knows the remote query contains aggregation, window functions, sorting, or table modification. It can give proper label. If remote query contains partial aggregation and relations join, for example, "Partial Agg + Scan" will be a proper label that shall be followed by the "Foreign %s". All you need to do are the two enhancements: - Add "const char *explain_label" on the ForeignScanState or somewhere extension can set. It gives a label to be printed. If NULL string is set, EXPLAIN shows "Foreign Scan" as a default. - Add "Bitmapset explain_rels" on the ForeignScanState or somewhere extension can set. It indicates the relations to be printed after the "Foreign %s" token. If you want to print all the relations names underlying this ForeignScan node, just copy scanrelids bitmap. If NULL bitmap is set, EXPLAIN shows nothing as a default. Please note that the default does not change the existing behavior. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests
On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquierwrote: > There is still an issue with pg_basebackup when testing stream mode > and replication slots. I am digging into this one now.. After 5 hours running this test in a row and 30 attempts torturing hamster with a script running make check in an infinite loop with set -e I am giving up on that for the time being... I have added the patch to make the tests more stable to next CF so as it is not forgotten: https://commitfest.postgresql.org/10/693/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers