Re: should there be a hard-limit on the number of transactions pending undo?
On Sat, Jul 20, 2019 at 4:17 AM Peter Geoghegan wrote: > > On Fri, Jul 19, 2019 at 12:52 PM Robert Haas wrote: > > Right, that's definitely a big part of the concern here, but I don't > > really believe that retaining locks is absolutely required, or even > > necessarily desirable. For instance, suppose that I create a table, > > bulk-load a whole lotta data into it, and then abort. Further support > > that by the time we start trying to process the undo in the > > background, we can't get the lock. Well, that probably means somebody > > is performing DDL on the table. > > I believe that the primary reason why certain other database systems > retain locks until rollback completes (or release their locks in > reverse order, as UNDO processing progresses) is that application code > will often repeat exactly the same actions on receiving a transient > error, until the action finally completes successfully. Just like with > serialization failures, or with manually implemented UPSERT loops that > must sometimes retry. This is why UNDO is often (or always) processed > synchronously, blocking progress of the client connection as its xact > rolls back. > > Obviously these other systems could easily hand off the work of > rolling back the transaction to an asynchronous worker process, and > return success to the client that encounters an error (or asks to > abort/roll back) almost immediately. I have to imagine that they > haven't implemented this straightforward optimization because it makes > sense that the cost of rolling back the transaction is primarily borne > by the client that actually rolls back. > It is also possible that there are some other disadvantages or technical challenges in those other systems due to which they decided not to have such a mechanism. I think one such database prepares the consistent copy of pages during read operation based on SCN or something like that. It might not be as easy for such a system to check if there is some pending undo which needs to be consulted. I am not telling that there are no ways to overcome such things but that might have incurred much more cost or has some other disadvantages. I am not sure if it is straight-forward to imagine why some other system does the things in some particular way unless there is some explicit documentation about the same. Having said that, I agree that there are a good number of advantages of performing the actions in the client that actually rolls back and we should try to do that where it is not a good idea to transfer to background workers like for short transactions. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 10:58 PM Robert Haas wrote: > > One other thing that seems worth noting is that we have to consider > what happens after a restart. After a crash, and depending on exactly > how we design it perhaps also after a non-crash restart, we won't > immediately know how many outstanding transactions need undo; we'll > have to grovel through the undo logs to find out. If we've got a hard > cap, we can't allow new undo-using transactions to start until we > finish that work. It's possible that, at the moment of the crash, the > maximum number of items had already been pushed into the background, > and every foreground session was busy trying to undo an abort as well. > If so, we're already up against the limit. We'll have to scan through > all of the undo logs and examine each transaction to get a count on > how many transactions are already in a needs-undo-work state; only > once we have that value do we know whether it's OK to admit new > transactions to using the undo machinery, and how many we can admit. > In typical cases, that won't take long at all, because there won't be > any pending undo work, or not much, and we'll very quickly read the > handful of transaction headers that we need to consult and away we go. > However, if the hard limit is pretty big, and we're pretty close to > it, counting might take a long time. It seems bothersome to have this > interval between when we start accepting transactions and when we can > accept transactions that use undo. Instead of throwing an ERROR, we > can probably just teach the system to wait for the background process > to finish doing the counting; that's what Amit's patch does currently. > Yeah, however, we wait for a certain threshold period of time (one minute) for counting to finish and then error out. We can wait till the counting is finished but I am not sure if that is a good idea because anyway user can try again after some time. > Or, we could not even open for connections until the counting has been > completed. > > When I first thought about this, I was really concerned about the idea > of a hard limit, but the more I think about it the less problematic it > seems. I think in the end it boils down to a question of: when things > break, what behavior would users prefer? > One minor thing I would like to add here is that we are providing some knobs wherein the systems having more number of rollbacks can configure to have a much higher value of hard limit such that it won't hit in their systems. I know it is not always easy to find the right value, but I guess they can learn from the behavior and then change it to avoid the same in future. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas wrote: > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila wrote: > > We are doing exactly what you have written in the last line of the > > next paragraph "stop the transaction from writing undo when the hash > > table is already too full.". So we will > > never face the problems related to repeated crash recovery. The > > definition of too full is that we stop allowing the new transactions > > that can write undo when we have the hash table already have entries > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > make sense? > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > 1. It looks like the test you added to PrepareUndoInsert has no > locking, and I don't see how that can be right. > +if (ProcGlobal->xactsHavingPendingUndo > +(UndoRollbackHashTableSize() - MaxBackends)) Actual HARD_LIMIT is UndoRollbackHashTableSize(), but we only allow a new backend to prepare the undo record if we have MaxBackends number of empty slots in the hash table. This will guarantee us to always have at least one slot in the hash table for our current prepare, even if all the backend which running their transaction has aborted and inserted an entry in the hash table. I think the problem with this check is that for any backend to prepare an undo there must be MaxBackend number of empty slots in the hash table for any concurrent backend to insert the request and this seems too restrictive. Having said that I think we must ensure MaxBackends*2 empty slots in the hash table as each backend can enter 2 requests in the hash table. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 19, 2019 at 6:37 PM Robert Haas wrote: > > On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar wrote: > > + * We just want to mask the cid in the undo record header. So > > + * only if the partial record in the current page include the undo > > + * record header then we need to mask the cid bytes in this page. > > + * Otherwise, directly jump to the next record. > > Here, I think you mean : "So only if the partial record in the current > > page includes the *cid* bytes", rather than "includes the undo record > > header" > > May be we can say : > > We just want to mask the cid. So do the partial record masking only if > > the current page includes the cid bytes from the partial record > > header. > > Hmm, but why is it correct to mask the CID at all? Shouldn't that match? > We don't write CID in the WAL. Because In hot-standby or after recovery we don't need actual CID for the visibility. So during REDO while generating the undo record we set CID as 'FirstCommandId' which is different from the DO time. That's the reason we mask it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan wrote: > On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova > wrote: > > Patch 0002 (must be applied on top of 0001) implements preserving of > > correct TID order > > inside posting list when inserting new tuples. > > This version passes all regression tests including amcheck test. > > I also used following script to test insertion into the posting list: > > Nice! Hmm. So, the attached test case fails amcheck verification for me with the latest version of the patch: $ psql -f amcheck-compress-test.sql DROP TABLE CREATE TABLE CREATE INDEX CREATE EXTENSION INSERT 0 2001 psql:amcheck-compress-test.sql:6: ERROR: down-link lower bound invariant violated for index "idx_desc_nl" DETAIL: Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438. Note that this test only has an INSERT statement. You have to use bt_index_parent_check() to see the problem -- bt_index_check() will not detect the problem. -- Peter Geoghegan amcheck-compress-test.sql Description: Binary data
Re: Compile from source using latest Microsoft Windows SDK
On Fri, Jul 19, 2019 at 08:30:38AM -0400, Andrew Dunstan wrote: > My tests of the VS2017 stuff used this install mechanism on a fresh > Windows instance: > > choco install -y visualstudio2017-workload-vctools --package-parameters > "--includeOptional" > > This installed Windows Kits 8.1 and 10, among many other things. So you have bypassed the problem by installing the v8.1 SDK. And if you don't do that and only include the v10 SDK, then you see the problem. Functionally, it also means that with a VS2017 compilation the SDK version is forcibly downgraded, isn't that a bad idea anyway? -- Michael signature.asc Description: PGP signature
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
On Fri, Jul 19, 2019 at 10:40:42PM +0900, Ian Barwick wrote: > Good point, it does actually fail with an error if an impossible slot name > is provided, so the escaping is superfluous anyway. FWIW, ReplicationSlotValidateName() gives the reason behind that restriction: Slot names may consist out of [a-z0-9_]{1,NAMEDATALEN-1} which should allow the name to be used as a directory name on every supported OS. > I'll take another look at it later as it's not exactly critical, just stuck > out when I was passing through the code. This restriction is unlikely going to be removed, still I would rather keep the escaped logic in pg_basebackup. This is the usual, recommended coding pattern, and there is a risk that folks refer to this code block for their own fancy stuff, spreading the problem. The intention behind the code is to use an escaped name as well. For those reasons your patch is fine by me. -- Michael signature.asc Description: PGP signature
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 4:14 PM Robert Haas wrote: > I don't think this matters here at all. As long as there's only DML > involved, there won't be any lock conflicts anyway - everybody's > taking RowExclusiveLock or less, and it's all fine. If you update a > row in zheap, abort, and then try to update again before the rollback > happens, we'll do a page-at-a-time rollback in the foreground, and > proceed with the update; when we get around to applying the undo, > we'll notice that page has already been handled and skip the undo > records that pertain to it. To get the kinds of problems I'm on about > here, somebody's got to be taking some more serious locks. If I'm not mistaken, you're tacitly assuming that you'll always be using zheap, or something sufficiently similar to zheap. It'll probably never be possible to UNDO changes to something like a GIN index on a zheap table, because you can never do that with sensible concurrency/deadlock behavior. I don't necessarily have a problem with that. I don't pretend to understand how much of a problem it is. Obviously it partially depends on what your ambitions are for this infrastructure. Still, assuming that I have it right, ISTM that UNDO/zheap/whatever should explicitly own this restriction. -- Peter Geoghegan
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 6:47 PM Peter Geoghegan wrote: > I believe that the primary reason why certain other database systems > retain locks until rollback completes (or release their locks in > reverse order, as UNDO processing progresses) is that application code > will often repeat exactly the same actions on receiving a transient > error, until the action finally completes successfully. Just like with > serialization failures, or with manually implemented UPSERT loops that > must sometimes retry. This is why UNDO is often (or always) processed > synchronously, blocking progress of the client connection as its xact > rolls back. I don't think this matters here at all. As long as there's only DML involved, there won't be any lock conflicts anyway - everybody's taking RowExclusiveLock or less, and it's all fine. If you update a row in zheap, abort, and then try to update again before the rollback happens, we'll do a page-at-a-time rollback in the foreground, and proceed with the update; when we get around to applying the undo, we'll notice that page has already been handled and skip the undo records that pertain to it. To get the kinds of problems I'm on about here, somebody's got to be taking some more serious locks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Multivariate MCV list vs. statistics target
On Fri, Jul 19, 2019 at 06:12:20AM +, Jamison, Kirk wrote: On Tuesday, July 9, 2019, Tomas Vondra wrote: >apparently v1 of the ALTER STATISTICS patch was a bit confused about >length of the VacAttrStats array passed to statext_compute_stattarget, >causing segfaults. Attached v2 patch fixes that, and it also makes sure >we print warnings about ignored statistics only once. > v3 of the patch, adding pg_dump support - it works just like when you tweak statistics target for a column, for example. When the value stored in the catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting it (for the already created statistics object). Hi Tomas, I stumbled upon your patch. According to the CF bot, your patch applies cleanly, builds successfully, and passes make world. Meaning, the pg_dump tap test passed, but there was no test for the new SET STATISTICS yet. So you might want to add a regression test for that and integrate it in the existing alter_generic file. Upon quick read-through, the syntax and docs are correct because it's similar to the format of ALTER TABLE/INDEX... SET STATISTICS... : ALTER [ COLUMN ] column_name SET STATISTICS integer + /* XXX What if the target is set to 0? Reset the statistic? */ This also makes me wonder. I haven't looked deeply into the code, but since 0 is a valid value, I believe it should reset the stats. I agree resetting the stats after setting the target to 0 seems quite reasonable. But that's not what we do for attribute stats, because in that case we simply skip the attribute during the future ANALYZE runs - we don't reset the stats, we keep the existing stats. So I've done the same thing here, and I've removed the XXX comment. If we want to change that, I'd do it in a separate patch for both the regular and extended stats. After lookup though, this is how it's tested in ALTER TABLE: /test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET STATISTICS 0 Well, yeah. But that tests we skip building the extended statistic (because we excluded the column from the ANALYZE run). I've considered making it part of CREATE STATISTICS itself, but it seems a bit cumbersome and we don't do it for columns either. I'm not against adding it in the future, but at this point I don't see a need. I agree. Perhaps that's for another patch should you decide to add it in the future. Right. Attached is v4 of the patch, with a couple more improvements: 1) I've renamed the if_not_exists flag to missing_ok, because that's more consistent with the "IF EXISTS" clause in the grammar (the old flag was kinda the exact opposite), and I've added a NOTICE about the skip. 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows, because that's what the function was doing anyway (computing sample size). 3) I've added a couple of regression tests to stats_ext.sql Aside from that, I've cleaned up a couple of places and improved a bunch of comments. Nothing huge. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/ref/alter_statistics.sgml b/doc/src/sgml/ref/alter_statistics.sgml index 58c7ed020d..987f9dbc6f 100644 --- a/doc/src/sgml/ref/alter_statistics.sgml +++ b/doc/src/sgml/ref/alter_statistics.sgml @@ -26,6 +26,7 @@ PostgreSQL documentation ALTER STATISTICS name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } ALTER STATISTICS name RENAME TO new_name ALTER STATISTICS name SET SCHEMA new_schema +ALTER STATISTICS name SET STATISTICS new_target @@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA + + new_target + + +The statistic-gathering target for this statistic object for subsequent + operations. +The target can be set in the range 0 to 1; alternatively, set it +to -1 to revert to using the system default statistics +target (). +For more information on the use of statistics by the +PostgreSQL query planner, refer to +. + + + + diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 8d633f2585..edbd24f5e9 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -307,7 +307,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params, VacAttrStats **vacattrstats; AnlIndexData *indexdata; int targrows, - numrows; + numrows, + minrows; double totalrows, totaldeadrows; HeapTuple *rows; @@ -491,6 +492,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params, } } + /* +* Look at extended statistic objects too, as those may define custom +* statistic target. So we may need to
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 12:52 PM Robert Haas wrote: > Right, that's definitely a big part of the concern here, but I don't > really believe that retaining locks is absolutely required, or even > necessarily desirable. For instance, suppose that I create a table, > bulk-load a whole lotta data into it, and then abort. Further support > that by the time we start trying to process the undo in the > background, we can't get the lock. Well, that probably means somebody > is performing DDL on the table. I believe that the primary reason why certain other database systems retain locks until rollback completes (or release their locks in reverse order, as UNDO processing progresses) is that application code will often repeat exactly the same actions on receiving a transient error, until the action finally completes successfully. Just like with serialization failures, or with manually implemented UPSERT loops that must sometimes retry. This is why UNDO is often (or always) processed synchronously, blocking progress of the client connection as its xact rolls back. Obviously these other systems could easily hand off the work of rolling back the transaction to an asynchronous worker process, and return success to the client that encounters an error (or asks to abort/roll back) almost immediately. I have to imagine that they haven't implemented this straightforward optimization because it makes sense that the cost of rolling back the transaction is primarily borne by the client that actually rolls back. And, as I said, because a lot of application code will immediately retry on failure, which needs to not deadlock with an asynchronous roll back process. > If they just did LOCK TABLE or ALTER > TABLE SET STATISTICS, we are going to need to execute that same undo > once the DDL is complete. However, if the DDL is DROP TABLE, we're > going to find that once we can get the lock, the undo is obsolete, and > we don't need to worry about it any more. Had we made it 100% certain > that the DROP TABLE couldn't go through until the undo was performed, > we could avoid having to worry about the undo having become obsolete > ... but that's hardly a win. We're better off allowing the drop and > then just chucking the undo. I'm sure that there are cases like that. And, I'm pretty sure that at least one of the other database systems that I'm thinking of isn't as naive as I suggest, without being sure of the specifics. The classic approach is to retain the locks, even though that sucks in some cases. That doesn't mean that you have to do it that way, but it's probably a good idea to present your design in a way that compares and contrasts with the classic approach. I'm pretty sure that this is related to the way in which other systems retain coarse-grained locks when bitmap indexes are used, even though that makes them totally unusable with OLTP apps. It seems like it would help users a lot if their bitmap indexes didn't come with that problem, but it's a price that they continue to have to pay. > Point being - there's at least some chance that the operations which > block forward progress also represent progress of another sort. That's good, provided that there isn't observable lock starvation. I don't think that you need to eliminate the theoretical risk of lock starvation. It deserves careful, ongoing consideration, though. It's difficult to codify exactly what I have in mind, but I can give you an informal definition now: It's probably okay if there is the occasional implementation-level deadlock because the user got unlucky once. However, it's not okay for there to be *continual* deadlocks because the user got unlucky just once. Even if the user had *extraordinarily* bad luck that one time. In short, my sense is that it's never okay for the system as a whole to "get stuck" in a deadlock or livelock loop. Actually, it might even be okay if somebody had a test case that exhibits "getting stuck" behavior, provided the test case is very delicate, and looks truly adversarial (i.e. it goes beyond being extraordinarily unlucky). I know that this is all pretty hand-wavy, and I don't expect you to have a definitive response. These are some high level concerns that I have, that may or may not apply to what you're trying to do. -- Peter Geoghegan
Re: partition routing layering in nodeModifyTable.c
Hi, On 2019-07-19 17:11:10 -0400, Alvaro Herrera wrote: > On 2019-Jul-19, Andres Freund wrote: > > > - slot = ExecDelete(node, tupleid, oldtuple, > > > planSlot, > > > - > > > >mt_epqstate, estate, > > > + slot = ExecDelete(node, resultRelInfo, tupleid, > > > oldtuple, > > > + planSlot, > > > >mt_epqstate, estate, > > > true, > > > node->canSetTag, > > > false /* > > > changingPart */ , NULL, NULL); > > > break; > > > > This reminds me of another complaint: ExecDelete and ExecInsert() have > > gotten more boolean parameters for partition moving, but only one of > > them is explained with a comment (/* changingPart */) - think we should > > do that for all. > > Maybe change the API to use a flags bitmask? > > (IMO the placement of the comment inside the function call, making the > comma appear preceded with a space, looks ugly. If we want to add > comments, let's put each param on its own line with the comment beyond > the comma. That's what we do in other places where this pattern is > used.) Well, that's the pre-existing style, so I'd just have gone with that. I'm not sure I buy there's much point in going for a bitmask, as this is file-private code, not code where changing the signature requires modifying multiple files. > > > /* Initialize the executor state. */ > > > - estate = create_estate_for_relation(rel); > > > + estate = create_estate_for_relation(rel, ); > > > > Hm. It kinda seems cleaner if we were to instead return the relevant > > index, rather than the entire ResultRelInfo, as an output from > > create_estate_for_relation(). Makes it clearer that it's still in the > > EState. > > Yeah. > > > Or perhaps we ought to compute it in a separate step? Then that'd be > > more amenable to support replcating into partition roots. > > I'm not quite seeing the shape that you're imagining this would take. > I vote not to mess with that for this patch; I bet that we'll have to > change a few other things in this code when we add better support for > partitioning in logical replication. Yea, I think it's fine to do that separately. If we wanted to support replication roots as replication targets, we'd obviously need to do something pretty similar to what ExecInsert()/ExecUpdate() already do. And there we can't just reference an index in EState, as partition children aren't in there. I kind of was wondering if we were to have a separate function for getting the ResultRelInfo targeted, we'd be able to just extend that function to support replication. But now that I think about it a bit more, that's so much just scratching the surface... We really ought to have the replication "sink" code share more code with nodeModifyTable.c. Greetings, Andres Freund
Re: partition routing layering in nodeModifyTable.c
On 2019-Jul-19, Andres Freund wrote: > On 2019-07-19 17:52:20 +0900, Amit Langote wrote: > > The first one (0001) deals with reducing the core executor's reliance > > on es_result_relation_info to access the currently active result > > relation, in favor of receiving it from the caller as a function > > argument. So no piece of core code relies on it being correctly set > > anymore. It still needs to be set correctly for the third-party code > > such as FDWs. > > I'm inclined to just remove it. There's not much code out there relying > on it, as far as I can tell. Most FDWs don't support the direct modify > API, and that's afaict the case where we one needs to use > es_result_relation_info? Yeah, I too agree with removing it; since our code doesn't use it, it seems very likely that it will become slightly out of sync with reality and we'd not notice until some FDW misbehaves weirdly. > > - slot = ExecDelete(node, tupleid, oldtuple, > > planSlot, > > - > > >mt_epqstate, estate, > > + slot = ExecDelete(node, resultRelInfo, tupleid, > > oldtuple, > > + planSlot, > > >mt_epqstate, estate, > > true, > > node->canSetTag, > > false /* > > changingPart */ , NULL, NULL); > > break; > > This reminds me of another complaint: ExecDelete and ExecInsert() have > gotten more boolean parameters for partition moving, but only one of > them is explained with a comment (/* changingPart */) - think we should > do that for all. Maybe change the API to use a flags bitmask? (IMO the placement of the comment inside the function call, making the comma appear preceded with a space, looks ugly. If we want to add comments, let's put each param on its own line with the comment beyond the comma. That's what we do in other places where this pattern is used.) > > /* Initialize the executor state. */ > > - estate = create_estate_for_relation(rel); > > + estate = create_estate_for_relation(rel, ); > > Hm. It kinda seems cleaner if we were to instead return the relevant > index, rather than the entire ResultRelInfo, as an output from > create_estate_for_relation(). Makes it clearer that it's still in the > EState. Yeah. > Or perhaps we ought to compute it in a separate step? Then that'd be > more amenable to support replcating into partition roots. I'm not quite seeing the shape that you're imagining this would take. I vote not to mess with that for this patch; I bet that we'll have to change a few other things in this code when we add better support for partitioning in logical replication. > > +/* > > + * ExecMove > > + * Move an updated tuple from the input result relation to the > > + * new partition of its root parent table > > + * > > + * This works by first deleting the tuple from the input result relation > > + * followed by inserting it into the root parent table, that is, > > + * mtstate->rootResultRelInfo. > > + * > > + * Returns true if it's detected that the tuple we're trying to move has > > + * been concurrently updated. > > + */ > > +static bool > > +ExecMove(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo, > > +ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot > > *planSlot, > > +EPQState *epqstate, bool canSetTag, TupleTableSlot **slot, > > +TupleTableSlot **inserted_tuple) > > +{ > > I know that it was one of the names I proposed, but now that I'm > thinking about it again, it sounds too generic. Perhaps > ExecCrossPartitionUpdate() wouldn't be a quite so generic name? Since > there's only one reference the longer name wouldn't be painful. That name sounds good. Isn't the return convention backwards? Sounds like "true" should mean that it succeeded. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Built-in connection pooler
On 19.07.2019 6:36, Ryan Lambert wrote: Here's what I found tonight in your latest patch (9). The output from git apply is better, fewer whitespace errors, but still getting the following. Ubuntu 18.04 if that helps. git apply -p1 < builtin_connection_proxy-9.patch :79: tab in indent. Each proxy launches its own subset of backends. :634: indent with spaces. union { :635: indent with spaces. struct sockaddr_in inaddr; :636: indent with spaces. struct sockaddr addr; :637: indent with spaces. } a; warning: squelched 54 whitespace errors warning: 59 lines add whitespace errors. A few more minor edits. In config.sgml: "If the max_sessions limit is reached new connection are not accepted." Should be "connections". "The default value is 10, so up to 10 backends will server each database," "sever" should be "serve" and the sentence should end with a period instead of a comma. In postmaster.c: /* The socket number we are listening for poolled connections on */ "poolled" --> "pooled" "(errmsg("could not create listen socket for locahost")));" "locahost" -> "localhost". " * so to support order balancing we should do dome smart work here." "dome" should be "some"? I don't see any tests covering this new functionality. It seems that this is significant enough functionality to warrant some sort of tests, but I don't know exactly what those would/should be. Thank you once again for this fixes. Updated patch is attached. Concerning testing: I do not think that connection pooler needs some kind of special tests. The idea of built-in connection pooler is that it should be able to handle all requests normal postgres can do. I have added to regression tests extra path with enabled connection proxies. Unfortunately, pg_regress is altering some session variables, so it backend becomes tainted and so pooling is not actually used (but communication through proxy is tested). It is also possible to run pg_bench with different number of connections though connection pooler. Thanks, Ryan diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 84341a30e5..f8b93f16ed 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -719,6 +719,123 @@ include_dir 'conf.d' + + max_sessions (integer) + + max_sessions configuration parameter + + + + + The maximum number of client sessions that can be handled by + one connection proxy when session pooling is switched on. + This parameter does not add any memory or CPU overhead, so + specifying a large max_sessions value + does not affect performance. + If the max_sessions limit is reached new connections are not accepted. + + + The default value is 1000. This parameter can only be set at server start. + + + + + + session_pool_size (integer) + + session_pool_size configuration parameter + + + + + Enables session pooling and defines the maximum number of + backends that can be used by client sessions for each database/user combination. + Launched non-tainted backends are never terminated even if there are no active sessions. + Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements. + Tainted backend can server only one client. + + + The default value is 10, so up to 10 backends will serve each database, + + + + + + proxy_port (integer) + + proxy_port configuration parameter + + + + + Sets the TCP port for the connection pooler. + Clients connected to main "port" will be assigned dedicated backends, + while client connected to proxy port will be connected to backends through proxy which + performs transaction level scheduling. + + + The default value is 6543. + + + + + + connection_proxies (integer) + + connection_proxies configuration parameter + + + + + Sets number of connection proxies. + Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing). + Each proxy launches its own subset of backends. + So maximal number of non-tainted backends is session_pool_size*connection_proxies*databases*roles. + + + The default value is 0, so session pooling is disabled. + + + + + + session_schedule (enum) + + session_schedule configuration parameter + + + + + Specifies scheduling policy for assigning session to proxies in case of
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra wrote: > Now, consider this example: > > create table t (a int, b int, c int); > insert into t select mod(i,100),mod(i,100),i from > generate_series(1,1000) s(i); > create index on t (a); > analyze t; > explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1; > > With 0001+0002+0003 pathes, I get a plan like this: > > QUERY PLAN > > Limit (cost=10375.39..10594.72 rows=1 width=16) >-> Incremental Sort (cost=10375.39..2203675.71 rows=1 width=16) > Sort Key: a, b, (sum(c)) > Presorted Key: a, b > -> GroupAggregate (cost=10156.07..2203225.71 rows=1 width=16) >Group Key: a, b >-> Gather Merge (cost=10156.07..2128124.39 rows=1175 > width=12) > Workers Planned: 2 > -> Incremental Sort (cost=9156.04..972856.05 > rows=4166740 width=12) >Sort Key: a, b >Presorted Key: a >-> Parallel Index Scan using t_a_idx on t > (cost=0.43..417690.30 rows=4166740 width=12) > (12 rows) > > > and with 0004, I get this: > > QUERY PLAN > -- > Limit (cost=20443.84..20665.32 rows=1 width=16) >-> Incremental Sort (cost=20443.84..2235250.05 rows=1 width=16) > Sort Key: a, b, (sum(c)) > Presorted Key: a, b > -> GroupAggregate (cost=20222.37..2234800.05 rows=1 width=16) >Group Key: a, b >-> Incremental Sort (cost=20222.37..2159698.74 rows=1175 > width=12) > Sort Key: a, b > Presorted Key: a > -> Index Scan using t_a_idx on t (cost=0.43..476024.65 > rows=1175 width=12) > (10 rows) > > Notice that cost of the second plan is almost double the first one. That > means 0004 does not even generate the first plan, i.e. there are cases > where we don't try to add the explicit sort before passing the path to > generate_gather_paths(). > > And I think I know why is that - while gather_grouping_paths() tries to > add explicit sort below the gather merge, there are other places that > call generate_gather_paths() that don't do that. In this case it's > probably apply_scanjoin_target_to_paths() which simply builds > >parallel (seq|index) scan + gather merge > > and that's it. The problem is likely the same - the code does not know > which pathkeys are "interesting" at that point. We probably need to > teach planner to do this. I've been working on figuring out sample queries for each of the places we're looking at adding create_increment_sort() (starting with the cases enabled by gather-merge nodes). The generate_useful_gather_paths() call in apply_scanjoin_target_to_paths() is required to generate the above preferred plan. But I found that if I set enable_sort=off (with our without the _useful_ variant of generate_gather_paths()) I get a very similar plan that's actually lower cost yet: QUERY PLAN -- Limit (cost=10255.98..10355.77 rows=1 width=16) -> Incremental Sort (cost=10255.98..1008228.67 rows=1 width=16) Sort Key: a, b, (sum(c)) Presorted Key: a, b -> Finalize GroupAggregate (cost=10156.20..1007778.67 rows=1 width=16) Group Key: a, b -> Gather Merge (cost=10156.20..1007528.67 rows=2 width=16) Workers Planned: 2 -> Partial GroupAggregate (cost=9156.18..1004220.15 rows=1 width=16) Group Key: a, b -> Incremental Sort (cost=9156.18..972869.60 rows=4166740 width=12) Sort Key: a, b Presorted Key: a -> Parallel Index Scan using t_a_idx on t (cost=0.43..417703.85 rows=4166740 width=12) Is that something we should consider a bug at this stage? It's also not clear to mean (costing aside) which plan is intuitively preferable. James
Re: minimizing pg_stat_statements performance overhead
On Wed, 2019-04-03 at 23:20 +, Raymond Martin wrote: > Hi Christoph, > > > you sent the patch as UTF-16, could you re-send it as plain ascii? > > Apologies. I re-attached the plain ascii version. Committed. Thanks! Regards, Jeff Davis
Re: sepgsql seems rather thoroughly broken on Fedora 30
On Fri, Jul 19, 2019 at 4:29 PM Tom Lane wrote: > > Mike Palmiotto writes: > > We probably need to polish this a bit more, but what do you think > > about something similar to the attached patches? They should hopefully > > reduce some of the complexity of running these regression tests. > > I can confirm that the 0001 patch fixes things on my Fedora 30 box. > So that's good, though I don't know enough to evaluate it for style > or anything like that. I think the policy is in need of review/rewriting anyway. The proper thing to do would be to create a common template for all of the SELinux regtest user domains and create more of a hierarchical policy to reduce redundancy. If you want to wait for more formal policy updates, I can do that in my spare time. Otherwise, the patch I posted should work with the general style of this policy module. > > I don't think I like the 0002 patch very much, because of its putting > all the sudo actions into the script. I'd rather not give a script > root permissions, thanks. Maybe I'm in the minority on that. Definitely not. I cringed a little bit as I was making those additions, but figured it was fine since it's just a test script (and we have to run `sudo` for various other installation items as well). > Also, since the documentation explicitly says that the > /usr/share/selinux/devel/Makefile path is not to be relied on, > why would we hard-wire it into the script? > > A bigger-picture issue is that right now, configuring a cluster for > sepgsql is a very manual process (cf. section F.35.2). I think there's > some advantage in forcing the user to run through that before running > the regression test, namely that they'll get the bugs out of any > misunderstandings or needed local changes. If we had that a bit more > automated then maybe having the test script do-it-for-you would be > sensible. (IOW, the fact that the test process is more like "make > installcheck" than "make check" seems like a feature not a bug.) Makes sense to me. Thanks for the feedback! -- Mike Palmiotto Software Engineer Crunchy Data Solutions https://crunchydata.com
Re: sepgsql seems rather thoroughly broken on Fedora 30
Mike Palmiotto writes: > We probably need to polish this a bit more, but what do you think > about something similar to the attached patches? They should hopefully > reduce some of the complexity of running these regression tests. I can confirm that the 0001 patch fixes things on my Fedora 30 box. So that's good, though I don't know enough to evaluate it for style or anything like that. I don't think I like the 0002 patch very much, because of its putting all the sudo actions into the script. I'd rather not give a script root permissions, thanks. Maybe I'm in the minority on that. Also, since the documentation explicitly says that the /usr/share/selinux/devel/Makefile path is not to be relied on, why would we hard-wire it into the script? A bigger-picture issue is that right now, configuring a cluster for sepgsql is a very manual process (cf. section F.35.2). I think there's some advantage in forcing the user to run through that before running the regression test, namely that they'll get the bugs out of any misunderstandings or needed local changes. If we had that a bit more automated then maybe having the test script do-it-for-you would be sensible. (IOW, the fact that the test process is more like "make installcheck" than "make check" seems like a feature not a bug.) regards, tom lane
Re: should there be a hard-limit on the number of transactions pending undo?
On 2019-07-19 15:57:45 -0400, Robert Haas wrote: > On Fri, Jul 19, 2019 at 3:12 PM Andres Freund wrote: > > Isn't that pretty inherently required? How are otherwise ever going to > > be able to roll back a transaction that holds an AEL on a relation it > > also modifies? I might be standing on my own head here, though. > > I think you are. If a transaction holds an AEL on a relation it also > modifies, we still only need something like RowExclusiveLock to roll > it back. If we retain the transaction's locks until undo is complete, > we will not deadlock, but we'll also hold AccessExclusiveLock for a > long time. If we release the transaction's locks, we can perform the > undo in the background with only RowExclusiveLock, which is full of > win. Even you insist that the undo task should acquire the same lock > the relation has, which seems entirely excessive to me, that hardly > prevents undo from being applied. Once the original transaction has > released its locks, those locks are released, and the undo system can > acquire those locks the next time the relation isn't busy (or when it > gets to the head of the lock queue). Good morning, Mr Freund. Not sure what you were thinking there.
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 3:12 PM Andres Freund wrote: > On 2019-07-19 14:50:22 -0400, Robert Haas wrote: > > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund wrote: > > > It doesn't seem that hard - and kind of required for robustness > > > independent of the decision around "completeness" - to find a way to use > > > the locks already held by the prepared transaction. > > > > I'm not wild about finding more subtasks to put on the must-do list, > > but I agree it's doable. > > Isn't that pretty inherently required? How are otherwise ever going to > be able to roll back a transaction that holds an AEL on a relation it > also modifies? I might be standing on my own head here, though. I think you are. If a transaction holds an AEL on a relation it also modifies, we still only need something like RowExclusiveLock to roll it back. If we retain the transaction's locks until undo is complete, we will not deadlock, but we'll also hold AccessExclusiveLock for a long time. If we release the transaction's locks, we can perform the undo in the background with only RowExclusiveLock, which is full of win. Even you insist that the undo task should acquire the same lock the relation has, which seems entirely excessive to me, that hardly prevents undo from being applied. Once the original transaction has released its locks, those locks are released, and the undo system can acquire those locks the next time the relation isn't busy (or when it gets to the head of the lock queue). As far as I can see, the only reason why you would care about this is to make the back-pressure system effective against prepared transactions. Different people may want that more or less, but I have a little trouble with the idea that it is a hard requirement. > Well, then perhaps that admin ought not to constantly terminate > connections... I was thinking that new connections wouldn't be forced > to do that if there were still a lot of headroom regarding > #transactions-to-be-rolled-back. And if undo workers kept up, you'd > also not hit this. Sure, but cascading failure scenarios suck. > > Hmm, that's not a bad idea. So the transactions would have to "count" > > the moment they insert their first undo record, which is exactly the > > right thing anyway. > > > > Hmm, but what about transactions that are only touching unlogged tables? > > Wouldn't we throw all that UNDO away in a crash restart? There's no > underlying table data anymore, after all. > > And for proper shutdown checkpoints they could just be included. On thirty seconds thought, that sounds like it would work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: sepgsql seems rather thoroughly broken on Fedora 30
On Fri, Jul 19, 2019 at 11:19 AM Tom Lane wrote: > > I got around to trying this, and lookee here: > > $ sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t > allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True > allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True > allow domain file_type:file map; [ domain_can_mmap_files ]:True > allow domain file_type:lnk_file map; [ domain_can_mmap_files ]:True > > Nothing about passwd_file_t. So *something* is different about the > way the policy is being expanded. Okay, I was finally able to replicate the issue (and fix it). It looks like perhaps the userdom_base_user_template changed and no longer allows reading of passwd_file_t? At any rate, I added some policy to ensure that we have the proper permissions. I also beefed up the test script a bit so it now: - installs the SELinux policy module - spins up a temporary cluster to muddy postgresql.conf and run the setup sql in an isolated environment We probably need to polish this a bit more, but what do you think about something similar to the attached patches? They should hopefully reduce some of the complexity of running these regression tests. -- Mike Palmiotto Software Engineer Crunchy Data Solutions https://crunchydata.com From 7f22a9d40ab5ad5334351932bd9010f538ba222a Mon Sep 17 00:00:00 2001 From: Mike Palmiotto Date: Fri, 19 Jul 2019 14:15:23 -0400 Subject: [PATCH 1/2] Make sepgsql-regtest policy module less error-prone --- contrib/sepgsql/sepgsql-regtest.te | 11 +++ 1 file changed, 11 insertions(+) diff --git a/contrib/sepgsql/sepgsql-regtest.te b/contrib/sepgsql/sepgsql-regtest.te index e5d65243e6..5d9af1a0dd 100644 --- a/contrib/sepgsql/sepgsql-regtest.te +++ b/contrib/sepgsql/sepgsql-regtest.te @@ -31,6 +31,9 @@ userdom_base_user_template(sepgsql_regtest_superuser) userdom_manage_home_role(sepgsql_regtest_superuser_r, sepgsql_regtest_superuser_t) userdom_exec_user_home_content_files(sepgsql_regtest_superuser_t) userdom_write_user_tmp_sockets(sepgsql_regtest_superuser_t) + +auth_read_passwd(sepgsql_regtest_superuser_t) + optional_policy(` postgresql_stream_connect(sepgsql_regtest_superuser_t) postgresql_unconfined(sepgsql_regtest_superuser_t) @@ -60,6 +63,9 @@ userdom_base_user_template(sepgsql_regtest_dba) userdom_manage_home_role(sepgsql_regtest_dba_r, sepgsql_regtest_dba_t) userdom_exec_user_home_content_files(sepgsql_regtest_dba_t) userdom_write_user_tmp_sockets(sepgsql_regtest_user_t) + +auth_read_passwd(sepgsql_regtest_dba_t) + optional_policy(` postgresql_admin(sepgsql_regtest_dba_t, sepgsql_regtest_dba_r) postgresql_stream_connect(sepgsql_regtest_dba_t) @@ -98,6 +104,9 @@ userdom_base_user_template(sepgsql_regtest_user) userdom_manage_home_role(sepgsql_regtest_user_r, sepgsql_regtest_user_t) userdom_exec_user_home_content_files(sepgsql_regtest_user_t) userdom_write_user_tmp_sockets(sepgsql_regtest_user_t) + +auth_read_passwd(sepgsql_regtest_user_t) + optional_policy(` postgresql_role(sepgsql_regtest_user_r, sepgsql_regtest_user_t) postgresql_stream_connect(sepgsql_regtest_user_t) @@ -126,6 +135,8 @@ userdom_manage_home_role(sepgsql_regtest_pool_r, sepgsql_regtest_pool_t) userdom_exec_user_home_content_files(sepgsql_regtest_pool_t) userdom_write_user_tmp_sockets(sepgsql_regtest_pool_t) +auth_read_passwd(sepgsql_regtest_pool_t) + type sepgsql_regtest_foo_t; type sepgsql_regtest_var_t; type sepgsql_regtest_foo_table_t; -- 2.21.0 From 53fb4c662347311ef6e21ca4c82b64c358f2096b Mon Sep 17 00:00:00 2001 From: Mike Palmiotto Date: Fri, 19 Jul 2019 14:29:22 -0400 Subject: [PATCH 2/2] Add sandboxed cluster for sepgsql regression tests --- contrib/sepgsql/test_sepgsql | 47 1 file changed, 42 insertions(+), 5 deletions(-) diff --git a/contrib/sepgsql/test_sepgsql b/contrib/sepgsql/test_sepgsql index 7530363d2c..d14a1c28bc 100755 --- a/contrib/sepgsql/test_sepgsql +++ b/contrib/sepgsql/test_sepgsql @@ -15,7 +15,43 @@ PG_BINDIR=`pg_config --bindir` # we must move to contrib/sepgsql directory to run pg_regress correctly -cd `dirname $0` +cd `dirname $0` || exit 1 + +# Shut down existing test cluster and delete tmp directory if we have it +if [ -d tmp/ ]; then +# Make sure we don't have a lingering regression test cluster installed +$PG_BINDIR/pg_ctl -D tmp -o "-p 15432" stop + +sudo rm -rf tmp/ +fi + +# Iniitalize our test environment +if ! $PG_BINDIR/pg_ctl initdb -D tmp; then +echo "test cluster initdb error" +exit 1 +fi + +echo "shared_preload_libraries = 'sepgsql'" >> tmp/postgresql.conf + +for DBNAME in template0 template1 postgres; +do +$PG_BINDIR/postgres --single -F -c exit_on_error=true -p 15432 -D tmp/ $DBNAME \ +< sepgsql.sql > /dev/null +done + +# Reload the policy module +if ! sudo make -f /usr/share/selinux/devel/Makefile reload; then +echo "policy reload error" +echo "" +echo "Unable to build sepgsql-regtest policy
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 2:57 PM Peter Geoghegan wrote: > On Fri, Jul 19, 2019 at 10:28 AM Robert Haas wrote: > > In scenario #2, the undo work is going to have to be retried in the > > background, and perforce that means reacquiring locks that have been > > released, and so there is a chance of long lock waits and/or deadlock > > that cannot really be avoided. > > I haven't studied the UNDO or zheap stuff in any detail, but I am > concerned about rollbacks that deadlock. I'd feel a lot better about > it if forward progress was guaranteed, somehow. That seems to imply > that locks are retained, which is probably massively inconvenient to > ensure. Not least because it probably requires cooperation from > underlying access methods. Right, that's definitely a big part of the concern here, but I don't really believe that retaining locks is absolutely required, or even necessarily desirable. For instance, suppose that I create a table, bulk-load a whole lotta data into it, and then abort. Further support that by the time we start trying to process the undo in the background, we can't get the lock. Well, that probably means somebody is performing DDL on the table. If they just did LOCK TABLE or ALTER TABLE SET STATISTICS, we are going to need to execute that same undo once the DDL is complete. However, if the DDL is DROP TABLE, we're going to find that once we can get the lock, the undo is obsolete, and we don't need to worry about it any more. Had we made it 100% certain that the DROP TABLE couldn't go through until the undo was performed, we could avoid having to worry about the undo having become obsolete ... but that's hardly a win. We're better off allowing the drop and then just chucking the undo. Likely, something like CLUSTER or VACUUM FULL would take care of removing any rows created by aborted transactions along the way, so the undo could be thrown away afterwards without processing it. Point being - there's at least some chance that the operations which block forward progress also represent progress of another sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova wrote: > Patch 0002 (must be applied on top of 0001) implements preserving of > correct TID order > inside posting list when inserting new tuples. > This version passes all regression tests including amcheck test. > I also used following script to test insertion into the posting list: Nice! > I suppose it is not the final version of the patch yet, > so I left some debug messages and TODO comments to ease review. I'm fine with leaving them in. I have sometimes distributed a separate patch with debug messages, but now that I think about it, that probably wasn't a good use of time. You will probably want to remove at least some of the debug messages during performance testing. I'm thinking of code that appears in very tight inner loops, such as the _bt_compare() code. > Please, in your review, pay particular attention to usage of > BTreeTupleGetHeapTID. > For posting tuples it returns the first tid from posting list like > BTreeTupleGetMinTID, > but maybe some callers are not ready for that and want > BTreeTupleGetMaxTID instead. > Incorrect usage of these macros may cause some subtle bugs, > which are probably not covered by tests. So, please double-check it. One testing strategy that I plan to use for the patch is to deliberately corrupt a compressed index in a subtle way using pg_hexedit, and then see if amcheck detects the problem. For example, I may swap the order of two TIDs in the middle of a posting list, which is something that is unlikely to produce wrong answers to queries, and won't even be detected by the "heapallindexed" check, but is still wrong. If we can detect very subtle, adversarial corruption like this, then we can detect any real-world problem. Once we have confidence in amcheck's ability to detect problems with posting lists in general, we can use it in many different contexts without much thought. For example, we'll probably need to do long running benchmarks to validate the performance of the patch. It's easy to add amcheck testing at the end of each run. Every benchmark is now also a correctness/stress test, for free. > Next week I'm going to check performance and try to find specific > scenarios where this > feature can lead to degradation and measure it, to understand if we need > to make this deduplication optional. Sounds good, though I think it might be a bit too early to decide whether or not it needs to be enabled by default. For one thing, the approach to WAL-logging within _bt_compress_one_page() is probably fairly inefficient, which may be a problem for certain workloads. It's okay to leave it that way for now, because it is not relevant to the core design of the patch. I'm sure that _bt_compress_one_page() can be carefully optimized when the time comes. My current focus is not on the raw performance itself. For now, I am focussed on making sure that the compression works well, and that the resulting indexes "look nice" in general. FWIW, the first few versions of my v12 work on nbtree didn't actually make *anything* go faster. It took a couple of months to fix the more important regressions, and a few more months to fix all of them. I think that the work on this patch may develop in a similar way. I am willing to accept regressions in the unoptimized code during development because it seems likely that you have the right idea about the data structure itself, which is the one thing that I *really* care about. Once you get that right, the remaining problems are very likely to either be fixable with further work on optimizing specific code, or a price that users will mostly be happy to pay to get the benefits. -- Peter Geoghegan
Re: should there be a hard-limit on the number of transactions pending undo?
Hi, On 2019-07-19 14:50:22 -0400, Robert Haas wrote: > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund wrote: > > It doesn't seem that hard - and kind of required for robustness > > independent of the decision around "completeness" - to find a way to use > > the locks already held by the prepared transaction. > > I'm not wild about finding more subtasks to put on the must-do list, > but I agree it's doable. Isn't that pretty inherently required? How are otherwise ever going to be able to roll back a transaction that holds an AEL on a relation it also modifies? I might be standing on my own head here, though. > > You could force new connections to complete the rollback processing of > > the terminated connection, if there's too much pending UNDO. That'd be a > > way of providing back-pressure against such crazy scenarios. Seems > > again that it'd be good to have that pressure, independent of the > > decision on completeness. > > That would definitely provide a whole lot of back-pressure, but it > would also make the system unusable if the undo handler finds a way to > FATAL, or just hangs for some stupid reason (stuck I/O?). It would be > a shame if the administrative action needed to fix the problem were > prevented by the back-pressure mechanism. Well, then perhaps that admin ought not to constantly terminate connections... I was thinking that new connections wouldn't be forced to do that if there were still a lot of headroom regarding #transactions-to-be-rolled-back. And if undo workers kept up, you'd also not hit this. > > Couldn't we record the outstanding transactions in the checkpoint, and > > then recompute the changes to that record during WAL replay? > > Hmm, that's not a bad idea. So the transactions would have to "count" > the moment they insert their first undo record, which is exactly the > right thing anyway. > > Hmm, but what about transactions that are only touching unlogged tables? Wouldn't we throw all that UNDO away in a crash restart? There's no underlying table data anymore, after all. And for proper shutdown checkpoints they could just be included. Greetings, Andres Freund
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 10:28 AM Robert Haas wrote: > In scenario #2, the undo work is going to have to be retried in the > background, and perforce that means reacquiring locks that have been > released, and so there is a chance of long lock waits and/or deadlock > that cannot really be avoided. I haven't studied the UNDO or zheap stuff in any detail, but I am concerned about rollbacks that deadlock. I'd feel a lot better about it if forward progress was guaranteed, somehow. That seems to imply that locks are retained, which is probably massively inconvenient to ensure. Not least because it probably requires cooperation from underlying access methods. -- Peter Geoghegan
Re: pgsql: Sync our copy of the timezone library with IANA release tzcode20
I wrote: > Michael Paquier writes: >> This is causing a compilation warning on Windows: > ...so I think your compiler has a point. I shall complain to upstream. The IANA folk want to fix it like this: diff --git a/zic.c b/zic.c index 8bf5628..a84703a 100644 --- a/zic.c +++ b/zic.c @@ -2145,7 +2145,7 @@ writezone(const char *const name, const char *const string, char version, } if (pass == 1 && !want_bloat()) { utcnt = stdcnt = thisleapcnt = 0; - thistimecnt = - locut - hicut; + thistimecnt = - (locut + hicut); thistypecnt = thischarcnt = 1; thistimelim = thistimei; } I'm not quite convinced whether that will silence the warning, but at least it's a bit less unreadable. regards, tom lane
Re: SQL/JSON path issues/questions
I would like to help review this documentation. Can you please point me in the right direction? Thanks Steve On Fri, Jul 19, 2019 at 2:02 AM Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Thu, Jul 18, 2019 at 5:08 PM Thom Brown wrote: > > On Tue, 16 Jul 2019 at 19:44, Alexander Korotkov > > wrote: > > > > > > On Tue, Jul 16, 2019 at 9:22 PM Thom Brown wrote: > > > > Now I'm looking at the @? and @@ operators, and getting a bit > > > > confused. This following query returns true, but I can't determine > > > > why: > > > > > > > > # SELECT '{"a":[1,2,3,4,5]}'::jsonb @? '$.b == "hello"'::jsonpath; > > > > ?column? > > > > -- > > > > t > > > > (1 row) > > > > > > > > "b" is not a valid item, so there should be no match. Perhaps it's > my > > > > misunderstanding of how these operators are supposed to work, but the > > > > documentation is quite terse on the behaviour. > > > > > > So, the result of jsonpath evaluation is single value "false". > > > > > > # SELECT jsonb_path_query_array('{"a":[1,2,3,4,5]}'::jsonb, '$.b == > "hello"'); > > > jsonb_path_query_array > > > > > > [false] > > > (1 row) > > > > > > @@ operator checks that result is "true". This is why it returns > "false". > > > > > > @? operator checks if result is not empty. So, it's single "false" > > > value, not empty list. This is why it returns "true". > > > > > > Perhaps, we need to clarify this in docs providing more explanation. > > > > Understood. Thanks. > > > > Also, is there a reason why jsonb_path_query doesn't have an operator > analog? > > The point of existing operator analogues is index support. We > introduced operators for searches we can accelerate using GIN indexes. > > jsonb_path_query() doesn't return bool. So, even if we have an > operator for that, it wouldn't get index support. > > However, we can discuss introduction of operator analogues for other > functions as syntax sugar. > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > > >
Re: should there be a hard-limit on the number of transactions pending undo?
On Fri, Jul 19, 2019 at 2:04 PM Andres Freund wrote: > It doesn't seem that hard - and kind of required for robustness > independent of the decision around "completeness" - to find a way to use > the locks already held by the prepared transaction. I'm not wild about finding more subtasks to put on the must-do list, but I agree it's doable. > I'm not following, unfortunately. > > I don't understand what exactly the scenario is you refer to. You say > "session N+1 can just stay in the same transaction", but then you also > reference something taking "probably a five digit number of > transactions". Are those transactions the prepared ones? So you begin a bunch of transactions. All but one of them begin a transaction, insert data into a table, and then prepare. The last one begins a transaction and locks the table. Now you roll back all the prepared transactions. Those sessions now begin new transactions, insert data into a second table, and prepare the second set of transactions. The last session, which still has the first table locked, now locks the second table in addition. Now you again roll back all the prepared transactions. At this point you have 2 * max_prepared_transactions that are waiting for undo, all blocked on that last session that holds locks on both tables. So now you go have all of those sessions begin a third transaction, and they all insert into a third table, and prepare. The last session now attempts AEL on that third table, and once it's waiting, you roll back all the prepared transactions, after which that last session successfully picks up its third table lock. You can keep repeating this, locking a new table each time, until you run out of lock table space, by which time you will have roughly max_prepared_transactions * size_of_lock_table transactions waiting for undo processing. > You could force new connections to complete the rollback processing of > the terminated connection, if there's too much pending UNDO. That'd be a > way of providing back-pressure against such crazy scenarios. Seems > again that it'd be good to have that pressure, independent of the > decision on completeness. That would definitely provide a whole lot of back-pressure, but it would also make the system unusable if the undo handler finds a way to FATAL, or just hangs for some stupid reason (stuck I/O?). It would be a shame if the administrative action needed to fix the problem were prevented by the back-pressure mechanism. One thing I've thought about, which I think would be helpful for a variety of scenarios, is to have a facility that forces a computed delay at the each write transaction (when it first writes WAL, or when an XID is assigned), or we could adapt that to this case and say the beginning of each undo-using transaction. So for example if you are about to run out of space in pg_wal, you can slow thinks down to let the checkpoint complete, or if you are about to run out of XIDs, you can slow things down to let autovacuum complete, or if you are about to run out of undo slots, you can slow things down to let some undo to complete. The trick is to make sure that you only wait when it's likely to do some good; if you wait because you're running out of XIDs and the reason you're running out of XIDs is because somebody left a replication slot or a prepared transaction around, the back-pressure is useless. > Couldn't we record the outstanding transactions in the checkpoint, and > then recompute the changes to that record during WAL replay? Hmm, that's not a bad idea. So the transactions would have to "count" the moment they insert their first undo record, which is exactly the right thing anyway. Hmm, but what about transactions that are only touching unlogged tables? > Yea, I think that's what it boils down to... Would be good to have a few > more opinions on this. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [sqlsmith] Crash in mcv_get_match_bitmap
Tomas Vondra writes: > [ mcv fixes ] These patches look OK to me. regards, tom lane
Re: Tid scan improvements
Hi, On 2019-07-19 13:54:59 +1200, David Rowley wrote: > On Thu, 18 Jul 2019 at 14:30, Andres Freund wrote: > > I think the AM part of this patch might be the wrong approach - it won't > > do anything meaningful for an AM that doesn't directly map the ctid to a > > specific location in a block (e.g. zedstore). To me it seems the > > callback ought to be to get a range of tids, and the tidrange scan > > shouldn't do anything but determine the range of tids the AM should > > return. > > Sounds like that's going to require adding some new fields to > HeapScanDescData, then some callback similar to heap_setscanlimits to > set those fields. > > Then, we'd either need to: > > 1. Make the table_scan_getnextslot() implementations check the tuple > falls within the range, or > 2. add another callback that pays attention to the set TID range. > The problem with #1 is that would add overhead to normal seqscans, > which seems like a bad idea. > > Did you imagined two additional callbacks, 1 to set the TID range, > then one to scan it? Duplicating the logic in heapgettup_pagemode() > and heapgettup() looks pretty horrible, but I guess we could add a > wrapper around it that loops until it gets the first tuple and bails > once it scans beyond the final tuple. > > Is that what you had in mind? Yea, I was thinking of something like 2. We already have a few extra types of scan nodes (bitmap heap and sample scan), it'd not be bad to add another. And as you say, they can just share most of the guts: For heap I'd just implement pagemode, and perhaps split heapgettup_pagemode into two parts (one to do the page processing, the other to determine the relevant page). You say that we'd need new fields in HeapScanDescData - not so sure about that, it seems feasible to just provide the boundaries in the call? But I think it'd also just be fine to have the additional fields. Greetings, Andres Freund
Re: Broken defenses against dropping a partitioning column
I wrote: > So I think we're probably stuck with the approach of adding new internal > dependencies. If we go that route, then our options for the released > branches are (1) do nothing, or (2) back-patch the code that adds such > dependencies, but without a catversion bump. That would mean that only > tables created after the next minor releases would have protection against > this problem. That's not ideal but maybe it's okay, considering that we > haven't seen actual field reports of trouble of this kind. Here's a proposed patch for that. It's mostly pretty straightforward, except I had to add some recursion defenses in findDependentObjects that weren't there before. But those seem like a good idea anyway to prevent infinite recursion in case of bogus entries in pg_depend. I also took the liberty of improving some related error messages that I thought were unnecessarily vague and not up to project standards. Per above, I'm envisioning applying this to HEAD and v12 with a catversion bump, and to v11 and v10 with no catversion bump. regards, tom lane diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 6315fc4..3356461 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -543,6 +543,7 @@ findDependentObjects(const ObjectAddress *object, ObjectIdGetDatum(object->objectId)); if (object->objectSubId != 0) { + /* Consider only dependencies of this sub-object */ ScanKeyInit([2], Anum_pg_depend_objsubid, BTEqualStrategyNumber, F_INT4EQ, @@ -550,7 +551,10 @@ findDependentObjects(const ObjectAddress *object, nkeys = 3; } else + { + /* Consider dependencies of this object and any sub-objects it has */ nkeys = 2; + } scan = systable_beginscan(*depRel, DependDependerIndexId, true, NULL, nkeys, key); @@ -567,6 +571,18 @@ findDependentObjects(const ObjectAddress *object, otherObject.objectId = foundDep->refobjid; otherObject.objectSubId = foundDep->refobjsubid; + /* + * When scanning dependencies of a whole object, we may find rows + * linking sub-objects of the object to the object itself. (Normally, + * such a dependency is implicit, but we must make explicit ones in + * some cases involving partitioning.) We must ignore such rows to + * avoid infinite recursion. + */ + if (otherObject.classId == object->classId && + otherObject.objectId == object->objectId && + object->objectSubId == 0) + continue; + switch (foundDep->deptype) { case DEPENDENCY_NORMAL: @@ -855,6 +871,16 @@ findDependentObjects(const ObjectAddress *object, otherObject.objectSubId = foundDep->objsubid; /* + * If what we found is a sub-object of the current object, just ignore + * it. (Normally, such a dependency is implicit, but we must make + * explicit ones in some cases involving partitioning.) + */ + if (otherObject.classId == object->classId && + otherObject.objectId == object->objectId && + object->objectSubId == 0) + continue; + + /* * Must lock the dependent object before recursing to it. */ AcquireDeletionLock(, 0); @@ -1588,8 +1614,10 @@ recordDependencyOnExpr(const ObjectAddress *depender, * As above, but only one relation is expected to be referenced (with * varno = 1 and varlevelsup = 0). Pass the relation OID instead of a * range table. An additional frammish is that dependencies on that - * relation (or its component columns) will be marked with 'self_behavior', - * whereas 'behavior' is used for everything else. + * relation's component columns will be marked with 'self_behavior', + * whereas 'behavior' is used for everything else; also, if reverse_self + * is true, those dependencies are reversed so that the columns are made + * to depend on the table not vice versa. * * NOTE: the caller should ensure that a whole-table dependency on the * specified relation is created separately, if one is needed. In particular, @@ -1602,7 +1630,7 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender, Node *expr, Oid relId, DependencyType behavior, DependencyType self_behavior, -bool ignore_self) +bool reverse_self) { find_expr_references_context context; RangeTblEntry rte; @@ -1626,7 +1654,8 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender, eliminate_duplicate_dependencies(context.addrs); /* Separate self-dependencies if necessary */ - if (behavior != self_behavior && context.addrs->numrefs > 0) + if ((behavior != self_behavior || reverse_self) && + context.addrs->numrefs > 0) { ObjectAddresses *self_addrs; ObjectAddress *outobj; @@ -1657,11 +1686,23 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender, } context.addrs->numrefs = outrefs; - /* Record the self-dependencies */ - if (!ignore_self) + /* Record the self-dependencies with the appropriate direction */ + if (!reverse_self)
Re: should there be a hard-limit on the number of transactions pending undo?
Hi, On 2019-07-19 13:28:14 -0400, Robert Haas wrote: > I want to consider three specific scenarios that could cause undo > application to fail, and then offer some observations about them. > > Scenario #1: > > 1. Sessions 1..N each begin a transaction and write a bunch of data to > a table (at least enough that we'll try to perform undo in the > background). > 2. Session N+1 begins a transaction and tries to lock the same table. > It blocks. > 3. Sessions 1..N abort, successfully pushing the undo work into the > background. > 4. Session N+1 now acquires the lock and sits on it. > 5. Optionally, repeat steps 1-4 K times, each time for a different table. > > Scenario #2: > > 1. Any number of sessions begin a transaction, write a bunch of data, > and then abort. > 2. They all try to perform undo in the foreground. > 3. They get killed using pg_terminate_backend(). > > Scenario #3: > > 1. A transaction begins, does some work, and then aborts. > 2. When undo processing occurs, 1% of such transactions fail during > undo apply because of a bug in the table AM. > 3. When undo processing retries after a failure, it fails again > because the bug is triggered by something about the contents of the > undo record, rather than by, say, concurrency. > However, if prepared transactions are in use, we could have a variant > of scenario #1 in which each transaction is first prepared, and then > the prepared transaction is rolled back. Unlike the ordinary case, > this can lead to a nearly-unbounded growth in the number of > transactions that are pending undo, because we don't have a way to > transfer the locks held by PGPROC used for the prepare to some running > session that could perform the undo. It doesn't seem that hard - and kind of required for robustness independent of the decision around "completeness" - to find a way to use the locks already held by the prepared transaction. > It's not necessary to have a > large value for max_prepared_transactions; it only has to be greater > than 0, because we can keep reusing the same slots with different > tables. That is, let N = max_prepared_xacts, and let K be anything at > all; session N+1 can just stay in the same transaction and keep on > taking new locks one at a time until the lock table fills up; not sure > exactly how long that will take, but it's probably a five digit number > of transactions, or maybe six. In this case, we can't force undo into > the foreground, so we can exceed the number of transactions that are > supposed to be backgrounded. I'm not following, unfortunately. I don't understand what exactly the scenario is you refer to. You say "session N+1 can just stay in the same transaction", but then you also reference something taking "probably a five digit number of transactions". Are those transactions the prepared ones? Aloso, if someobdy fills up the entire lock table, then the system is effectively down - independent of UNDO, and no meaningful amount of UNDO is going to be written. Perhaps we need some better resource control, but that's really independent of UNDO. Perhaps you can just explain the scenario in a few more words? My comments regarding it probably make no sense, given how little I understand what the scenario is. > In scenario #2, the undo work is going to have to be retried in the > background, and perforce that means reacquiring locks that have been > released, and so there is a chance of long lock waits and/or deadlock > that cannot really be avoided. I think there is basically no way at > all to avoid an unbounded accumulation of transactions requiring undo > in this case, just as in the similar case where the cluster is > repeatedly shut down or repeatedly crashes. Eventually, if you have a > hard cap on the number of transactions requiring undo, you're going to > hit it, and have to start refusing new undo-using transactions. As > Thomas pointed out, that might still be better than some other systems > which use undo, where the system doesn't open for any transactions at > all after a restart until all undo is retired, and/or where undo is > never processed in the background. But it's a possible concern. On the > other hand, if you don't have a hard cap, the system may just get > further and further behind until it eventually melts, and that's also > a possible concern. You could force new connections to complete the rollback processing of the terminated connection, if there's too much pending UNDO. That'd be a way of providing back-pressure against such crazy scenarios. Seems again that it'd be good to have that pressure, independent of the decision on completeness. > One other thing that seems worth noting is that we have to consider > what happens after a restart. After a crash, and depending on exactly > how we design it perhaps also after a non-crash restart, we won't > immediately know how many outstanding transactions need undo; we'll > have to grovel through the undo logs to find out. If we've got a
Re: pg_receivewal documentation
On Tue, Jul 16, 2019 at 9:38 PM Michael Paquier wrote: > I think we should really document the caveat with priority-based sets > of standbys as much as quorum-based sets. For example if a user sets > synchronous_commit = remote_apply in postgresql.conf, and then sets > s_s_names to '2(pg_receivewal, my_connected_standby)' to get a > priority-based set, then you have the same problem, and pg_receivewal > is not the only synchronous standby in this configuration. The patch > does not cover that case properly. I don't agree with this approach. It seems to me that the original was too precise already, and making it more precise only exacerbates the situation. The point is that synchronous_commit = remote_apply is *categorically* a bad idea for sessions running pg_receivewal. The reason why you're adding all this complexity is to try to distinguish between the case where it's merely a bad idea and the case where it will also completely fail to work. But why is it important to describe the scenarios under which it will altogether fail to work? You could just say something like: Since pg_receivewal does not apply WAL, you should not allow it to become a synchronous standby when synchronous_commit = remote_apply. If it does, it will appear to be a standby which never catches up, which may cause commits to block. To avoid this, you should either configure an appropriate value for synchronous_standby_names, or specify an application_name for pg_receivewal that does not match it, or change the value of synchronous_commit to something other than remote_apply. I think that'd be a lot more useful than enumerating the total-failure scenarios. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
17.07.2019 19:36, Anastasia Lubennikova: There is one major issue left - preserving TID order in posting lists. For a start, I added a sort into BTreeFormPostingTuple function. It turned out to be not very helpful, because we cannot check this invariant lazily. Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to implement insertion into the middle of the posting list. I'll send a new version this week. Patch 0002 (must be applied on top of 0001) implements preserving of correct TID order inside posting list when inserting new tuples. This version passes all regression tests including amcheck test. I also used following script to test insertion into the posting list: set client_min_messages to debug4; drop table tbl; create table tbl (i1 int, i2 int); insert into tbl select 1, i from generate_series(0,1000) as i; insert into tbl select 1, i from generate_series(0,1000) as i; create index idx on tbl (i1); delete from tbl where i2 <500; vacuum tbl ; insert into tbl select 1, i from generate_series(1001, 1500) as i; The last insert triggers several insertions that can be seen in debug messages. I suppose it is not the final version of the patch yet, so I left some debug messages and TODO comments to ease review. Please, in your review, pay particular attention to usage of BTreeTupleGetHeapTID. For posting tuples it returns the first tid from posting list like BTreeTupleGetMinTID, but maybe some callers are not ready for that and want BTreeTupleGetMaxTID instead. Incorrect usage of these macros may cause some subtle bugs, which are probably not covered by tests. So, please double-check it. Next week I'm going to check performance and try to find specific scenarios where this feature can lead to degradation and measure it, to understand if we need to make this deduplication optional. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 9126c18..2b05b1e 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state) { IndexTuple norm; - norm = bt_normalize_tuple(state, itup); - bloom_add_element(state->filter, (unsigned char *) norm, - IndexTupleSize(norm)); - /* Be tidy */ - if (norm != itup) -pfree(norm); + if (BTreeTupleIsPosting(itup)) + { +IndexTuple onetup; +int i; + +/* Fingerprint all elements of posting tuple one by one */ +for (i = 0; i < BTreeTupleGetNPosting(itup); i++) +{ + onetup = BTreeGetNthTupleOfPosting(itup, i); + + norm = bt_normalize_tuple(state, onetup); + bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); + /* Be tidy */ + if (norm != onetup) + pfree(norm); + pfree(onetup); +} + } + else + { +norm = bt_normalize_tuple(state, itup); +bloom_add_element(state->filter, (unsigned char *) norm, + IndexTupleSize(norm)); +/* Be tidy */ +if (norm != itup) + pfree(norm); + } } /* diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 602f884..26ddf32 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -20,6 +20,7 @@ #include "access/tableam.h" #include "access/transam.h" #include "access/xloginsert.h" +#include "catalog/catalog.h" #include "miscadmin.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off); static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel); +static bool insert_itupprev_to_page(Page page, BTCompressState *compressState); +static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel); /* * _bt_doinsert() -- Handle insertion of a single index tuple in the tree. @@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel, _bt_vacuum_one_page(rel, insertstate->buf, heapRel); insertstate->bounds_valid = false; } + + /* + * If the target page is full, try to compress the page + */ + if (PageGetFreeSpace(page) < insertstate->itemsz) + _bt_compress_one_page(rel, insertstate->buf, heapRel); } else { @@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel, } /* + * Before considering moving right, try to compress the page + */ + _bt_compress_one_page(rel, insertstate->buf, heapRel); + + /* * Nope, so check conditions (b) and (c) enumerated above * * The earlier _bt_check_unique() call may well have established a @@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel) * the page. */ } + +/* + * Add new item (compressed or not) to the page,
should there be a hard-limit on the number of transactions pending undo?
On Tue, Jul 16, 2019 at 6:23 PM Andres Freund wrote: > Yea, that seems like a question independent of the "completeness" > requirement. If desirable, it seems trivial to either have > RollbackHashEntry have per-persistence level status (for one entry per > xid), or not (for per-persistence entries). I want to talk more about the "completeness" issue, which is basically a question of whether we should (a) put a hard limit on the number of transactions that have unprocessed undo and that are either aborted or in progress (and thus capable of aborting) as proposed by Andres or (b) not have such a hard limit, as originally proposed by me. I think everyone who is not working on this code has installed an automatic filter rule to send the original thread to /dev/null, so I'm changing the subject line in the hopes of getting some of those people to pay attention. If it doesn't work, at least the concerns will be memorialized in case it comes up later. I originally proposed (b) because if undo application is failing for some reason, it seemed better not to bring the whole system to a screeching halt, but rather just cause incremental performance degradation or something. However, Andres has pointed out this may postpone remedial action and let bugs go undetected, so it might not actually be better. Also, some of the things we need to do, like computing the oldest XID whose undo has not been retired, are tricky/expensive if you don't have complete data in memory, and you can't be sure of having complete data in shared memory without a hard limit. No matter which way we go, failures to apply undo had better be really rare, or we're going to be in really serious trouble, so we're only concerned here with how to handle what is hopefully a very rare scenario, not the common case. I want to consider three specific scenarios that could cause undo application to fail, and then offer some observations about them. Scenario #1: 1. Sessions 1..N each begin a transaction and write a bunch of data to a table (at least enough that we'll try to perform undo in the background). 2. Session N+1 begins a transaction and tries to lock the same table. It blocks. 3. Sessions 1..N abort, successfully pushing the undo work into the background. 4. Session N+1 now acquires the lock and sits on it. 5. Optionally, repeat steps 1-4 K times, each time for a different table. Scenario #2: 1. Any number of sessions begin a transaction, write a bunch of data, and then abort. 2. They all try to perform undo in the foreground. 3. They get killed using pg_terminate_backend(). Scenario #3: 1. A transaction begins, does some work, and then aborts. 2. When undo processing occurs, 1% of such transactions fail during undo apply because of a bug in the table AM. 3. When undo processing retries after a failure, it fails again because the bug is triggered by something about the contents of the undo record, rather than by, say, concurrency. In scenario one, the problem is mostly self-correcting. When we decide that we've got too many things queued up for background processing, and start to force undo processing to happen in the foreground, it will start succeeding, because the foreground process will have retained the lock that it took before writing any data and can therefore undo those writes without having to wait for the lock. However, this will do nothing to help the requests that are already in the background, which will just sit there until they can get the lock. I think there is a good argument that they should not actually wait for the lock, or only for a certain time period, and then give up and put the transaction on the error queue for reprocessing at a later time. Otherwise, we're pinning down undo workers, which could easily lead to starvation, just as it does for autovacuum. On the whole, this doesn't sound too bad. We shouldn't be able to fill up the queue with small transactions, because of the restriction that we only push undo work into the background when the transaction is big enough, and if we fill it up with big transactions, then (1) back-pressure will prevent the problem from crippling the system and (2) eventually the problem will be self-correcting, because when the transaction in session N+1 ends, the undo will all go through and everything will be fine. The only real user impact of this scenario is that unrelated work on the system might notice that large rollbacks are now happening in the foreground rather than the background, and if that causes a problem, the DBA can fix it by terminating session N+1. Even if she doesn't, you shouldn't ever hit the hard cap. However, if prepared transactions are in use, we could have a variant of scenario #1 in which each transaction is first prepared, and then the prepared transaction is rolled back. Unlike the ordinary case, this can lead to a nearly-unbounded growth in the number of transactions that are pending undo, because we don't have a way to transfer the locks held by
Re: global / super barriers (for checksums)
Hi, On 2019-07-10 15:31:11 +0200, Magnus Hagander wrote: > In re-reading this, I notice there are a lot of references to Intterrupt > (with two t). I'm guessing this is just a spelling error, and not something > that actually conveys some meaning? Just a spelling error. I think I wrote the patch in a night after pgconf.eu, to allow you to quickly make progress :P > Can you elaborate on what you mean with: > + /* XXX: need a more principled approach here */ > Is that the thing you refer to above about "checksum internals"? I think I didn't actually mean "checksum" but instead "checkpoint". It does bother me that we have an operation as long-running as BufferSync() commonly is, without a proper way to accept event. There's a hack for doing something similar-ish in CheckpointWriteDelay(), for absorbing fsync requests, but it doesn't trigger for checkpoints not done in checkpointer, nor is it really extensible. > Also in checking we figured it'd be nice to have a wait event for this, > since a process can potentially get stuck in an infinite loop waiting for > some other process if it's misbehaving. Kind of like the attached? Yea, that makes sense. Greetings, Andres Freund
Re: [RFC] Removing "magic" oids
Hi, On 2019-07-07 10:00:35 -0700, Noah Misch wrote: > On Tue, Nov 20, 2018 at 01:20:04AM -0800, Andres Freund wrote: > > On 2018-11-14 21:02:41 -0800, Andres Freund wrote: > > > > The point of the test is to exercise OidGenLock by issuing many parallel > > > > GetNewOidWithIndex() and verifying absence of duplicates. There's > > > > nothing > > > > special about OidGenLock, but it is important to use an operation that > > > > takes a > > > > particular LWLock many times, quickly. If the test query spends too > > > > much time > > > > on things other than taking locks, it will catch locking races too > > > > rarely. > > > > > > Sequences ought to do that, too. And if it's borked, we'd hopefully see > > > unique violations. But it's definitely not a 1:1 replacement. > > > I've tested this on ppc. Neither the old version nor the new version > > stress test spinlocks sufficiently to error out with weakened spinlocks > > (not that surprising, there are no spinlocks in any hot path of either > > workload). Both versions very reliably trigger on weakened lwlocks. So I > > think we're comparatively good on that front. > > I tested this on xlc, the compiler that motivated the OID test, and the v12+ > version of the test didn't catch the bug[1] with xlc 13.1.3. CREATE TYPE > ... AS ENUM generates an OID for each label, so the attached patch makes the > v12+ test have locking behavior similar to its v11 ancestor. Interesting. It's not obvious to me why that works meaningfully better. > [1] > https://postgr.es/m/flat/a72cfcb0-37d0-de2f-b3ec-f38ad8d6a...@postgrespro.ru > diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl > b/src/bin/pgbench/t/001_pgbench_with_server.pl > index dc2c72f..3b097a9 100644 > --- a/src/bin/pgbench/t/001_pgbench_with_server.pl > +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl > @@ -58,27 +58,20 @@ sub pgbench > return; > } > > -# Test concurrent insertion into table with serial column. This > -# indirectly exercises LWLock and spinlock concurrency. This test > -# makes a 5-MiB table. > - > -$node->safe_psql('postgres', > - 'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); '); > - > +# Test concurrent OID generation via pg_enum_oid_index. This indirectly > +# exercises LWLock and spinlock concurrency. > +my $labels = join ',', map { "'l$_'" } 1 .. 1000; > pgbench( > '--no-vacuum --client=5 --protocol=prepared --transactions=25', > 0, > [qr{processed: 125/125}], > [qr{^$}], > - 'concurrent insert workload', > + 'concurrent OID generation', > { > '001_pgbench_concurrent_insert' => > - 'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);' > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE > pg_temp.e;" > }); Hm, perhaps we should just do something stupid an insert into a catalog table, determining the oid to insert with pg_nextoid? That ought to be a lot faster and thus more "stress testing" than going through a full blown DDL statement? But perhaps that's just too ugly. Greetings, Andres Freund
Re: pg_receivewal documentation
Hi, On 7/18/19 9:09 PM, Michael Paquier wrote: pg_receivewal -D /tmp/wal -S replica1 --synchronous -h localhost -p 5432 -U repluser -W psql -c 'SELECT * FROM pg_stat_replication;' postgres psql -c 'SELECT * FROM pg_replication_slots;' postgres psql -c 'CREATE DATABASE test' postgres In what scenarios do you see 'on' working ? Because the code says so, "on" is an alias for "remote_flush" (which is not user-visible by the way): src/include/access/xact.h:#define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSH And if you do that it works fine (pg_receivewal --synchronous runs in the background and I created a dummy table): =# SELECT application_name, sync_state, flush_lsn, replay_lsn FROM pg_stat_replication; application_name | sync_state | flush_lsn | replay_lsn --++---+ pg_receivewal| sync | 0/15E1F88 | null (1 row) =# set synchronous_commit to on ; SET =# insert into aa values (2); INSERT 0 1 I forgot to use pg_receivewal -d with application_name instead of -h -p -U. Maybe we should have an explicit option for that, but that is a separate thread. Best regards, Jesper
Re: pg_receivewal documentation
Hi, On 7/18/19 9:27 PM, Michael Paquier wrote: The location of the warning is also harder to catch for the reader, so instead let's move it to the top where we have an extra description for --synchronous. I am finishing with the attached that I would be fine to commit and back-patch as needed. Does that sound fine? LGTM. Best regards, Jesper
Re: partition routing layering in nodeModifyTable.c
Hi, On 2019-07-19 17:52:20 +0900, Amit Langote wrote: > Attached are two patches. Awesome. > The first one (0001) deals with reducing the core executor's reliance > on es_result_relation_info to access the currently active result > relation, in favor of receiving it from the caller as a function > argument. So no piece of core code relies on it being correctly set > anymore. It still needs to be set correctly for the third-party code > such as FDWs. I'm inclined to just remove it. There's not much code out there relying on it, as far as I can tell. Most FDWs don't support the direct modify API, and that's afaict the case where we one needs to use es_result_relation_info? In fact, I searched through alllFDWs listed on https://wiki.postgresql.org/wiki/Foreign_data_wrappers that are on github and in first few categories (up and including to "file wrappers"), and there was only one reference to es_result_relation_info, and that just in comments in a test: https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93=es_result_relation_info= which I think was just copied from our source code. IOW, we should just change the direct modify calls to get the relevant ResultRelationInfo or something in that vein (perhaps just the relevant RT index?). pglogical also references it, but just because it creates its own EState afaict. > @@ -334,32 +335,50 @@ ExecComputeStoredGenerated(EState *estate, > TupleTableSlot *slot) > * ExecInsert > * > * For INSERT, we have to insert the tuple into the target relation > - * and insert appropriate tuples into the index relations. > + * (or partition thereof) and insert appropriate tuples into the > index > + * relations. > * > * Returns RETURNING result if any, otherwise NULL. > + * > + * This may change the currently active tuple conversion map in > + * mtstate->mt_transition_capture, so the callers must take care to > + * save the previous value to avoid losing track of it. > * > */ > static TupleTableSlot * > ExecInsert(ModifyTableState *mtstate, > +ResultRelInfo *resultRelInfo, > TupleTableSlot *slot, > TupleTableSlot *planSlot, > EState *estate, > bool canSetTag) > { > - ResultRelInfo *resultRelInfo; > RelationresultRelationDesc; > List *recheckIndexes = NIL; > TupleTableSlot *result = NULL; > TransitionCaptureState *ar_insert_trig_tcs; > ModifyTable *node = (ModifyTable *) mtstate->ps.plan; > OnConflictAction onconflict = node->onConflictAction; > + PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing; > + > + /* > + * If the input result relation is a partitioned table, find the leaf > + * partition to insert the tuple into. > + */ > + if (proute) > + { > + ResultRelInfo *partRelInfo; > + > + slot = ExecPrepareTupleRouting(mtstate, estate, proute, > + > resultRelInfo, slot, > + > ); > + resultRelInfo = partRelInfo; > + /* Result relation has changed, so update EState reference too. > */ > + estate->es_result_relation_info = resultRelInfo; > + } I think by removing es_result_relation entirely, this would look cleaner. > @@ -1271,18 +1274,18 @@ lreplace:; > > mtstate->mt_root_tuple_slot); > > /* > - * Prepare for tuple routing, making it look like we're > inserting > - * into the root. > + * ExecInsert() may scribble on > mtstate->mt_transition_capture, > + * so save the currently active map. >*/ > + if (mtstate->mt_transition_capture) > + saved_tcs_map = > mtstate->mt_transition_capture->tcs_map; Wonder if we could remove the need for this somehow, it's still pretty darn ugly. Thomas, perhaps you have some insights? To me the need to modify these ModifyTable wide state on a per-subplan and even per-partition basis indicates that the datastructures are in the wrong place. > @@ -2212,23 +2207,17 @@ ExecModifyTable(PlanState *pstate) > switch (operation) > { > case CMD_INSERT: > - /* Prepare for tuple routing if needed. */ > - if (proute) > - slot = ExecPrepareTupleRouting(node, > estate, proute, > - >
Re: [sqlsmith] Crash in mcv_get_match_bitmap
On Thu, Jul 18, 2019 at 11:16:08AM -0400, Tom Lane wrote: Tomas Vondra writes: I've pushed the fixes listed in the previous message, with the exception of the collation part, because I had some doubts about that. Sorry for being slow on this. Now, for the collation part - after some more thought and looking at code I think the fix I shared before is OK. It uses per-column collations consistently both when building the stats and estimating clauses, which makes it consistent. I was not quite sure if using Var->varcollid is the same thing as pg_attribute.attcollation, but it seems it is - at least for Vars pointing to regular columns (which for extended stats should always be the case). I think you are right, but it could use some comments in the code. The important point here is that if we coerce a Var's collation to something else, that will be represented as a separate CollateExpr (which will be a RelabelType by the time it gets here, I believe). We don't just replace varcollid, the way eval_const_expressions will do to a Const. OK, thanks. I've added a comment about that into mcv_get_match_bitmap (not all the details about RelabelType, because that gets stripped while examining the opexpr, but generally about the collations). While I'm looking at the code --- I don't find this at all convincing: /* * We don't care about isgt in equality, because * it does not matter whether it's (var op const) * or (const op var). */ match = DatumGetBool(FunctionCall2Coll(, DEFAULT_COLLATION_OID, cst->constvalue, item->values[idx])); It *will* matter if the operator is cross-type. I think there is no good reason to have different code paths for the equality and inequality cases --- just look at isgt and swap or don't swap the arguments. BTW, "isgt" seems like a completely misleading name for that variable. AFAICS, what that is actually telling is whether the var is on the left or right side of the OpExpr. Everywhere else in the planner with a need for that uses "bool varonleft", and I think you should do likewise here (though note that that isgt, as coded, is more like "varonright"). Yes, you're right in both cases. I've fixed the first issue with equality by simply using the same code as for all operators (which means we don't need to check the oprrest at all in this code, making it simpler). And I've reworked the examine_opclause_expression() to use varonleft naming - I agree it's a much better name. This was one of the remnants of the code I initially copied from somewhere in selfuncs.c and massaged it until it did what I needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 14dc49ad7eefd6e67b751e74ce7253cdc85ca5ad Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 19 Jul 2019 16:28:28 +0200 Subject: [PATCH 1/2] Rework examine_opclause_expression to use varonleft The examine_opclause_expression function needs to return information on which side of the operator we found the Var, but the variable was called "isgt" which is rather misleading (it assumes the operator is either less-than or greater-than, but it may be equality or something else). Other places in the planner use a variable called "varonleft" for this purpose, so just adopt the same convention here. The code also assumed we don't care about this flag for equality, as (Var = Const) and (Const = Var) should be the same thing. But that does not work for cross-type operators, in which case we need to pass the parameters to the procedure in the right order. So just use the same code for all types of expressions. This means we don't need to care about the selectivity estimation function anymore, at least not in this code. We should only get the supported cases here (thanks to statext_is_compatible_clause). Discussion: https://postgr.es/m/8736jdhbhc.fsf%40ansel.ydns.eu Backpatch-to: 12 --- src/backend/statistics/extended_stats.c | 16 ++--- src/backend/statistics/mcv.c | 62 +-- .../statistics/extended_stats_internal.h | 2 +- 3 files changed, 26 insertions(+), 54 deletions(-) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index d2346cbe02..23c74f7d90 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -1196,15 +1196,15 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid, * returns true, otherwise returns false. * * Optionally returns pointers to the extracted Var/Const nodes, when passed - *
Re: sepgsql seems rather thoroughly broken on Fedora 30
Mike Palmiotto writes: > The sepgsql_regtest_user_t domain should be allowed to read any file > labeled "passwd_file_t". We can check that with the `sesearch` tool, > provided by the "setools-console" package on F30: > % sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t > allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True > allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True > allow domain file_type:file map; [ domain_can_mmap_files ]:True > allow nsswitch_domain passwd_file_t:file { getattr ioctl lock map open read }; I got around to trying this, and lookee here: $ sudo sesearch -A -s sepgsql_regtest_user_t -t passwd_file_t allow domain file_type:blk_file map; [ domain_can_mmap_files ]:True allow domain file_type:chr_file map; [ domain_can_mmap_files ]:True allow domain file_type:file map; [ domain_can_mmap_files ]:True allow domain file_type:lnk_file map; [ domain_can_mmap_files ]:True Nothing about passwd_file_t. So *something* is different about the way the policy is being expanded. regards, tom lane
Re: sepgsql seems rather thoroughly broken on Fedora 30
Mike Palmiotto writes: > On Thu, Jul 18, 2019 at 11:06 PM Tom Lane wrote: >>> $ runcon -t sepgsql_regtest_user_t psql --help >>> psql: fatal: could not look up effective user ID 1000: user does not exist > You can rule out SELinux for this piece by running `sudo setenforce > 0`. If the `runcon ... psql` command works in Permissive we should > look at your audit log to determine what is being denied. audit2allow > will provide a summary of the SELinux denials and is generally a good > starting point: > # grep denied /var/log/audit/audit.log | audit2allow It's definitely SELinux. The grep finds entries like type=AVC msg=audit(1563547268.044:465): avc: denied { read } for pid=10940 comm="psql" name="passwd" dev="sda6" ino=4721184 scontext=unconfined_u:unconfined_r:sepgsql_regtest_user_t:s0-s0:c0.c1023 tcontext=system_u:object_r:passwd_file_t:s0 tclass=file permissive=0 which audit2allow turns into #= sepgsql_regtest_user_t == allow sepgsql_regtest_user_t passwd_file_t:file read; So somehow, my system's interpretation of the test policy file does not include that permission. I tried: * restorecon / ... no effect, which is unsurprising given that /etc/passwd was OK already. * removing container-selinux ... this made the compile warnings go away, as you predicted, but no change in the test results. > FWIW, this appears to be working on my recently-installed F30 VM: > % runcon -t sepgsql_regtest_user_t psql --help &> /dev/null > % echo $? > 0 Well, that's just weird. I've not done anything to the SELinux state on this installation either, so what's different? I am wondering whether maybe the different behavior is a result of some RPM that's present on my system but not yours, or vice versa. As a first stab at that, I see: $ rpm -qa | grep selinux | sort cockpit-selinux-198-1.fc30.noarch container-selinux-2.107-1.git453b816.fc30.noarch flatpak-selinux-1.4.2-2.fc30.x86_64 libselinux-2.9-1.fc30.x86_64 libselinux-devel-2.9-1.fc30.x86_64 libselinux-utils-2.9-1.fc30.x86_64 python3-libselinux-2.9-1.fc30.x86_64 rpm-plugin-selinux-4.14.2.1-4.fc30.1.x86_64 selinux-policy-3.14.3-40.fc30.noarch selinux-policy-devel-3.14.3-40.fc30.noarch selinux-policy-targeted-3.14.3-40.fc30.noarch tpm2-abrmd-selinux-2.0.0-4.fc30.noarch regards, tom lane
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Tomas Vondra wrote: > On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote: > >Tomas Vondra wrote: > > > >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: > >> >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: > >> >> One extra thing we should consider is authenticated encryption. We can't > >> >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...), > >> >> as that does not provide integrity protection (i.e. can't detect when > >> >> the ciphertext was corrupted due to disk failure or intentionally). And > >> >> we can't quite rely on checksums, because that checksums the plaintext > >> >> and is stored encrypted. > >> > > >> >Uh, if someone modifies a few bytes of the page, we will decrypt it, but > >> >the checksum (per-page or WAL) will not match our decrypted output. How > >> >would they make it match the checksum without already knowing the key. > >> >I read [1] but could not see that explained. > >> > > >> > >> Our checksum is only 16 bits, so perhaps one way would be to just > >> generate 64k of randomly modified pages and hope one of them happens to > >> hit the right checksum value. Not sure how practical such attack is, but > >> it does require just filesystem access. > > > >I don't think you can easily generate 64k of different checksums this way. If > >the data is random, I suppose that each set of 2^(128 - 16) blocks will > >contain the the same checksum after decryption. Thus even you generate 64k of > >different ciphertext blocks that contain the checksum, some (many?) > >checksums > >will be duplicate. Unfortunately the math to describe this problem does not > >seem to be trivial. > > > > I'm not sure what's your point, or why you care about the 128 bits, but I > don't think the math is very complicated (and it's exactly the same with > or without encryption). The probability of checksum collision for randomly > modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a > collision is (1-p)=99.9985%. So with N pages, the probability of no > collisions is pow((1-p),N) which behaves like this: > > N pow((1-p),N) > >1 85% >2 73% >3 63% >46000 49% >20 4% > > So with 1.6GB relation you have about 96% chance of a checksum collision. I thought your attack proposal was to find valid (encrypted) checksum for a given encrypted page. Instead it seems that you were only trying to say that it's not too hard to generate page with a valid checksum in general. Thus the attacker can try to modify the ciphertext again and again in a way that is not quite random, but the chance to pass the checksum verification may still be relatively high. > >Also note that if you try to generate ciphertext, decryption of which will > >result in particular value of checksum, you can hardly control the other 14 > >bytes of the block, which in turn are used to verify the checksum. > > > > Now, I'm not saying this attack is particularly practical - it would > generate a fair number of checkpoint failures before getting the first > collision. So it'd trigger quite a few alerts, I guess. You probably mean "checksum failures". I agree. And even if the checksum passed the verification, page or tuple headers would probably be incorrect and cause other errors. > >> FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor > >> a cryptographic hash algorithm. Now, maybe we don't want authenticated > >> encryption (e.g. XTS is not authenticated, unlike GCM/CCM). > > > >I'm also not sure if we should try to guarantee data authenticity / > >integrity. As someone already mentioned elsewhere, page MAC does not help if > >the whole page is replaced. (An extreme case is that old filesystem snapshot > >containing the whole data directory is restored, although that will probably > >make the database crash soon.) > > > >We can guarantee integrity and authenticity of backup, but that's a separate > >feature: someone may need this although it's o.k. for him to run the cluster > >unencrypted. > > > > Yes, I do agree with that. I think attempts to guarantee data authenticity > and/or integrity at the page level is mostly futile (replay attacks are an > example of why). IMHO we should consider that to be outside the threat > model TDE is expected to address. When writing my previous email I forgot that, besides improving data integrity, the authenticated encryption also tries to detect an attempt to get encryption key via "chosen-ciphertext attack (CCA)". The fact that pages are encrypted / decrypted independent from each other should not be a problem here. We just need to consider if this kind of CCA is the threat we try to protect against. > IMO a better way to handle authenticity/integrity would be based on WAL, > which is essentially an authoritative log of operations. We should be able > to parse WAL, deduce expected state
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita wrote: > I.e., partition_bounds_merge() is performed for each pair of input > partitioned relations for a join relation in try_partitionwise_join(). > Since partition_bounds_merge() would need a lot of CPU cycles, I don't > think this is acceptable; ISTM that some redesign is needed to avoid > this. I'm wondering that once we successfully merged partition bounds > from a pair of input partitioned relations for the join relation, by > using the merged partition bounds, we could get the lists of matching > to-be-joined partitions for subsequent pairs of input partitioned > relations for the join relation in a more efficient way than by > performing partition_bounds_merge() as proposed in the patch. I don't know whether partition_bounds_merge() is well-implemented; I haven't looked. But in general I don't see an alternative to doing some kind of merging on each pair of input relations. That's just how planning works, and I don't see why it should need to be prohibitively expensive. I might be missing something, though; do you have an idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
On 7/19/19 7:45 PM, Sergei Kornilov wrote: Hi Oh. Replication slot name currently can contains only a-z0-9_ characters. So we can not actually write such recovery.conf, pg_basebackup will stop before. But perform escape_quotes on string and not use result - error anyway. Good point, it does actually fail with an error if an impossible slot name is provided, so the escaping is superfluous anyway. I'll take another look at it later as it's not exactly critical, just stuck out when I was passing through the code. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: sepgsql seems rather thoroughly broken on Fedora 30
On Thu, Jul 18, 2019 at 11:06 PM Tom Lane wrote: > > Mike Palmiotto writes: > > On Wed, Jul 17, 2019 at 12:32 PM Tom Lane wrote: > >> $ runcon -t sepgsql_regtest_user_t psql --help > >> psql: fatal: could not look up effective user ID 1000: user does not exist You can rule out SELinux for this piece by running `sudo setenforce 0`. If the `runcon ... psql` command works in Permissive we should look at your audit log to determine what is being denied. audit2allow will provide a summary of the SELinux denials and is generally a good starting point: # grep denied /var/log/audit/audit.log | audit2allow If SELinux is indeed the issue here and you want to avoid doing all of this detective work, it may be a good idea to just run a system-wide restorecon (assuming you didn't already do that before) to make sure your labels are in a decent state. FWIW, this appears to be working on my recently-installed F30 VM: % runcon -t sepgsql_regtest_user_t psql --help &> /dev/null % echo $? 0 Hopefully a system-wide `restorecon` just magically fixes this for you. Otherwise, we can start digging into denials. -- Mike Palmiotto Software Engineer Crunchy Data Solutions https://crunchydata.com
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar wrote: > + * We just want to mask the cid in the undo record header. So > + * only if the partial record in the current page include the undo > + * record header then we need to mask the cid bytes in this page. > + * Otherwise, directly jump to the next record. > Here, I think you mean : "So only if the partial record in the current > page includes the *cid* bytes", rather than "includes the undo record > header" > May be we can say : > We just want to mask the cid. So do the partial record masking only if > the current page includes the cid bytes from the partial record > header. Hmm, but why is it correct to mask the CID at all? Shouldn't that match? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila wrote: > We are doing exactly what you have written in the last line of the > next paragraph "stop the transaction from writing undo when the hash > table is already too full.". So we will > never face the problems related to repeated crash recovery. The > definition of too full is that we stop allowing the new transactions > that can write undo when we have the hash table already have entries > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > make sense? Oops, I was looking in the wrong place. Yes, that makes sense, but: 1. It looks like the test you added to PrepareUndoInsert has no locking, and I don't see how that can be right. 2. It seems like this is would result in testing for each new undo insertion that gets prepared, whereas surely we would want to only test when first attaching to an undo log. If you've already attached to the undo log, there's no reason not to continue inserting into it, because doing so doesn't increase the number of transactions (or transaction-persistence level combinations) that need undo. 3. I don't think the test itself is correct. It can fire even when there's no problem. It is correct (or would be if it said 2 * MaxBackends) if every other backend in the system is already attached to an undo log (or two). But if they are not, it will block transactions from being started for no reason. For instance, suppose max_connections = 100 and there are another 100 slots for background rollbacks. Now suppose that the system crashes when 101 slots are in use -- 100 pushed into the background plus 1 that was aborted by the crash. On recovery, this test will refuse to let any new transaction start. Actually it is OK for up to 99 transactions to write undo, just not 100. Or, given that you have a slot per persistence level, it's OK to have up to 199 transaction-persistence-level combinations in flight, just not 200. And that is the difference between the system being unusable after the crash until a rollback succeeds and being almost fully usable immediately. > > I don't follow this. If you have a hash table where the key is XID, > > there is no need to delete and reinsert anything just because you > > discover that the XID has not only permanent undo but also unlogged > > undo, or something of that sort. > > The size of the total undo to be processed will be changed if we find > anyone (permanent or unlogged) later. Based on the size, the entry > location should be changed in size queue. OK, true. But that's not a significant cost, either in runtime or code complexity. I still don't really see any good reason to the hash table key be anything other than XID, or really, FXID. I mean, sure, the data structure manipulations are a little different, but not in any way that really matters. And it seems to me that there are some benefits, the biggest of which is that the system becomes easier for users to understand. We can simply say that there is a limit on the number of transactions that either (1) are in progress and have written undo or (2) have aborted and not all of the undo has been processed. If the key is XID + persistence level, then it's a limit on the number of transaction-and-persistence-level combinations, which I feel is not so easy to understand. In most but not all scenarios, it means that the limit is about double what you think the limit is, and as the mistake in the current version of the patch makes clear, even the people writing the code can forget about that factor of two. It affects a few other things, too. If you made the key XID and fixed problems (2) and (3) from above, then you'd have a situation where a transaction could fail at only one times: either it bombs the first time it tries to write undo, or it works. As it is, there is a second failure scenario: you do a bunch of work on permanent (or unlogged) tables and then try to write to an unlogged (or permanent) table and it fails because there are not enough slots. Is that the end of the world? No, certainly not. The situation should be rare. But if we have to fail transactions, it's best to fail them before they've started doing any work, because that minimizes the amount of work we waste by having to retry. Of course, a transaction that fails midway through when it tries to write at a second persistence level is also consuming an undo slot in a situation where we're short of undo slots. Another thing which Andres pointed out to me off-list is that we might want to have a function that takes a transaction ID as an argument and tells you the status of that transaction from the point of view of the undo machinery: does it have any undo, and if so how much? As you have it now, such a function would require searching the whole hash table, because the user won't be able to provide an UndoRecPtr to go with the XID. If the hash table key were in fact rather than , then you could it with two lookups; if it were XID alone, you
Re: Compiler warnings with MinGW
On 7/19/19 1:08 AM, Michael Paquier wrote: > Hi all, > > Just browsing through the logs of the buildfarm, I have noticed that > some buildfarm animals complain with warnings (jacana uses MinGW): > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=jacana=2019-07-19%2001%3A45%3A28=make > > There are two of them: > c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/port/win32/mingwcompat.c:60:1: > warning: 'RegisterWaitForSingleObject' redeclared without dllimport > attribute: previous dllimport ignored [-Wattributes] > > c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/bin/pg_basebackup/pg_basebackup.c:1448:8: > warning: variable 'filemode' set but not used > [-Wunused-but-set-variable] > Jul 18 21:59:49 int filemode; > > The first one has been discussed already some time ago and is a cause > of 811be893, but nothing got actually fixed (protagonists in CC): > https://www.postgresql.org/message-id/cabueveyezfuvaymunop3nyrvvrh2up2tstk8sxvapderf8p...@mail.gmail.com To answer Magnus' question in that thread, the Mingw headers on jacana declare the function with WINBASEAPI which in turn is defined as DECLSPEC_IMPORT, as long as _KERNEL32_ isn't defined, and we don't do that (and I don't think anything else does either). So the fix Peter proposed looks like it should be correct. > > The second one is rather obvious to fix, because we don't care about > the file mode on Windows, so the attached should do the work. I am > actually surprised that the Visual Studio compilers don't complain > about that, but let's fix it. > > Thoughts? +1. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Compile from source using latest Microsoft Windows SDK
On 7/19/19 5:51 AM, Michael Paquier wrote: > >> I'm also very curious on how hamerkop and bowerbird build postgres with >> VS2017. Looks like hamerkop and bowerbird both exist before VS2017 >> and maybe they get SDK v8.1 from previous VS installations. I will >> contact admin of hamerkop and bowerbird and see if that's the case. >> As of now I can still encounter the same issue with fresh installed >> Windows Server 2016 and VS2017, on both azure and google cloud. So >> better to patch the build system anyway. > I guess so but I cannot confirm myself. I am adding Andrew and Hari > in CC to double-check if they have seen this problem or not. Hari has > also worked on porting VS 2017 and 2019 in the MSVC scripts in the > tree. My tests of the VS2017 stuff used this install mechanism on a fresh Windows instance: choco install -y visualstudio2017-workload-vctools --package-parameters "--includeOptional" This installed Windows Kits 8.1 and 10, among many other things. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Further hacking on SPITupleTable struct
> On 17 Jul 2019, at 22:35, Tom Lane wrote: > > Thinking more about the public/private field distinction we just > specified --- it's always annoyed me that SPITupleTable doesn't > provide a number-of-valid-rows field, so that callers have to > look at the entirely separate SPI_processed variable in order > to make sense of SPI_tuptable. Sorry for being slow to return to this, I see you have already committed it. FWIW, I do agree that this makes a lot more sense. Retroactively +1’ing it. Regarding the core code I agree that no callers directly benefit without some refactoring, but contrib/xml2/xpath.c has one case which seems applicable as per the attached. Now, since contrib/xml2 has been deprecated for a long time it’s probably not worth bothering, but it was the one case I found so I figured I’d record it in this thread. cheers ./daniel xpath_numvals.diff Description: Binary data
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Jul 19, 2019 at 01:32:01PM +0200, Antonin Houska wrote: Tomas Vondra wrote: On Mon, Jul 15, 2019 at 06:11:41PM -0400, Bruce Momjian wrote: >On Mon, Jul 15, 2019 at 11:05:30PM +0200, Tomas Vondra wrote: >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: >> > On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: >> > > One extra thing we should consider is authenticated encryption. We can't >> > > just encrypt the pages (no matter which AES mode is used - XTS/CBC/...), >> > > as that does not provide integrity protection (i.e. can't detect when >> > > the ciphertext was corrupted due to disk failure or intentionally). And >> > > we can't quite rely on checksums, because that checksums the plaintext >> > > and is stored encrypted. >> > >> > Uh, if someone modifies a few bytes of the page, we will decrypt it, but >> > the checksum (per-page or WAL) will not match our decrypted output. How >> > would they make it match the checksum without already knowing the key. >> > I read [1] but could not see that explained. >> > >> >> Our checksum is only 16 bits, so perhaps one way would be to just >> generate 64k of randomly modified pages and hope one of them happens to >> hit the right checksum value. Not sure how practical such attack is, but >> it does require just filesystem access. > >Yes, that would work, and opens the question of whether our checksum is >big enough for this, and if it is not, we need to find space for it, >probably with a custom encrypted page format. :-( And that makes >adding encryption offline almost impossible because you potentially have >to move tuples around. Yuck! > Right. We've been working on allowing to disable checksum online, and it would be useful to allow something like that for encryption too I guess. And without some sort of page-level flag that won't be possible, which would be rather annoying. Not sure it needs to be in the page itself, though - that's pretty much why I proposed to store metadata (IV, key ID, ...) for encryption in a new fork. That would be a bit more flexible than storing it in the page itself (e.g. different encryption schemes might easily store different amounts of metadata). Maybe a new fork is way too complex solution, not sure. One problem of this new fork would be that contents of its buffer (the MAC values) is not determined until the corresponding buffers of the MAIN fork get encrypted. However encryption is performed by the storage layer (md.c), which is not expected to lock other buffers (such as those of the "MAC fork"), read their pages from disk or insert their WAL records. This is different from the FSM or VM forks whose buffers are only updated above the storage layer. Yes, that seems like a valid issue :-( -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote: Tomas Vondra wrote: On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: >> One extra thing we should consider is authenticated encryption. We can't >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...), >> as that does not provide integrity protection (i.e. can't detect when >> the ciphertext was corrupted due to disk failure or intentionally). And >> we can't quite rely on checksums, because that checksums the plaintext >> and is stored encrypted. > >Uh, if someone modifies a few bytes of the page, we will decrypt it, but >the checksum (per-page or WAL) will not match our decrypted output. How >would they make it match the checksum without already knowing the key. >I read [1] but could not see that explained. > Our checksum is only 16 bits, so perhaps one way would be to just generate 64k of randomly modified pages and hope one of them happens to hit the right checksum value. Not sure how practical such attack is, but it does require just filesystem access. I don't think you can easily generate 64k of different checksums this way. If the data is random, I suppose that each set of 2^(128 - 16) blocks will contain the the same checksum after decryption. Thus even you generate 64k of different ciphertext blocks that contain the checksum, some (many?) checksums will be duplicate. Unfortunately the math to describe this problem does not seem to be trivial. I'm not sure what's your point, or why you care about the 128 bits, but I don't think the math is very complicated (and it's exactly the same with or without encryption). The probability of checksum collision for randomly modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a collision is (1-p)=99.9985%. So with N pages, the probability of no collisions is pow((1-p),N) which behaves like this: N pow((1-p),N) 1 85% 2 73% 3 63% 46000 49% 20 4% So with 1.6GB relation you have about 96% chance of a checksum collision. Also note that if you try to generate ciphertext, decryption of which will result in particular value of checksum, you can hardly control the other 14 bytes of the block, which in turn are used to verify the checksum. But we don't care about the 14 bytes. In fact, we want the page header (which includes both the checksums and the other 14B in the block) to remain unchanged - the attack only needs to modify the remaining parts of the 8kB page in a way to generate the same checksum on the plaintext. And that's not that hard to do, IMHO, because the header is stored at the beginning of the page. So we can just randomly modify the last AES block (last 16B on the page) to minimize the corruption to the last block. Now, I'm not saying this attack is particularly practical - it would generate a fair number of checkpoint failures before getting the first collision. So it'd trigger quite a few alerts, I guess. FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor a cryptographic hash algorithm. Now, maybe we don't want authenticated encryption (e.g. XTS is not authenticated, unlike GCM/CCM). I'm also not sure if we should try to guarantee data authenticity / integrity. As someone already mentioned elsewhere, page MAC does not help if the whole page is replaced. (An extreme case is that old filesystem snapshot containing the whole data directory is restored, although that will probably make the database crash soon.) We can guarantee integrity and authenticity of backup, but that's a separate feature: someone may need this although it's o.k. for him to run the cluster unencrypted. Yes, I do agree with that. I think attempts to guarantee data authenticity and/or integrity at the page level is mostly futile (replay attacks are an example of why). IMHO we should consider that to be outside the threat model TDE is expected to address. IMO a better way to handle authenticity/integrity would be based on WAL, which is essentially an authoritative log of operations. We should be able to parse WAL, deduce expected state (min LSN, checksums) for each page, and validate the cluster state based on that. I still think having to decrypt the page in order to verify a checksum (because the header is part of the encrypted page, and is computed from the plaintext version) is not great. regards -- Antonin Houska Web: https://www.cybertec-postgresql.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: POC: Cleaning up orphaned files using undo logs
On Thu, 9 May 2019 at 12:04, Dilip Kumar wrote: > Patches can be applied on top of undo branch [1] commit: > (cb777466d008e656f03771cf16ec7ef9d6f2778b) > > [1] https://github.com/EnterpriseDB/zheap/tree/undo Below are some review points for 0009-undo-page-consistency-checker.patch : + /* Calculate the size of the partial record. */ + partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) + + phdr->tuple_len + phdr->payload_len - + phdr->record_offset; There is already an UndoPagePartialRecSize() function which calculates the size of partial record, which seems to do the same as above. If this is same, you can omit the above code, and instead down below where you increment next_record, you can do "next_record += UndoPagePartialRecSize()". Also, I see an extra sizeof(uint16) added in UndoPagePartialRecSize(). Not sure which one is correct, and which one wrong, unless I am wrong in assuming that the above calculation and the function definition do the same thing. -- + * We just want to mask the cid in the undo record header. So + * only if the partial record in the current page include the undo + * record header then we need to mask the cid bytes in this page. + * Otherwise, directly jump to the next record. Here, I think you mean : "So only if the partial record in the current page includes the *cid* bytes", rather than "includes the undo record header" May be we can say : We just want to mask the cid. So do the partial record masking only if the current page includes the cid bytes from the partial record header. + if (phdr->record_offset < (cid_offset + sizeof(CommandId))) + { +char*cid_data; +Size mask_size; + +mask_size = Min(cid_offset - phdr->record_offset, +sizeof(CommandId)); + +cid_data = next_record + cid_offset - phdr->record_offset; +memset(_data, MASK_MARKER, mask_size); + Here, if record_offset lies *between* cid start and cid end, then cid_offset - phdr->record_offset will be negative, and so will be mask_size. Probably abs() should do the work. Also, an Assert(cid_data + mask_size <= page_end) would be nice. I know cid position of a partial record cannot go beyond the page boundary, but it's better to have this Assert to do sanity check. + * Process the undo record of the page and mask their cid filed. filed => field -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Tomas Vondra wrote: > On Mon, Jul 15, 2019 at 06:11:41PM -0400, Bruce Momjian wrote: > >On Mon, Jul 15, 2019 at 11:05:30PM +0200, Tomas Vondra wrote: > >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: > >> > On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: > >> > > One extra thing we should consider is authenticated encryption. We > >> > > can't > >> > > just encrypt the pages (no matter which AES mode is used - > >> > > XTS/CBC/...), > >> > > as that does not provide integrity protection (i.e. can't detect when > >> > > the ciphertext was corrupted due to disk failure or intentionally). And > >> > > we can't quite rely on checksums, because that checksums the plaintext > >> > > and is stored encrypted. > >> > > >> > Uh, if someone modifies a few bytes of the page, we will decrypt it, but > >> > the checksum (per-page or WAL) will not match our decrypted output. How > >> > would they make it match the checksum without already knowing the key. > >> > I read [1] but could not see that explained. > >> > > >> > >> Our checksum is only 16 bits, so perhaps one way would be to just > >> generate 64k of randomly modified pages and hope one of them happens to > >> hit the right checksum value. Not sure how practical such attack is, but > >> it does require just filesystem access. > > > >Yes, that would work, and opens the question of whether our checksum is > >big enough for this, and if it is not, we need to find space for it, > >probably with a custom encrypted page format. :-( And that makes > >adding encryption offline almost impossible because you potentially have > >to move tuples around. Yuck! > > > > Right. We've been working on allowing to disable checksum online, and it > would be useful to allow something like that for encryption too I guess. > And without some sort of page-level flag that won't be possible, which > would be rather annoying. > > Not sure it needs to be in the page itself, though - that's pretty much > why I proposed to store metadata (IV, key ID, ...) for encryption in a > new fork. That would be a bit more flexible than storing it in the page > itself (e.g. different encryption schemes might easily store different > amounts of metadata). > > Maybe a new fork is way too complex solution, not sure. One problem of this new fork would be that contents of its buffer (the MAC values) is not determined until the corresponding buffers of the MAIN fork get encrypted. However encryption is performed by the storage layer (md.c), which is not expected to lock other buffers (such as those of the "MAC fork"), read their pages from disk or insert their WAL records. This is different from the FSM or VM forks whose buffers are only updated above the storage layer. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Support for CALL statement in ecpg
Hi, Thanks for the review. Please find my comments in-line. On Fri, Jul 19, 2019 at 8:33 AM Kyotaro Horiguchi wrote: > > Hello. > > > +ECPG: CallStmtCALLfunc_application > > Even though it is the default behavior, but as a written rule > this needs the postfix "block". > Done. > +$$ = cat_str(2,mm_strdup("call"),$2); > > Let's have proper spacing. > > + * Copy input arguments to the result arguments list so that all the > + * host variables gets treated as INOUT params. > I've removed above comments so this is no more valid. > This fails for the following usage: > > -- define procedure > create procedure ptest2 (in x int, inout y int) language plpgsql as $$ > begin > y := y + x; > end; > $$; > > -- in .pgc > 14: a = 3; > 15: r = 5; > 16: EXEC SQL call ptest2(:a, :r); > 21: printf("ret = %d, %d\n", a, r); > > > This complains like this: > > > SQL error: too many arguments on line 16 > > ret = 8, 5; > > The result should be "3, 8". This is because the patch requests > two return but actually retruned just one. > > I'm not sure how to know that previously on ecpg. Might need to > let users append INTO clause explicitly. > As the ecpg connector is not aware of the param types of the procedure that it is calling, it becomes the responsibility of end users to ensure that only those many out variables gets created by ecpg as the number of fields in the tuple returned by the server and for that, as you rightly said they must use the INTO clause with CALL statement in ecpg. Considering this approach, now with the attached v2 patch, the CALL statement in ecpg application would be like this: EXEC SQL CALL(:hv1, :hv2) INTO :ret1, ret2; EXEC SQL CALL(:hv1, :hv2) INTO :ret1 :ind1, :ret2, :ind2; In case if INTO clause is not used with the CALL statement then the ecpg compiler would fail with a parse error: "INTO clause is required with CALL statement" -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons index cbffd50..932b7f9 100644 --- a/src/interfaces/ecpg/preproc/ecpg.addons +++ b/src/interfaces/ecpg/preproc/ecpg.addons @@ -555,3 +555,12 @@ ECPG: limit_clauseLIMITselect_limit_value','select_offset_value block } ECPG: SignedIconstIconst rule | civar { $$ = $1; } +ECPG: CallStmtCALLfunc_application block + { + mmerror(PARSE_ERROR, ET_ERROR, "INTO clause is required with CALL statement"); + $$ = EMPTY; + } + | CALL func_application ecpg_into + { + $$ = cat_str(2,mm_strdup("call"),$2); + } diff --git a/src/interfaces/ecpg/test/ecpg_schedule b/src/interfaces/ecpg/test/ecpg_schedule index e034c5a..7e6b5c7 100644 --- a/src/interfaces/ecpg/test/ecpg_schedule +++ b/src/interfaces/ecpg/test/ecpg_schedule @@ -33,6 +33,7 @@ test: preproc/whenever_do_continue test: sql/array test: sql/binary test: sql/bytea +test: sql/call test: sql/code100 test: sql/copystdout test: sql/createtableas diff --git a/src/interfaces/ecpg/test/expected/sql-call.c b/src/interfaces/ecpg/test/expected/sql-call.c new file mode 100644 index 000..813b3b7 --- /dev/null +++ b/src/interfaces/ecpg/test/expected/sql-call.c @@ -0,0 +1,264 @@ +/* Processed by ecpg (regression mode) */ +/* These include files are added by the preprocessor */ +#include +#include +#include +/* End of automatic include section */ +#define ECPGdebug(X,Y) ECPGdebug((X)+100,(Y)) + +#line 1 "call.pgc" +#include +#include + + +#line 1 "sqlca.h" +#ifndef POSTGRES_SQLCA_H +#define POSTGRES_SQLCA_H + +#ifndef PGDLLIMPORT +#if defined(WIN32) || defined(__CYGWIN__) +#define PGDLLIMPORT __declspec (dllimport) +#else +#define PGDLLIMPORT +#endif /* __CYGWIN__ */ +#endif /* PGDLLIMPORT */ + +#define SQLERRMC_LEN 150 + +#ifdef __cplusplus +extern "C" +{ +#endif + +struct sqlca_t +{ + char sqlcaid[8]; + long sqlabc; + long sqlcode; + struct + { + int sqlerrml; + char sqlerrmc[SQLERRMC_LEN]; + } sqlerrm; + char sqlerrp[8]; + long sqlerrd[6]; + /* Element 0: empty */ + /* 1: OID of processed tuple if applicable */ + /* 2: number of rows processed*/ + /* after an INSERT, UPDATE or*/ + /* DELETE statement */ + /* 3: empty */ + /* 4: empty */ + /* 5: empty */ + char sqlwarn[8]; + /* Element 0: set to 'W' if at least one other is 'W' */ + /* 1: if 'W' at least one character string */ + /* value was truncated when it was */ + /* stored into a host variable. */ + + /* + * 2: if 'W' a (hopefully) non-fatal notice occurred + */ /* 3: empty */ + /* 4: empty */ + /* 5: empty */ + /* 6: empty */ + /* 7: empty */ + + char sqlstate[5]; +}; + +struct sqlca_t *ECPGget_sqlca(void); + +#ifndef POSTGRES_ECPG_INTERNAL +#define sqlca (*ECPGget_sqlca()) +#endif + +#ifdef __cplusplus +} +#endif + +#endif + +#line 4 "call.pgc" + + +#line 1 "regression.h" + + + + + + +#line 5 "call.pgc" + + +int +main(void) +{ + /* exec sql begin declare section
Re: [HACKERS] advanced partition matching algorithm for partition-wise join
On Mon, Jul 8, 2019 at 5:03 PM Etsuro Fujita wrote: > On Wed, Jul 3, 2019 at 3:44 PM Etsuro Fujita > wrote: > > On Tue, Jul 2, 2019 at 1:47 PM amul sul wrote: > > > Attached version is rebase atop of the latest master head(c74d49d41c), > thanks. > > > > Thanks! Will review. > > I started reviewing this. Here is my initial review comments: > > * 0001-Hash-partition-bound-equality-refactoring-v22.patch > > First of all, I agree with your view on hash partitioning: > > "3. For hash partitioned tables however, we support partition-wise join > only when the bounds exactly match. For hash partitioning it's unusual > to have missing partitions and hence generic partition matching is not > required." > > which is cited from the commit message for the main patch > "0002-Partition-wise-join-for-1-1-1-0-0-1-partition-matchi-v22.patch". > (I think it would be better if we can extend the partition matching to > the hash-partitioning case where there are missing partitions in > future, though.) However, I don't think it's a good idea to do this > refactoring, because that would lead to duplicating the code to check > whether two given hash bound collections are equal in > partition_bounds_equal() and partition_hash_bounds_merge() that will > be added by the main patch, after all. To avoid that, how about > calling partition_bounds_equal() from partition_hash_bounds_merge() in > the main patch, like the attached? Agree, your changes look good to me, thanks for working on it. Regards, Amul
Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )
Hi Oh. Replication slot name currently can contains only a-z0-9_ characters. So we can not actually write such recovery.conf, pg_basebackup will stop before. But perform escape_quotes on string and not use result - error anyway. regards, Sergei
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Tomas Vondra wrote: > On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: > >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: > >> One extra thing we should consider is authenticated encryption. We can't > >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...), > >> as that does not provide integrity protection (i.e. can't detect when > >> the ciphertext was corrupted due to disk failure or intentionally). And > >> we can't quite rely on checksums, because that checksums the plaintext > >> and is stored encrypted. > > > >Uh, if someone modifies a few bytes of the page, we will decrypt it, but > >the checksum (per-page or WAL) will not match our decrypted output. How > >would they make it match the checksum without already knowing the key. > >I read [1] but could not see that explained. > > > > Our checksum is only 16 bits, so perhaps one way would be to just > generate 64k of randomly modified pages and hope one of them happens to > hit the right checksum value. Not sure how practical such attack is, but > it does require just filesystem access. I don't think you can easily generate 64k of different checksums this way. If the data is random, I suppose that each set of 2^(128 - 16) blocks will contain the the same checksum after decryption. Thus even you generate 64k of different ciphertext blocks that contain the checksum, some (many?) checksums will be duplicate. Unfortunately the math to describe this problem does not seem to be trivial. Also note that if you try to generate ciphertext, decryption of which will result in particular value of checksum, you can hardly control the other 14 bytes of the block, which in turn are used to verify the checksum. > FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor > a cryptographic hash algorithm. Now, maybe we don't want authenticated > encryption (e.g. XTS is not authenticated, unlike GCM/CCM). I'm also not sure if we should try to guarantee data authenticity / integrity. As someone already mentioned elsewhere, page MAC does not help if the whole page is replaced. (An extreme case is that old filesystem snapshot containing the whole data directory is restored, although that will probably make the database crash soon.) We can guarantee integrity and authenticity of backup, but that's a separate feature: someone may need this although it's o.k. for him to run the cluster unencrypted. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Compile from source using latest Microsoft Windows SDK
On Fri, Jul 19, 2019 at 03:39:49PM +0800, Peifeng Qiu wrote: > I updated the patch to only include the WindowsTargetPlatformVersion node > if WindowsSDKVersion is present. I can confirm that this issue no > longer exists for VS2019. So only VS2017 is problematic. (Could you please avoid to top-post?) Ugh. That's one I don't have at hand. > I'm also very curious on how hamerkop and bowerbird build postgres with > VS2017. Looks like hamerkop and bowerbird both exist before VS2017 > and maybe they get SDK v8.1 from previous VS installations. I will > contact admin of hamerkop and bowerbird and see if that's the case. > As of now I can still encounter the same issue with fresh installed > Windows Server 2016 and VS2017, on both azure and google cloud. So > better to patch the build system anyway. I guess so but I cannot confirm myself. I am adding Andrew and Hari in CC to double-check if they have seen this problem or not. Hari has also worked on porting VS 2017 and 2019 in the MSVC scripts in the tree. -- Michael signature.asc Description: PGP signature
Re: SQL/JSON path issues/questions
On Thu, Jul 18, 2019 at 5:08 PM Thom Brown wrote: > On Tue, 16 Jul 2019 at 19:44, Alexander Korotkov > wrote: > > > > On Tue, Jul 16, 2019 at 9:22 PM Thom Brown wrote: > > > Now I'm looking at the @? and @@ operators, and getting a bit > > > confused. This following query returns true, but I can't determine > > > why: > > > > > > # SELECT '{"a":[1,2,3,4,5]}'::jsonb @? '$.b == "hello"'::jsonpath; > > > ?column? > > > -- > > > t > > > (1 row) > > > > > > "b" is not a valid item, so there should be no match. Perhaps it's my > > > misunderstanding of how these operators are supposed to work, but the > > > documentation is quite terse on the behaviour. > > > > So, the result of jsonpath evaluation is single value "false". > > > > # SELECT jsonb_path_query_array('{"a":[1,2,3,4,5]}'::jsonb, '$.b == > > "hello"'); > > jsonb_path_query_array > > > > [false] > > (1 row) > > > > @@ operator checks that result is "true". This is why it returns "false". > > > > @? operator checks if result is not empty. So, it's single "false" > > value, not empty list. This is why it returns "true". > > > > Perhaps, we need to clarify this in docs providing more explanation. > > Understood. Thanks. > > Also, is there a reason why jsonb_path_query doesn't have an operator analog? The point of existing operator analogues is index support. We introduced operators for searches we can accelerate using GIN indexes. jsonb_path_query() doesn't return bool. So, even if we have an operator for that, it wouldn't get index support. However, we can discuss introduction of operator analogues for other functions as syntax sugar. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: POC: Cleaning up orphaned files using undo logs
On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar wrote: > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas wrote: > > > > I don't like the fact that undoaccess.c has a new global, > > undo_compression_info. I haven't read the code thoroughly, but do we > > really need that? I think it's never modified (so it could just be > > declared const), > > Actually, this will get modified otherwise across undo record > insertion how we will know what was the values of the common fields in > the first record of the page. Another option could be that every time > we insert the record, read the value from the first complete undo > record on the page but that will be costly because for every new > insertion we need to read the first undo record of the page. > This information won't be shared across transactions, so can't we keep it in top transaction's state? It seems to me that will be better than to maintain it as a global state. Few more comments on this patch: 1. PrepareUndoInsert() { .. + if (logswitched) + { .. + } + else + { .. + resize = true; .. + } + .. + + do + { + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm); .. + rbm = RBM_ZERO; + cur_blk++; + } while (cur_size < size); + + /* + * Set/overwrite compression info if required and also exclude the common + * fields from the undo record if possible. + */ + if (UndoSetCommonInfo(compression_info, urec, urecptr, + context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)) + resize = true; + + if (resize) + size = UndoRecordExpectedSize(urec); I see that in some cases where resize is possible are checked before buffer allocation and some are after. Isn't it better to do all these checks before buffer allocation? Also, isn't it better to even compute changed size before buffer allocation as that might sometimes help in lesser buffer allocations? Can you find a better way to write :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)? It makes the line too long and difficult to understand. Check for similar instances in the patch and if possible, change them as well. 2. +InsertPreparedUndo(UndoRecordInsertContext *context) { .. /* + * Try to insert the record into the current page. If it + * doesn't succeed then recall the routine with the next page. + */ + InsertUndoData(, page, starting_byte); + if (ucontext.stage == UNDO_PACK_STAGE_DONE) + { + MarkBufferDirty(buffer); + break; + } + MarkBufferDirty(buffer); .. } Can't we call MarkBufferDirty(buffer) just before 'if' check? That will avoid calling it twice. 3. + * Later, during insert phase we will write actual records into thse buffers. + */ +struct PreparedUndoBuffer /thse/these 4. + /* + * If we are writing first undo record for the page the we can set the + * compression so that subsequent records from the same transaction can + * avoid including common information in the undo records. + */ + if (first_complete_undo) /page the we/page then we 5. PrepareUndoInsert() { .. After + * allocation We'll only advance by as many bytes as we turn out to need. + */ + UndoRecordSetInfo(urec); Change the beginning of comment as: "After allocation, we'll .." 6. PrepareUndoInsert() { .. * TODO: instead of storing this in the transaction header we can + * have separate undo log switch header and store it there. + */ + prevlogurp = + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp), +(UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen)); + I don't think this TODO is valid anymore because now the patch has a separate log-switch header. 7. /* + * If undo log is switched then set the logswitch flag and also reset the + * compression info because we can use same compression info for the new + * undo log. + */ + if (UndoRecPtrIsValid(prevlog_xact_start)) /can/can't -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: partition routing layering in nodeModifyTable.c
On Thu, Jul 18, 2019 at 4:50 PM Amit Langote wrote: > On Thu, Jul 18, 2019 at 2:53 PM Andres Freund wrote: > > On 2019-07-18 14:24:29 +0900, Amit Langote wrote: > > > On Thu, Jul 18, 2019 at 10:09 AM Andres Freund wrote: > > Or perhaps the actually correct fix is to remove es_result_relation_info > > alltogether, and just pass it down the places that need it - we've a lot > > more code setting it than using the value. And it'd not be hard to > > actually pass it to the places that read it. Given all the > > setting/resetting of it it's pretty obvious that a query-global resource > > isn't the right place for it. >> > > > Would you like me to write a patch for some or all items? > > > > Yes, that would be awesome. > > OK, I will try to post a patch soon. Attached are two patches. The first one (0001) deals with reducing the core executor's reliance on es_result_relation_info to access the currently active result relation, in favor of receiving it from the caller as a function argument. So no piece of core code relies on it being correctly set anymore. It still needs to be set correctly for the third-party code such as FDWs. Also, because the partition routing related suggestions upthread are closely tied into this, especially those around ExecInsert(), I've included them in the same patch. I chose to keep the function ExecPrepareTupleRouting, even though it's now only called from ExecInsert(), to preserve the readability of the latter. The second patch (0002) implements some rearrangement of the UPDATE tuple movement code, addressing the point 2 of in the first email of this thread. Mainly the block of code in ExecUpdate() that implements row movement proper has been moved in a function called ExecMove(). It also contains the cosmetic improvements suggested in point 4. Thanks, Amit v1-0001-Reduce-es_result_relation_info-usage.patch Description: Binary data v1-0002-Rearrange-partition-update-row-movement-code-a-bi.patch Description: Binary data
[PATCH] minor bugfix for pg_basebackup (9.6 ~ )
Hi In pg_basebackup's GenerateRecoveryConf() function, the value for "primary_slot_name" is escaped, but the original, non-escaped value is written. See attached patch. This has been present since the code was added in 9.6 (commit 0dc848b0314). Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services pg_basebackup-generate-recovery-conf.v1.patch Description: Binary data
Re: Compile from source using latest Microsoft Windows SDK
Hi Michael. Thanks for your review. I updated the patch to only include the WindowsTargetPlatformVersion node if WindowsSDKVersion is present. I can confirm that this issue no longer exists for VS2019. So only VS2017 is problematic. I'm also very curious on how hamerkop and bowerbird build postgres with VS2017. Looks like hamerkop and bowerbird both exist before VS2017 and maybe they get SDK v8.1 from previous VS installations. I will contact admin of hamerkop and bowerbird and see if that's the case. As of now I can still encounter the same issue with fresh installed Windows Server 2016 and VS2017, on both azure and google cloud. So better to patch the build system anyway. Peifeng Qiu Best regards, On Thu, Jul 18, 2019 at 4:09 PM Michael Paquier wrote: > Hi Peifeng, > > On Fri, Mar 29, 2019 at 12:01:26AM +0900, Peifeng Qiu wrote: > > The current Windows build system supports compiling with Windows SDK up > to > > v8.1. When building with the latest Windows SDK v10 which is the default > > for Visual Studio 2017, we will get the following error: > > > > error MSB8036: The Windows SDK version 8.1 was not found. > > Actually up to 10, no? Sorry for the delay, I have just noticed this > patch registered in the commit fest. And now is review time. > > > When the build system generates projects files for MSBuild to consume, it > > doesn't include a SDK version number. Then MSBuild will assume v8.1 as > > default. > > But if we only install the latest v10 but not v8.1, MSBuild will error > out. > > So... This actually boils down to that behavior: > > https://developercommunity.visualstudio.com/content/problem/140294/windowstargetplatformversion-makes-it-impossible-t.html > > While WindowsSDKVersion seems to be present all the time. I think > that we should be more defensive if the variable is not defined, and > instead rely on the default provided by the system, whatever it may > be. In short it seems to me that the tag WindowsTargetPlatformVersion > should be added only if the variable exists, and your patch always > sets it. > > For anything with Postgres on Windows, I have been using Visual Studio > 2015 and 2019 lately to compile Postgres mainly with the Native Tools > command prompt so I have never actually faced this failure even with > the most recent VS 2019. Using just a command prompt causes a failure > when finding out nmake for example as that's not in the default PATH. > Our buildfarm members don't complain either, and there are two animals > using VS 2017: hamerkop and bowerbird. > -- > Michael > compile-latest-win-sdk-v2.patch Description: Binary data
Re: pg_receivewal documentation
On Fri, 2019-07-19 at 10:27 +0900, Michael Paquier wrote: > On Thu, Jul 18, 2019 at 08:40:36AM -0400, Jesper Pedersen wrote: > > On 7/18/19 1:29 AM, Michael Paquier wrote: > > > Or more simply like that? > > > "Note that while WAL will be flushed with this setting, > > > pg_receivewal never applies it, so synchronous_commit must not be set > > > to remote_apply if pg_receivewal is a synchronous standby, be it a > > > member of a priority-based (FIRST) or a quorum-based (ANY) synchronous > > > replication setup." > > > > Yeah, better. > > I was looking into committing that, and the part about > synchronous_commit = on is not right. The location of the warning is > also harder to catch for the reader, so instead let's move it to the > top where we have an extra description for --synchronous. I am > finishing with the attached that I would be fine to commit and > back-patch as needed. Does that sound fine? It was my first reaction too that this had better be at the top. I'm happy with the patch as it is. Yours, Laurenz Albe
Re: Add parallelism and glibc dependent only options to reindexdb
On Fri, Jul 19, 2019 at 2:35 AM Michael Paquier wrote: > > On Thu, Jul 18, 2019 at 09:45:14AM +0900, Michael Paquier wrote: > > On Wed, Jul 17, 2019 at 07:46:10PM +0200, Julien Rouhaud wrote: > >> Is it ok to call pg_free(slots) and let caller have a pointer pointing > > to freed memory? > > > > The interface has a Setup call which initializes the whole thing, and > > Terminate is the logical end point, so having the free logic within > > the termination looks more consistent to me. We could now have actual > > Init() and Free() but I am not sure that this justifies the move as > > this complicates the scripts using it. > > I have reconsidered this point, moved the pg_free() call out of the > termination logic, and committed the first patch after an extra lookup > and more polishing. Thanks! > For the second patch, could you send a rebase with a fix for the > connection slot when processing the reindex commands? Attached, I also hopefully removed all the now unneeded progname usage. reindex_parallel_v5.diff Description: Binary data
RE: Multivariate MCV list vs. statistics target
On Tuesday, July 9, 2019, Tomas Vondra wrote: > >apparently v1 of the ALTER STATISTICS patch was a bit confused about > >length of the VacAttrStats array passed to statext_compute_stattarget, > >causing segfaults. Attached v2 patch fixes that, and it also makes sure > >we print warnings about ignored statistics only once. > > > > v3 of the patch, adding pg_dump support - it works just like when you tweak > statistics target for a column, for example. When the value stored in the > catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting > it (for the already created statistics object). > Hi Tomas, I stumbled upon your patch. According to the CF bot, your patch applies cleanly, builds successfully, and passes make world. Meaning, the pg_dump tap test passed, but there was no test for the new SET STATISTICS yet. So you might want to add a regression test for that and integrate it in the existing alter_generic file. Upon quick read-through, the syntax and docs are correct because it's similar to the format of ALTER TABLE/INDEX... SET STATISTICS... : ALTER [ COLUMN ] column_name SET STATISTICS integer + /* XXX What if the target is set to 0? Reset the statistic? */ This also makes me wonder. I haven't looked deeply into the code, but since 0 is a valid value, I believe it should reset the stats. After lookup though, this is how it's tested in ALTER TABLE: /test/regress/sql/stats_ext.sql:-- Ensure things work sanely with SET STATISTICS 0 > I've considered making it part of CREATE STATISTICS itself, but it seems a > bit cumbersome and we don't do it for columns either. I'm not against adding > it in the future, but at this point I don't see a need. I agree. Perhaps that's for another patch should you decide to add it in the future. Regards, Kirk Jamison