Re: [HACKERS] replicating DROP commands across servers
On Mon, Dec 29, 2014 at 2:15 PM, Alvaro Herrera wrote: > Here's a patch that tweaks the grammar to use TypeName in COMMENT, > SECURITY LABEL, and DROP for the type and domain cases. The required > changes in the code are pretty minimal, thankfully. Note the slight > changes to the new object_address test also. > > I think I'm pretty much done with this now, so I intend to push this > first thing tomorrow and then the rebased getObjectIdentityParts patch > sometime later. > This commit 3f88672a4e4d8e648d24ccc65 seems to have broken pg_upgrade for pg_trgm. On 9.2.9 freshly initdb and started with default config: $ createdb jjanes in psql: create extension pg_trgm ; create table foo (x text); create index on foo using gin (upper(x) gin_trgm_ops); Then run 9.5's pg_upgrade and it fails at the stage of restoring the database schema. The relevant log files are attached Cheers, Jeff pg_upgrade_server.log Description: Binary data pg_upgrade_dump_16384.log Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Fri, Jan 2, 2015 at 8:34 PM, Kevin Grittner wrote: > > Amit Kapila wrote: > > > Notes for Committer - > > There is one behavioural difference in the handling of --analyze-in-stages > > switch, when individual tables (by using -t option) are analyzed by > > using this switch, patch will process (in case of concurrent jobs) all the > > given tables for stage-1 and then for stage-2 and so on whereas in the > > unpatched code it will process all the three stages table by table > > (table-1 all three stages, table-2 all three stages and so on). I think > > the new behaviour is okay as the same is done when this utility does > > vacuum for whole database. > > IMV, the change is for the better. The whole point of > --analyze-in-stages is to get minimal statistics so that "good > enough" plans will be built for most queries to allow a production > database to be brought back on-line quickly, followed by generating > increasing granularity (which takes longer but should help ensure > "best plan") while the database is in use with the initial > statistics. Doing all three levels for one table before generating > the rough statistics for the others doesn't help with that, so I > see this change as fixing a bug. Is it feasible to break that part > out as a separate patch? > Currently as the patch stands the fix (or new behaviour) is only implemented for the multiple jobs option, however to fix this in base code a separate patch is required. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] History of WAL_LEVEL (archive vs hot_standby)
On Thu, Mar 27, 2014 at 08:16:13PM -0700, David Johnston wrote: > Slightly tangential but are the locking operations associated with the > recent bugfix generated in both (all?) modes or only hot_standby? All modes. > I thought > it strange that transient locking operations were output with WAL though I > get it if they are there to support read-only queries. It is unintuitive. This comment in heap_lock_tuple() attempts to explain: /* * XLOG stuff. You might think that we don't need an XLOG record because * there is no state change worth restoring after a crash. You would be * wrong however: we have just written either a TransactionId or a * MultiXactId that may never have been seen on disk before, and we need * to make sure that there are XLOG entries covering those ID numbers. * Else the same IDs might be re-used after a crash, which would be * disastrous if this page made it to disk before the crash. Essentially * we have to enforce the WAL log-before-data rule even in this case. * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG * entries for everything anyway.) */ Another reason not mentioned is torn pages. Locking a tuple updates t_xmax, t_infomask2 and t_infomask. It's possible for t_xmax to fall on one side of a page tear and the infomasks to fall on the other side. Writing t_xmax without writing the corresponding infomasks could cause the tuple to be considered deleted, not merely locked, after a crash. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/30/2014 08:43 AM, Greg Sabino Mullane wrote: > I am working on enhancing the ping() method of DBD::Pg. The goal of > that is for a user to be able to determine if the connection to the > database is still valid. This entire concept is flawed. IMO the documentation for it should be marked with a warning pointing out that use of this method to validate a connection before use implies a race condition; it is never safe to "validate" a connection then assume it'll actually be valid and working for the following statement(s). Correctly written applications must instead tolerate failures, trapping errors/exceptions at each statement and retrying the transaction. > The basic way we do this is to send a simple SELECT via PQexec and > then check for a valid return value (and when in doubt, we check > PQstatus). At the protocol level all that's needed, and all that should be done IMO, is to send a Sync message and wait for the server's ReadyForQuery response. It might be nice to expose a libpq function to do this. It's only safe to do this when there's no in-progress query, but that's true of *any* validation method you might choose. > After some experimenting, the best solution I found is to send the > PQexec, and then check if PQresultErrorField(result, 'C') is > '25P02'. If it is, then all is "well", in that the server is still > alive. If it is not, then we can assume the backend is bad (for > example, PQerrorMessage gives a "could not receive data from > server: Bad file descriptor"). Being that we cannot do a rollback > before calling the PQexec, is this a decent solution? Can we depend > on really serious errors always trumping the expected 25P02? I struggle to understand the utility of this. Your transaction could be in the failed state because of an internal error caused by the server's storage failing. So it's "alive" in the sense of responding to queries, but completely useless. This is another aspect of why the whole approach is wrong. If you want to find out if the server can run a particular query, run the query. If you get an error, then it couldn't run that query. - -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUp1IhAAoJELBXNkqjr+S2hvwH/3V1jNIYTzjVKlMKGfiOmXAH FAdy1PznxZx/DWidcIl7AfSa+9SDoF+cqR9leH1ju0MCy1VkS0W/Lx9lFEfDm6fU s75kN0zno0N9wbSb/sMLGCFEv4wyX3On0rC401NY7/2HXDWco227JfH7O0fAz/lv dNUDdmIg2+d0J1lKyTQ9Z5T8hl8SvMuRvnoaT0/5/6sqSRL3S/hSE0pObFGpKG0I hWpklz3nQwMXZgLOt1YmSAprBd6HyUIzQDG0mV8QQ4SKn7M92J5KSgN1ORyVbMGZ ImKJ+EpnUVEA+n/yG/CV/u27OfKVSYVQZJLZE3XepLY+/eBI3Ai2d+wK7x9Yrfk= =NkUj -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)
On Thu, Jan 1, 2015 at 11:17 PM, Peter Geoghegan wrote: > I've been working on fixing the bugs that Jeff Janes found [1] with > approach #2 to value locking [2]. Approach #1 was unaffected. Unfortunately, exclusion constraints (which only value locking approach #2 has support for, for the IGNORE variant) are broken, even with my fix for the largely unrelated issues also described in my opening mail. I now believe that it is inherently impossible to support exclusion constraints with INSERT ... ON CONFLICT IGNORE, if we are insistent that there cannot be livelocks, unprincipled deadlocks or spurious exclusion violationsand I think you all know how I feel about those :-) . I think that we should drop support for exclusion constraints, regardless of whatever else happens with value locking. Previously, my mental model of approach #2 was that it more or less handled B-Trees and exclusion constraint related indexes equivalently. This is untrue, though. Today, in general, exclusion constraints more or less work by physically inserting a heap tuple and index tuple relating to the exclusion constraint (typically this is with a GiST index). There could well be a conflict due to a concurrent insertion, or even due to the fact that there was a conflicting tuple all along. The implementation tries to find a conclusively committed conflicting tuple. If it does not find such a tuple, it knows that its own already physically inserted tuple is enough of a reason for others to have to worry about it, and so it may proceed and commit the xact -- once it does a whole index scan with only its own tuple found (and not any overlapping tuple from an unfinished xact), it's clear that it can proceed. Exclusion constraints are made to do a pre-check by approach #2 to value locking, to do an IGNORE, but that isn't really significant here. The reason that #2 is broken for exclusion constraints but not for unique indexes is that unique index insertion reports a conflict in the process of inserting. In contrast, unique constraints insert optimistically, and re-check. What's the difference? Well, even though (since the implementation passes UNIQUE_CHECK_PARTIAL for unique index insertion) we still insert when there is a conflict, there is an important distinction: Value locking. That is to say, even though we still insert, we also still use buffer locks as value locks in the usual way that unique indexes do (we hold an exclusion buffer lock on the first B-Tree leaf page the value could be on, as always). We still decide that there is no conflict when the is an exclusion buffer lock, which is a sufficient choke point to make that determination conclusive. This implies that even with many concurrent insertions into a newly created table, there will definitely be one inserter that conclusively does *not* have a conflict, while the others might (depends on the status of the guy who conclusively had *no* conflict, or his line of successor xacts that may or may not themselves commit). In a private e-mail to Heikki, I pointed out that he was mistaken when he said that there is a pre-existing problem with exclusion constraints: it is not possible for concurrent inserters to both spuriously get exclusion violations when only one needs to. Heikki accepted this. However, I now realize that Heikki was closer to being right about it than I was; you can't get spurious exclusion violations, but you can get something that's about the same, which is a deadlock. That usually doesn't matter that much, because people usually don't worry about the issue with exclusion constraints. But in a world where they support INSERT ... ON CONFLICT IGNORE, that isn't acceptable, IMV. I have a test case: https://github.com/petergeoghegan/upsert/blob/master/exclusion.sh I see both spurious exclusion violations, and unprincipled deadlocks with this simple IGNORE-based testcase, that uses a GiST + btree_gist exclusion constraint. I didn't see livelocks, which would have been really bad, because a mutual need to wait on each other's xact causes a deadlock, and the deadlock detector detects those. I think that the spurious exclusion violations are merely an artifact of the implementation, as opposed to an inherent problem with exclusion constraints (in other words, I don't withdraw my remarks from the last paragraph - only (arguably) unprincipled deadlocks occur with ordinary exclusion constraint enforcement with lots of concurrency). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On Fri, Jan 2, 2015 at 7:57 PM, Jim Nasby wrote: > On 1/2/15, 4:18 PM, Tom Lane wrote: >> >> Heikki Linnakangas writes: >>> >>> >On 01/02/2015 11:41 PM, Tom Lane wrote: >>What might be worth trying is establishing a hard-and-fast boundary >>between C land and SQL land, with bitwise names in C and bytewise >> names >>in SQL. This would mean, for example, that int4pl() would be renamed >> to >>int32pl() so far as the C function goes, but the function's SQL name >> would >>remain the same. >>> >>> >I don't like that. I read int4pl as the function implementing plus >>> >operator for the SQL-visible int4 datatype, so int4pl makes perfect >>> > sense. >> >> I agree with that so far as the SQL name for the function goes, which is >> part of why I don't think we should rename anything at the SQL level. >> But right now at the C level, it's unclear how things should be named, >> and I think we don't really want a situation where the most appropriate >> name is so unclear and potentially confusing. We're surviving fine with >> "int32" in C meaning "int4" in SQL so far as the type names go, so why not >> copy that naming approach for function names? > > > Realistically, how many non-developers actually use the intXX SQL names? I > don't think I've ever seen it; the only places I recall seeing it done are > code snippets on developer blogs. Everyone else uses smallint, etc. > > I know we're all gun-shy about this after standard_conforming_strings, but > that affected *everyone*. I believe this change would affect very, very few > users. > > Also, note that I'm not talking about removing anything yet; that would come > later. I think it's naive to think the intXX names wouldn't be used just because they're postgres-specific. Especially when pgadmin3 generates them. Many scripts of mine have them because pgadmin3 generated them, many others have them because I grew accustomed to using them, especially when I don't care being postgres-specific. float8 vs double precision and stuff like that. Lets not generalize anecdote (me using them), lets just pay attention to the fact that lots of tools expose them (pgadmin3 among them). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 1/2/15, 4:18 PM, Tom Lane wrote: Heikki Linnakangas writes: >On 01/02/2015 11:41 PM, Tom Lane wrote: >>What might be worth trying is establishing a hard-and-fast boundary >>between C land and SQL land, with bitwise names in C and bytewise names >>in SQL. This would mean, for example, that int4pl() would be renamed to >>int32pl() so far as the C function goes, but the function's SQL name would >>remain the same. >I don't like that. I read int4pl as the function implementing plus >operator for the SQL-visible int4 datatype, so int4pl makes perfect sense. I agree with that so far as the SQL name for the function goes, which is part of why I don't think we should rename anything at the SQL level. But right now at the C level, it's unclear how things should be named, and I think we don't really want a situation where the most appropriate name is so unclear and potentially confusing. We're surviving fine with "int32" in C meaning "int4" in SQL so far as the type names go, so why not copy that naming approach for function names? Realistically, how many non-developers actually use the intXX SQL names? I don't think I've ever seen it; the only places I recall seeing it done are code snippets on developer blogs. Everyone else uses smallint, etc. I know we're all gun-shy about this after standard_conforming_strings, but that affected *everyone*. I believe this change would affect very, very few users. Also, note that I'm not talking about removing anything yet; that would come later. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 01/02/2015 01:57 AM, Heikki Linnakangas wrote: > wal_keep_segments does not affect the calculation of CheckPointSegments. > If you set wal_keep_segments high enough, checkpoint_wal_size will be > exceeded. The other alternative would be to force a checkpoint earlier, > i.e. lower CheckPointSegments, so that checkpoint_wal_size would be > honored. However, if you set wal_keep_segments high enough, higher than > checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no > matter how frequently you checkpoint. So you're saying that wal_keep_segments is part of the max_wal_size total, NOT in addition to it? Just asking for clarification, here. I think that's a fine idea, I just want to make sure I understood you. The importance of wal_keep_segments will be fading as more people use replication slots. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
Heikki Linnakangas writes: > On 01/02/2015 11:41 PM, Tom Lane wrote: >> What might be worth trying is establishing a hard-and-fast boundary >> between C land and SQL land, with bitwise names in C and bytewise names >> in SQL. This would mean, for example, that int4pl() would be renamed to >> int32pl() so far as the C function goes, but the function's SQL name would >> remain the same. > I don't like that. I read int4pl as the function implementing plus > operator for the SQL-visible int4 datatype, so int4pl makes perfect sense. I agree with that so far as the SQL name for the function goes, which is part of why I don't think we should rename anything at the SQL level. But right now at the C level, it's unclear how things should be named, and I think we don't really want a situation where the most appropriate name is so unclear and potentially confusing. We're surviving fine with "int32" in C meaning "int4" in SQL so far as the type names go, so why not copy that naming approach for function names? >> That would introduce visible inconsistency between such >> functions' pg_proc.proname and pg_proc.prosrc fields, but at least the >> inconsistency would follow a very clear pattern. And I doubt that very >> many user applications are depending on the contents of pg_proc.prosrc. > Someone might be doing > DirectFunctionCall2(int4pl, ...) > in an extension. Well, probably not too likely for int4pl, as you could > just use the native C + operator, but it's not inconceivable for > something like int4recv(). We don't have a lot of hesitation about breaking individual function calls in extensions, so long as (1) you'll get a compile error and (2) it's clear how to update the code. See for instance the multitude of cases where we've added new arguments to existing C functions. So I don't think this objection holds a lot of water, especially when (as you note) there's not that much reason to be calling most of these functions directly from C. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 01/02/2015 11:41 PM, Tom Lane wrote: What might be worth trying is establishing a hard-and-fast boundary between C land and SQL land, with bitwise names in C and bytewise names in SQL. This would mean, for example, that int4pl() would be renamed to int32pl() so far as the C function goes, but the function's SQL name would remain the same. I don't like that. I read int4pl as the function implementing plus operator for the SQL-visible int4 datatype, so int4pl makes perfect sense. That would introduce visible inconsistency between such functions' pg_proc.proname and pg_proc.prosrc fields, but at least the inconsistency would follow a very clear pattern. And I doubt that very many user applications are depending on the contents of pg_proc.prosrc. Someone might be doing DirectFunctionCall2(int4pl, ...) in an extension. Well, probably not too likely for int4pl, as you could just use the native C + operator, but it's not inconceivable for something like int4recv(). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS
On 12/31/14, 3:05 PM, Noah Misch wrote: On Wed, Dec 31, 2014 at 05:33:43PM +, Andrew Gierth wrote: > >"Noah" == Noah Misch writes: > > Noah> Suppose one node orchestrated all sorting and aggregation. > >Well, that has the downside of making it into an opaque blob, without >actually gaining much. The opaque-blob criticism is valid. As for not gaining much, well, the gain I sought was to break this stalemate. You and Tom have expressed willingness to accept the read I/O multiplier of the CTE approach. You and I are willing to swallow an architecture disruption, namely a tuplestore acting as a side channel between executor nodes. Given your NACK, I agree that it fails to move us toward consensus and therefore does not gain much. Alas. I haven't read the full discussion in depth, but is what we'd want here is the ability to feed tuples to more than one node simultaneously? That would allow things like GroupAggregate --> Sort(a) \ +--> Sort(a,b) -\ --> Hash(b) + \--> SeqScan That would allow the planner to trade off things like total memory consumption vs IO. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
Jim Nasby writes: > On 12/31/14, 8:13 AM, Andres Freund wrote: >> Note that the C datatype has been int32/int64 for a while now, it's just >> the SQL datatype and the names of its support functions. Given that, >> afaiu, we're talking about the C datatype it seems pretty clear that it >> should be named int128. I don't think there was any debate about calling the C typedef int128. The question at hand was that there are some existing C functions whose names follow the int2/int4/int8 convention, and it's not very clear what to do with their 128-bit analogues. Having said that, I'm fine with switching to int128 for the C names of the new functions; but it is definitely less than consistent with what we're doing elsewhere. > Should we start down this road with SQL too, by creating int32, 64 and 128 > (if needed?), and changing docs as needed? No. The situation is messy enough without changing our conventions at the SQL level; that will introduce breakage into user applications, for no very good reason because we've never had any inconsistency there. What might be worth trying is establishing a hard-and-fast boundary between C land and SQL land, with bitwise names in C and bytewise names in SQL. This would mean, for example, that int4pl() would be renamed to int32pl() so far as the C function goes, but the function's SQL name would remain the same. That would introduce visible inconsistency between such functions' pg_proc.proname and pg_proc.prosrc fields, but at least the inconsistency would follow a very clear pattern. And I doubt that very many user applications are depending on the contents of pg_proc.prosrc. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 12/31/14, 8:13 AM, Andres Freund wrote: On 2015-01-01 03:00:50 +1300, David Rowley wrote: 2. References to int16 meaning 16 bytes. I'm really in two minds about this, it's quite nice to keep the natural flow, int4, int8, int16, but I can't help think that this will confuse someone one day. I think it'll be a long time before it confused anyone if we called it int128 instead, but I'm not that excited about seeing it renamed either. I'd like to hear what others have to say... Is there a chance that some sql standard in the distant future will have HUGEINT and we might regret not getting the internal names nailed down? Yeah, I think using int16 to mean 16-bytes will be confusing to someone almost immediately. hmm, I think it should be changed to int128 then. Pitty int4 was selected as a name instead of int32 back in the day... Note that the C datatype has been int32/int64 for a while now, it's just the SQL datatype and the names of its support functions. Given that, afaiu, we're talking about the C datatype it seems pretty clear that it should be named int128. Should we start down this road with SQL too, by creating int32, 64 and 128 (if needed?), and changing docs as needed? Presumably that would be best as a separate patch... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions involving multiple postgres foreign servers
Ashutosh Bapat writes: > While looking at the patch for supporting inheritance on foreign tables, I > noticed that if a transaction makes changes to more than two foreign > servers the current implementation in postgres_fdw doesn't make sure that > either all of them rollback or all of them commit their changes, IOW there > is a possibility that some of them commit their changes while others > rollback theirs. > PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at > XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the > foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT > event it commits or aborts those transactions resp. TBH, I think this is a pretty awful idea. In the first place, this does little to improve the actual reliability of a commit occurring across multiple foreign servers; and in the second place it creates a bunch of brand new failure modes, many of which would require manual DBA cleanup. The core of the problem is that this doesn't have anything to do with 2PC as it's commonly understood: for that, you need a genuine external transaction manager that is aware of all the servers involved in a transaction, and has its own persistent state (or at least a way to reconstruct its own state by examining the per-server states). This patch is not that; in particular it treats the local transaction asymmetrically from the remote ones, which doesn't seem like a great idea --- ie, the local transaction could still abort after committing all the remote ones, leaving you no better off in terms of cross-server consistency. As far as failure modes go, one basic reason why this cannot work as presented is that the remote servers may not even have prepared transaction support enabled (in fact max_prepared_transactions = 0 is the default in all supported PG versions). So this would absolutely have to be a not-on-by-default option. But the bigger issue is that leaving it to the DBA to clean up after failures is not a production grade solution, *especially* not for prepared transactions, which are performance killers if not closed out promptly. So I can't imagine anyone wanting to turn this on without a more robust answer than that. Basically I think what you'd need for this to be a credible patch would be for it to work by changing the behavior only in the PREPARE TRANSACTION path: rather than punting as we do now, prepare the remote transactions, and report their server identities and gids to an external transaction manager, which would then be responsible for issuing the actual commits (along with the actual commit of the local transaction). I have no idea whether it's feasible to do that without having to assume a particular 2PC transaction manager API/implementation. It'd be interesting to hear from people who are using 2PC in production to find out if this would solve any real-world problems for them, and what the details of the TM interface would need to look like to make it work in practice. In short, you can't force 2PC technology on people who aren't using it already; while for those who are using it already, this isn't nearly good enough as-is. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
* Bruce Momjian (br...@momjian.us) wrote: > To be specific, desirable in streaming replication scenarios that don't > use SSL compression. (What percentage is that?) It is something we > should mention in the docs for this feature? Considering how painful the SSL rengeotiation problems were and the CPU overhead, I'd be surprised if many high-write-volume replication environments use SSL at all. There's a lot of win to be had from compression of FPWs, but it's like most compression in that there are trade-offs to be had and environments where it won't be a win, but I believe those cases to be the minority. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD
On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote: > diff --git a/src/backend/storage/buffer/buf_init.c > b/src/backend/storage/buffer/buf_init.c > new file mode 100644 > index ff6c713..c4dce5b > *** a/src/backend/storage/buffer/buf_init.c > --- b/src/backend/storage/buffer/buf_init.c > *** InitBufferPool(void) > *** 67,72 > --- 67,73 > boolfoundBufs, > foundDescs; > > + fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc)); > BufferDescriptors = (BufferDesc *) > ShmemInitStruct("Buffer Descriptors", > NBuffers * sizeof(BufferDesc), > &foundDescs); > diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c > new file mode 100644 > index 2ea2216..669c07f > *** a/src/backend/storage/ipc/shmem.c > --- b/src/backend/storage/ipc/shmem.c > *** ShmemInitStruct(const char *name, Size s > *** 327,332 > --- 327,335 > ShmemIndexEnt *result; > void *structPtr; > > + if (strcmp(name, "Buffer Descriptors") == 0) > + size = BUFFERALIGN(size) + 64; > + > LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE); > > if (!ShmemIndex) > *** ShmemInitStruct(const char *name, Size s > *** 413,418 > --- 416,432 > " \"%s\" (%zu bytes > requested)", > name, size))); > } > + if (strcmp(name, "Buffer Descriptors") == 0) > + { > + /* align on 32 */ > + if ((int64)structPtr % 32 != 0) > + structPtr = (void *)((int64)structPtr + 32 - > (int64)structPtr % 32); > + /* align on 32 but not 64 */ > + if ((int64)structPtr % 64 == 0) > + structPtr = (void *)((int64)structPtr + 32); > + } > + fprintf(stderr, "shared memory alignment of %s: %ld-byte\n", > name, > + (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % > 64); > result->size = size; > result->location = structPtr; > } > diff --git a/src/backend/storage/buffer/buf_init.c > b/src/backend/storage/buffer/buf_init.c > new file mode 100644 > index ff6c713..c4dce5b > *** a/src/backend/storage/buffer/buf_init.c > --- b/src/backend/storage/buffer/buf_init.c > *** InitBufferPool(void) > *** 67,72 > --- 67,73 > boolfoundBufs, > foundDescs; > > + fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc)); > BufferDescriptors = (BufferDesc *) > ShmemInitStruct("Buffer Descriptors", > NBuffers * sizeof(BufferDesc), > &foundDescs); > diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c > new file mode 100644 > index 2ea2216..50f836e > *** a/src/backend/storage/ipc/shmem.c > --- b/src/backend/storage/ipc/shmem.c > *** ShmemInitStruct(const char *name, Size s > *** 327,332 > --- 327,335 > ShmemIndexEnt *result; > void *structPtr; > > + if (strcmp(name, "Buffer Descriptors") == 0) > + size = BUFFERALIGN(size) + 64; > + > LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE); > > if (!ShmemIndex) > *** ShmemInitStruct(const char *name, Size s > *** 413,418 > --- 416,429 > " \"%s\" (%zu bytes > requested)", > name, size))); > } > + if (strcmp(name, "Buffer Descriptors") == 0) > + { > + /* align on 64 */ > + if ((int64)structPtr % 64 != 0) > + structPtr = (void *)((int64)structPtr + 64 - > (int64)structPtr % 64); > + } > + fprintf(stderr, "shared memory alignment of %s: %ld-byte\n", > name, > + (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % > 64); > result->size = size; > result->location = structPtr; > } I can't run tests right now... What exactly do you want to see with these tests? that's essentially what I've already benchmarked + some fprintfs? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 2, 2015 at 02:18:12PM -0300, Claudio Freire wrote: > On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund wrote: > >> , I now see the compression patch as something that has negatives, so > >> has to be set by the user, and only wins in certain cases. I am > >> disappointed, and am trying to figure out how this became such a > >> marginal win for 9.5. :-( > > > > I find the notion that a multi digit space reduction is a "marginal win" > > pretty ridiculous and way too narrow focused. Our WAL volume is a > > *significant* problem in the field. And it mostly consists out of FPWs > > spacewise. > > One thing I'd like to point out, is that in cases where WAL I/O is an > issue (ie: WAL archiving), usually people already compress the > segments during archiving. I know I do, and I know it's recommended on > the web, and by some consultants. > > So, I wouldn't want this FPW compression, which is desirable in > replication scenarios if you can spare the CPU cycles (because of > streaming), adversely affecting WAL compression during archiving. To be specific, desirable in streaming replication scenarios that don't use SSL compression. (What percentage is that?) It is something we should mention in the docs for this feature? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Publish autovacuum informations
On 1/1/15, 4:17 PM, Noah Misch wrote: I'd be all right with putting the data structure declarations in a file >named something like autovacuum_private.h, especially if it carried an >annotation that "if you depend on this, don't be surprised if we break >your code in future". Such an annotation would be no more true than it is for the majority of header files. If including it makes you feel better, I don't object. We need to be careful with that. Starting to segregate things into _private headers implies that stuff in non-private headers *is* locked down. We'd need to set clear expectations. I do think more clarity would be good here. Right now the only distinction we have is things like SPI are spelled out in the docs. Other than that, the there really isn't anything to indicate how safe it is to rely on what's in the headers. For example, I've got some code that's looking at fcinfo->flinfo->fn_expr, and I have no idea how likely that is to get broken. Since it's a parse node, my guess is "likely", but I'm just guessing. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund wrote: >> , I now see the compression patch as something that has negatives, so >> has to be set by the user, and only wins in certain cases. I am >> disappointed, and am trying to figure out how this became such a >> marginal win for 9.5. :-( > > I find the notion that a multi digit space reduction is a "marginal win" > pretty ridiculous and way too narrow focused. Our WAL volume is a > *significant* problem in the field. And it mostly consists out of FPWs > spacewise. One thing I'd like to point out, is that in cases where WAL I/O is an issue (ie: WAL archiving), usually people already compress the segments during archiving. I know I do, and I know it's recommended on the web, and by some consultants. So, I wouldn't want this FPW compression, which is desirable in replication scenarios if you can spare the CPU cycles (because of streaming), adversely affecting WAL compression during archiving. Has anyone tested the compressability of WAL segments with FPW compression on? AFAIK, both pglz and lz4 output should still be compressible with deflate, but I've never tried. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 2, 2015 at 06:11:29PM +0100, Andres Freund wrote: > > My negativity is not that I don't want it, but I want to understand why > > it isn't better than I remembered. You are basically telling me it was > > always a marginal win. :-( Boohoo! > > No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently > are not the bottleneck. What I have not seen is any recent benchmarks that show it as a win, while the original email did, so I was confused. I tried to explain exactly how I viewed things --- you can not like it, but that is how I look for upcoming features, and where we should focus our time. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 2015-01-02 12:06:33 -0500, Bruce Momjian wrote: > On Fri, Jan 2, 2015 at 05:55:52PM +0100, Andres Freund wrote: > > On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote: > > > Why are we not seeing the 33% compression and 15% performance > > > improvement he saw? What am I missing here? > > > > To see performance improvements something needs to be the bottleneck. If > > WAL writes/flushes aren't that in the tested scenario, you won't see a > > performance benefit. Amdahl's law and all that. > > > > I don't understand your negativity about the topic. > > I remember the initial post from Masao in August 2013 showing a > performance boost, so I assumed, while we had the concurrent WAL insert > performance improvement in 9.4, this was going to be our 9.5 WAL > improvement. I don't think it makes sense to compare features/improvements that way. > While the WAL insert performance improvement required no tuning and > was never a negative It's actually a negative in some cases. > , I now see the compression patch as something that has negatives, so > has to be set by the user, and only wins in certain cases. I am > disappointed, and am trying to figure out how this became such a > marginal win for 9.5. :-( I find the notion that a multi digit space reduction is a "marginal win" pretty ridiculous and way too narrow focused. Our WAL volume is a *significant* problem in the field. And it mostly consists out of FPWs spacewise. > My negativity is not that I don't want it, but I want to understand why > it isn't better than I remembered. You are basically telling me it was > always a marginal win. :-( Boohoo! No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently are not the bottleneck. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 2, 2015 at 05:55:52PM +0100, Andres Freund wrote: > On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote: > > Why are we not seeing the 33% compression and 15% performance > > improvement he saw? What am I missing here? > > To see performance improvements something needs to be the bottleneck. If > WAL writes/flushes aren't that in the tested scenario, you won't see a > performance benefit. Amdahl's law and all that. > > I don't understand your negativity about the topic. I remember the initial post from Masao in August 2013 showing a performance boost, so I assumed, while we had the concurrent WAL insert performance improvement in 9.4, this was going to be our 9.5 WAL improvement. While the WAL insert performance improvement required no tuning and was never a negative, I now see the compression patch as something that has negatives, so has to be set by the user, and only wins in certain cases. I am disappointed, and am trying to figure out how this became such a marginal win for 9.5. :-( My negativity is not that I don't want it, but I want to understand why it isn't better than I remembered. You are basically telling me it was always a marginal win. :-( Boohoo! -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote: > Why are we not seeing the 33% compression and 15% performance > improvement he saw? What am I missing here? To see performance improvements something needs to be the bottleneck. If WAL writes/flushes aren't that in the tested scenario, you won't see a performance benefit. Amdahl's law and all that. I don't understand your negativity about the topic. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 2, 2015 at 10:15:57AM -0600, k...@rice.edu wrote: > On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote: > > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote: > > > I still don't understand the value of adding WAL compression, given the > > > high CPU usage and minimal performance improvement. The only big > > > advantage is WAL storage, but again, why not just compress the WAL file > > > when archiving. > > > > before: pg_xlog is 800GB > > after: pg_xlog is 600GB. > > > > I'm damned sure that many people would be happy with that, even if the > > *per backend* overhead is a bit higher. And no, compression of archives > > when archiving helps *zap* with that (streaming, wal_keep_segments, > > checkpoint_timeout). As discussed before. > > > > Greetings, > > > > Andres Freund > > > > +1 > > On an I/O constrained system assuming 50:50 table:WAL I/O, in the case > above you can process 100GB of transaction data at the cost of a bit > more CPU. OK, so given your stats, the feature give a 12.5% reduction in I/O. If that is significant, shouldn't we see a performance improvement? If we don't see a performance improvement, is I/O reduction worthwhile? Is it valuable in that it gives non-database applications more I/O to use? Is that all? I suggest we at least document that this feature as mostly useful for I/O reduction, and maybe say CPU usage and performance might be negatively impacted. OK, here is the email I remember from Fujii Masao this same thread that showed a performance improvement for WAL compression: http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hnr7ly516kft5ffqf4pp1q...@mail.gmail.com Why are we not seeing the 33% compression and 15% performance improvement he saw? What am I missing here? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote: > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote: > > I still don't understand the value of adding WAL compression, given the > > high CPU usage and minimal performance improvement. The only big > > advantage is WAL storage, but again, why not just compress the WAL file > > when archiving. > > before: pg_xlog is 800GB > after: pg_xlog is 600GB. > > I'm damned sure that many people would be happy with that, even if the > *per backend* overhead is a bit higher. And no, compression of archives > when archiving helps *zap* with that (streaming, wal_keep_segments, > checkpoint_timeout). As discussed before. > > Greetings, > > Andres Freund > +1 On an I/O constrained system assuming 50:50 table:WAL I/O, in the case above you can process 100GB of transaction data at the cost of a bit more CPU. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
Amit Kapila wrote: > Notes for Committer - > There is one behavioural difference in the handling of --analyze-in-stages > switch, when individual tables (by using -t option) are analyzed by > using this switch, patch will process (in case of concurrent jobs) all the > given tables for stage-1 and then for stage-2 and so on whereas in the > unpatched code it will process all the three stages table by table > (table-1 all three stages, table-2 all three stages and so on). I think > the new behaviour is okay as the same is done when this utility does > vacuum for whole database. IMV, the change is for the better. The whole point of --analyze-in-stages is to get minimal statistics so that "good enough" plans will be built for most queries to allow a production database to be brought back on-line quickly, followed by generating increasing granularity (which takes longer but should help ensure "best plan") while the database is in use with the initial statistics. Doing all three levels for one table before generating the rough statistics for the others doesn't help with that, so I see this change as fixing a bug. Is it feasible to break that part out as a separate patch? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 01/01/2015 09:17 AM, Abhijit Menon-Sen wrote: Hi. OK, here are the patches with the various suggestions applied. I found that the alignment didn't seem to make much difference for the CRC32* instructions, so I changed to process (len/8)*8bytes followed by (len%8)*1bytes, the way the Linux kernel does. Ok. In the slicing-by-8 version, I wonder if it would be better to do single-byte loads to c0-c7, instead of two 4-byte loads and shifts. 4-byte loads are presumably faster than single byte loads, but then you'd avoid the shifts. And then you could go straight into the 8-bytes-at-a-time loop, without the initial single-byte processing to get the start address aligned. (the Linux implementation doesn't do that, so maybe it's a bad idea, but might be worth testing..) Looking at the Linux implementation, I think it only does the bswap once per call, not inside the hot loop. Would it even make sense to keep the crc variable in different byte order, and only do the byte-swap once in END_CRC32() ? The comments need some work. I note that there is no mention of the slicing-by-8 algorithm anywhere in the comments (in the first patch). Instead of checking for "defined(__GNUC__) || defined(__clang__)", should add an explicit configure test for __builtin_bswap32(). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Fri, Jan 2, 2015 at 11:47 AM, Dilip kumar wrote: > > On 31 December 2014 18:36, Amit Kapila Wrote, > > >The patch looks good to me. I have done couple of > > >cosmetic changes (spelling mistakes, improve some comments, > > >etc.), check the same once and if you are okay, we can move > > >ahead. > > > > Thanks for review and changes, changes looks fine to me.. > Okay, I have marked this patch as "Ready For Committer" Notes for Committer - There is one behavioural difference in the handling of --analyze-in-stages switch, when individual tables (by using -t option) are analyzed by using this switch, patch will process (in case of concurrent jobs) all the given tables for stage-1 and then for stage-2 and so on whereas in the unpatched code it will process all the three stages table by table (table-1 all three stages, table-2 all three stages and so on). I think the new behaviour is okay as the same is done when this utility does vacuum for whole database. As there was no input from any committer on this point, I thought it is better to get the same rather than waiting more just for one point. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas wrote: > > > In the meantime, I had a good chat with Heikki on IM yesterday which > gave me some new ideas on how to fix up the transaction handling in > here, which I am working on implementing. So hopefully I will have > that by then. > > I am also hoping Amit will be adapting his parallel seq-scan patch to > this framework. > While working on parallel seq-scan patch to adapt this framework, I noticed few things and have questions regrading the same. 1. Currently parallel worker just attaches to error queue, for tuple queue do you expect it to be done in the same place or in the caller supplied function, if later then we need segment address as input to that function to attach queue to the segment(shm_mq_attach()). Another question, I have in this regard is that if we have redirected messages to error queue by using pq_redirect_to_shm_mq, then how can we set tuple queue for the same purpose. Similarly I think more handling is needed for tuple queue in master backend and the answer to above will dictate what is the best way to do it. 2. Currently there is no interface for wait_for_workers_to_become_ready() in your patch, don't you think it is important that before start of fetching tuples, we should make sure all workers are started, what if some worker fails to start? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] compress method for spgist - 2
On 12/23/2014 03:02 PM, Teodor Sigaev wrote: >I think we'll need a separate SpGistTypeDesc for the input type. Or perhaps a >separate SpGistTypeDesc for the reconstructed value and an optional decompress >method to turn the reconstructedValue back into an actual reconstructed input >datum. Or something like that. I suppose that compress and reconstruct are mutual exclusive options. I would rather not assume that. You might well want to store something in the leaf nodes that's different from the original Datum, but nevertheless contains enough information to reconstruct the original Datum. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote: > I still don't understand the value of adding WAL compression, given the > high CPU usage and minimal performance improvement. The only big > advantage is WAL storage, but again, why not just compress the WAL file > when archiving. before: pg_xlog is 800GB after: pg_xlog is 600GB. I'm damned sure that many people would be happy with that, even if the *per backend* overhead is a bit higher. And no, compression of archives when archiving helps *zap* with that (streaming, wal_keep_segments, checkpoint_timeout). As discussed before. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On 2 January 2015 at 11:13, Amit Kapila wrote: > > On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown wrote: > > > > On 1 January 2015 at 10:34, Amit Kapila wrote: > >> > >> > Running it again, I get the same issue. This is with > parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too. > >> > > >> > This doesn't happen if I set the pgbench scale to 50. I suspect this > is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale > 100, and I don't have swap enabled. But I'm concerned it crashes the whole > instance. > >> > > >> > >> Isn't this a backend crash due to OOM? > >> And after that server will restart automatically. > > > > > > Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 > parallel sequential scans in separate sessions (1 per session) don't cause > the server to crash, but in a single session (8 in 1 session), they do. > > > > It could be possible that master backend retains some memory > for longer period which causes it to hit OOM error, by the way > in your test does always master backend hits OOM or is it > random (either master or worker) > Just ran a few tests, and it appears to always be the master that hits OOM, or at least I don't seem to be able to get an example of the worker hitting it. > > > > > Will there be a GUC to influence parallel scan cost? Or does it take > into account effective_io_concurrency in the costs? > > > > And will the planner be able to decide whether or not it'll choose to > use background workers or not? For example: > > > > Yes, we are planing to introduce cost model for parallel > communication (there is some discussion about the same > upthread), but it's still not there and that's why you > are seeing it to choose parallel plan when it shouldn't. > Currently in patch, if you set parallel_seqscan_degree, it > will most probably choose parallel plan only. > Ah, okay. Great. Thanks. Thom
Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Am 31.12.2014 um 11:40 schrieb Michael Paquier : >> As long as master is fixed, I don't actually care. But I agree with Dennis >> that it's hard to see what's been commited with all the different issues >> found, and if any commits were done, in which branch. I'd like to be able to >> tell my customers: update to this minor release to see if it's fixed, but I >> can't even do that. > This bug does not endanger at all data consistency as only old WAL > files remain on the standby, so I'm fine as well with just a clean fix > on master, and nothing done on back-branches. I’d like to point out that this issue is causing severe disk space bloat over time, as described in my previous posting. I’d assume that, as time goes by, this issue will pop up more and more, as long as back-branches remain unfixed. And as the general recommendation is to never ever fiddle with pg_xlog/ contents, I’m not even sure what to do about it. - D. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown wrote: > > On 1 January 2015 at 10:34, Amit Kapila wrote: >> >> > Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too. >> > >> > This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance. >> > >> >> Isn't this a backend crash due to OOM? >> And after that server will restart automatically. > > > Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 parallel sequential scans in separate sessions (1 per session) don't cause the server to crash, but in a single session (8 in 1 session), they do. > It could be possible that master backend retains some memory for longer period which causes it to hit OOM error, by the way in your test does always master backend hits OOM or is it random (either master or worker) > > Will there be a GUC to influence parallel scan cost? Or does it take into account effective_io_concurrency in the costs? > > And will the planner be able to decide whether or not it'll choose to use background workers or not? For example: > Yes, we are planing to introduce cost model for parallel communication (there is some discussion about the same upthread), but it's still not there and that's why you are seeing it to choose parallel plan when it shouldn't. Currently in patch, if you set parallel_seqscan_degree, it will most probably choose parallel plan only. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On 1 January 2015 at 10:34, Amit Kapila wrote: > > Running it again, I get the same issue. This is with > parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too. > > > > This doesn't happen if I set the pgbench scale to 50. I suspect this is > a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale > 100, and I don't have swap enabled. But I'm concerned it crashes the whole > instance. > > > > Isn't this a backend crash due to OOM? > And after that server will restart automatically. > Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 parallel sequential scans in separate sessions (1 per session) don't cause the server to crash, but in a single session (8 in 1 session), they do. > > > I also notice that requesting BUFFERS in a parallel EXPLAIN output > yields no such information. > > -- > > Yeah and the reason for same is that all the work done related > to BUFFERS is done by backend workers, master backend > doesn't read any pages, so it is not able to accumulate this > information. > > > Is that not possible to report? > > It is not impossible to report such information, we can develop some > way to share such information between master backend and workers. > I think we can do this if required once the patch is more stablized. > Ah great, as I think losing such information to this feature would be unfortunate. Will there be a GUC to influence parallel scan cost? Or does it take into account effective_io_concurrency in the costs? And will the planner be able to decide whether or not it'll choose to use background workers or not? For example: # explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts; QUERY PLAN --- HashAggregate (cost=89584.00..89584.05 rows=5 width=4) (actual time=228.222..228.224 rows=5 loops=1) Output: bid Group Key: pgbench_accounts.bid Buffers: shared hit=83334 -> Seq Scan on public.pgbench_accounts (cost=0.00..88334.00 rows=50 width=4) (actual time=0.008..136.522 rows=50 loops=1) Output: bid Buffers: shared hit=83334 Planning time: 0.071 ms Execution time: 228.265 ms (9 rows) This is a quick plan, but if we tell it that it's allowed 8 background workers: # set parallel_seqscan_degree = 8; SET Time: 0.187 ms # explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts; QUERY PLAN HashAggregate (cost=12291.75..12291.80 rows=5 width=4) (actual time=603.042..603.042 rows=1 loops=1) Output: bid Group Key: pgbench_accounts.bid -> Parallel Seq Scan on public.pgbench_accounts (cost=0.00..11041.75 rows=50 width=4) (actual time=2.445..529.284 rows=50 loops=1) Output: bid Number of Workers: 8 Number of Blocks Per Workers: 10416 Planning time: 0.049 ms Execution time: 663.103 ms (9 rows) Time: 663.437 ms It's significantly slower. I'd hope the planner would anticipate this and decide, "I'm just gonna perform a single scan in this instance as it'll be a lot quicker for this simple case." So at the moment parallel_seqscan_degree seems to mean "You *must* use this many threads if you can parallelise." Ideally we'd be saying "can use up to if necessary". Thanks Thom
Re: [HACKERS] Parallel Seq Scan
On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas wrote: > > On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello > wrote: > > Can we check the number of free bgworkers slots to set the max workers? > > The real solution here is that this patch can't throw an error if it's > unable to obtain the desired number of background workers. It needs > to be able to smoothly degrade to a smaller number of background > workers, or none at all. I think handling this way can have one side effect which is that if we degrade to smaller number, then the cost of plan (which was decided by optimizer based on number of parallel workers) could be more than non-parallel scan. Ideally before finalizing the parallel plan we should reserve the bgworkers required to execute that plan, but I think as of now we can workout a solution without it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On 1 January 2015 at 17:59, Robert Haas wrote: > On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello > wrote: > > Can we check the number of free bgworkers slots to set the max workers? > > The real solution here is that this patch can't throw an error if it's > unable to obtain the desired number of background workers. It needs > to be able to smoothly degrade to a smaller number of background > workers, or none at all. I think a lot of this work will fall out > quite naturally if this patch is reworked to use the parallel > mode/parallel context stuff, the latest version of which includes an > example of how to set up a parallel scan in such a manner that it can > run with any number of workers. > +1 That sounds like exactly what's needed. Thom
Re: [HACKERS] Redesigning checkpoint_segments
On 01/01/2015 03:24 AM, Josh Berkus wrote: Please remind me because I'm having trouble finding this in the archives: how does wal_keep_segments interact with the new settings? It's not very straightforward. First of all, min_recycle_wal_size has a different effect than wal_keep_segments. Raising min_recycle_wal_size causes more segments to be recycled rather than deleted, while wal_keep_segments causes old segments to be retained as old segments, so that they can be used for streaming replication. If you raise min_recycle_wal_size, it will not do any good for streaming replication. wal_keep_segments does not affect the calculation of CheckPointSegments. If you set wal_keep_segments high enough, checkpoint_wal_size will be exceeded. The other alternative would be to force a checkpoint earlier, i.e. lower CheckPointSegments, so that checkpoint_wal_size would be honored. However, if you set wal_keep_segments high enough, higher than checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no matter how frequently you checkpoint. It's not totally straightforward to calculate what maximum size of WAL a given wal_keep_segments settings will force. wal_keep_segments is taken into account at a checkpoint, when we recycle old WAL segments. For example, imagine that prior checkpoint started at segment 95, a new checkpoint finishes at segment 100, and wal_keep_segments=10. Because of wal_keep_segments, we have to retain segments 90-95, which could otherwise be recycled. So that forces a WAL size of 10 segments, while otherwise 5 would be enough. However, before we reach the next checkpoint, let's assume it will complete at segment 105, we will consume five more segments, so the actual max WAL size is 15 segments. However, we could start recycling the segments 90-95 before we reach the next checkpoint, because wal_keep_segments stipulates how many segments from the current *insert* location needs to be retained, with not regard to checkpoints. But we only attempt to recycle segments at checkpoints. So that could be made more straightforward if we recycled old segments in the background, between checkpoints. That might allow merging wal_keep_segments and min_recycle_wal_size settings, too: instead of renaming all old recycleable segments at a checkpoint, you could keep them around as old segments until they're actually needed for reuse, so they could be used for streaming replication up to that point. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers