Re: [HACKERS] AutoVacuum starvation from sinval messages
On 8 November 2012 23:58, Jeff Janes jeff.ja...@gmail.com wrote: On Thu, Nov 8, 2012 at 2:50 PM, Hannu Krosing ha...@2ndquadrant.com wrote: On 11/08/2012 11:40 PM, Simon Riggs wrote: On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote: It does not seem outrageous to me that there would be real-world conditions in which invalidations would be sent more than once a minute over prolonged periods, so this total starvation seems like a bug. Yes, its a bug, but do you really believe the above? In what cases? We see lots of traffic on the mail list about people trying to dump several hundred thousand tables, or they can only create one database every two minutes, or truncating hundreds of tables at a time over and over again gets slow, etc. I know little about the internal of the invalidation code, but I would think doing that kind of thing must generate a lot of them. OK, so the problem is *any* sinval. I thought you meant one sinval per object per minute, which seemed much less likely. I agree one sinval per minute for long periods is actually quite likely. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 8 November 2012 23:20, Hannu Krosing ha...@2ndquadrant.com wrote: On 11/08/2012 08:51 PM, Simon Riggs wrote: On 8 November 2012 17:07, Robert Haas robertmh...@gmail.com wrote: On Wed, Nov 7, 2012 at 10:34 AM, Simon Riggs si...@2ndquadrant.com wrote: For 9.2 we discussed having COPY setting tuples as frozen. Various details apply. Earlier threads: RFC: Making TRUNCATE more MVCC-safe COPY with hints, rebirth I was unhappy with changing the behaviour of TRUNCATE, and still am. So the proposal here is to have a specific modifier on TRUNCATE command that makes it MVCC safe by throwing a serialization error. I don't think I understand the proposal. Under what circumstances would it throw a serialization error? If you ask for TRUNCATE SERIALIZABLE then if someone asks for data in the table and has a snapshot that can see earlier data then it will throw a serializable error. So its a new kind of TRUNCATE that is MVCC safe. Can't we make it so that the reader with earlier snapshot sees the data from the pre-truncation file ? We could... but that would require keeping a history of relfilenodes for an object to allow for more than one TRUNCATE event. Tracking all of that would be hard and I don't personally think its worth that effort. and we unlink the base file(s) only once nobody has a snapshot the can see it ? DELETE does that if that's the semantics you want. or are there some subtler problems (I was under impression that we already did this as described above) ? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ERROR: lock 9 is not held
9.3devel / AMD FX-8120 8-core / centos 6.2 / Linux 2.6.32-279.11.1.el6.x86_64 Hi, I made a test setup of 9.3devel synchronous replication (git-master as of a few hours ago): two instances on 1 machine: master: 93_1 port 6664 slave: 93_2 port 6665 I do nothing on either server. I get these logs (note the ERROR-lines in the slave log, every 5 minutes) : $ tail -F /home/aardvark/pg_stuff/pg_installations/pgsql.93_1/data/../logfile.93_1 LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: received smart shutdown request LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down 2012-11-09 09:42:49.717 CET 6279 LOG: database system was shut down at 2012-11-09 09:42:49 CET 2012-11-09 09:42:49.823 CET 6283 LOG: autovacuum launcher started 2012-11-09 09:42:49.824 CET 6278 LOG: database system is ready to accept connections 2012-11-09 09:42:53.244 CET 6329 LOG: standby wal_receiver_01 is now the synchronous standby with priority 1 $ tail -F /home/aardvark/pg_stuff/pg_installations/pgsql.93_2/data/../logfile.93_2 cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/00010004': No such file or directory 2012-11-09 09:42:53.144 CET 6328 LOG: streaming replication successfully connected to primary 2012-11-09 09:49:01.124 CET 6323 ERROR: lock 9 is not held 2012-11-09 09:53:02.427 CET 6323 ERROR: lock 9 is not held 2012-11-09 09:58:03.845 CET 6323 ERROR: lock 9 is not held 2012-11-09 10:03:05.239 CET 6323 ERROR: lock 9 is not held 2012-11-09 10:07:51.620 CET 6323 ERROR: lock 9 is not held 2012-11-09 10:12:53.019 CET 6323 ERROR: lock 9 is not held 2012-11-09 10:17:54.416 CET 6323 ERROR: lock 9 is not held 2012-11-09 10:22:55.811 CET 6323 ERROR: lock 9 is not held Master configuration: data_directory = '/home/aardvark/pg_stuff/pg_installations/pgsql.93_1/data' listen_addresses = '*' max_connections = 100 shared_buffers = 128MB wal_level = hot_standby synchronous_commit = on checkpoint_segments = 50 archive_mode = on archive_command = 'cp %p /home/aardvark/pg_stuff/archive_dir/%f /dev/null' max_wal_senders = 3 synchronous_standby_names = '*' Slave configuration: data_directory = '/home/aardvark/pg_stuff/pg_installations/pgsql.93_2/data' listen_addresses = '*' port = 6665 max_connections = 100 shared_buffers = 128MB wal_level = hot_standby synchronous_commit = on checkpoint_segments = 50 archive_mode = on archive_command = 'cp %p /home/aardvark/pg_stuff/archive_dir/%f /dev/null' max_wal_senders = 3 synchronous_standby_names = '' hot_standby = on wal_receiver_status_interval = 59 It's just an ERROR-line in the log, and it may not be a real problem, but in any case I'd like to know how it is caused (I ought to add that I had not run these build-scripts for testing syncrep for a few weeks/months so I may have missed some developments in setup/config) thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 11/09/2012 09:34 AM, Simon Riggs wrote: On 8 November 2012 23:20, Hannu Krosing ha...@2ndquadrant.com wrote: On 11/08/2012 08:51 PM, Simon Riggs wrote: On 8 November 2012 17:07, Robert Haas robertmh...@gmail.com wrote: On Wed, Nov 7, 2012 at 10:34 AM, Simon Riggs si...@2ndquadrant.com wrote: For 9.2 we discussed having COPY setting tuples as frozen. Various details apply. Earlier threads: RFC: Making TRUNCATE more MVCC-safe COPY with hints, rebirth I was unhappy with changing the behaviour of TRUNCATE, and still am. So the proposal here is to have a specific modifier on TRUNCATE command that makes it MVCC safe by throwing a serialization error. I don't think I understand the proposal. Under what circumstances would it throw a serialization error? If you ask for TRUNCATE SERIALIZABLE then if someone asks for data in the table and has a snapshot that can see earlier data then it will throw a serializable error. So its a new kind of TRUNCATE that is MVCC safe. Can't we make it so that the reader with earlier snapshot sees the data from the pre-truncation file ? We could... but that would require keeping a history of relfilenodes for an object to allow for more than one TRUNCATE event. MVCC does keep history of old relfilenodes. I thought we were able to read old MVCC versions in pg_class for this if the snapshot required data matching older pg_class record. Tracking all of that would be hard and I don't personally think its worth that effort. and we unlink the base file(s) only once nobody has a snapshot the can see it ? DELETE does that if that's the semantics you want. or are there some subtler problems (I was under impression that we already did this as described above) ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Wed, Nov 7, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote: I was unhappy with changing the behaviour of TRUNCATE, and still am. So the proposal here is to have a specific modifier on TRUNCATE command that makes it MVCC safe by throwing a serialization error. That new behaviour should be requestable by adding the SERIALIZABLE keyword. i.e. TRUNCATE foo SERIALIZABLE; This then allows a new style of TRUNCATE, yet without modiying behaviour of earlier programs (ugh!). Personally I think the behavior should be dictated by the *reader*. The one doing the truncation may not know about the consistency requirements of particular readers. Especially when you do the truncate via pg_restore or some other generic tool. And indeed different readers may have different consistency requirements. So I'd prefer if it were a GUC variable; readers that accept relaxed consistency can set truncate_conflict=off Granted, making this part of the TRUNCATE statement does give you table-level granularity. But if a reader can already handle serialization conflicts, it doesn't really matter which table they came from. But making TRUNCATE behave in a MVCC-safe way, like proposed by Hannu, would be much better. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 10:28, Marti Raudsepp ma...@juffo.org wrote: On Wed, Nov 7, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote: I was unhappy with changing the behaviour of TRUNCATE, and still am. So the proposal here is to have a specific modifier on TRUNCATE command that makes it MVCC safe by throwing a serialization error. That new behaviour should be requestable by adding the SERIALIZABLE keyword. i.e. TRUNCATE foo SERIALIZABLE; This then allows a new style of TRUNCATE, yet without modiying behaviour of earlier programs (ugh!). Personally I think the behavior should be dictated by the *reader*. The one doing the truncation may not know about the consistency requirements of particular readers. Especially when you do the truncate via pg_restore or some other generic tool. And indeed different readers may have different consistency requirements. So I'd prefer if it were a GUC variable; readers that accept relaxed consistency can set truncate_conflict=off I proposed something similar earlier, but Robert said he didn't like that. The way you've described it here makes more sense, whereas my parameter name made it seem more arbitrary. So I think that looks like the way to go. I'm trying to use this as a way to optimize COPY, so we'd need to make a Seq Scan return zero rows if the truncatexid is seen as running by the snapshot, which is the current behaviour. That seems easy enough. Granted, making this part of the TRUNCATE statement does give you table-level granularity. But if a reader can already handle serialization conflicts, it doesn't really matter which table they came from. But making TRUNCATE behave in a MVCC-safe way, like proposed by Hannu, would be much better. Maybe, but one of the reasons for having a separate TRUNCATE command rather than DELETE is the immediately removal of space. Changing TRUNCATE so it suddenly holds on to space for longer will force us to create a new command that acts like the old TRUNCATE. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUG] False indication in pg_stat_replication.sync_state
On Fri, Nov 9, 2012 at 4:06 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Fujii Masao escribió: On Fri, Oct 19, 2012 at 10:29 PM, Fujii Masao masao.fu...@gmail.com wrote: However, I've forgotten to treat other three portions in walsender.c and syncrep.c also does XLogRecPtrIsInvalid(XLogPtr which comes from WAL receiver). This new patch includes the changes for them. Good catch. Does any commiter pick up this? If not, please add to next commitfest so that we don't forget. Yep, I added this to next CF. This is just a bug fix, so please feel free to pick up this even before CF. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
On Thu, Nov 8, 2012 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote: It does not seem outrageous to me that there would be real-world conditions in which invalidations would be sent more than once a minute over prolonged periods, so this total starvation seems like a bug. Yes, its a bug, but do you really believe the above? In what cases? It doesn't take a whole lot of DDL to provoke an sinval overrun, if the recipient process is just sitting idle and not servicing the messages. I think Jeff's concern is entirely valid. So, do we need a sinval overrun or just a sinval message to provoke starvation? The former would be bad but the latter would be really, really bad. IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. OTOH, creating 1 temp table a minute would hit a much broader swath of users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 8:22 AM, Simon Riggs si...@2ndquadrant.com wrote: Personally I think the behavior should be dictated by the *reader*. The one doing the truncation may not know about the consistency requirements of particular readers. Especially when you do the truncate via pg_restore or some other generic tool. And indeed different readers may have different consistency requirements. So I'd prefer if it were a GUC variable; readers that accept relaxed consistency can set truncate_conflict=off I proposed something similar earlier, but Robert said he didn't like that. The way you've described it here makes more sense, whereas my parameter name made it seem more arbitrary. So I think that looks like the way to go. Hmm, I don't remember saying I didn't like that. Maybe I disliked something about a particular proposed implementation? Actually, I don't really see the need for this to be customizable at all. I have to believe that there is vanishingly little application code that would care about this change in semantics, so why not just change the behavior and call it good? I think the question that hasn't really been adequately answered is: where and how are we going to track conflicts? Your previous patch involved storing an XID in pg_class, but I think we both found that a bit grotty - it'd probably need special handling for wraparound, and I think we came up with some related cases that couldn't be handled in the same way without adding a bunch more XIDs to various places. I don't really like the idea of having XIDs floating around in the system catalogs - it seems like a recipe for bugs, not to mention that storing ephemeral data in a persistent table seems like a mismatch. What I've been wondering since this last came up is whether we could use some variant of the SIREAD locks Kevin introduced for SSI to handle this case - essentially have the transaction doing the TRUNCATE make an entry in the lock table that will force a serialization failure for any backend which accesses the table with a snapshot that can't see the truncating transaction's XID. The lock table entry would need some kind of deferred clean-up, so it doesn't go away until the locker's XID precedes RecentGlobalXmin. Of course, an extra lock table probe for every table access will be unacceptable from a concurrency perspective, but we could probably optimize most of them away by only checking the lock table if the pg_class row's own xmin is new enough that the other backend's MVCC snapshot can't see it. A recent update to pg_class doesn't imply the existing of a lock, but the absence of any recent update to pg_class does imply that no lock can exist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
On 9 November 2012 13:42, Robert Haas robertmh...@gmail.com wrote: On Thu, Nov 8, 2012 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote: It does not seem outrageous to me that there would be real-world conditions in which invalidations would be sent more than once a minute over prolonged periods, so this total starvation seems like a bug. Yes, its a bug, but do you really believe the above? In what cases? It doesn't take a whole lot of DDL to provoke an sinval overrun, if the recipient process is just sitting idle and not servicing the messages. I think Jeff's concern is entirely valid. So, do we need a sinval overrun or just a sinval message to provoke starvation? The former would be bad but the latter would be really, really bad. IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. OTOH, creating 1 temp table a minute would hit a much broader swath of users. The point is moot because latches don't work that way anymore. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
On Fri, Nov 9, 2012 at 9:02 AM, Simon Riggs si...@2ndquadrant.com wrote: So, do we need a sinval overrun or just a sinval message to provoke starvation? The former would be bad but the latter would be really, really bad. IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. OTOH, creating 1 temp table a minute would hit a much broader swath of users. The point is moot because latches don't work that way anymore. One of us is confused, because IIUC Tom just fixed this this morning, and I'm trying to figure out how many users will be affected by it, and how seriously. Like, do we need an immediate minor release? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 14:01, Robert Haas robertmh...@gmail.com wrote: I think the question that hasn't really been adequately answered is: where and how are we going to track conflicts? Your previous patch involved storing an XID in pg_class, but I think we both found that a bit grotty - it'd probably need special handling for wraparound, and I think we came up with some related cases that couldn't be handled in the same way without adding a bunch more XIDs to various places. I don't really like the idea of having XIDs floating around in the system catalogs - it seems like a recipe for bugs, not to mention that storing ephemeral data in a persistent table seems like a mismatch. Yes, the xid only needs to be transient, not in pg_class. What I've been wondering since this last came up is whether we could use some variant of the SIREAD locks Kevin introduced for SSI to handle this case - essentially have the transaction doing the TRUNCATE make an entry in the lock table that will force a serialization failure for any backend which accesses the table with a snapshot that can't see the truncating transaction's XID. The lock table entry would need some kind of deferred clean-up, so it doesn't go away until the locker's XID precedes RecentGlobalXmin. Of course, an extra lock table probe for every table access will be unacceptable from a concurrency perspective, but we could probably optimize most of them away by only checking the lock table if the pg_class row's own xmin is new enough that the other backend's MVCC snapshot can't see it. A recent update to pg_class doesn't imply the existing of a lock, but the absence of any recent update to pg_class does imply that no lock can exist. I think the xid should still live in relcache, per the patch, but should live in a transient place (and not pg_class). We need a fast lookup structure that is expandable to accommodate arbitrary numbers of truncates. Shared hash table, with some form of overflow mechanism. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff, On 11/09/2012 02:01 AM, Jeff Davis wrote: For the sake of simplicity (implementation as well as usability), it seems like there is agreement that checksums should be enabled or disabled for the entire instance, not per-table. Agreed. I've quickly thought about making it a per-database setting, but how about shared system catalogs... Let's keep it simple and have a single per-cluster instance switch for now. I don't think a GUC entirely makes sense (in its current form, anyway). We basically care about 3 states: 1. Off: checksums are not written, nor are they verified. Pages that are newly dirtied have the checksum information in the header cleared. 2. Enabling: checksums are written for every dirty page, but only verified for pages where the checksum is present (as determined by information in the page header). 3. On: checksums are written for every dirty page, and verified for every page that's read. If a page does not have a checksum, it's corrupt. Sounds sane, yes. And the next question is what commands to add to change state. Ideas: CHECKSUMS ENABLE; -- set state to Enabling CHECKSUMS DISABLE; -- set state to Off Yet another SQL command doesn't feel like the right thing for such a switch. Quick googling revealed that CHECKSUM is a system function in MS SQL and MySQL knows a CHECKSUM TABLE command. And you never know what the committee is coming up with next. Apart from that, I'd like something more descriptive that just checksums. Block checksums? Heap checksums? Data checksums? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
On 9 November 2012 14:16, Robert Haas robertmh...@gmail.com wrote: On Fri, Nov 9, 2012 at 9:02 AM, Simon Riggs si...@2ndquadrant.com wrote: So, do we need a sinval overrun or just a sinval message to provoke starvation? The former would be bad but the latter would be really, really bad. IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. OTOH, creating 1 temp table a minute would hit a much broader swath of users. The point is moot because latches don't work that way anymore. One of us is confused, because IIUC Tom just fixed this this morning, and I'm trying to figure out how many users will be affected by it, and how seriously. Like, do we need an immediate minor release? You asked what provokes starvation, and the answer is nothing anymore, since Tom's commit. No confusion here... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/09/2012 06:18 AM, Jesper Krogh wrote: I would definately stuff our system in state = 2 in your description if it was available. Hm.. that's an interesting statement. What's probably worst when switching from OFF to ON is the VACUUM run that needs to touch every page (provided you haven't ever turned checksumming on before). Maybe you want to save that step and still get the additional safety for newly dirtied pages, right? A use case worth supporting? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
Simon Riggs wrote: Robert Haas robertmh...@gmail.com wrote: One of us is confused, because IIUC Tom just fixed this this morning, and I'm trying to figure out how many users will be affected by it, and how seriously. Like, do we need an immediate minor release? You asked what provokes starvation, and the answer is nothing anymore, since Tom's commit. No confusion here... So for all those end users running production with a build compiled from HEAD after Tom's commit this morning there is no issue. I'm wondering about those using something else in production. What does it take for their autovacuum to be stalled? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
Robert Haas robertmh...@gmail.com writes: So, do we need a sinval overrun or just a sinval message to provoke starvation? The former would be bad but the latter would be really, really bad. The former. Actually, a signal will be sent as soon as a backend is determined to be unreasonably far behind, which I think is when the message queue is half full. In the AV launcher case, the queue probably never actually overflows; but the launcher has to wake up to process messages every so often, and that wakeup cycle is breaking the timeout management in WaitLatch. IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. Well, one DDL typically generates multiple messages --- one for each catalog row added/modified/removed, roughly speaking. When I run the constant create/drop example Jeff posted, I see the AV launcher getting a catchup signal every few seconds. I didn't try to determine exactly how many create/drop cycles that was, but I'm pretty sure it's a lot less than 1000. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 4:24 PM, Simon Riggs si...@2ndquadrant.com wrote: We need a fast lookup structure that is expandable to accommodate arbitrary numbers of truncates. Shared hash table, with some form of overflow mechanism. Surely you only need to remember the last completed truncate for each relation? The latest one also invalidates any snapshots before earlier truncates. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Robert Haas wrote: What I've been wondering since this last came up is whether we could use some variant of the SIREAD locks Kevin introduced for SSI to handle this case - essentially have the transaction doing the TRUNCATE make an entry in the lock table that will force a serialization failure for any backend which accesses the table with a snapshot that can't see the truncating transaction's XID. It seems to me that the goal would be to make this semantically idential to the behavior users would see if an unqualified DELETE were run against the table rather than a TRUNCATE. To wit: (1) Any attempt to read from the truncated table would not block. The pg_class version included in the transaction's snapshot would determine which heap and indexes were accessed. If the reading transaction were SERIALIZABLE, it would generate a read-write conflict out to the truncating transaction. (2) Any attempt to write to the truncated table would block until the end of the transaction which is doing the truncation. If the truncating transaction rolls back, it proceeds normally against the old data. Otherwise: If the transaction is READ COMMITTED, follow the pg_class update links. At more strict isolation levels, generate a write conflict error. I'm not sure where any new use of the predicate locking system would come into play in that, other than properly handling read-write conflicts when both transactions were SERIALIZABLE. This seems like a subset of the issues which one might want to address by making DDL statement behave in a more strictly MVCC fashion. Does it make sense to pick those off one at a time, or should something like this be done only in the context of an overall plan to deal with all of it? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
Tom Lane escribió: Robert Haas robertmh...@gmail.com writes: IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. Well, one DDL typically generates multiple messages --- one for each catalog row added/modified/removed, roughly speaking. When I run the constant create/drop example Jeff posted, I see the AV launcher getting a catchup signal every few seconds. I didn't try to determine exactly how many create/drop cycles that was, but I'm pretty sure it's a lot less than 1000. Just creating the sequence for the serial column means 16 pg_attribute tuples. There's also two pg_class entries, one more pg_attribute, two pg_type entries, a bunch of pg_depend entries ... I doubt it's less than 30 catalog tuples, all things considered. Double that for the drop. So for a 4k entry table that needs to get 50% full, that's only ~35 temp table creations like that. I hadn't realized sequences used so many pg_attribute entries. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] AutoVacuum starvation from sinval messages
On Fri, Nov 9, 2012 at 10:08 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Tom Lane escribió: Robert Haas robertmh...@gmail.com writes: IIRC the queue has 4K entries, and IIRC a single DDL operation might provoke a couple of sinvals, but I'm thinking that somebody would probably have to be creating 1024 temp tables a minute to overrun the queue, which is very possible but not necessarily common. Well, one DDL typically generates multiple messages --- one for each catalog row added/modified/removed, roughly speaking. When I run the constant create/drop example Jeff posted, I see the AV launcher getting a catchup signal every few seconds. I didn't try to determine exactly how many create/drop cycles that was, but I'm pretty sure it's a lot less than 1000. Just creating the sequence for the serial column means 16 pg_attribute tuples. There's also two pg_class entries, one more pg_attribute, two pg_type entries, a bunch of pg_depend entries ... I doubt it's less than 30 catalog tuples, all things considered. Double that for the drop. So for a 4k entry table that needs to get 50% full, that's only ~35 temp table creations like that. I hadn't realized sequences used so many pg_attribute entries. Hmm. So, are we going to force a minor release for this, or do we think it's not serious enough to warrant that? I'm not expressing an opinion either way, just asking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 14:55, Marti Raudsepp ma...@juffo.org wrote: On Fri, Nov 9, 2012 at 4:24 PM, Simon Riggs si...@2ndquadrant.com wrote: We need a fast lookup structure that is expandable to accommodate arbitrary numbers of truncates. Shared hash table, with some form of overflow mechanism. Surely you only need to remember the last completed truncate for each relation? Yes The latest one also invalidates any snapshots before earlier truncates. 1 per table, arbirary number of tables -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Thu, Nov 8, 2012 at 9:17 PM, Christopher Browne cbbro...@gmail.com wrote: I see one thing to be concerned about, there... I imagine it would not be a totally happy thing if the only way to switch it on/off was to use Slony or Londiste to replicate into a database with the opposite setting. (e.g. - This implies that built-in replication may only replicate into a database with the identical checksum configuration.) Sure, I agree. I don't think it should stay that way forever, but removing the burden of dealing with this issue from the initial commit would likely allow that commit to happen this release cycle, perhaps even in the next CommitFest. And then we'd have half a loaf, which is better than none, and we could deal with the issues of switching it on and off as a further enhancement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 14:22, Kevin Grittner kgri...@mail.com wrote: Robert Haas wrote: What I've been wondering since this last came up is whether we could use some variant of the SIREAD locks Kevin introduced for SSI to handle this case - essentially have the transaction doing the TRUNCATE make an entry in the lock table that will force a serialization failure for any backend which accesses the table with a snapshot that can't see the truncating transaction's XID. It seems to me that the goal would be to make this semantically idential to the behavior users would see if an unqualified DELETE were run against the table rather than a TRUNCATE. Unqualified DELETE already runs that way. TRUNCATE is a different command for a reason. Making TRUNCATE like something we already have seems not very useful to me, not least because it breaks existing applications. This seems like a subset of the issues which one might want to address by making DDL statement behave in a more strictly MVCC fashion. Does it make sense to pick those off one at a time, or should something like this be done only in the context of an overall plan to deal with all of it? TRUNCATE is not DDL, plus I have no interest in this other than speeding up COPY. Scope creep just kills features. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 8:22 AM, Kevin Grittner kgri...@mail.com wrote: Robert Haas wrote: What I've been wondering since this last came up is whether we could use some variant of the SIREAD locks Kevin introduced for SSI to handle this case - essentially have the transaction doing the TRUNCATE make an entry in the lock table that will force a serialization failure for any backend which accesses the table with a snapshot that can't see the truncating transaction's XID. It seems to me that the goal would be to make this semantically idential to the behavior users would see if an unqualified DELETE were run against the table rather than a TRUNCATE. To wit: but, triggers would not fire, right? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Merlin Moncure wrote: Kevin Grittner kgri...@mail.com wrote: Robert Haas wrote: It seems to me that the goal would be to make this semantically idential to the behavior users would see if an unqualified DELETE were run against the table rather than a TRUNCATE. but, triggers would not fire, right? Right. Perhaps identical was too strong a word. I was referring to the aspect under consideration here -- making it serializable in line with other MVCC operations. If we're not talking about making conflicts with other transactions behave just the same as an unqualified DELETE from a user perspective, I'm not sure what the goal is, exactly. Obviously we would be keeping the guts of the implementation the same (swapping in a new, empty heap). -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote: If we're not talking about making conflicts with other transactions behave just the same as an unqualified DELETE from a user perspective, I'm not sure what the goal is, exactly. Reasonable question. My goal is to allow COPY to load frozen tuples without causing MVCC violations. Altering TRUNCATE so it behaves perfectly from an MVCC/Serializable perspective is a much bigger, and completely different goal, as well as something I don't see as desirable anyway for at least 2 good reasons, as explained. IMHO if people want MVCC/Serializable semantics, use DELETE, possibly spending time to make unqualified DELETE do some fancy TRUNCATE-like tricks with relfilenodes. Forcing a tightly scoped proposal into a much wider one will just kill this and leave it blocked. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 15:46, Simon Riggs si...@2ndquadrant.com wrote: Altering TRUNCATE so it behaves perfectly from an MVCC/Serializable perspective is a much bigger, and completely different goal, as well as something I don't see as desirable anyway for at least 2 good reasons, as explained. IMHO if people want MVCC/Serializable semantics, use DELETE, possibly spending time to make unqualified DELETE do some fancy TRUNCATE-like tricks with relfilenodes. We spent a lot of time in 9.2 making TRUNCATE/reload of a table just work, rather than implementing a REPLACE command. ISTM strange to throw away all that effort, changing behaviour of TRUNCATE and thus forcing the need for a REPLACE command after all. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DEALLOCATE IF EXISTS
On Tue, Oct 9, 2012 at 4:44 PM, Vik Reykja vikrey...@gmail.com wrote: On Tue, Oct 9, 2012 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: =?ISO-8859-1?Q?S=E9bastien_Lardi=E8re?= slardi...@hi-media.com writes: Indeed, brackets was not correct, it's better now (I think), and correct some comments. Still wrong ... at the very least you missed copyfuncs/equalfuncs. In general, when adding a field to a struct, it's good practice to grep for all uses of that struct. I don't see Sébastien's message, but I made the same mistake in my patch. Another one is attached with copyfuncs and equalfuncs. I did a grep for DeallocateStmt and I don't believe I have missed anything else. Also, I'm changing the subject so as not to hijack this thread any further. I am taking no comments to mean no objections and have added this to the next commitfest.
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Simon Riggs wrote: This seems like a subset of the issues which one might want to address by making DDL statement behave in a more strictly MVCC fashion. Does it make sense to pick those off one at a time, or should something like this be done only in the context of an overall plan to deal with all of it? TRUNCATE is not DDL You're right, I should have said utility commands. I have no interest in this other than speeding up COPY. I would love to have that! Scope creep just kills features. Well, I wasn't saying it should all be *done* at the same time, but this is not the only utility command which could benefit from such an effort, and if each one is done with no consideration of what it takes for them all to be done, we could wind up with something that doesn't hang together very coherently. Per perhaps this one could serve as a pilot, to identify issues and help develop such a plan. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Simon Riggs wrote: My goal is to allow COPY to load frozen tuples without causing MVCC violations. OK. That wasn't initially clear to me. Forcing a tightly scoped proposal into a much wider one will just kill this and leave it blocked. The goal is important enough and narrow enough to merit the approach you're talking about, IMV, at least. Sorry I initially misunderstood what you were going for. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Simon Riggs si...@2ndquadrant.com writes: On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote: If we're not talking about making conflicts with other transactions behave just the same as an unqualified DELETE from a user perspective, I'm not sure what the goal is, exactly. Reasonable question. My goal is to allow COPY to load frozen tuples without causing MVCC violations. If that's the goal, I question why you're insisting on touching TRUNCATE's behavior. We already have the principle that TRUNCATE is like DELETE except not concurrent-safe. Why not just invent a non-concurrent-safe option to COPY that loads prefrozen tuples into a new heap, and call it good? There will be visibility oddness from that definition, sure, but AFAICS there will be visibility oddness from what you're talking about too. You'll just have expended a very great deal of effort to make the weirdness a bit different. Even if the TRUNCATE part of it were perfectly clean, the load prefrozen tuples part won't be --- so I'm not seeing the value of changing TRUNCATE. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 9:22 AM, Kevin Grittner kgri...@mail.com wrote: (1) Any attempt to read from the truncated table would not block. The pg_class version included in the transaction's snapshot would determine which heap and indexes were accessed. Well, the thing is, you can't actually do this. When the transaction commits, we truncate the main forks of the old heap and index and remove all of the supplemental forks. The main forks are finally removed for good at the next checkpoint cycle. To make this work, we'd have to keep around the old heap and index until there were no longer any MVCC snapshots that could see them. That might be useful as an option, but it would need a bunch of additional mechanism, and it doesn't seem desirable as a default behavior because it could defer disk space reclamation indefinitely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 11:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: My goal is to allow COPY to load frozen tuples without causing MVCC violations. If that's the goal, I question why you're insisting on touching TRUNCATE's behavior. We already have the principle that TRUNCATE is like DELETE except not concurrent-safe. Why not just invent a non-concurrent-safe option to COPY that loads prefrozen tuples into a new heap, and call it good? There will be visibility oddness from that definition, sure, but AFAICS there will be visibility oddness from what you're talking about too. You'll just have expended a very great deal of effort to make the weirdness a bit different. Even if the TRUNCATE part of it were perfectly clean, the load prefrozen tuples part won't be --- so I'm not seeing the value of changing TRUNCATE. I don't object to the idea of giving COPY a way to load prefrozen tuples, but I think you might be missing the point here otherwise. Right now, if you CREATE or TRUNCATE a table, copy a bunch of data into it, and then commit, another transaction that took a snapshot before your commit can subsequently look at that table and it will NOT see your newly-loaded data. What it will see instead is an empty table. This is, of course, wrong: it ought to fail with a serialization error. It is very possible that the table has never been empty at the conclusion of a completed transaction: it might have contained data before the TRUNCATE, and it might again contain data by the time the truncating transaction commits. Yet, we see it as empty, which is not MVCC-compliant. If we were to make COPY pre-freeze the data when the table was created or truncated in the same transaction, it would alter the behavior in this situation, and from an application perspective, only this situation. Now, instead of seeing the table as empty, you'd see the new contents. This is also not MVCC-compliant, and I guess the concern when we have talked about this topic before is that changing from wrong behavior to another, not-backward-compatible wrong behavior might not be the friendliest thing to do. We could decide we don't care and just break it. Or we could try to make it through a serialization error, as Simon is proposing here, which seems like the tidiest solution. Or we could keep the old heap around until there are no more snapshots that can need it, which is a bit scary since we'd be eating double disk-space in the meantime, but it would certainly be useful to some users, I think. Just having an option to preload frozen tuples dodges all of these issues by throwing our hands up in the air, but it does have the advantage of being more general. Even if we do that I'm not sure it would be a bad thing to try to solve this issue in a somewhat more principled way, but it would surely reduce the urgency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff, I don't think a GUC entirely makes sense (in its current form, anyway). We basically care about 3 states: Huh? Why would a GUC not make sense? How else would you make sure that checksums where on when you started the system? 1. Off: checksums are not written, nor are they verified. Pages that are newly dirtied have the checksum information in the header cleared. 2. Enabling: checksums are written for every dirty page, but only verified for pages where the checksum is present (as determined by information in the page header). 3. On: checksums are written for every dirty page, and verified for every page that's read. If a page does not have a checksum, it's corrupt. Well, large databases would tend to be stuck permanently in Enabling, becuase the user would never vacuum old cold partitions in order to checksum them. So we need to be prepared for this to be the end state for a lot of databases. In fact, we'd need three settings for the checksum GUC: OFF -- don't checksum anything, equal to state (1) above WRITES -- checksum pages which are being written anyway, but ignore tables which aren't touched. Permanent Enabling state. ALL -- checksum everything you can. particularly, autovacuum would checksum any table which was not already checksummed at the next vacuum of that table. Goal is to get to state 3 above. Does it make sense to store this information in pg_control? That doesn't require adding any new file, and it has the benefit that it's already checksummed. It's available during recovery and can be made available pretty easily in the places where we write data. And the next question is what commands to add to change state. Ideas: CHECKSUMS ENABLE; -- set state to Enabling CHECKSUMS DISABLE; -- set state to Off Don't like this, please make it a GUC. And then to get to the On state, you have to run a system-wide VACUUM while in the Enabling state. Or, if the above syntax causes problems, we can make all of these into VACUUM options. As there's no such thing as system-wide vacuum, we're going to have to track whether a table is fully checksummed in the system catalogs. We'll also need: VACUUM ( CHECKSUM ON ) ... which would vacuum an entire table, skipping no pages and writing checksums for every page, unless the table were marked fully checksummed already, in which case it would do a regular vacuum. Once a table was flagged as all checksummed, then the system could start producing errors (or warnings?) whenever a page with a missing checksum was found. Hmmm, better to have a 2nd GUC: checksum_fail_action = WARNING | ERROR ... since some people want the write or read to fail, and others just want to see it in the logs. So, thinking about it, state (3) is never the state of an entire installation; it's always the state of individual tables. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: Are sure the server you are dumping out of is head? I experimented a bit with dumping/restoring 16000 tables matching Bruce's test case (ie, one serial column apiece). The pg_dump profile seems fairly flat, without any easy optimization targets. But restoring the dump script shows a rather interesting backend profile: samples %image name symbol name 3086139.6289 postgres AtEOXact_RelationCache 9911 12.7268 postgres hash_seq_search ... There are at least three ways we could whack that mole: * Run the psql script in --single-transaction mode, as I was mumbling about the other day. If we were doing AtEOXact_RelationCache only once, rather than once per CREATE TABLE statement, it wouldn't be a problem. Easy but has only a narrow scope of applicability. That is effective when loading into 9.3 (assuming you make max_locks_per_transaction large enough). But when loading into 9.3, using --single-transaction will evoke the quadratic behavior in the resource owner/lock table and make things worse rather than better. But there is still the question of how people can start using 9.3 if they can't use pg_upgrade, or use the pg_dump half of the dump/restore in, order to get there. It seems to me that pg_upgrade takes some pains to ensure that no one else attaches to the database during its operation. In that case, is it necessary to run the entire dump in a single transaction in order to get a consistent picture? The attached crude patch allows pg_dump to not use a single transaction (and thus not accumulate a huge number of locks) by using the --pg_upgrade flag. This seems to remove the quadratic behavior of running pg_dump against pre-9.3 servers. It is linear up to 30,000 tables with a single serial column, at about 1.5 msec per table. I have no evidence other than a gut feeling that this is a safe thing to do. I've also tested Tatsuo-san's group-LOCK TABLE patch against this case, and it is minimal help. The problem is that there is no syntax for locking sequences, so they cannot be explicitly locked as a group but rather are implicitly locked one by one and so still suffer from the quadratic behavior. Cheers, Jeff pg_dump_for_upgrade.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2012-11-09 at 15:42 +0100, Markus Wanner wrote: On 11/09/2012 06:18 AM, Jesper Krogh wrote: I would definately stuff our system in state = 2 in your description if it was available. Hm.. that's an interesting statement. What's probably worst when switching from OFF to ON is the VACUUM run that needs to touch every page (provided you haven't ever turned checksumming on before). Maybe you want to save that step and still get the additional safety for newly dirtied pages, right? A use case worth supporting? One problem is telling which pages are protected and which aren't. We can have a couple bits in the header indicating that a checksum is present, but it's a little disappointing to have only a few bits protecting a 16-bit checksum. Also, I think that people will want to have a way to protect their old data somehow. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, 2012-11-08 at 23:33 -0300, Alvaro Herrera wrote: There's no such thing as a system-wide VACUUM. The most you can get is a database-wide VACUUM, which means you'd have to store the state per-database somewhere (presumably the pg_database catalog), and perhaps pg_control could have it as a system-wide value that's computed as the minimum of all database states (so it stays enabling until all databases have upgraded to on). That's a good point. Maybe this should be done as an offline operation using a command-line utility? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/09/2012 07:53 PM, Jeff Davis wrote: One problem is telling which pages are protected and which aren't. We can have a couple bits in the header indicating that a checksum is present, but it's a little disappointing to have only a few bits protecting a 16-bit checksum. Given your description of option 2 I was under the impression that each page already has a bit indicating whether or not the page is protected by a checksum. Why do you need more bits than that? Also, I think that people will want to have a way to protect their old data somehow. Well, given that specific set of users is not willing to go through a rewrite of each and every page of its database, it's hard to see how we can protect their old data better. However, we certainly need to provide the option to go through the rewrite for other users, who are well willing to bite that bullet. From a users perspective, the trade-off seems to be: if you want your old data to be covered by checksums, you need to go through such an expensive VACUUM run that touches every page in your database. If you don't want to or cannot do that, you can still turn on checksumming for newly written pages. You won't get full protection and it's hard to tell what data is protected and what not, but it's still better than no checksumming at all. Especially for huge databases, that might be a reasonable compromise. One could even argue, that this just leads to a prolonged migration and with time, the remaining VACUUM step becomes less and less frightening. Do you see any real foot-guns or other show-stoppers for permanently allowing that in-between-state? Or do we have other viable options that prolong the migration and thus spread the load better over time? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] My first patch! (to \df output)
On Oct 27, 2012, at 10:45 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello 2012/10/27 Jon Erdman postgre...@thewickedtribe.net: Hello Hackers! So, currently the only way to see if a function is security definer or not is to directly query pg_proc. This is both irritating, and I think perhaps dangerous since security definer functions can be so powerful. I thought that rectifying that would make an excellent first patch, and I was bored today here in Prague since pgconf.eu is now over...so here it is. :) This patch adds a column to the output of \df titled Security with values of definer or invoker based on the boolean secdef column from pg_proc. I've also included a small doc patch to match. This patch is against master from git. Comments welcome! I just realized I didn't address regression tests, so I guess this is not actually complete yet. I should have time for that next week after I get back to the states. I would also like to start discussion about perhaps adding a couple more things to \df+, specifically function execution permissions (which are also exposed nowhere outside the catalog to my knowledge), and maybe search_path since that's related to secdef. Thoughts? I prefer show this in \dt+ for column Security - and for other functionality maybe new statement. I'm assuming you meant \df+, and I've changed it accordingly. With this change there is now nothing to change in the regression tests, so please consider this my formal and complete submission. describe.patch Description: Binary data Is there anything else I need to do to get this considered? Oh, in case anyone is interested, here's what the query now looks like and the new output: jerdman=# \df+ public.akeys * QUERY ** SELECT n.nspname as Schema, p.proname as Name, pg_catalog.pg_get_function_result(p.oid) as Result data type, pg_catalog.pg_get_function_arguments(p.oid) as Argument data types, CASE WHEN p.proisagg THEN 'agg' WHEN p.proiswindow THEN 'window' WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger' ELSE 'normal' END as Type, CASE WHEN prosecdef THEN 'definer' ELSE 'invoker' END AS Security, CASE WHEN p.provolatile = 'i' THEN 'immutable' WHEN p.provolatile = 's' THEN 'stable' WHEN p.provolatile = 'v' THEN 'volatile' END as Volatility, pg_catalog.pg_get_userbyid(p.proowner) as Owner, l.lanname as Language, p.prosrc as Source code, pg_catalog.obj_description(p.oid, 'pg_proc') as Description FROM pg_catalog.pg_proc p LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace LEFT JOIN pg_catalog.pg_language l ON l.oid = p.prolang WHERE p.proname ~ '^(akeys)$' AND n.nspname ~ '^(public)$' ORDER BY 1, 2, 4; ** List of functions Schema | Name | Result data type | Argument data types | Type | Security | Volatility | Owner | Language | Source code | Description +---+--+-++--++-+--+--+- public | akeys | text[] | hstore | normal | invoker | immutable | jerdman | c| hstore_akeys | (1 row) -- Jon T Erdman Postgresql Zealot -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] My first patch! (to \df output)
Oops! Here it is in the proper diff format. I didn't have my env set up correctly :( describe.patch Description: Binary data -- Jon T Erdman Postgresql Zealot On Nov 9, 2012, at 1:53 PM, Jon Erdman postgre...@thewickedtribe.net wrote: On Oct 27, 2012, at 10:45 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello 2012/10/27 Jon Erdman postgre...@thewickedtribe.net: Hello Hackers! So, currently the only way to see if a function is security definer or not is to directly query pg_proc. This is both irritating, and I think perhaps dangerous since security definer functions can be so powerful. I thought that rectifying that would make an excellent first patch, and I was bored today here in Prague since pgconf.eu is now over...so here it is. :) This patch adds a column to the output of \df titled Security with values of definer or invoker based on the boolean secdef column from pg_proc. I've also included a small doc patch to match. This patch is against master from git. Comments welcome! I just realized I didn't address regression tests, so I guess this is not actually complete yet. I should have time for that next week after I get back to the states. I would also like to start discussion about perhaps adding a couple more things to \df+, specifically function execution permissions (which are also exposed nowhere outside the catalog to my knowledge), and maybe search_path since that's related to secdef. Thoughts? I prefer show this in \dt+ for column Security - and for other functionality maybe new statement. I'm assuming you meant \df+, and I've changed it accordingly. With this change there is now nothing to change in the regression tests, so please consider this my formal and complete submission. describe.patch Is there anything else I need to do to get this considered? Oh, in case anyone is interested, here's what the query now looks like and the new output: jerdman=# \df+ public.akeys * QUERY ** SELECT n.nspname as Schema, p.proname as Name, pg_catalog.pg_get_function_result(p.oid) as Result data type, pg_catalog.pg_get_function_arguments(p.oid) as Argument data types, CASE WHEN p.proisagg THEN 'agg' WHEN p.proiswindow THEN 'window' WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger' ELSE 'normal' END as Type, CASE WHEN prosecdef THEN 'definer' ELSE 'invoker' END AS Security, CASE WHEN p.provolatile = 'i' THEN 'immutable' WHEN p.provolatile = 's' THEN 'stable' WHEN p.provolatile = 'v' THEN 'volatile' END as Volatility, pg_catalog.pg_get_userbyid(p.proowner) as Owner, l.lanname as Language, p.prosrc as Source code, pg_catalog.obj_description(p.oid, 'pg_proc') as Description FROM pg_catalog.pg_proc p LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace LEFT JOIN pg_catalog.pg_language l ON l.oid = p.prolang WHERE p.proname ~ '^(akeys)$' AND n.nspname ~ '^(public)$' ORDER BY 1, 2, 4; ** List of functions Schema | Name | Result data type | Argument data types | Type | Security | Volatility | Owner | Language | Source code | Description +---+--+-++--++-+--+--+- public | akeys | text[] | hstore | normal | invoker | immutable | jerdman | c| hstore_akeys | (1 row) -- Jon T Erdman Postgresql Zealot -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Robert Haas robertmh...@gmail.com writes: Just having an option to preload frozen tuples dodges all of these issues by throwing our hands up in the air, but it does have the advantage of being more general. Even if we do that I'm not sure it would be a bad thing to try to solve this issue in a somewhat more principled way, but it would surely reduce the urgency. Yeah. ISTM the whole point of TRUNCATE is I don't care about serializability for this operation, give me efficiency instead. So I see nothing wrong with a (non-default) option for COPY that similarly trades away some semantic guarantees for efficiency's sake. There are an awful lot of bulk-load scenarios where people will gladly take that trade, and are not very interested in halfway points either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 9 November 2012 16:27, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote: If we're not talking about making conflicts with other transactions behave just the same as an unqualified DELETE from a user perspective, I'm not sure what the goal is, exactly. Reasonable question. My goal is to allow COPY to load frozen tuples without causing MVCC violations. If that's the goal, I question why you're insisting on touching TRUNCATE's behavior. We already have the principle that TRUNCATE is like DELETE except not concurrent-safe. Why not just invent a non-concurrent-safe option to COPY that loads prefrozen tuples into a new heap, and call it good? There will be visibility oddness from that definition, sure, but AFAICS there will be visibility oddness from what you're talking about too. You'll just have expended a very great deal of effort to make the weirdness a bit different. Even if the TRUNCATE part of it were perfectly clean, the load prefrozen tuples part won't be --- so I'm not seeing the value of changing TRUNCATE. This is wonderful thought and I wish I'd thought of it. My digression via truncate now annoys me. Yes, there are objections and I've read what Robert has said. An explicit new option is perfectly entitled to introduce new behaviour and won't cause a problem with existing applications. I personally don't care about serializable stuff here, and nor do most others. They just want a way to load new data quickly. It is important to me that we do things in well principled ways, but the task at hand is data loading not perfect visibility guarantees. So what we're talking about here is a new mode for COPY, that when requested will pre-freeze tuples when loading into a newly created/truncated table. If the table isn't newly created/truncated then we'll just ignore it and continue. I see no need to throw an error, since that will just cause annoying usability issues. COPY FREEZE here we come, with extensive docs to explain the trade-off the user is accepting. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Thu, Nov 8, 2012 at 08:59:21PM -0800, Jeff Janes wrote: On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian br...@momjian.us wrote: On Thu, Nov 8, 2012 at 03:46:09PM -0800, Jeff Janes wrote: On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian br...@momjian.us wrote: As a followup to Magnus's report that pg_upgrade was slow for many tables, I did some more testing with many tables, e.g.: ... Any ideas? I am attaching my test script. Have you reviewed the thread at: http://archives.postgresql.org/pgsql-performance/2012-09/msg3.php ? There is a known N^2 behavior when using pg_dump against pre-9.3 servers. I am actually now dumping git head/9.3, so I assume all the problems we know about should be fixed. Are sure the server you are dumping out of is head? Well, I tested again with 9.2 dumping/loading 9.2 and the same for git head, and got these results: pg_dumprestore 9.2 git 9.2 git 1 0.13 0.11 0.07 0.07 1000 4.37 3.98 4.32 5.28 2000 12.98 12.19 13.64 14.25 4000 47.85 50.14 61.31 70.97 8000210.39183.00302.67294.20 16000901.53769.83 1399.25 1359.09 As you can see, there is very little difference between 9.2 and git head, except maybe at the 16k level for pg_dump. Is there some slowdown with a mismatched version dump/reload? I am attaching my test script. Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to dump 16,000 tables (schema only) like your example, and it is definitely quadratic. Are you using a SERIAL column for the tables. I am, and Peter Eisentraut reported that was a big slowdown. But using head's pg_dump do dump tables out of head's server, it only took 24.95 seconds, and the quadratic term is not yet important, things still look linear. Again, using SERIAL? But even the 179.11 seconds is several times faster than your report of 757.8, so I'm not sure what is going on there. I don't think my laptop is particularly fast: Intel(R) Pentium(R) CPU B960 @ 2.20GHz I am using server-grade hardware, Xeon E5620 2.4GHz: http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012 Is the next value, increment, etc. for a sequence stored in a catalog, or are they stored in the 8kb file associated with each sequence? If Each sequence is stored in its own 1-row 8k table: test= CREATE SEQUENCE seq; CREATE SEQUENCE test= SELECT * FROM seq; -[ RECORD 1 ]-+ sequence_name | seq last_value| 1 start_value | 1 increment_by | 1 max_value | 9223372036854775807 min_value | 1 cache_value | 1 log_cnt | 0 is_cycled | f is_called | f they are stored in the file, than it is shame that pg_dump goes to the effort of extracting that info if pg_upgrade is just going to overwrite it anyway. Actually, pg_upgrade needs pg_dump to restore all those sequence values. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + : . traprm export QUIET=$((QUIET + 1)) /rtmp/out for CYCLES in 1 1000 2000 4000 8000 16000 do echo $CYCLES /rtmp/out for DIR in /pgtoprel /pgtop do echo $DIR /rtmp/out cd $DIR pginstall # need for +16k pipe sed 's/#max_locks_per_transaction = 64/max_locks_per_transaction = 500/' /u/pg/data/postgresql.conf pipe sed 's/shared_buffers = 128MB/shared_buffers = 1GB/' /u/pg/data/postgresql.conf pipe sed 's/#work_mem = 1MB/work_mem = 500MB/' /u/pg/data/postgresql.conf pipe sed 's/#maintenance_work_mem = 16MB/maintenance_work_mem = 500MB/' /u/pg/data/postgresql.conf pgrestart sleep 2 for JOT in $(jot $CYCLES); do echo CREATE TABLE test$JOT (x SERIAL);; done| PGOPTIONS=-c synchronous_commit=off sql test echo pg_dump creation /rtmp/out /usr/bin/time --output=/rtmp/out --append --format '%e' aspg pg_dump --schema-only test $TMP/1 newdb test echo pg_dump restore /rtmp/out PGOPTIONS=-c synchronous_commit=off /usr/bin/time --output=/rtmp/out --append --format '%e' sql test $TMP/1 pgstop sleep 2 done done bell -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2012-11-09 at 20:48 +0100, Markus Wanner wrote: Given your description of option 2 I was under the impression that each page already has a bit indicating whether or not the page is protected by a checksum. Why do you need more bits than that? The bit indicating that a checksum is present may be lost due to corruption. However, we certainly need to provide the option to go through the rewrite for other users, who are well willing to bite that bullet. That's the use case that I've been focusing on, but perhaps you are right that it's not the only important one. Do you see any real foot-guns or other show-stoppers for permanently allowing that in-between-state? The biggest problem that I see is a few bits indicating the presence of a checksum may be vulnerable to more kinds of corruption. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote: Yeah. I definitely think that we could shed an enormous amount of complexity by deciding that this is, for now, an option that can only be selected at initdb time. That would remove approximately 85% of everything I've ever disliked about this patch - without, I think, precluding the possibility of improving things later. That's certainly true, but it introduces one large problem: upgrading would not work, which (in the past few releases) we've treated as a major showstopper for many features. If there is really no other good way to do it, then that might be reasonable. But it seems within grasp to at least offer an offline way to set checksums. It also occurred to me that another way to reduce the scope of this change would be to have a first version that does CRCs only for SLRU pages. That would be useful for verifying the integrity of some of our most critical data (pg_clog) and be a useful building block toward a more complete implementation. That also breaks upgrade, right? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On 2012-11-08 12:30:11 -0500, Peter Eisentraut wrote: On 11/7/12 9:17 PM, Bruce Momjian wrote: As a followup to Magnus's report that pg_upgrade was slow for many tables, I did some more testing with many tables, e.g.: CREATE TABLE test991 (x SERIAL); I ran it for 0, 1k, 2k, ... 16k tables, and got these results: tablespg_dump restore pg_upgrade(increase) 0 0.300.24 11.73(-) 1000 6.466.55 28.79(2.45x) 2000 29.82 20.96 69.75(2.42x) 4000 95.70 115.88 289.82(4.16x) 8000 405.38 505.93 1168.60(4.03x) 160001702.23 2197.56 5022.82(4.30x) I can reproduce these numbers, more or less. (Additionally, it ran out of shared memory with the default setting when dumping the 8000 tables.) But this issue seems to be entirely the fault of sequences being present. When I replace the serial column with an int, everything finishes within seconds and scales seemingly linearly. I don't know the pg_dump code at all but I would guess that without the serial there are no dependencies, so the whole dependency sorting business doesn't need to do very much... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Fri, 2012-11-09 at 10:18 -0500, Robert Haas wrote: Sure, I agree. I don't think it should stay that way forever, but removing the burden of dealing with this issue from the initial commit would likely allow that commit to happen this release cycle, perhaps even in the next CommitFest. And then we'd have half a loaf, which is better than none, and we could deal with the issues of switching it on and off as a further enhancement. Just after sending the last email, I realized that it can be separated into separate commits fairly naturally, I think. So, I agree with you that we should focus on an initdb setting for the next commitfest and try for at least an offline migration tool (if not online) later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inadequate thought about buffer locking during hot standby replay
During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for standard cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP checksums patch
On Mon, 2012-10-01 at 10:22 -0700, Josh Berkus wrote: I think that's OK, because it's still protected by the WAL CRC, and there's no expectation that the checksum is correct in shared buffers, and the correct checksum should be set on the next checkpoint. Just an observation. We'd need to document that emphatically. Otherwise folks running on ZFS and/or FusionIO with atomic writes (and, in the future, BTRFS) will assume that they can turn full_page_writes off and checksums on, and clearly that won't work with the current code. I think that's an acceptable limitation, I just think we need to document it carefully, and maybe throw a warning if people start up in that configuration. What situation are you concerned about here? I think that COW filesystems should still be safe with full_page_writes off, right? The checksum is calculated before every write, and the COW filesystems do atomic writes, so the checksums should always be fine. What am I missing? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 2012-11-09 18:24:25 -0500, Tom Lane wrote: I can't prove that this is the cause of bug #7648, but it's fairly easy to see that it could explain the symptom. You only need to assume that the page-being-split had been handled as a full-page image, and that the new right-hand page had gotten allocated by extending the relation. Then there will be an interval just after RestoreBkpBlocks does its thing where the updated left-hand sibling is in the index and is not locked in any way, but its right-link points off the end of the index. If a few indexscans come along before the replay process gets to continue, you'd get exactly the reported errors. Sounds plausible. I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( I wonder if we couldn't instead fix it by ensuring the backup blocks are in the right order in the backup blocks at the inserting location. That would just need some care about the order of XLogRecData blocks. I am pretty unfamiliar with the nbtree locking but I seem to remember that we should be fine if we always restore from left to right? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: During normal running, operations such as btree page splits are extremely careful about the order in which they acquire and release buffer locks, if they're doing something that concurrently modifies multiple pages. During WAL replay, that all goes out the window. Even if an individual WAL-record replay function does things in the right order for standard cases, RestoreBkpBlocks has no idea what it's doing. So if one or more of the referenced pages gets treated as a full-page image, we are left with no guarantee whatsoever about what order the pages are restored in. That never mattered when the code was originally designed, but it sure matters during Hot Standby when other queries might be able to see the intermediate states. I can't prove that this is the cause of bug #7648, (I was the reporter of 7648) To lend slightly more circumstantial evidence in support of this, I also happened to note that the relfile in question was the last segment and it was about a quarter full, so the access attempt was definitely at the extreme outermost edge of the index most generally. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian br...@momjian.us wrote: I did some more research and realized that I was not using --schema-only like pg_upgrade uses. With that setting, things look like this: ... For profiling pg_dump in isolation, you should also specify --binary-upgrade. I was surprised that it makes a big difference, slowing it down by about 2 fold. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Andres Freund and...@anarazel.de writes: On 2012-11-09 18:24:25 -0500, Tom Lane wrote: I'm inclined to think that we need to fix this by getting rid of RestoreBkpBlocks per se, and instead having the per-WAL-record restore routines dictate when each full-page image is restored (and whether or not to release the buffer lock immediately). That's not going to be a small change unfortunately :-( I wonder if we couldn't instead fix it by ensuring the backup blocks are in the right order in the backup blocks at the inserting location. That would just need some care about the order of XLogRecData blocks. I don't think that's a good way to go. In the first place, if we did that the fix would require incompatible changes in the contents of WAL streams. In the second place, there are already severe constraints on the positioning of backup blocks to ensure that WAL records can be uniquely decoded (the section of access/transam/README about WAL coding touches on this) --- I don't think it's a good plan to add still more constraints there. And in the third place, the specific problem we're positing here results from a failure to hold the buffer lock for a full-page image until after we're done restoring a *non* full-page image represented elsewhere in the same WAL record. In general, of the set of pages touched by a WAL record, any arbitrary subset of them might be converted to FPIs during XLogInsert; but the replay-time locking requirements are going to be the same regardless of that. So AFAICS, any design in which RestoreBkpBlocks acts independently of the non-full-page-image updates is just broken. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian br...@momjian.us wrote: On Thu, Nov 8, 2012 at 08:59:21PM -0800, Jeff Janes wrote: On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian br...@momjian.us wrote: I am actually now dumping git head/9.3, so I assume all the problems we know about should be fixed. Are sure the server you are dumping out of is head? Well, I tested again with 9.2 dumping/loading 9.2 and the same for git head, and got these results: pg_dumprestore 9.2 git 9.2 git 1 0.13 0.11 0.07 0.07 1000 4.37 3.98 4.32 5.28 2000 12.98 12.19 13.64 14.25 4000 47.85 50.14 61.31 70.97 8000210.39183.00302.67294.20 16000901.53769.83 1399.25 1359.09 For pg_dump, there are 4 possible combinations, not just two. you can use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump from a 9.2 server, use git's pg_dump to dump from a git server, or use 9.2's pg_dump to dump from a git server (although that last one isn't very relevant) As you can see, there is very little difference between 9.2 and git head, except maybe at the 16k level for pg_dump. Is there some slowdown with a mismatched version dump/reload? I am attaching my test script. Sorry, from the script I can't really tell what versions are being used for what. Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to dump 16,000 tables (schema only) like your example, and it is definitely quadratic. Are you using a SERIAL column for the tables. I am, and Peter Eisentraut reported that was a big slowdown. Yes, I'm using the same table definition as your example. But using head's pg_dump do dump tables out of head's server, it only took 24.95 seconds, and the quadratic term is not yet important, things still look linear. Again, using SERIAL? Yep. Is the next value, increment, etc. for a sequence stored in a catalog, or are they stored in the 8kb file associated with each sequence? If Each sequence is stored in its own 1-row 8k table: test= CREATE SEQUENCE seq; CREATE SEQUENCE test= SELECT * FROM seq; -[ RECORD 1 ]-+ sequence_name | seq last_value| 1 start_value | 1 increment_by | 1 max_value | 9223372036854775807 min_value | 1 cache_value | 1 log_cnt | 0 is_cycled | f is_called | f they are stored in the file, than it is shame that pg_dump goes to the effort of extracting that info if pg_upgrade is just going to overwrite it anyway. Actually, pg_upgrade needs pg_dump to restore all those sequence values. I did an experiment where I had pg_dump just output dummy values rather than hitting the database. Once pg_upgrade moves the relation files over, the dummy values disappear and are set back to their originals. So I think that pg_upgrade depends on pg_dump only in a trivial way--they need to be there, but it doesn't matter what they are. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL
On Thursday, November 08, 2012 7:56 PM Amit Kapila On Thursday, November 08, 2012 1:45 AM Robert Haas wrote: On Wed, Nov 7, 2012 at 2:50 PM, Josh Berkus j...@agliodbs.com wrote: Well, Magnus' proposed implementation supposed that the existing values *have* been loaded into the current session. I agree that with some locking and yet more code you could implement it without that. But this still doesn't seem to offer any detectable benefit over value-per- file. Well, value-per-file is ugly (imagine you've set 40 different variables that way) but dodges a lot of complicated issues. And I suppose ugly doesn't matter, because the whole idea of the auto-generated files is that users aren't supposed to look at them anyway. That's pretty much how I feel about it, too. I think value-per-file is an ugly wimp-out that shouldn't really be necessary to solve this problem. It can't be that hard to rewrite a file where every like is of the form: key = 'value' I also believe that it should be possible to rewrite a file without loading values into the current session. One of the solution if we assume that file is of fixed format and each record (key = 'value') of fixed length can be: 1. While writing .auto file, it will always assume that .auto file contain all config parameters. Now as this .auto file is of fixed format and fixed record size, it can directly write a given record to its particular position. 2. To handle locking issues, we can follow an approach similar to what GIT is doing for editing conf files (using .lock file): a. copy the latest content of .auto to .auto.lock b. make all the changes to auto.lock file. c. at the end of command rename the auto.lock file to .auto file d. otherwise if SQL COMMAND/function failed in-between we can delete the .auto.lock file 3. Two backends trying to write to .auto file we can use .auto.lock as the the lock by trying to create it in exclusive mode as the first step of the command. If it already exists then backend needs to wait. Please let me know if there are any objections or problems in above method of implementation, else I can go ahead to prepare the patch for the coming CF. For initial version I will use the function as syntax to provide this feature. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers