Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan. The lseek point was a for-example, not the entire universe of possible problem sources for this patch. (Also, underestimating the EOF point in a seqscan is normally not an issue since any rows in a just-added page are by definition not visible to the scan's snapshot. But I digress.) The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user. I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] proposal: condition blocks in psql
On Sun, Jun 28, 2015 at 1:59 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: \if_ver_eq 9.2 What do you thinking about it? Couldn't this kind of thing be done directly with PL/pgSQL? you can use PL/pgSQL - but there are some limits * maintenance large plpgsql functions I agree with large but that would not necessarily mean complex. Also, some functions could be in SQL, and just the logic with PL/pgSQL. * the plpgsql functions or anonymous functions create a transaction borders - what should not be wanted Hmmm... If something fails when installing an extension, a transaction border is probably a good thing? Also, the interaction of \if with possible BEGIN/COMMIT can lead to strange states. Manual transaction control is the killer feature IMO; not being able to do it forces code out of sql and into a scripting language. Transaction management in 'psql scripting' is no more or less fragile than in most imperative languages. Personally, I prefer a server side solution to this problem (stored procedures) so that the application can leverage this functionality through the protocol. However, psql extensions are probably worth it in their own right. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier michael.paqu...@gmail.com wrote: Feedback is of course welcome, but note that I am not seriously expecting any until we get into 9.6 development cycle and I am adding this patch to the next CF. I have moved this patch to CF 2015-09, as I have enough patches to take care of for now... Let's focus on Windows support and improvement of logging for TAP in the first round. That will be already a good step forward. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
On 2015-06-29 12:11:08 +0200, Andres Freund wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock) \ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? I've pushed a patch the _sched_fence with _mf. That's what we use in the atomics code as well. I did reread http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf and I do think _Asm_mf is the correct thing. What I did not change was the mask used for serialization - the default 0x3D3D (c.f. page 23) should be correct, even though slightly suboptimal. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPSERT on partition
On 2015-06-25 AM 09:51, Amit Langote wrote: Peter, On 2015-06-25 AM 02:35, Peter Geoghegan wrote: Inheritance with triggers is a leaky abstraction, so this kind of thing is always awkward. Still, UPSERT has full support for *inheritance* -- that just doesn't help in this case. Could you clarify as to what UPSERT's support for inheritance entails? Oh, I see that this stuff has been discussed (-hackers) and written about (wiki). I'll go read about it. I agree with Fujii-san's concern that any UPSERT on partition limitations given those of partitioning approach should be documented likewise, if not under INSERT/UPSERT, then under partitioning; not really sure which. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? which immediately raises the question of why omitting the volatile cast is okay. Should be fine if _Asm_sched_fence() were a proper memory (or een release) barrier. Which I don't think it is. I also wonder if we don't need a second _Asm_sched_fence() after the lock release. Shouldn't be needed - the only thing that could be reordered is the actual lock release. Which will just impact timing in a minor manner (it can't move into another locked section). Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rework the way multixact truncations work
On 2015-06-29 23:54:40 +1200, Thomas Munro wrote: On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund and...@anarazel.de wrote: It'd be very welcome to see some wider testing and review on this. I started looking at this and doing some testing. Here is some initial feedback: Perhaps vac_truncate_clog needs a new name now that it does more, maybe vac_truncate_transaction_logs? It has done more before, so I don't really see a connection to this patch... MultiXactState-sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'? Oops. In the struct xl_multixact_truncate, and also the function WriteMTruncateXlogRec and other places, I think you have confused the typedefs MultiXactOffset and MultiXactId. If I'm not mistaken, startMemb and endMemb have the correct type, but startOff and endOff should be of type MultiXactId despite their names (the *values* stored inside pg_multixact/offsets are indeed offsets (into pg_multixact/members), but their *location* is what a multixact ID represents). IIRC I did it that way to make clear this is just 'byte' type offsets, without other meaning. Wasn't a good idea. I was trying to understand if there could be any problem caused by setting latest_page_number to the pageno that holds (or will hold) xlrec.endOff in multixact_redo. The only thing that jumps out at me is that the next call to SlruSelectLRUPage will no longer be prevented from evicting the *real* latest page (the most recently created page). That hasn't changed unless I miss something? In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU control lock? I think it's safer than not doing it, but don't particularly care. In find_multixact_start you call SimpleLruFlush before calling SimpleLruDoesPhysicalPageExist. Should we do something like this instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a I'm currently slightly inclined to do it my way. They way these functions are used it doesn't seem like a bad property to ensure things are on disk. I think saw some extra autovacuum activity that I didn't expect in a simple scenario, but I'm not sure and will continue looking tomorrow. Cool, thanks! Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm version 1.2
On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes jeff.ja...@gmail.com wrote: This patch implements version 1.2 of contrib module pg_trgm. This supports the triconsistent function, introduced in version 9.4 of the server, to make it faster to implement indexed queries where some keys are common and some are rare. I've included the paths to both upgrade and downgrade between 1.1 and 1.2, although after doing so you must close and restart the session before you can be sure the change has taken effect. There is no change to the on-disk index structure This shows the difference it can make in some cases: create extension pg_trgm version 1.1; create table foo as select md5(random()::text)|| case when random()0.05 then 'lmnop' else '123' end || md5(random()::text) as bar from generate_series(1,1000); create index on foo using gin (bar gin_trgm_ops); --some queries alter extension pg_trgm update to 1.2; --close, reopen, more queries select count(*) from foo where bar like '%12344321lmnabcddd%'; V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache V1.2: Time: 2.839 ms --- after repeated execution to warm the cache Wow! I'm going to test this. I have some data sets for which trigram searching isn't really practical...if the search string touches trigrams with a lot of duplication the algorithm can have trouble beating brute force searches. trigram searching is important: it's the only way currently to search string encoded structures for partial strings quickly. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rework the way multixact truncations work
On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund and...@anarazel.de wrote: It'd be very welcome to see some wider testing and review on this. I started looking at this and doing some testing. Here is some initial feedback: Perhaps vac_truncate_clog needs a new name now that it does more, maybe vac_truncate_transaction_logs? MultiXactState-sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'? In the struct xl_multixact_truncate, and also the function WriteMTruncateXlogRec and other places, I think you have confused the typedefs MultiXactOffset and MultiXactId. If I'm not mistaken, startMemb and endMemb have the correct type, but startOff and endOff should be of type MultiXactId despite their names (the *values* stored inside pg_multixact/offsets are indeed offsets (into pg_multixact/members), but their *location* is what a multixact ID represents). I was trying to understand if there could be any problem caused by setting latest_page_number to the pageno that holds (or will hold) xlrec.endOff in multixact_redo. The only thing that jumps out at me is that the next call to SlruSelectLRUPage will no longer be prevented from evicting the *real* latest page (the most recently created page). In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU control lock? In find_multixact_start you call SimpleLruFlush before calling SimpleLruDoesPhysicalPageExist. Should we do something like this instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a I think saw some extra autovacuum activity that I didn't expect in a simple scenario, but I'm not sure and will continue looking tomorrow. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_file_settings view vs. Windows
Sawada Masahiko sawada.m...@gmail.com writes: On Mon, Jun 29, 2015 at 12:20 AM, Tom Lane t...@sss.pgh.pa.us wrote: So let me make a radical proposal that both gets rid of the portability problem and, arguably, makes the view more useful than it is today. I think we should define the view as showing, not what was in the config file(s) at the last SIGHUP, but what is in the files *now*. That means you could use it to validate edits before you attempt to apply them, rather than having to pull the trigger and then ask if it worked. And yet it would remain just about as useful as it is now for post-mortem checks when a SIGHUP didn't do what you expected. You meant that we parse each GUC parameter in configration file, and then pass changeVal=false to set_config_option whenever pg_file_settings is refered. So this view would be used for checking whether currently configuration file is applied successfully or not, right? Well, it would check whether the current contents of the file could be applied successfully. The such a feature would be also good, but the main purpose of pg_file_settings was to resolve where each GUC parameter came from, especially in case of using ALTER SYSTEM command. I'm not sure that it would be adequate for our originally purpose. I'm not following. Surely pg_settings itself is enough to tell you where the currently-applied GUC value came from. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reduce ProcArrayLock contention
I have been working on to analyze different ways to reduce the contention around ProcArrayLock. I have evaluated mainly 2 ideas, first one is to partition the ProcArrayLock (the basic idea is to allow multiple clients (equal to number of ProcArrayLock partitions) to perform ProcArrayEndTransaction and then wait for all of them at GetSnapshotData time) and second one is to have a mechanism to GroupClear the Xid during ProcArrayEndTransaction() and the second idea clearly stands out in my tests, so I have prepared the patch for that to further discuss here. The idea behind second approach (GroupClear Xid) is, first try to get ProcArrayLock conditionally in ProcArrayEndTransaction(), if we get the lock then clear the advertised XID, else set the flag (which indicates that advertised XID needs to be clear for this proc) and push this proc to pendingClearXidList. Except one proc, all other proc's will wait for their Xid to be cleared. The only allowed proc will attempt the lock acquiration, after acquring the lock, pop all of the requests off the list using compare-and-swap, servicing each one before moving to next proc, and clearing their Xids. After servicing all the requests on pendingClearXidList, release the lock and once again go through the saved pendingClearXidList and wake all the processes waiting for their Xid to be cleared. To set the appropriate value for ShmemVariableCache-latestCompletedXid, we need to advertise latestXid incase proc needs to be pushed to pendingClearXidList. Attached patch implements the above idea. Performance Data - RAM - 500GB 8 sockets, 64 cores(Hyperthreaded128 threads total) Non-default parameters max_connections = 150 shared_buffers=8GB min_wal_size=10GB max_wal_size=15GB checkpoint_timeout=35min maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 wal_buffers = 256MB pgbench setup scale factor - 300 Data is on magnetic disk and WAL on ssd. pgbench -M prepared tpc-b Head : commit 51d0fe5d Patch -1 : group_xid_clearing_at_trans_end_rel_v1 Client Count/TPS18163264128HEAD814609210899199262363617812Patch-110866483 11093199083122028237 The graph for the data is attached. Points about performance data - 1. Gives good performance improvement at or greater than 64 clients and give somewhat moderate improvement at lower client count. The reason is that because the contention around ProcArrayLock is mainly seen at higher client count. I have checked that at higher client-count, it started behaving lockless (which means performance with patch is equivivalent to if we just comment out ProcArrayLock in ProcArrayEndTransaction()). 2. There is some noise in this data (at 1 client count, I don't expect much difference). 3. I have done similar tests on power-8 m/c and found similar gains. 4. The gains are visible when the data fits in shared_buffers as for other workloads I/O starts dominating. 5. I have seen that effect of Patch is much more visible if we keep autovacuum = off (do manual vacuum after each run) and keep wal_writer_delay to lower value (say 20ms). I have not included that data here, but if somebody is interested, I can do the detailed tests against HEAD with those settings and share the results. Here are steps used to take data (there are repeated for each reading) 1. Start Server 2. dropdb postgres 3. createdb posters 4. pgbench -i -s 300 postgres 5. pgbench -c $threads -j $threads -T 1800 -M prepared postgres 6. checkpoint 7. Stop Server Thanks to Robert Haas for having discussion (offlist) about the idea and suggestions to improve it and also Andres Freund for having discussion and sharing thoughts about this idea at PGCon. Suggestions? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com group_xid_clearing_at_trans_end_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Oh, this is embarrassing: init file logic is still broken
F.Y.I. here is the results I did on my laptop (Ubuntu 14, i7-4600U, 16GB mem, 512GB SSD). Unlike Josh, I used Unix domain sockets. In summary: 9.4.3: 943.439840 9.4.4: 429.953953 9.4 stable as of June 30: 929.804491 So comparing with 9.4.3, 9.4.4 is 54% slow, and 9.4-stable is 1.4% slow. I think the difference between 9.4.3 seems a measurement error but the difference between 9.4.3 and 9.4.4 is real. Good thing is 9.4 stable fixes the issue as expected. Thanks Tom! Below is the details. 9.4.3: t-ishii@localhost: pgbench -p 5434 -s 100 -j 2 -c 6 -r -C -S -T 1200 test Scale option ignored, using pgbench_branches table count = 100 starting vacuum...end. transaction type: SELECT only scaling factor: 100 query mode: simple number of clients: 6 number of threads: 2 duration: 1200 s number of transactions actually processed: 1132130 latency average: 6.360 ms tps = 943.439840 (including connections establishing) tps = 64573.165915 (excluding connections establishing) statement latencies in milliseconds: 0.002809\set naccounts 10 * :scale 0.001427\setrandom aid 1 :naccounts 4.254619SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 9.4.4: t-ishii@localhost: pgbench -p 5433 -s 100 -j 2 -c 6 -r -C -S -T 1200 test Scale option ignored, using pgbench_branches table count = 100 starting vacuum...end. transaction type: SELECT only scaling factor: 100 query mode: simple number of clients: 6 number of threads: 2 duration: 1200 s number of transactions actually processed: 515946 latency average: 13.955 ms tps = 429.953953 (including connections establishing) tps = 56235.664740 (excluding connections establishing) statement latencies in milliseconds: 0.003311\set naccounts 10 * :scale 0.001653\setrandom aid 1 :naccounts 9.320204SELECT abalance FROM pgbench_accounts WHERE aid = :aid 9.4-stable: t-ishii@localhost: pgbench -p 5435 -s 100 -j 2 -c 6 -r -C -S -T 1200 test Scale option ignored, using pgbench_branches table count = 100 starting vacuum...end. transaction type: SELECT only scaling factor: 100 query mode: simple number of clients: 6 number of threads: 2 duration: 1200 s number of transactions actually processed: 1115767 latency average: 6.453 ms tps = 929.804491 (including connections establishing) tps = 64676.863921 (excluding connections establishing) statement latencies in milliseconds: 0.002803\set naccounts 10 * :scale 0.001395\setrandom aid 1 :naccounts 4.316686SELECT abalance FROM pgbench_accounts WHERE aid = :aid; OK, this is pretty bad in its real performance effects. On a workload which is dominated by new connection creation, we've lost about 17% throughput. To test it, I ran pgbench -s 100 -j 2 -c 6 -r -C -S -T 1200 against a database which fits in shared_buffers on two different m3.large instances on AWS (across the network, not on unix sockets). A typical run on 9.3.6 looks like this: scaling factor: 100 query mode: simple number of clients: 6 number of threads: 2 duration: 1200 s number of transactions actually processed: 252322 tps = 210.267219 (including connections establishing) tps = 31958.233736 (excluding connections establishing) statement latencies in milliseconds: 0.002515\set naccounts 10 * :scale 0.000963\setrandom aid 1 :naccounts 19.042859 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; Whereas a typical run on 9.3.9 looks like this: scaling factor: 100 query mode: simple number of clients: 6 number of threads: 2 duration: 1200 s number of transactions actually processed: 208180 tps = 173.482259 (including connections establishing) tps = 31092.866153 (excluding connections establishing) statement latencies in milliseconds: 0.002518\set naccounts 10 * :scale 0.000988\setrandom aid 1 :naccounts 23.076961 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; Numbers are pretty consistent on four runs each on two different instances (+/- 4%), so I don't think this is Amazon variability we're seeing. I think the syscache invalidation is really costing us 17%. :-( -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com wrote: On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote: Thanks to Robert Haas for having discussion (offlist) about the idea and suggestions to improve it and also Andres Freund for having discussion and sharing thoughts about this idea at PGCon. Weird. This patch is implemented exactly the way I said to implement it publicly at PgCon. Was nobody recording the discussion at the unconference?? Amit presented an earlier version of this at the in unconference? --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote: Thanks to Robert Haas for having discussion (offlist) about the idea and suggestions to improve it and also Andres Freund for having discussion and sharing thoughts about this idea at PGCon. Weird. This patch is implemented exactly the way I said to implement it publicly at PgCon. Was nobody recording the discussion at the unconference?? -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 06/29/2015 01:01 AM, Michael Paquier wrote: On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus j...@agliodbs.com wrote: Right. Well, another reason we should be using a system catalog and not a single GUC ... I assume that this takes into account the fact that you will still need a SIGHUP to reload properly the new node information from those catalogs and to track if some information has been modified or not. Well, my hope was NOT to need a sighup, which is something I see as a failing of the current system. And the fact that a connection to those catalogs will be needed as well, something that we don't have now. Hmmm? I was envisioning the catalog being used as one on the master. Why do we need an additional connection for that? Don't we already need a connection in order to update pg_stat_replication? Another barrier to the catalog approach is that catalogs get replicated to the standbys, and I think that we want to avoid that. Yeah, it occurred to me that that approach has its downside as well as an upside. For example, you wouldn't want a failed-over new master to synchrep to itself. Mostly, I was looking for something reactive, relational, and validated, instead of passing an unvalidated string to pg.conf and hoping that it's accepted on reload. Also some kind of catalog approach would permit incremental changes to the config instead of wholesale replacement. But perhaps you simply meant having an SQL interface with some metadata, right? Perhaps I got confused by the word 'catalog'. No, that doesn't make any sense. I'm personally not convinced that quorum and prioritization are compatible. I suggest instead that quorum and prioritization should be exclusive alternatives, that is that a synch set should be either a quorum set (with all members as equals) or a prioritization set (if rep1 fails, try rep2). I can imagine use cases for either mode, but not one which would involve doing both together. Yep, separating the GUC parameter between prioritization and quorum could be also good idea. We're agreed, then ... Er, I disagree here. Being able to get prioritization and quorum working together is a requirement of this feature in my opinion. Using again the example above with 2 data centers, being able to define a prioritization set on the set of nodes of data center 1, and a quorum set in data center 2 would reduce failure probability by being able to prevent problems where for example one or more nodes lag behind (improving performance at the same time). Well, then *someone* needs to define the desired behavior for all permutations of prioritized synch sets. If it's undefined, then we're far worse off than we are now. Also I think that we must enable us to decide which server we should promote when the master server is down. Yes, and probably my biggest issue with this patch is that it makes deciding which server to fail over to *more* difficult (by adding more synchronous options) without giving the DBA any more tools to decide how to fail over. Aside from because we said we'd eventually do it, what real-world problem are we solving with this patch? Hm. This patch needs to be coupled with improvements to pg_stat_replication to be able to represent a node tree by basically adding to which group a node is assigned. I can draft that if needed, I am just a bit too lazy now... Honestly, this is not a matter of tooling. Even today if a DBA wants to change s_s_names without touching postgresql.conf you could just run ALTER SYSTEM and then reload parameters. You're confusing two separate things. The primary manageability problem has nothing to do with altering the parameter. The main problem is: if there is more than one synch candidate, how do we determine *after the master dies* which candidate replica was in synch at the time of failure? Currently there is no way to do that. This proposal plans to, effectively, add more synch candidate configurations without addressing that core design failure *at all*. That's why I say that this patch decreases overall reliability of the system instead of increasing it. When I set up synch rep today, I never use more than two candidate synch servers because of that very problem. And even with two I have to check replay point because I have no way to tell which replica was in-sync at the time of failure. Even in the current limited feature, this significantly reduces the utility of synch rep. In your proposal, where I could have multiple synch rep groups in multiple geos, how on Earth could I figure out what to do when the master datacenter dies? BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC (assuming it's one parameter) instead of some custom syntax. If it's JSON, we can validate it in psql, whereas if it's some custom syntax we have to wait for the db to reload and fail to figure out that we forgot a comma. Using JSON would also permit us to use jsonb_set and
Re: [HACKERS] Reduce ProcArrayLock contention
On 29 June 2015 at 18:11, Andres Freund and...@anarazel.de wrote: On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com wrote: On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote: Thanks to Robert Haas for having discussion (offlist) about the idea and suggestions to improve it and also Andres Freund for having discussion and sharing thoughts about this idea at PGCon. Weird. This patch is implemented exactly the way I said to implement it publicly at PgCon. Was nobody recording the discussion at the unconference?? Amit presented an earlier version of this at the in unconference? Yes, I know. And we all had a long conversation about how to do it without waking up the other procs. Forming a list, like we use for sync rep and having just a single process walk the queue was the way I suggested then and previously. Weird. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
[HACKERS] Bug in bttext_abbrev_convert()
Commits b181a919 and arguably c79b6413 added bugs to bttext_abbrev_convert() in the process of fixing some others. In the master branch, bttext_abbrev_convert() can leak memory when the C locale happens to be used and we must detoast, which is unacceptable for about the same reason that it's unacceptable for a standard B-Tree comparator routine. There could also be a use-after-free issue for large strings that are detoasted, because bttext_abbrev_convert() hashes memory which might already be freed (when hashing the authoritative value). Attached patch fixes these issues. As we all know, the state of automated testing is pretty lamentable. This is the kind of thing that we could catch more easily in the future if better infrastructure were in place. I caught this by eyeballing bttext_abbrev_convert() with slightly fresher eyes than the last time I looked. -- Peter Geoghegan From 9aa381b1c136b9c4228b0b965c38fc62234ba251 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan peter.geoghega...@gmail.com Date: Mon, 29 Jun 2015 16:21:16 -0700 Subject: [PATCH] Fix memory management bugs in text abbreviation Make sure that a pfree() of an unpacked Datum value (from a toasted Datum) always occurs, including when a text sort uses the C locale. Free unpacked value memory at the latest possible opportunity, ensuring that it is not spuriously referenced after pfree(). --- src/backend/utils/adt/varlena.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 779729d..2fbbf54 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -2034,13 +2034,9 @@ bttext_abbrev_convert(Datum original, SortSupport ssup) } /* Just like strcoll(), strxfrm() expects a NUL-terminated string */ - memcpy(tss-buf1, VARDATA_ANY(authoritative), len); + memcpy(tss-buf1, authoritative_data, len); tss-buf1[len] = '\0'; - /* Don't leak memory here */ - if (PointerGetDatum(authoritative) != original) - pfree(authoritative); - for (;;) { #ifdef HAVE_LOCALE_T @@ -2108,6 +2104,10 @@ bttext_abbrev_convert(Datum original, SortSupport ssup) addHyperLogLog(tss-abbr_card, hash); + /* Don't leak memory here */ + if (PointerGetDatum(authoritative) != original) + pfree(authoritative); + return res; } -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
Andres Freund and...@anarazel.de writes: I'm working on integrating the suggestions I made last week to the release notes. Would anybody mind if I start to add commit ids in comments, similar to what Tom has done for minor releases, to news items? +1. Helps confirm which items are meant to correspond to which commits. In case you didn't realize it already, the stuff I put into the minor release notes is from src/tools/git_changelog. Dunno whether we need to use that exact format for major releases though; there's no need to identify branches in major release notes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PANIC in GIN code
On 06/29/2015 07:20 PM, Jeff Janes wrote: On Mon, Jun 29, 2015 at 1:37 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Another piece of info here that might be relevant. Almost all UPDATE_META_PAGE xlog records other than the last one have two backup blocks. The last UPDATE_META_PAGE record only has one backup block. And the metapage is mostly zeros: head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od 000 00 00 161020 073642 00 00 00 00 020 00 00 00 00 053250 00 053250 00 040 006140 00 01 00 01 00 00 00 060 031215 00 000452 00 00 00 00 00 100 025370 00 00 00 02 00 00 00 120 00 00 00 00 00 00 00 00 * 002 Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it doesn't initialize the page. It copies the metapage data, but it doesn't touch the page headers. The only way I can see that that would cause trouble is if the index somehow got truncated away or removed in the standby. That could happen in crash recovery, if you drop the index and the crash, but that should be harmless, because crash recovery doesn't try to read the metapage, only update it (by overwriting it), and by the time crash recovery has completed, the index drop is replayed too. But AFAICS that bug is present in earlier versions too. Nope, looking closer, in previous versions the page was always read from disk, even though we're overwriting it. That was made smarter in 9.5, by using the ZERO_OR_LOCK mode, but that means that the page headers indeed need to be initialized. Yes, I did see this error reported previously but it was always after the first appearance of the PANIC, so I assumed it was a sequella to that and didn't investigate it further at that time. Can you reproduce this easily? How? I can reproduce it fairly easy. [instructions to reproduce] I was actually not able to reproduce it that way, but came up with a much simpler method. The problem occurs when the metapage update record WAL record is replayed, and the metapage is not in the page cache yet. Usually it is, as if you do pretty much anything at all with the index, the metapage stays in cache. But VACUUM produces a metapage update record to update the stats, and it's pretty easy to arrange things so that that's the first record after checkpoint, and start recover from that checkpoint: postgres=# create table foo (t text[]); CREATE TABLE postgres=# insert into foo values ('{foo}'); INSERT 0 1 postgres=# create index i_foo on foo using gin (t); CREATE INDEX postgres=# vacuum foo; VACUUM postgres=# vacuum foo; VACUUM postgres=# checkpoint; CHECKPOINT postgres=# vacuum foo; VACUUM Now kill -9 postmaster, and restart. Voila, the page headers are all zeros: postgres=# select * from page_header(get_raw_page('i_foo', 0)); lsn| checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+- -- 0/1891270 |0 | 0 | 0 | 0 | 0 |0 | 0 | 0 (1 row) postgres=# select * from gin_metapage_info(get_raw_page('i_foo', 0));ERROR: input page is not a GIN metapage DETAIL: Flags 0189, expected 0008 I just pushed a fix for this, but unfortunately it didn't make it 9.5alpha1. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solaris testers wanted for strxfrm() behavior
Josh Berkus j...@agliodbs.com writes: Joyent confirms that the bug is fixed on SmartOS: The more interesting bit of information would be *when* it was fixed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
Andres Freund and...@anarazel.de writes: Haven't yet thought much about the format, maybe in the style of git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master I'd personally like to see the hash and the time, the rest isn't particularly important to me. [ ... plays with that ... ] Hm. To keep down the bulk of the release notes, I'd kind of like to fit this info into single SGML comment lines, ie !-- info here -- and if that's to fit into 80 columns without wrapping, the info is pretty constrained. I get something like this when running git log in a 70-column window: [f78329d] Tom Lane [2015-06-29 15:42:18 -0400]: Stamp 9.5alpha1. [85c25fd] Tom Lane [2015-06-29 15:38:46 -0400]: Desultory review of 9. [cbc8d65] Tom Lane [2015-06-29 12:42:52 -0400]: Code + docs review for [07cb8b0] Andres Freund [2015-06-29 14:53:32 +0200]: Replace ia64 S_UN [c5e5d44] Peter Eisentraut [2015-06-28 23:56:55 -0400]: Translation up [2bdc51a] Tom Lane [2015-06-28 20:49:35 -0400]: Run the C portions of [62d16c7] Tom Lane [2015-06-28 18:06:14 -0400]: Improve design and imp [d661532] Heikki Linnakangas [2015-06-29 00:09:10 +0300]: Also trigger [6ab4d38] Heikki Linnakangas [2015-06-29 00:01:26 +0300]: Fix markup i [a32c3ec] Heikki Linnakangas [2015-06-28 22:30:39 +0300]: Promote the [a45c70a] Heikki Linnakangas [2015-06-28 22:16:21 +0300]: Fix double-X [b36805f] Heikki Linnakangas [2015-06-28 21:35:51 +0300]: Don't choke [cb2acb1] Heikki Linnakangas [2015-06-28 21:35:46 +0300]: Add missing_ [cca8ba9] Kevin Grittner [2015-06-28 12:43:59 -0500]: Fix comment for [527e6d3] Tatsuo Ishii [2015-06-28 18:54:27 +0900]: Fix function decla [0a52d37] Tom Lane [2015-06-27 17:47:39 -0400]: Avoid passing NULL to [d47a113] Andres Freund [2015-06-27 19:00:45 +0200]: Fix test_decoding [604e993] Kevin Grittner [2015-06-27 09:55:06 -0500]: Add opaque decla [7845db2] Heikki Linnakangas [2015-06-27 10:17:42 +0300]: Fix typo in [66fbcb0] Simon Riggs [2015-06-27 00:41:47 +0100]: Avoid hot standby c [7d60b2a] Alvaro Herrera [2015-06-26 18:17:54 -0300]: Fix DDL command [4028222] Alvaro Herrera [2015-06-26 18:13:05 -0300]: Fix BRIN xlog re [7c02d48] Robert Haas [2015-06-26 16:04:46 -0400]: Fix grammar. [8f15f74] Robert Haas [2015-06-26 15:53:13 -0400]: Be more conservativ [c66bc72] Robert Haas [2015-06-26 14:49:37 -0400]: release notes: Add [8a8c581] Robert Haas [2015-06-26 14:46:48 -0400]: Remove unnecessary [31c018e] Robert Haas [2015-06-26 14:20:29 -0400]: release notes: Comb [9043ef3] Robert Haas [2015-06-26 11:37:32 -0400]: Don't warn about cr That might be all right but it seems like the timestamp is taking up an undue amount of space. If we could get git log to put out only the date, it would be better IMO, but I don't see a format option that does that :-(. I wonder if it's worth teaching git_changelog to have an option to put out summary lines in the specific format we select. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solaris testers wanted for strxfrm() behavior
On 06/29/2015 02:08 PM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Joyent confirms that the bug is fixed on SmartOS: The more interesting bit of information would be *when* it was fixed. Answer: not certain, but fixed at least 2 years ago. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
On 2015-06-29 17:30:57 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: Haven't yet thought much about the format, maybe in the style of git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master I'd personally like to see the hash and the time, the rest isn't particularly important to me. [ ... plays with that ... ] Hm. To keep down the bulk of the release notes, I'd kind of like to fit this info into single SGML comment lines, ie and if that's to fit into 80 columns without wrapping, the info is pretty constrained. I get something like this when running git log in a 70-column window: [f78329d] Tom Lane [2015-06-29 15:42:18 -0400]: Stamp 9.5alpha1. I agree this is a bit long, but I don't particularly care about 70/80 cols. How about: git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master 2015-06-29 [f78329d] Tom Lane: Stamp 9.5alpha1. 2015-06-29 [85c25fd] Tom Lane: Desultory review of 9.5 release notes. 2015-06-29 [cbc8d65] Tom Lane: Code + docs review for escaping of option values (commit 11a020eb6). 2015-06-29 [07cb8b0] Andres Freund: Replace ia64 S_UNLOCK compiler barrier with a full memory barrier. 2015-06-28 [c5e5d44] Peter Eisentraut: Translation updates 2015-06-28 [2bdc51a] Tom Lane: Run the C portions of guc-file.l through pgindent. 2015-06-28 [62d16c7] Tom Lane: Improve design and implementation of pg_file_settings view. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
Hi, On 2015-06-11 00:15:21 -0400, Bruce Momjian wrote: I have committed the first draft of the 9.5 release notes. I'm working on integrating the suggestions I made last week to the release notes. Would anybody mind if I start to add commit ids in comments, similar to what Tom has done for minor releases, to news items? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_*_columns?
On 6/26/15 6:09 PM, Joel Jacobson wrote: Can't we just use the infrastructure of PostgreSQL to handle the few megabytes of data we are talking about here? Why not just store the data in a regular table? Why bother with special files and special data structures? If it's just a table we want to produce as output, why can't we just store it in a regular table, in the pg_catalog schema? The problem is the update rate. I've never tried measuring it, but I'd bet that the stats collector can end up with 10s of thousands of updates per second. MVCC would collapse under that kind of load. What might be interesting is setting things up so the collector simply inserted into history tables every X seconds and then had a separate process to prune that data. The big problem with that is I see no way for that to easily allow access to real-time data (which is certainly necessary sometimes). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
Andres Freund and...@anarazel.de writes: How about: git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master Ah yes, didn't see that option. That's definitely better. I'd still vote for restricting it to fit in an 80-column window though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solaris testers wanted for strxfrm() behavior
All, Joyent confirms that the bug is fixed on SmartOS: This is what I see: SunOS pkgsrc-pbulk-2014Q4-1.local 5.11 joyent_20141030T081701Z i86pc i386 i86pc locale is_IS.ISO8859-1: strxfrm returned 212; last modified byte at 58 (0x0) locale is_IS.ISO8859-1: strxfrm returned 212; last modified byte at 58 (0x0) locale : strxfrm returned 26; last modified byte at 27 (0x0) Cheers, -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
On 2015-06-29 17:58:54 -0400, Tom Lane wrote: Yeah we are. The only places you'll find where we aren't formatting to 77 or 78 columns or so are where it would require breaking SGML tags in weird places. Which isn't exactly infrequent... Anyway, how about: format:%cd [%h] %(8,trunc)%cN: %(49,trunc)%s (which you can configure as pretty.pgmajor or so in .gitconfig btw.) (There are two different things to worry about here. One is how you're going to reverse-engineer the annotations into the release notes Bruce already made. Even un-truncated first lines of commit messages will often not be enough for matching up commits to those notes. But I'm thinking more about how we do this in future release cycles, and for that, getting git_changelog to help is clearly the ticket.) I'm not against doing that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
On 2015-06-29 17:44:06 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: How about: git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master Ah yes, didn't see that option. That's definitely better. I'd still vote for restricting it to fit in an 80-column window though. I don't see the benefit of truncating away the end of the commit message - that'll just mean more manual work and harder to understand heading... It's also not like we're generally very religious about the line length in sgml? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
Andres Freund and...@anarazel.de writes: On 2015-06-29 17:44:06 -0400, Tom Lane wrote: Ah yes, didn't see that option. That's definitely better. I'd still vote for restricting it to fit in an 80-column window though. I don't see the benefit of truncating away the end of the commit message - that'll just mean more manual work and harder to understand heading... It's also not like we're generally very religious about the line length in sgml? Yeah we are. The only places you'll find where we aren't formatting to 77 or 78 columns or so are where it would require breaking SGML tags in weird places. If we use this format without truncating the log lines then it'll become practically impossible to edit release notes in a window less than about 120 characters wide, and I for one will object to that. It doesn't fit well in my usual screen layout. And it would be very wasteful of screen space because even if you let regular text flow all the way to a 120-character margin, there are enough short lines like para that there'd just be huge amounts of white space on your screen. As for manual work to generate the format, we could fix that by getting git_changelog to emit what we want. I'd think the best thing to start from would be the summary lines interspersed with full commit messages anyway; the raw output of git log is going to be near unusable for this purpose, with or without truncation. (There are two different things to worry about here. One is how you're going to reverse-engineer the annotations into the release notes Bruce already made. Even un-truncated first lines of commit messages will often not be enough for matching up commits to those notes. But I'm thinking more about how we do this in future release cycles, and for that, getting git_changelog to help is clearly the ticket.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release notes
On 2015-06-29 17:05:59 -0400, Tom Lane wrote: +1. Helps confirm which items are meant to correspond to which commits. That's what made me think of it. In case you didn't realize it already, the stuff I put into the minor release notes is from src/tools/git_changelog. Dunno whether we need to use that exact format for major releases though; there's no need to identify branches in major release notes. I'd recognized the format, but I didn't want to exactly go that way. As you say, the branch information is redundant. Haven't yet thought much about the format, maybe in the style of git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master I'd personally like to see the hash and the time, the rest isn't particularly important to me. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
[Jumping in without catching up on entire thread. Please let me know if these questions have already been covered.] 1. Can you change the name to something like ParallelHeapScan? Parallel Sequential is a contradiction. (I know this is bikeshedding and I won't protest further if you keep the name.) 2. Where is the speedup coming from? How much of it is CPU and IO overlapping (i.e. not leaving disk or CPU idle while the other is working), and how much from the CPU parallelism? I know this is difficult to answer rigorously, but it would be nice to have some breakdown even if for a specific machine. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus j...@agliodbs.com wrote: On 06/28/2015 04:36 AM, Sawada Masahiko wrote: On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus j...@agliodbs.com wrote: On 06/26/2015 11:32 AM, Robert Haas wrote: I think your proposal is worth considering, but you would need to fill in a lot more details and explain how it works in detail, rather than just via a set of example function calls. The GUC-based syntax proposal covers cases like multi-level rules and, now, prioritization, and it's not clear how those would be reflected in what you propose. So what I'm seeing from the current proposal is: 1. we have several defined synchronous sets 2. each set requires a quorum of k (defined per set) 3. within each set, replicas are arranged in priority order. One thing which the proposal does not implement is *names* for synchronous sets. I would also suggest that if I lose this battle and we decide to go with a single stringy GUC, that we at least use JSON instead of defining our out, proprietary, syntax? JSON would be more flexible for making synchronous set, but it will make us to change how to parse configuration file to enable a value contains newline. Right. Well, another reason we should be using a system catalog and not a single GUC ... I assume that this takes into account the fact that you will still need a SIGHUP to reload properly the new node information from those catalogs and to track if some information has been modified or not. And the fact that a connection to those catalogs will be needed as well, something that we don't have now. Another barrier to the catalog approach is that catalogs get replicated to the standbys, and I think that we want to avoid that. But perhaps you simply meant having an SQL interface with some metadata, right? Perhaps I got confused by the word 'catalog'. I'm personally not convinced that quorum and prioritization are compatible. I suggest instead that quorum and prioritization should be exclusive alternatives, that is that a synch set should be either a quorum set (with all members as equals) or a prioritization set (if rep1 fails, try rep2). I can imagine use cases for either mode, but not one which would involve doing both together. Yep, separating the GUC parameter between prioritization and quorum could be also good idea. We're agreed, then ... Er, I disagree here. Being able to get prioritization and quorum working together is a requirement of this feature in my opinion. Using again the example above with 2 data centers, being able to define a prioritization set on the set of nodes of data center 1, and a quorum set in data center 2 would reduce failure probability by being able to prevent problems where for example one or more nodes lag behind (improving performance at the same time). Also I think that we must enable us to decide which server we should promote when the master server is down. Yes, and probably my biggest issue with this patch is that it makes deciding which server to fail over to *more* difficult (by adding more synchronous options) without giving the DBA any more tools to decide how to fail over. Aside from because we said we'd eventually do it, what real-world problem are we solving with this patch? Hm. This patch needs to be coupled with improvements to pg_stat_replication to be able to represent a node tree by basically adding to which group a node is assigned. I can draft that if needed, I am just a bit too lazy now... Honestly, this is not a matter of tooling. Even today if a DBA wants to change s_s_names without touching postgresql.conf you could just run ALTER SYSTEM and then reload parameters. It's always been a problem that one can accomplish a de-facto denial-of-service by joining a cluster using the same application_name as the synch standby, moreso because it's far too easy to do that accidentally. One needs to simply make the mistake of copying recovery.conf from the synch replica instead of the async replica, and you've created a reliability problem. That's a scripting problem then. There are many ways to do a false manipulation in this area when setting up a standby. application_name value is one, you can do worse by pointing to an incorrect IP as well, miss a firewall filter or point to an incorrect port. Also, the fact that we use application_name for synch_standby groups prevents us from giving the standbys in the group their own names for identification purposes. It's only the fact that synchronous groups are relatively useless in the current feature set that's prevented this from being a real operational problem; if we implement quorum commit, then users are going to want to use groups more often and will want to identify the members of the group, and not just by IP address. Managing groups in the synchronous protocol is adding one level of complexity for the operator, while what I had in mind first was to allow a user to be able
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
Hi KaiGai-san, On 2015/06/27 21:09, Kouhei Kaigai wrote: BTW, if you try newer version of postgres_fdw foreign join patch, please provide me to reproduce the problem/ OK Did you forget to attach the patch, or v17 is in use? Sorry, I made a mistake. The problem was produced using v16 [1]. I'd like to suggest a solution that re-construct remote tuple according to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0. It enables to run local qualifier associated with the ForeignScan node, and it will also work for the case when tuple in es_epqTupleSet[] was local heap. Maybe I'm missing something, but I don't think your proposal works properly because we don't have any component ForeignScan state node or subsidiary join state node once we've replaced the entire join with the ForeignScan performing the join remotely, IIUC. So, my image was to have another subplan for EvalPlanQual as well as the ForeignScan, to do the entire join locally for the component test tuples if we are inside an EvalPlanQual recheck. Hmm... Probably, we have two standpoints to tackle the problem. The first standpoint tries to handle the base foreign table as a prime relation for locking. Thus, we have to provide a way to fetch a remote tuple identified with the supplied ctid. The advantage of this approach is the way to fetch tuples from base relation is quite similar to the existing form, however, its disadvantage is another side of the same coin, because the ForeignScan node with scanrelid==0 (that represents remote join query) may have local qualifiers which shall run on the tuple according to fdw_scan_tlist. IIUC, I think this approach would also need to evaluate join conditions and remote qualifiers in addition to local qualifiers in the local, for component tuples that were re-fetched from the remote (and remaining component tuples that were copied from whole-row vars, if any), in cases where the re-fetched tuples were updated versions of those tuples rather than the same versions priviously obtained. One other standpoint tries to handle a bunch of base foreign tables as a unit. That means, if any of base foreign table is the target of locking, it prompts FDW driver to fetch the latest joined tuple identified by ctid, even if this join contains multiple base relations to be locked. The advantage of this approach is that we can use qualifiers of the ForeignScan node with scanrelid==0 and no need to pay attention of remote qualifier and/or join condition individually. Its disadvantage is, we may extend EState structure to keep the joined tuples, in addition to es_epqTupleSet[]. That is an idea. However, ISTM there is another disadvantage; that is not efficient because that would need to perform another remote join query having such additional conditions during an EvalPlanQual check, as you proposed. I'm inclined to think the later standpoint works well, because it does not need to reproduce an alternative execution path in local side again, even if a ForeignScan node represents much complicated remote query. If we would fetch tuples of individual base relations, we need to reconstruct identical join path to be executed on remote- side, don't it? Yeah, that was my image for fixing this issue. IIUC, the purpose of EvalPlanQual() is to ensure the tuples to be locked is still visible, so it is not an essential condition to fetch base tuples individually. I think so too, but taking the similarity and/or efficiency of processing into consideration, I would vote for the idea of having an alternative execution path in the local. That would also allow FDW authors to write the foreign join pushdown functionality in their FDWs by smaller efforts. Best regards, Etsuro Fujita [1] http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Mon, Jun 29, 2015 at 3:46 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/24/2015 09:43 AM, Michael Paquier wrote: Attached is a new set of patches. Except for the last ones that addresses one issue of pg_rewind (symlink management when streaming PGDATA), all the others introduce if_not_exists options for the functions of genfile.c. The pg_rewind stuff could be more polished though. Feel free to comment. I've committed the additional option to the functions in genfile.c (I renamed it to missing_ok, for clarity), and the pg_rewind changes to use that option. I ended up refactoring the patch quite a bit, so if you could double-check what I committed to make sure I didn't break anything, that would be great. Thanks, the new patch looks far better than what I did, I noticed a couple of typos in the docs though: - s/behaviour/behavior, behavior is American English spelling, and it is the one used elsewhere as well, hence I guess that it makes sense to our it. - s/an non-existent/a non-existent - pg_proc.h is still using if_not_exists as in my patch (you corrected it to use missing_ok). Those are fixed as 0001 in the attached set. I didn't commit the tablespace or symlink handling changes yet, will review those separately. Thanks. Attached are rebased versions that take into account your previous changes as well (like the variable renaming, wrapper function usage, etc). I also added missing_ok to pg_readlink for consistency, and rebased the fix of pg_rewind for soft links with 0005. Note that 0005 does not use pg_tablespace_location(), still the patch series include an implementation of missing_ok for it. Feel free to use it if necessary. I also didn't commit the new regression test yet. It would indeed be nice to have one, but I think it was a few bricks shy of a load. It should work in a freshly initdb'd system, but not necessarily on an existing installation. First, it relied on the fact that postgresql.conf.auto exists, but a DBA might remove that if he wants to make sure the feature is not used. Secondly, it relied on the fact that pg_twophase is empty, but there is no guarantee of that either. Third, the error messages included in the expected output, e.g No such file or directory, depend on the operating system and locale. And finally, it'd be nice to test more things, in particular the behaviour of different offsets and lengths to pg_read_binary_file(), although an incomplete test would be better than no test at all. Portability is going to really reduce the test range, the only things that we could test are: - NULL results returned when missing_ok = true (with a dummy file name/link/directory) as missing_ok = false would choke depending on the platform as you mentioned. - Ensure that those functions called by users without superuser rights fail properly. - Path format errors for each function, like that: =# select pg_ls_dir('..'); ERROR: 42501: path must be in or below the current directory LOCATION: convert_and_check_filename, genfile.c:78 For tests on pg_read_binary_file, do you think that there is one file of PGDATA that we could use for scanning? I cannot think of one on the top of my mind now (postgresql.conf or any configuration files could be overridden by the user so they are out of scope, PG_VERSION is an additional maintenance burden). Well, I think that those things are still worth testing in any case and I think that you think so as well. If you are fine with that I could wrap up a patch that's better than nothing done for sure. Thoughts? Now we could have a TAP test for this stuff, where we could have a custom PGDATA with some dummy files that we know will be empty for example, still it seems like an overkill to me for this purpose, initdb is rather costly in this facility.. And on small machines. Regards, -- Michael From 378bd8af5c27e7e20b038f05b693b95e9871ef0b Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 29 Jun 2015 13:38:03 +0900 Subject: [PATCH 1/5] Fix some typos and an incorrect naming introduced by cb2acb1 --- doc/src/sgml/func.sgml| 4 ++-- src/include/catalog/pg_proc.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index d49cd43..8b28e29 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17851,7 +17851,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); para All of these functions take an optional parametermissing_ok/ parameter, -which specifies the behaviour when the file or directory does not exist. +which specifies the behavior when the file or directory does not exist. If literaltrue/literal, the function returns NULL (except functionpg_ls_dir/, which returns an empty result set). If literalfalse/, an error is raised. The default is literalfalse/. @@ -17867,7 +17867,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
Re: [HACKERS] PANIC in GIN code
On 06/29/2015 01:12 AM, Jeff Janes wrote: Now I'm getting a different error, with or without checksums. ERROR: invalid page in block 0 of relation base/16384/16420 CONTEXT: automatic vacuum of table jjanes.public.foo 16420 is the gin index. I can't even get the page with pageinspect: jjanes=# SELECT * FROM get_raw_page('foo_text_array_idx', 0); ERROR: invalid page in block 0 of relation base/16384/16420 This is the last few gin entries from pg_xlogdump rmgr: Gin len (rec/tot): 0/ 3893, tx: 0, lsn: 0/77270E90, prev 0/77270E68, desc: VACUUM_PAGE , blkref #0: rel 1663/16384/16420 blk 27 FPW rmgr: Gin len (rec/tot): 0/ 3013, tx: 0, lsn: 0/77272080, prev 0/77272058, desc: VACUUM_PAGE , blkref #0: rel 1663/16384/16420 blk 6904 FPW rmgr: Gin len (rec/tot): 0/ 3093, tx: 0, lsn: 0/77272E08, prev 0/77272DE0, desc: VACUUM_PAGE , blkref #0: rel 1663/16384/16420 blk 1257 FPW rmgr: Gin len (rec/tot): 8/ 4662, tx: 318119897, lsn: 0/77A2CF10, prev 0/77A2CEC8, desc: INSERT_LISTPAGE , blkref #0: rel 1663/16384/16420 blk 22184 rmgr: Gin len (rec/tot): 88/ 134, tx: 318119897, lsn: 0/77A2E188, prev 0/77A2E160, desc: UPDATE_META_PAGE , blkref #0: rel 1663/16384/16420 blk 0 And the metapage is mostly zeros: head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od 000 00 00 161020 073642 00 00 00 00 020 00 00 00 00 053250 00 053250 00 040 006140 00 01 00 01 00 00 00 060 031215 00 000452 00 00 00 00 00 100 025370 00 00 00 02 00 00 00 120 00 00 00 00 00 00 00 00 * 002 Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it doesn't initialize the page. It copies the metapage data, but it doesn't touch the page headers. The only way I can see that that would cause trouble is if the index somehow got truncated away or removed in the standby. That could happen in crash recovery, if you drop the index and the crash, but that should be harmless, because crash recovery doesn't try to read the metapage, only update it (by overwriting it), and by the time crash recovery has completed, the index drop is replayed too. But AFAICS that bug is present in earlier versions too. Can you reproduce this easily? How? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind failure by file deletion in source server
On Mon, Jun 29, 2015 at 3:53 PM, Vladimir Borodin r...@simply.name wrote: 28 июня 2015 г., в 21:46, Heikki Linnakangas hlinn...@iki.fi wrote: I've committed the additional option to the functions in genfile.c (I renamed it to missing_ok, for clarity), and the pg_rewind changes to use that option. And since it changes API it would not be back-ported to 9.4, right? Those API changes are not going to be back-patched to 9.4 as this is basically a new feature. For 9.4's pg_rewind, the problem does not concern this mailing list anyway, so let's discuss it directly on the project page on github. Just mentioning, but I am afraid that we are going to need a set of TRY/CATCH blocks with a wrapper function in plpgsql that does the failure legwork done here, or to export those functions in an extension with some copy/paste as all the necessary routines of genfile.c are not exposed externally, and to be sure that those functions are installed on the source server. I am not sure that I will be able to look at that in the short term unfortunately... Hence patches are welcome there and I will happily review, argue or integrate them if needed. The non-streaming option is not impacted in any case, so it's not like pg_rewind cannot be used at all on 9.4 or 9.3 instances. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Mon, Jun 29, 2015 at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote: Yes, I know. And we all had a long conversation about how to do it without waking up the other procs. Forming a list, like we use for sync rep and having just a single process walk the queue was the way I suggested then and previously. Weird. I am not sure what your point is. Are you complaining that you didn't get a design credit for this patch? If so, I think that's a bit petty. I agree that you mentioned something along these lines at PGCon, but Amit and I have been discussing this every week for over a month, so it's not as if the conversations at PGCon were the only ones, or the first. Nor is there a conspiracy to deprive Simon Riggs of credit for his ideas. I believe that you should assume good faith and take it for granted that Amit credited who he believed that he got his ideas from. The fact that you may have had similar ideas does not mean that he got his from you. It probably does mean that they are good ideas, since we are apparently all thinking in the same way. (I could provide EDB email internal emails documenting the timeline of various ideas and showing which ones happened before and after PGCon, and we could debate exactly who thought of what when. But I don't really see the point. I certainly hope that a debate over who deserves how much credit for what isn't going to overshadow the good work Amit has done on this patch.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Solaris testers wanted for strxfrm() behavior
On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus j...@agliodbs.com wrote: On 06/29/2015 02:08 PM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Joyent confirms that the bug is fixed on SmartOS: The more interesting bit of information would be *when* it was fixed. Answer: not certain, but fixed at least 2 years ago. 2 years is definitely not long enough to assume that no unpatched machines exist in the wild. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
On 2015-06-29 22:45:49 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 22:11:33 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics. Can you explain what you mean by volatile having acquire/release semantics? I don't see how volatile can create a CPU barrier, but I'm guessing that it somehow can and that you're about to enlighten me. It's a IA64 pecularity. I think it started with intel's compiler, but since then at least msvc and gcc copied it. In essence volatile reads implicitly have acquire semantics, and volatile writes release. That's relatively cheap on IA64 because they have 'opcode tags' marking normal moves etc. as having release or acquire semantics (mov.rel etc.). We even have a comment about it that scratches the surface a bit: /* * Intel Itanium, gcc or Intel's compiler. * * Itanium has weak memory ordering, but we rely on the compiler to enforce * strict ordering of accesses to volatile data. In particular, while the * xchg instruction implicitly acts as a memory barrier with 'acquire' * semantics, we do not have an explicit memory fence instruction in the * S_UNLOCK macro. We use a regular assignment to clear the spinlock, and * trust that the compiler marks the generated store instruction with the * .rel opcode. * * Testing shows that assumption to hold on gcc, although I could not find * any explicit statement on that in the gcc manual. In Intel's compiler, * the -m[no-]serialize-volatile option controls that, and testing shows that * it is enabled by default. */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] LWLock deadlock and gdb advice
I have a 9.5alpha1 cluster which is locked up. All the user back ends seem to be waiting on semop, eventually on WALInsertLockAcquire. Is there a way to use gdb to figure out who holds the lock they are waiting for? It is compiled with both debug and cassert. I am hoping someone can give me recipe similar to the one for Examining backend memory use in https://wiki.postgresql.org/wiki/Developer_FAQ example backtrace: #0 0x003dcb6eaf27 in semop () from /lib64/libc.so.6 #1 0x0067190f in PGSemaphoreLock (sema=0x7f28a98b9468) at pg_sema.c:387 #2 0x006d4b0c in LWLockAcquireCommon (l=0x7f28a0e6d600, valptr=0x7f28a0e6d618, val=0) at lwlock.c:1042 #3 LWLockAcquireWithVar (l=0x7f28a0e6d600, valptr=0x7f28a0e6d618, val=0) at lwlock.c:916 #4 0x004f3c4f in WALInsertLockAcquire (rdata=0xc5c130, fpw_lsn=0) at xlog.c:1411 #5 XLogInsertRecord (rdata=0xc5c130, fpw_lsn=0) at xlog.c:948 #6 0x004f7aac in XLogInsert (rmid=13 '\r', info=32 ' ') at xloginsert.c:453 #7 0x0047e0b0 in ginPlaceToPage (btree=0x7fffca9263e0, stack=0x2c94ff8, insertdata=value optimized out, updateblkno=value optimized out, childbuf=0, buildStats=0x0) at ginbtree.c:418 #8 0x0047f3ad in ginInsertValue (btree=0x7fffca9263e0, stack=0x2c94ff8, insertdata=0x7fffca926460, buildStats=0x0) at ginbtree.c:748 #9 0x00475c8b in ginEntryInsert (ginstate=0x7fffca9267e0, attnum=29784, key=1, category=value optimized out, items=0x7f28a0c7b458, nitem=47, buildStats=0x0) at gininsert.c:234 #10 0x00485ecc in ginInsertCleanup (ginstate=0x7fffca9267e0, vac_delay=value optimized out, stats=0x0) at ginfast.c:843 #11 0x00487059 in ginHeapTupleFastInsert (ginstate=0x7fffca9267e0, collector=value optimized out) at ginfast.c:436 #12 0x004760fa in gininsert (fcinfo=value optimized out) at gininsert.c:531 Cheers, Jeff
Re: [HACKERS] anole: assorted stability problems
On 2015-06-29 22:11:33 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Refactor to split nodeAgg.c?
I was going to rebase my HashAgg patch, and got some conflicts related to the grouping sets patch. I could probably sort them out, but I think that may be the tipping point where we want to break up nodeAgg.c into nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well. This would also (I hope) be convenient for Simon and David Rowley, who have been hacking on aggregates in general. Anyone see a reason I shouldn't give this a try? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 22:11:33 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics. Can you explain what you mean by volatile having acquire/release semantics? I don't see how volatile can create a CPU barrier, but I'm guessing that it somehow can and that you're about to enlighten me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] anole: assorted stability problems
Robert Haas robertmh...@gmail.com writes: On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote: You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics. Can you explain what you mean by volatile having acquire/release semantics? I don't see how volatile can create a CPU barrier, but I'm guessing that it somehow can and that you're about to enlighten me. It's late and I'm tired, but: gcc (and, apparently, icc) treats accesses to volatile-qualified objects as cues to emit .acq or .rel memory ordering qualifiers on IA64 instructions, per the comments in s_lock.h. I have not seen any documentation stating specifically that aCC does the same, but the buildfarm evidence is pretty clear that the 9.4 IA64-non-gcc version of S_UNLOCK worked and the up-to-now-9.5 version does not. So personally, I would be inclined to put back the volatile qualifier, independently of any fooling around with _Asm_double_magic_xyzzy calls. Or to put it differently: where is the evidence that removing the volatile qual is a good idea? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PANIC in GIN code
On Mon, Jun 29, 2015 at 2:08 PM, Heikki Linnakangas hlinn...@iki.fi wrote: Now kill -9 postmaster, and restart. Voila, the page headers are all zeros: postgres=# select * from page_header(get_raw_page('i_foo', 0)); lsn| checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+- -- 0/1891270 |0 | 0 | 0 | 0 | 0 |0 | 0 | 0 (1 row) postgres=# select * from gin_metapage_info(get_raw_page('i_foo', 0));ERROR: input page is not a GIN metapage DETAIL: Flags 0189, expected 0008 I just pushed a fix for this, but unfortunately it didn't make it 9.5alpha1. Thanks. I think that that fixed it. It survived for over an hour this time. Then it hit a different problem (apparent deadlock in the LWLock code, but that is probably a topic for another thread, I'm working on reproducing it.) Cheers, Jeff
Re: [HACKERS] anole: assorted stability problems
On Mon, Jun 29, 2015 at 10:53 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 22:45:49 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 22:11:33 -0400, Robert Haas wrote: On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-29 00:42:53 -0400, Tom Lane wrote: #define S_UNLOCK(lock)\ do { _Asm_sched_fence(); (*(lock)) = 0; } while (0) Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler barrier? Shouldn't this be a _Asm_mf()? The point of the commit was to make spinlocks act as compiler barriers as well as CPU barriers. So I was just looking to add a compiler barrier to what was already there. You removed a volatile at the same time, and volatile on IA64 has acquire/release semantics. Can you explain what you mean by volatile having acquire/release semantics? I don't see how volatile can create a CPU barrier, but I'm guessing that it somehow can and that you're about to enlighten me. It's a IA64 pecularity. I think it started with intel's compiler, but since then at least msvc and gcc copied it. In essence volatile reads implicitly have acquire semantics, and volatile writes release. That's relatively cheap on IA64 because they have 'opcode tags' marking normal moves etc. as having release or acquire semantics (mov.rel etc.). We even have a comment about it that scratches the surface a bit: /* * Intel Itanium, gcc or Intel's compiler. * * Itanium has weak memory ordering, but we rely on the compiler to enforce * strict ordering of accesses to volatile data. In particular, while the * xchg instruction implicitly acts as a memory barrier with 'acquire' * semantics, we do not have an explicit memory fence instruction in the * S_UNLOCK macro. We use a regular assignment to clear the spinlock, and * trust that the compiler marks the generated store instruction with the * .rel opcode. * * Testing shows that assumption to hold on gcc, although I could not find * any explicit statement on that in the gcc manual. In Intel's compiler, * the -m[no-]serialize-volatile option controls that, and testing shows that * it is enabled by default. */ Ah. So my bad, then, removing the volatile. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote: Is there a way to use gdb to figure out who holds the lock they are waiting for? Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG defined? That might do it. Otherwise, I suggest dereferencing the l argument to LWLockAcquireWithVar() or something -- set the frame to 3 in your example backtrace, using f 3. Then, p *l. You'll get the tranche id there, and so on. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Mon, Jun 29, 2015 at 10:52 PM, Simon Riggs si...@2ndquadrant.com wrote: On 29 June 2015 at 18:11, Andres Freund and...@anarazel.de wrote: On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com wrote: On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote: Thanks to Robert Haas for having discussion (offlist) about the idea and suggestions to improve it and also Andres Freund for having discussion and sharing thoughts about this idea at PGCon. Weird. This patch is implemented exactly the way I said to implement it publicly at PgCon. Was nobody recording the discussion at the unconference?? Amit presented an earlier version of this at the in unconference? Yes, I know. And we all had a long conversation about how to do it without waking up the other procs. Forming a list, like we use for sync rep and having just a single process walk the queue was the way I suggested then and previously. Yes, I remember very well about your suggestion which you have given during Unconference and I really value that idea/suggestion. Here, the point is that I could get chance to have in-person discussion with Robert and Andres where they have spent more time discussing about the problem (Robert has discussed about this much more apart from PGCon as well), but that doesn't mean I am not thankful of the ideas or suggestions I got from you and or others during Unconference. Thank you for participating in the discussion and giving suggestions and I really mean it, as I felt good about it even after wards. Now, I would like to briefly explain how allow-one-waker idea has helped to improve the patch as not every body here was present in that Un-conference. The basic problem I was seeing during initial version of patch was that I was using LWLockAcquireOrWait call for all the clients who are not able to get ProcArrayLock (conditionally in the first attempt) and then after waking each proc was checking if it's XID is cleared and if not they were again try to AcquireOrWait for ProcArrayLock, this was limiting the patch to improve performance at somewhat moderate client count like 32 or 64 because even trying for LWLock in exclusive mode again and again was the limiting factor (I have tried with LWLockAcquireConditional as well instead of LWLockAcquireOrWait, though it helped a bit but not very much). Allowing just one-client to become kind of Group leader to clear the other proc's xid and wake them-up and all the other proc's waiting after adding them to pendingClearXid list has helped to reduce the bottleneck around procarraylock significantly. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Refactor to split nodeAgg.c?
Jeff Davis pg...@j-davis.com writes: I was going to rebase my HashAgg patch, and got some conflicts related to the grouping sets patch. I could probably sort them out, but I think that may be the tipping point where we want to break up nodeAgg.c into nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well. This would also (I hope) be convenient for Simon and David Rowley, who have been hacking on aggregates in general. Anyone see a reason I shouldn't give this a try? As with the discussion about pgbench, it's hard to opine about this without seeing a concrete refactoring proposal. But if you want to try, now, very early in the dev cycle, would be the best time to try. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in bttext_abbrev_convert()
On Mon, Jun 29, 2015 at 7:47 PM, Peter Geoghegan p...@heroku.com wrote: Commits b181a919 and arguably c79b6413 added bugs to bttext_abbrev_convert() in the process of fixing some others. In the master branch, bttext_abbrev_convert() can leak memory when the C locale happens to be used and we must detoast, which is unacceptable for about the same reason that it's unacceptable for a standard B-Tree comparator routine. There could also be a use-after-free issue for large strings that are detoasted, because bttext_abbrev_convert() hashes memory which might already be freed (when hashing the authoritative value). Attached patch fixes these issues. As we all know, the state of automated testing is pretty lamentable. This is the kind of thing that we could catch more easily in the future if better infrastructure were in place. I caught this by eyeballing bttext_abbrev_convert() with slightly fresher eyes than the last time I looked. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. Another idea here is that use some other way instead of lseek to know the size of file. Do you think we can use stat() for this purpose, we are already using it in fd.c? So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint One way to ensure that is verify that each buffer header tag is valid (which essentially means whether the object exists), do you have something else in mind to accomplish this part if we decide to go this route? - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process. Agreed and if that turns out to be cheap, then we might want to use some way to optimize Drop Database and others in same way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Mon, Jun 29, 2015 at 5:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) Considering we already have exclusive lock while doing this operation and nobody else can perform write on this file, won't closing and opening it again would avoid such problems. On what grounds do you base that touching faith? Sorry, but I don't get what problem do you see in this touching? On again thinking about it, I think you are worried that if we close and reopen the file, it would break any flush operation that could happen in parallel to it via checkpoint. Still I am not clear that do we want to assume that we can't rely on lseek for the size of file if there can be parallel write activity on file (even if that write doesn't increase the size of file)? If yes, then we have below options: a. Have some protection mechanism for File Access (ignore error for file not present or accessed during flush) and clean the buffers buffers containing invalid objects as is being discussed up-thread. b. Use some other API like stat to get the size of file? Do you prefer any of these or if you have any better idea, then please do share the same? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] LWLock deadlock and gdb advice
On Tue, Jun 30, 2015 at 6:25 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote: Is there a way to use gdb to figure out who holds the lock they are waiting for? Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG defined? That might do it. If you define LOCK_DEBUG, then you can check owner of the lock [1], which will tell you about the Exclusive owner of that lock and can help you in debugging the problem. [1] #ifdef LOCK_DEBUG struct PGPROC *owner; /* last exlusive owner of the lock */ #endif } LWLock; With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. They're preventative checks. You always hope it is wasted effort. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Solaris testers wanted for strxfrm() behavior
On Mon, Jun 29, 2015 at 11:52:26AM +1200, Thomas Munro wrote: On Mon, Jun 29, 2015 at 10:57 AM, Tom Lane t...@sss.pgh.pa.us wrote: Thomas Munro thomas.mu...@enterprisedb.com writes: Just by the way, I wonder if this was that bug: https://illumos.org/issues/1594 Oooh. Might or might not be *same* bug, but it sure looks like it could have the right symptom. If this is indeed inherited from old Solaris, I'm afraid we are totally fooling ourselves if we guess that it's no longer present in the wild. Very interesting. Looks like the illumos strxfrm() came from FreeBSD, not from Solaris; illumos introduced their bug independently: https://illumos.org/issues/2 https://github.com/illumos/illumos-gate/commits/master/usr/src/lib/libc/port/locale/collate.c Also, here is an interesting patch that went into the Apache C++ standard library. Maybe the problem was limited to amd64 system... https://github.com/illumos/illumos-userland/blob/master/components/stdcxx/patches/047-collate.cpp.patch That's a useful data point. Based on Oskari Saarenmaa's report, newer Solaris 10 is not affected. The fix presumably showed up after the 05/08 release and no later than the 01/13 release. On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote: On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus j...@agliodbs.com wrote: My perspective is that if both SmartOS and OmniOS pass, it's not our responsibility to support OldSolaris if they won't update libraries. Another idea would be to make a test during postmaster start to see if this bug exists, and fail if so. I'm generally on board with the thought that we don't need to work on systems with such a bad bug, but it would be a good thing if the failure was clean and produced a helpful error message, rather than looking like a Postgres bug. Failing cleanly on unpatched Solaris is adequate, agreed. A check at postmaster start isn't enough, because the postmaster might run in the C locale while individual databases or collations use problem locales. The safest thing is to test after every setlocale(LC_COLLATE) and newlocale(LC_COLLATE). That's once at backend start and once per backend per collation used, more frequent than I would like. Hmm. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_trgm version 1.2
On Mon, Jun 29, 2015 at 7:23 AM, Merlin Moncure mmonc...@gmail.com wrote: On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes jeff.ja...@gmail.com wrote: V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache V1.2: Time: 2.839 ms --- after repeated execution to warm the cache Wow! I'm going to test this. I have some data sets for which trigram searching isn't really practical...if the search string touches trigrams with a lot of duplication the algorithm can have trouble beating brute force searches. trigram searching is important: it's the only way currently to search string encoded structures for partial strings quickly. I ran your patch against stock 9.4 and am happy to confirm massive speedups of pg_trgm; results of 90% reduction in runtime are common. Also, with the new changes it's hard to get the indexed search to significantly underperform brute force searching which is a huge improvement vs the stock behavior, something that made me very wary of using these kinds of searches in the past. datatable: 'test2' rows: ~ 2 million heap size: 3.3GB (includes several unrelated fields) index size: 1GB 9.4: stock 9.5: patched match 50% rows, brute force seq scan 9.4: 11.5s 9.5: 9.1s match 50% rows, indexed (time is quite variable with 9.4 giving 40 sec times) 9.4: 21.0s 9.5: 11.8s match 1% rows, indexed (90% time reduction!) 9.4: .566s 9.5: .046s match .1% rows, one selective one non-selective search term, selective term first 9.4: .563s 9.5: .028s match .1% rows, one selective one non-selective search term, selective term last 9.4: 1.014s 9.5: 0.093s very nice! Recently, I examined pg_tgrm for an attribute searching system -- it failed due to response time variability and lack of tools to control that. Were your patch in place, I would have passed it. I had a 'real world' data set though. With this, pg_trgm is basically outperforming SOLR search engine for all cases we're interested in whereas before low selectivity cases where having all kinds of trouble. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers