Re: [HACKERS] Get more from indices.
Hello, I've modified the patch to work in such a way. Also, as ISTM the patch is more complicated than what the patch really does, I've simplified the patch. I've revised the patch a bit. Please find attached the patch. Thank you, but it seems to me too simplified. You made two major functional changes. One is, you put the added code for getrelation_info() out of the block for the condition (info-relam == BTREE_AM_OID) (though amcanorder would be preferable..) Anyway the reason for the place is to guarantee 'full_ordered' index always to be orderable. I believe the relation between them are not obvious. So your patch has an oppotunity to make wrong assumption for possible indexes which is not orderable but unique. Going on your way some additional works would be needed to judge an index to be orderable or not on checking the extensibility of pathkeys. Another is, you changed pathkeys expantion to be all-or-nothing decision. While this change should simplify the code slightly, it also dismisses the oppotunity for partially-extended pathkeys. Could you let me know the reason why you did so. Any thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Regression tests failing if not launched on db regression
Hi all, It happens that the following regression tests are failing if they are run on a database not named regression: - updatable_views - foreign_data - sequence Those tests are failing because some relations of information_schemas contain information that are database-dependent. Please see the diffs attached. Note that this can be easily reproduced by running pg_regress with a command of this type from src/test/regress: ./pg_regress --inputdir=. --temp-install=./tmp_check \ --top-builddir=../../.. --dlpath=. \ --schedule=./parallel_schedule \ --dbname=foo IMHO, the regression test suite would gain in consistency and portability if we do not refer to data that is database-dependent. Opinions? A patch fixing that would be trivial to do, and I do not mind writing it. Also, if we consider that as a bug, and I think it is, the fix should be back-patched as well. Regards, -- Michael dbname_regressions.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regression tests failing if not launched on db regression
On Thu, Dec 5, 2013 at 5:12 PM, Michael Paquier michael.paqu...@gmail.com wrote: Those tests are failing because some relations of information_schemas contain information that are database-dependent. I forgot to mention that it is our QE team at VMware that reported me a portion of this failure, and I just had to dig a little bit to find the root cause. And for the curious people: yes, we run regressions on customized database names. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
(2013/12/04 16:39), Claudio Freire wrote: On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii is...@postgresql.org wrote: Can we avoid the Linux kernel problem by simply increasing our shared buffer size, say up to 80% of memory? It will be swap more easier. Is that the case? If the system has not enough memory, the kernel buffer will be used for other purpose, and the kernel cache will not work very well anyway. In my understanding, the problem is, even if there's enough memory, the kernel's cache does not work as expected. Problem is, Postgres relies on a working kernel cache for checkpoints. Checkpoint logic would have to be heavily reworked to account for an impaired kernel cache. Really, there's no difference between fixing the I/O problems in the kernel(s) vs in postgres. The only difference is, in the kernel(s), everyone profits, and you've got a huge head start. Yes. And using something efficiently DirectIO is more difficult than BufferedIO. If we change write() flag with direct IO in PostgreSQL, it will execute hardest ugly randomIO. Communicating more with the kernel (through posix_fadvise, fallocate, aio, iovec, etc...) would probably be good, but it does expose more kernel issues. posix_fadvise, for instance, is a double-edged sword ATM. I do believe, however, that exposing those issues and prompting a fix is far preferable than silently working around them. Agreed. And, I believe that controled BufferedIO is faster and easier than controled DirectIO perfectly. In actually, Oracle database uses BufferedIO to access small datasets, and uses DirectIO to access big datasets. It is because using OS file cache more efficiently. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus j...@agliodbs.com wrote: I also wasn't exaggerating the reception I got when I tried to talk about IO and PostgreSQL at LinuxCon and other events. The majority of Linux hackers I've talked to simply don't want to be bothered with PostgreSQL's performance needs, and I've heard similar things from my collegues at the MySQL variants. Greg KH was the only real exception. If so, he is a fairly major exception. But there is at least one other major exception: I met Theodore Ts'o at pgConf.EU (he was in town for some Google thing), and he seemed pretty interested in what we had to say, and encouraged us to reach out to the Kernel development community. I suspect that we simply haven't gone about it the right way. But you know what? 2.6, overall, still performs better than any kernel in the 3.X series, at least for Postgres. What about the fseek() scalability issue? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On Thu, Dec 5, 2013 at 1:45 AM, Simon Riggs si...@2ndquadrant.com wrote: On 3 December 2013 18:46, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 3, 2013 at 12:36 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Tue, Dec 3, 2013 at 2:33 PM, Christian Kruse christ...@2ndquadrant.com wrote: Hi Fabrizio, looks good to me. I did some testing on 9.2.4, 9.2.5 and HEAD. It applies and compiles w/o errors or warnings. I set up a master and two hot standbys replicating from the master, one with 5 minutes delay and one without delay. After that I created a new database and generated some test data: CREATE TABLE test (val INTEGER); INSERT INTO test (val) (SELECT * FROM generate_series(0, 100)); The non-delayed standby nearly instantly had the data replicated, the delayed standby was replicated after exactly 5 minutes. I did not notice any problems, errors or warnings. Thanks for your review Christian... So, I proposed this patch previously and I still think it's a good idea, but it got voted down on the grounds that it didn't deal with clock drift. I view that as insufficient reason to reject the feature, but others disagreed. Unless some of those people have changed their minds, I don't think this patch has much future here. I had that objection and others. Since then many people have requested this feature and have persuaded me that this is worth having and that my objections are minor points. I now agree with the need for the feature, almost as written. Not recalling the older thread, but it seems the breaks on clock drift, I think we can fairly easily make that situation good enough. Just have IDENTIFY_SYSTEM return the current timestamp on the master, and refuse to start if the time difference is too great. Yes, that doesn't catch the case when the machines are in perfect sync when they start up and drift *later*, but it will catch the most common cases I bet. But I think that's good enough that we can accept the feature, given that *most* people will have ntp, and that it's a very useful feature for those people. But we could help people who run into it because of a simple config error.. Or maybe the suggested patch already does this, in which case ignore that part :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Parallel Select query performance and shared buffers
- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is disappeared for 8 core machines and come back with 16 core machines on Amazon EC2. Would it be related with PostgreSQL locking mechanism? If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we see tons of contention with default value of NUM_BUFFER_PARTITIONS which is 16: $ tail -f /tmp/logfile | grep lwlock | egrep -v blk 0 ... PID 15965 lwlock 0: shacq 0 exacq 33 blk 2 PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192 PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426 PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959 PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370 PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980 PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719 PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273 PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201 PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904 PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740 PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616 PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431 PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345 PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106 PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155 PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989 After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is decreased: ... PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1 PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1 PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1 PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1 PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1 PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2 PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1 PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1 PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1 PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1 PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1 PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1 PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3 PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1 PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1 PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1 PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1
Re: [HACKERS] Parallel Select query performance and shared buffers
On 2013-12-05 11:15:20 +0200, Metin Doslu wrote: - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is disappeared for 8 core machines and come back with 16 core machines on Amazon EC2. Would it be related with PostgreSQL locking mechanism? If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we see tons of contention with default value of NUM_BUFFER_PARTITIONS which is 16: Is your workload bigger than RAM? I think a good bit of the contention you're seeing in that listing is populating shared_buffers - and might actually vanish once you're halfway cached. From what I've seen so far the bigger problem than contention in the lwlocks itself, is the spinlock protecting the lwlocks... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
Is your workload bigger than RAM? RAM is bigger than workload (more than a couple of times). I think a good bit of the contention you're seeing in that listing is populating shared_buffers - and might actually vanish once you're halfway cached. From what I've seen so far the bigger problem than contention in the lwlocks itself, is the spinlock protecting the lwlocks... Could you clarify a bit what do you mean by halfway cached and spinlock protecting the lwlocks.
Re: [HACKERS] Parallel Select query performance and shared buffers
On 2013-12-05 11:33:29 +0200, Metin Doslu wrote: Is your workload bigger than RAM? RAM is bigger than workload (more than a couple of times). I think a good bit of the contention you're seeing in that listing is populating shared_buffers - and might actually vanish once you're halfway cached. From what I've seen so far the bigger problem than contention in the lwlocks itself, is the spinlock protecting the lwlocks... Could you clarify a bit what do you mean by halfway cached Well, your stats showed a) fairly low lock counts overall b) a high percentage of exclusive locks. a) indicates the system wasn't running long. b) tells me there were lots of changes to the buffer mapping - which basically only happens if a buffer is placed or removed from shared-buffers. If your shared_buffers is big enough to contain most of the data you shouldn't see many exclusive locks in comparison to the number of shared locks. and spinlock protecting the lwlocks. Every LWLock has an internal spinlock to protect its state. So whenever somebody does a LWLockAcquire()/Release(), even if only in shared mode, we currently acquire that spinlock, manipulate the LWLocks state, and release the spinlock again. In lots of workloads that internal spinlock is the contention point, not the lenght over which the lwlock is held. Especially when they are mostly held in shared mode. Makes sense? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
Hi Robert, On 2013-12-04 23:32:27 -0500, Robert Haas wrote: But I'm also learning painfully that this kind of thing only goes so far. For example, I spent some time looking at what it would take to provide a dynamic shared memory equivalent of palloc/pfree, a facility that I feel fairly sure would attract a few fans. I have to admit, I don't fully see the point in a real palloc() equivalent for shared memory. If you're allocating shared memory in a granular fashion at somewhat high frequency you're doing something *wrong*. And the result is not going to be scalable. Well, for regular palloc, we store a pointer to the memory context before the beginning of the chunk, so that when you call pfree you can find the memory context and stick the chunk on one of its free lists. So there are two pointers there: the pointer to the context is a pointer, of course, but so is the free list. Heh, heh. This is one of those times where live really, really would be easier if we were using c++. Just have small smart-pointer wrapper, and we wouldn't need to worry too much. I still have mixed feelings about the idea of same-address mappings. On the one hand, on 64-bit architectures, there's a huge amount of address space available. Assuming the OS does something even vaguely sensible in terms of laying out the text, heap, stack, and shared library mappings, there ought to be many petabytes of address space that never gets touched, and it's hard to see why we couldn't find some place in there to stick our stuff. It's actually quite a bit away from petabytes. Most x86-64 CPUs have 48bit of virtual memory, with the OS eating up a far bit of that (so it doesn't need to tear down TLBs in system calls). I think cross-platform you end up with something around 8TB, up to 64TB on others OSs. Now, there are some CPUs coming out with bigger virtual memory sizes, but it's going to be longer than I am willing to wait till we are there. Now, on the other hand, as far as dynamic shared memory allocation and freeing is concerned, there aren't really THAT many places where I need a pointer, so using Size or uint64 or something to store an offset instead is annoying, but I have an idea how to do this that only uses pointers in a couple of places, so I think it can be made to work. I am not sure how much complaint that will provoke, though. And even if I do it, the first poor slob that wants to store a linked list or so in dynamic shared memory is going to be unhappy if they can't get a same-address mapping. Maybe that's OK; using linked lists in shared memory might not be a great idea in the first place. I'm sure there will be more than one person who wants to do it, though. a Just because people want it, doesn't make it worthwile to provide it ;) Any thoughts on what the least painful compromise is here? I think a reasonable route is having some kind of smart-pointer on C level that abstracts away the offset math and allows to use pointers locally. Something like void *sptr_deref(sptr *); where the returned pointer can be used as long as it is purely in memory. And sptr_update(sptr *, void *); which allows a sptr to point elsewhere in the same segment. + lots of smarts I continue to believe that a) using pointers in dynamically allocated segments is going to end up with lots of pain. b) the pain from not having real pointers is manageable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On 5 December 2013 08:51, Magnus Hagander mag...@hagander.net wrote: Not recalling the older thread, but it seems the breaks on clock drift, I think we can fairly easily make that situation good enough. Just have IDENTIFY_SYSTEM return the current timestamp on the master, and refuse to start if the time difference is too great. Yes, that doesn't catch the case when the machines are in perfect sync when they start up and drift *later*, but it will catch the most common cases I bet. But I think that's good enough that we can accept the feature, given that *most* people will have ntp, and that it's a very useful feature for those people. But we could help people who run into it because of a simple config error.. Or maybe the suggested patch already does this, in which case ignore that part :) I think the very nature of *this* feature is that it doesnt *require* the clocks to be exactly in sync, even though that is the basis for measurement. The setting of this parameter for sane usage would be minimum 5 mins, but more likely 30 mins, 1 hour or more. In that case, a few seconds drift either way makes no real difference to this feature. So IMHO, without prejudice to other features that may be more critically reliant on time synchronisation, we are OK to proceed with this specific feature. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
Hello 2013/12/5 Peter Eisentraut pete...@gmx.net Can someone in this thread clarify the commit fest situation? I see two entries that appear to be the same: https://commitfest.postgresql.org/action/patch_view?id=1174 https://commitfest.postgresql.org/action/patch_view?id=1175 I think the first one is a duplicate or obsolete. no, both are valid, and every solve different issue. https://commitfest.postgresql.org/action/patch_view?id=1175 it implements fully fault tolerant DROP IF EXISTS statements. This patch is prerequisite for second patch https://commitfest.postgresql.org/action/patch_view?id=1174 This is a implementation of new pg_dump option --if-exists. This option ensure using fault tolerant DROPs statement by dump. Regards Pavel
Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
2013/12/5 Dean Rasheed dean.a.rash...@gmail.com On 5 December 2013 01:33, Peter Eisentraut pete...@gmx.net wrote: Can someone in this thread clarify the commit fest situation? I see two entries that appear to be the same: https://commitfest.postgresql.org/action/patch_view?id=1174 https://commitfest.postgresql.org/action/patch_view?id=1175 I think the first one is a duplicate or obsolete. #1174 looks to be a separate feature. I don't think it's dependent on #1175 from a code standpoint, but it probably needs it to work properly in all situations. I think #1175 is close to being ready for commit. Pavel, will you produce an updated patch based on our last discussion? I'll set this patch to waiting on author. I expected so your version was a final. I have no problem to do other enhancing (by me) , but I don't fully understand to your last proposal. Can you specify it more, please? Regards Pavel Regards, Dean
Re: [HACKERS] Changes in Trigger Firing
I read somewhere that the best editor is the one you master (1) :) 1: http://www.postgresql.org/message-id/m2wrs6giyp@hi-media.com Thanks, I am using eclipse now. Any comments about the utility of this feature? Or is it just me who thinks this can be useful? I think users/developers of trigger based replication tools can benefit from this. I am not sure how replication tools like slony handle a scenario where replicated table is partitioned and I add a new partition to it. Regards Sameer
Re: [HACKERS] Performance optimization of btree binary search
On Wed, Dec 4, 2013 at 5:28 PM, Peter Geoghegan p...@heroku.com wrote: I'm also curious about the impact on insertion into primary key indexes. Presently, we hold an exclusive buffer lock for the duration of a couple of operations when checkUnique != UNIQUE_CHECK_NO. _bt_binsrch() is one such operation. The other one there, _bt_check_unique(), is likely to be a lot cheaper than _bt_binsrch() on average, I think, so I'm cautiously optimistic that it'll be noticeable. I better go and check it out. I did a relatively short variant 'insert' pgbench-tools benchmark, with a serial primary index created on the pgbench_history table. Results: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert/ Note that in the interest of increasing the signal to noise ratio, I used unlogged tables. These are considerably shorter two minute runs, of inserts/writes, but even still it's interesting to compare it to my original 'select' benchmark. Checkpoint settings were very high. This looks to be about a wash. I'm not that surprised, because this is all about memory bandwidth, not lock contention. Even with the lock contention on the rightmost btree page, we top out at about the same TPS level as the 'select' benchmark (at least compared to Postgres master). However, we top out at that level much sooner here (growth is much steeper), because the contention is mostly on the same one or two btree leaf pages, which are well cached by CPU L1 cache. Writes are actually quite a bit faster than uniformly-distributed reads, even with the exclusive buffer locking (and lock contention) necessitated by writing. You might even say that my patch makes the first benchmark look more like this one (a less awkward comparison could probably be made between what I've done for uniform distribution read workloads, and a read workload with far more uneven data access, which will naturally have fewer cache misses and less TLB contention). I strongly suspect I wouldn't have seen a significant number of cache misses when binary searching on btree pages if I was to instrument this workload. It might still be worth pursuing the SERIAL hinting mechanism, but it's not going to be nearly as big a win as what I've done for reads, I think. It might also be worth looking at a workload with uniformly distributed btree index tuple insertions, where I'd expect similar advantages to those originally shown for reads. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On 2013-12-04 20:25:53 -0500, Tom Lane wrote: Tatsuo Ishii is...@sraoss.co.jp writes: I would like to add a variant of regclass, which is exactly same as current regclass except it does not raise an error when the target table is not found. Instead it returns InvalidOid (0). I've sometimes thought we should just make all the reg* input converters act that way. It's not terribly consistent that they'll happily take numeric inputs that don't correspond to any existing OID. And more often than not, I've found the throw-an-error behavior to be annoying not helpful. I find that to be a bit of a scary change. I have seen application check for the existance of tables using the error thrown by ::regclass. Now, they could change that to check for IS NULL which would be better for them performancewise, but the likelihood they will notice in time seems small. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
2013/12/5 Andres Freund and...@2ndquadrant.com On 2013-12-04 20:25:53 -0500, Tom Lane wrote: Tatsuo Ishii is...@sraoss.co.jp writes: I would like to add a variant of regclass, which is exactly same as current regclass except it does not raise an error when the target table is not found. Instead it returns InvalidOid (0). I've sometimes thought we should just make all the reg* input converters act that way. It's not terribly consistent that they'll happily take numeric inputs that don't correspond to any existing OID. And more often than not, I've found the throw-an-error behavior to be annoying not helpful. I find that to be a bit of a scary change. I have seen application check for the existance of tables using the error thrown by ::regclass. Now, they could change that to check for IS NULL which would be better for them performancewise, but the likelihood they will notice in time seems small. this change can break some applications but personally I like this change. We can introduce some assert polymorphic function CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that can be used for check inside SQL Regards Pavel Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On 2013-12-05 11:54:20 +0100, Pavel Stehule wrote: 2013/12/5 Andres Freund and...@2ndquadrant.com We can introduce some assert polymorphic function CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that can be used for check inside SQL Uh. How is that going to help applications that upgraded, without having noticed a pretty obscure notice in the release notes? If this were day one, I would agree we should go that way, but today... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 5 December 2013 01:55, Peter Eisentraut pete...@gmx.net wrote: On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote: If an application wants to allow these connection parameters to be used, it would need to do PQenableStartServer() first. If it doesn't, those connection parameters will be rejected. Stupid idea: Would it work that we require an environment variable to be set before we allow the standalone_backend connection parameter? That's easy to do, easy to audit, and doesn't require any extra code in the individual clients. I like the idea... should it be in pg_hba.conf ? Or should it be next to listen_addresses in postgresql.conf? hba might be less convenient but seems like the correct place -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 2013-12-04 20:55:08 -0500, Peter Eisentraut wrote: On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote: If an application wants to allow these connection parameters to be used, it would need to do PQenableStartServer() first. If it doesn't, those connection parameters will be rejected. Stupid idea: Would it work that we require an environment variable to be set before we allow the standalone_backend connection parameter? That's easy to do, easy to audit, and doesn't require any extra code in the individual clients. I still don't think it's ok to start forking in arbitrary applications without their knowledge. So I don't think that buys us enough. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
2013/12/5 Andres Freund and...@2ndquadrant.com On 2013-12-05 11:54:20 +0100, Pavel Stehule wrote: 2013/12/5 Andres Freund and...@2ndquadrant.com We can introduce some assert polymorphic function CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that can be used for check inside SQL Uh. How is that going to help applications that upgraded, without having noticed a pretty obscure notice in the release notes? this function doesn't replace a obscure notice in the release notes. On second hand is better to throw unpractically designed feature early than hold it forever. If there was not too aversion against GUC, I can say, so for some time GUC can be solution. But it isnot Regards Pavel If this were day one, I would agree we should go that way, but today... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] better atomics - v0.2
On 2013-11-19 10:37:35 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: The only animal we have that doesn't support quiet inlines today is HP-UX/ac++, and I think - as in patch 1 in the series - we might be able to simply suppress the warning there. Or just not worry about it, if it's only a warning? So, my suggestion on that end is that we remove the requirement for quiet inline now and see if it that has any negative consequences on the buildfarm for a week or so. Imo that's a good idea regardless of us relying on inlining support. Does anyone have anything against that plan? If not, I'll prepare a patch. Or does the warning mean code bloat (lots of useless copies of the inline function)? After thinking on that for a bit, yes that's a possible consequence, but it's quite possible that it happens in cases where we don't get the warning too, so I don't think that argument has too much bearing as I don't recall a complaint about it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get more from indices.
Kyotaro HORIGUCHI wrote: Thank you, but it seems to me too simplified. You made two major functional changes. Thank you for the comments! One is, you put the added code for getrelation_info() out of the block for the condition (info-relam == BTREE_AM_OID) (though amcanorder would be preferable..) Anyway the reason for the place is to guarantee 'full_ordered' index always to be orderable. I believe the relation between them are not obvious. So your patch has an oppotunity to make wrong assumption for possible indexes which is not orderable but unique. Going on your way some additional works would be needed to judge an index to be orderable or not on checking the extensibility of pathkeys. By checking the following equation in build_index_paths(), the updated version of the patch guarantees that the result of an index scan is ordered: index_is_ordered = (index-sortopfamily != NULL); Another is, you changed pathkeys expantion to be all-or-nothing decision. While this change should simplify the code slightly, it also dismisses the oppotunity for partially-extended pathkeys. Could you let me know the reason why you did so. At first I thought the partially-extended pathkey list that is made from query_pathkeys, as you proposed in the original versions of the patch. But I've started to doubt whether it's worth doing that because I think the partially-extended pathkey list is merely one example while the original pathkey list can be partially-extended in different ways, ie, ISTM the partially-extended pathkey list doesn't necessarily have the optimality in anything significant. We might be able to partially-extend the original pathkey list optimally in something significant, but that seems useless complexity to me. So, I modified the patch to do the all-or-nothing decision. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get more from indices.
I wrote: Kyotaro HORIGUCHI wrote: Another is, you changed pathkeys expantion to be all-or-nothing decision. While this change should simplify the code slightly, it also dismisses the oppotunity for partially-extended pathkeys. Could you let me know the reason why you did so. At first I thought the partially-extended pathkey list that is made from query_pathkeys, as you proposed in the original versions of the patch. But I've started to doubt whether it's worth doing that because I think the partially-extended pathkey list is merely one example while the original pathkey list can be partially-extended in different ways, ie, ISTM the partially-extended pathkey list doesn't necessarily have the optimality in anything significant. We might be able to partially-extend the original pathkey list optimally in something significant, but that seems useless complexity to me. So, I modified the patch to do the all-or-nothing decision. Here I mean the optimality for use in merge joins. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl stop times out when it should respond quickly
From: Tom Lane t...@sss.pgh.pa.us If you're going to do a postmaster_is_alive check, why bother with repeated get_pgpid()? As I said yesterday, I removed get_pgpid() calls. I'll add this patch to 2014-1 commitfest this weekend if it is not committed until then. Regards MauMau pg_stop_fail_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
On 2013-12-04 18:48:44 -0500, Robert Haas wrote: * When a type narrower than Datum is stored in a Datum, we place it in the * low-order bits and are careful that the DatumGetXXX macro for it discards * the unused high-order bits (as opposed to, say, assuming they are zero). * This is needed to support old-style user-defined functions, since depending * on architecture and compiler, the return value of a function returning char * or short may contain garbage when called as if it returned Datum. And record_image_eq does a rather elaborate dance around here, calling the appropriate GET_x_BYTES macro depending on the type-width. If we can really count on the high-order bits to be zero, that's all completely unnecessary tomfoolery. I don't think we can get rid of that dance in record_image_eq - it very well could used on records originally generated when those bits haven't been guaranteed to be zeroed. Other usecases where the appropriate DatumGetFoo() macros are used don't have a problem with that since it's cleared again there, but in record_image_eq we can't rely on that. Or am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
Hello, Andres. You wrote: AF On 2013-12-04 20:25:53 -0500, Tom Lane wrote: Tatsuo Ishii is...@sraoss.co.jp writes: I would like to add a variant of regclass, which is exactly same as current regclass except it does not raise an error when the target table is not found. Instead it returns InvalidOid (0). I've sometimes thought we should just make all the reg* input converters act that way. It's not terribly consistent that they'll happily take numeric inputs that don't correspond to any existing OID. And more often than not, I've found the throw-an-error behavior to be annoying not helpful. AF I find that to be a bit of a scary change. I have seen application check AF for the existance of tables using the error thrown by ::regclass. Now, AF they could change that to check for IS NULL which would be better for AF them performancewise, but the likelihood they will notice in time seems AF small. I personally see two approaches: 1. Implement GUC variable controling this behaviour per session 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc. AF Greetings, AF Andres Freund AF -- AF Andres Freund http://www.2ndQuadrant.com/ AF PostgreSQL Development, 24x7 Support, Training Services -- With best wishes, Pavel mailto:pa...@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2013-12-02 02:39:55 -0500, Jaime Casanova wrote: === performance === i expected a regression on performance with the module turned on because of the new XLOG records and wrote of files in pg_committs but the performance drop is excessive. Master 437.835674 tps Patch, guc off 436.4340982 tps Patch, guc on 0.370524 tps There clearly is something wrong. The additional amount of xlog records should be nearly unnoticeable because committs piggybacks on commit records. I started the server with the module off, then after some work turned the module on and restarted the server and run a few benchs then turned it off again and restart the server and get a message like: LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 0/3112AE58 LOG: redo is not required FATAL: cannot make new WAL entries during recovery LOG: startup process (PID 24876) exited with exit code 1 LOG: aborting startup due to startup process failure Alvaro: That's because of the location you call StartupCommitts - a) it's called at the beginning of recovery if HS is enabled b) it's called before StartupXLOG() does LocalSetXLogInsertAllowed(). So I think you need to split StartupCommitts into StartupCommitts() and TrimCommitts() where only the latter does the trimming if committs is disabled. I also wonder if track_commit_timestamp should be tracked by the the XLOG_PARAMETER_CHANGE stuff? So it's not disabled on the standby when it's enabled on the primary? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund and...@2ndquadrant.com wrote: Hi Robert, On 2013-12-04 23:32:27 -0500, Robert Haas wrote: But I'm also learning painfully that this kind of thing only goes so far. For example, I spent some time looking at what it would take to provide a dynamic shared memory equivalent of palloc/pfree, a facility that I feel fairly sure would attract a few fans. I have to admit, I don't fully see the point in a real palloc() equivalent for shared memory. If you're allocating shared memory in a granular fashion at somewhat high frequency you're doing something *wrong*. And the result is not going to be scalable. Why? Lots of people have written lots of programs that do just that. I agree it's overdone, but saying it should never happen seems like an over-rotation in the opposite direction. In any case, the application that led me to this was actually parallel internal sort. Let's suppose we had an implementation of parallel internal sort. You'd load all the tuples you want to sort into dynamic shared memory (just as we now load them into palloc'd chunks), start a bunch of workers, and the original backend and the newly-started workers would cooperate to sort your data. Win. It's fairly clear that, technically speaking, you do NOT need a real allocator for this. You can just shove the first tuple that you store into shared memory, say right up against the end of the region. Then you can place the next tuple immediately before it and the following one just before that, and so on. This is very simple to implement and also extremely memory-efficient, so superficially it's quite an appealing approach. But now let's suppose the input data was estimated to fit in work_mem but in the end did not, and therefore we need to instead do an external sort. That means we're going to start evicting tuples from memory and to make room for new tuples we read from the input stream. Well, now our strategy of tight-packing everything does not look so good, because we have no way of tracking which tuples we no longer need and reusing that space for new tuples. If external sort is not something we know how to do in parallel, we can potentially copy all of the tuples out of the dynamic shared memory segment and back to backend-private memory in palloc'd chunks, and then deallocate the dynamic shared memory segment... but that's potentially expensive if work_mem is large, and it temporarily uses double what we're allowed by work_mem. And suppose we want to do a parallel external sort with, say, the worker backend pushing tuples into a heap stored in the dynamic shared memory segment and other backends writing them out to tapes and removing them from the heap. You can't do that without some kind of space management, i.e. an allocator. This is one of those times where live really, really would be easier if we were using c++. Just have small smart-pointer wrapper, and we wouldn't need to worry too much. Yes, a template class would be perfect here. I still have mixed feelings about the idea of same-address mappings. On the one hand, on 64-bit architectures, there's a huge amount of address space available. Assuming the OS does something even vaguely sensible in terms of laying out the text, heap, stack, and shared library mappings, there ought to be many petabytes of address space that never gets touched, and it's hard to see why we couldn't find some place in there to stick our stuff. It's actually quite a bit away from petabytes. Most x86-64 CPUs have 48bit of virtual memory, with the OS eating up a far bit of that (so it doesn't need to tear down TLBs in system calls). I think cross-platform you end up with something around 8TB, up to 64TB on others OSs. Now, there are some CPUs coming out with bigger virtual memory sizes, but it's going to be longer than I am willing to wait till we are there. Oh. Well, that's a shame. Any thoughts on what the least painful compromise is here? I think a reasonable route is having some kind of smart-pointer on C level that abstracts away the offset math and allows to use pointers locally. Something like void *sptr_deref(sptr *); where the returned pointer can be used as long as it is purely in memory. And sptr_update(sptr *, void *); which allows a sptr to point elsewhere in the same segment. + lots of smarts Can you elaborate on this a little bit? I'm not sure I understand what you're getting at here. I continue to believe that a) using pointers in dynamically allocated segments is going to end up with lots of pain. b) the pain from not having real pointers is manageable. Fair opinion, but I think we will certainly need to pass around memory offsets in some form for certain things. Even for purely internal parallel sort, the Tuplesort objects need to store the offset of the actual tuple within the segment. My first guess as to how to do that was to assume that sizeof(Size) == sizeof(void *) and just
Re: [HACKERS] [bug fix] pg_ctl fails with config-only directory
From: Amit Kapila amit.kapil...@gmail.com On Wed, Dec 4, 2013 at 7:57 PM, MauMau maumau...@gmail.com wrote: * Approach 1 When postgres starts, it removes Administrator privileges from its own process. But is this possible at all? Windows security API is complex and provides many functions. It seems difficult to understand them. I'm afraid it would take a long time to figure out the solution. Is there any good web page to look at? * Approach 2 Do not call check_root() on Windows when -C, --describe-config, or --single is specified when running postgres. This would be easy, and should not be dangerous in terms of security because attackers cannot get into the server process via network. Approach-2 has been discussed previously to resolve it and it doesn't seem to be a good way to handle it. Please refer link: http://www.postgresql.org/message-id/1339601668-sup-4...@alvh.no-ip.org You can go through that mail chain and see if there can be a better solution than Approach-2. Thanks for the info. I understand your feeling, but we need to be practical. I believe we should not leave a bug and inconvenience by worrying about theory too much. In addition to the config-only directory, the DBA with admin privs will naturally want to run postgres -C and postgres --describe-config, because they are useful and so described in the manual. I don't see any (at least big) risk in allowing postgres -C/--describe-config to run with admin privs. In addition, recent Windows versions help to secure the system by revoking admin privs with UAC, don't they? Disabling UAC is not recommended. I couldn't find a way to let postgres delete its token groups from its own primary access token. There doesn't seem to be a reasonably clean and good way. So I had to choose approach 2. Please find attached the patch. This simple and not-complex change worked well. I'd like to add this to 2014-1 commitfest this weekend unless a better approach is proposed. Regards MauMau config_dir_win.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.7
Hello, Will send the rebased version as soon as I've addressed your comments. Thank you. = 0001: - You assined HeapTupleGetOid(tuple) into relid to read in several points but no modification. Nevertheless, you left HeapTupleGetOid not replaced there. I think 'relid' just for repeated reading has far small merit compared to demerit of lowering readability. You'd be better to make them uniform in either way. It's primarily to get the line lengths halfway under control. Mm. I'm afraid I couldn't catch your words, do you mean that IsSystemClass() or isTempNamespace() could change the NULL bitmap in the tuple? = 0002: - You are identifying the wal_level with the expr 'wal_level = WAL_LEVEL_LOGICAL' but it seems somewhat improper. Hm. Why? It actually does no harm and somewhat trifling so I don't assert you should fix it. The reason for the comment is the greater values for wal_level are undefined at the moment, so strictly saying, such values should be handled as invalid ones. Although there is a practice to avoid loop overruns by comparing counters with the expression like (i CEILING). For instance, I found a macro for which comment reads as follows and I feel a bit uneasy with it :-) It's nothing more than that. | /* Do we need to WAL-log information required only for Hot Standby? */ | #define XLogStandbyInfoActive() (wal_level = WAL_LEVEL_HOT_STANDBY) - RelationIsAccessibleInLogicalDecoding and RelationIsLogicallyLogged are identical in code. Together with the name ambiguity, this is quite confising and cause of future misuse between these macros, I suppose. Plus the names seem too long. Hm, don't think they are equivalent, rather the contrary. Note one returns false if IsCatalogRelation(), the other true. Oops, I'm sorry. I understand they are not same. Then I have other questions. The name for the first one 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing what its comment reads. | /* True if we need to log enough information to have access via | decoding snapshot. */ Making the macro name for this comment directly, I suppose it would be something like 'NeedsAdditionalInfoInLogicalDecoding' or more directly 'LogicalDeodingNeedsCids' or so.. - In heap_insert, the information conveyed on rdata[3] seems to be better to be in rdata[2] because of the scarecity of the additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only seems to be needed. Also is in heap_multi_insert and heap_upate. Could you explain a bit more what you mean by that? The reason it's a separate rdata entry is that otherwise a full page write will remove the information. Sorry, I missed the comment 'so that an eventual FPW doesn't remove the tuple's data'. Although given the necessity of removal prevention, rewriting rdata[].buffer which is required by design (correct?) with InvalidBuffer seems a bit outrageous for me and obfuscating the objective of it. Other mechanism should be preferable, I suppose. The most straight way to do that should be new flag bit for XLogRecData, say, survive_fpw or something. - In heap_multi_insert, need_cids referred only once so might be better removed. It's accessed in a loop over potentially quite some items, that's why I moved it into an extra variable. Sorry bothering you with comments biside the point.. But the scope of needs_cids is narrower than it is. I think the definition should be moved into the block for 'if (needwal)'. - In heap_delete, at the point adding replica identity, same to the aboves, rdata[3] could be moved into rdata[2] making new type like 'xl_heap_replident'. Hm. I don't think that'd be a good idea, because we'd then need special case decoding code for deletes because the wal format would be different for inserts/updates and deletes. Hmm. Although one common xl_heap_replident is impractical, splitting a logcally single entity into two or more XLogRecDatas also seems not to be a good idea. - In heapam_xlog.h, the new type xl_heap_header_len is inadequate in both of naming which is confising and construction on which the header in xl_heap_header is no longer be a header and indecisive member name 't_len'.. The header bit in the name refers to the fact that it's containing information about the a HeapTuple's header, not that it's a header itself. Do you have a better suggestion than xl_heap_header_len? Sorry, I'm confused during writing the comment, The order of members in xl_heap_header_len doesn't matter. I got the reason for the xl_header_len and whole xlog record image after re-reading the relevant code. The update record became to contain two variable length data by this patch. So the length of the tuple body cannot be calculated only with whole record length and header lengths. Considering above, looking
Re: [HACKERS] Performance optimization of btree binary search
On 12/05/2013 07:30 AM, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: On Wed, 2013-12-04 at 20:27 -0500, Tom Lane wrote: Lazy people? I'm not in a hurry to drop it; it's not costing us much to just sit there, other than in this connection which we see how to fix. Actually, I think it probably costs a fair portion of extension authors when their initial code crashes because they forgot to declare all their functions V1. I think it might actually be more of a bother to lazy people than a benefit. Hm. We have heard one or two complaints like that, but not a huge number. It happens to me about 75% of the time when I write a new C function. These days, I usually realize it pretty quickly. I wonder how easy it would be to make the compiler produce a warning about it. Or issue a warning in PostgreSQL when you do CREATE FUNCTION and the C function appears to be a V0 function. I'm worried about breaking code that's been working since god-knows-when; but I will concede there's little evidence that there's very much of that out there either. I wouldn't mind just dropping the V0 support. It's not difficult to modify your function for V1 convention, so if there's still anyone out there using V0, it wouldn't be unreasonable to require them to update. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
Hi, Planned to look at this for a while... Not a detailed review, just some thoughts. I'll let what I read sink in and possibly comment later. On 2013-10-31 12:21:31 -0400, Robert Haas wrote: The attached patches attempt to rectify some of these problems. Well, I wouldn't call it problems. Just natural, incremental development ;) Patch #1, on-dsm-detach-v1.patch, adds the concept of on_dsm_detach hooks [snip] The part of this patch which I suppose will elicit some controversy is that I've had to rearrange on_shmem_exit a bit. It turns out that during shmem_exit, we do user-level cleanup, like aborting the transaction, first. We expect that will probably release all of our shared-memory resources. Hm. The API change of on_shmem_exit() is going to cause a fair bit of pain. There are a fair number of extensions out there using it and all would need to be littered by version dependent #ifdefs. What about providing a version of on_shmem_exit() that allows to specify the exit phase, and make on_shmem_exit() a backward compatible wrapper? FWIW, I find on_dsm_detach_cancel() to be a confusing name. I thought it might be a variant that is called in different circumstances than on_dsm_detach(). Maybe go with cancel_on_shmem_detach(), like with cancel_shmem_exit()? Patch #2, shm-toc-v1.patch, provides a facility for sizing a dynamic shared memory segment before creation, and for dividing it up into chunks after it's been created. It therefore serves a function quite similar to RequestAddinShmemSpace, except of course that there is only one main shared memory segment created at postmaster startup time, whereas new dynamic shared memory segments can come into existence on the fly; and it serves even more conspicuously the function of ShmemIndex, which enables backends to locate particular data structures within the shared memory segment. It is however quite a bit simpler than the ShmemIndex mechanism: we don't need the same level of extensibility here that we do for the main shared memory segment, because a new extension need not piggyback on an existing dynamic shared memory segment, but can create a whole segment of its own. Hm. A couple of questions/comments: * How do you imagine keys to be used/built? * Won't the sequential search over toc entries become problematic? * Some high level overview of how it's supposed to be used wouldn't hurt. * the estimator stuff doesn't seem to belong in the public header? Patch #3, shm-mq-v1.patch, is the heart of this series. It creates an infrastructure for sending and receiving messages of arbitrary length using ring buffers stored in shared memory (presumably dynamic shared memory, but hypothetically the main shared memory segment could be used). The API seems to assume it's in dsm tho? Queues are single-reader and single-writer; they use process latches to implement waiting for the queue to fill (in the case of the reader) or drain (in the case of the writer). Hm. Too bad, I'd hoped for single-reader, multiple-writer :P A non-blocking mode is also available for situations where other options might lead to deadlock. That's cool. I wonder if there's a way to discover, on the writing end, when new data can be sent. For reading there's a comment about waiting for the latch... Couple of questions: * Some introductory comments about the concurrency approach would be good. * shm_mq_attach() is talking about BackgroundWorkerHandles - is there a limitation that a bgworker has to be involved? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changes in Trigger Firing
I have been finally able to get the right set of files. I going with below approach: 1) Add a new column in pg_trigger called tgiscascaded 2) Change pg_trigger.h for this 3) Made changes in trigger.c to insert this values 4) Corresponding changes made in - reltrigger.h - parsenode.h 5) Changed gram.y to understand a new key word in CREATE TRIGGER statement CASCADED. - The new option will be optional and will apply only to non-constraint triggers - If the option is specified trigger will CASCADE to child tables 6) I just complied the source code (modified with above changes) and it - Added a new column in pg_trigger - The new column is able to derive its value from CREATE TRIGGER statement based on whether CASDADED was specified or not - The value is True if the option was specified - The value is false if the option was not specified 7) Now I will work on trigger firing mechanism with below approach - Before firing triggers, check if it's an inherited table - First get CASCADED triggers for parents and fire them - Proceed as usual Any suggestion on improving the approach or comments on need for this feature are welcome. Regards Sameer
Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing
Andres Freund wrote: On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote: I added a new isolation spec to test this specific case, and noticed something that seems curious to me when that test is run in REPEATABLE READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a can't serialize due to concurrent update, but when the UPDATE is committed, FOR UPDATE works fine. Shouldn't it happen pretty much exactly the other way around? That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the DidCommit() branch in test_lockmode_for_conflict(). You forgot something akin to /* locker has finished, all good to go */ if (!ISUPDATE_from_mxstatus(status)) return HeapTupleMayBeUpdated; So I did. Here are two patches, one to fix this issue, and the other to fix the issue above. I intend to apply these two to 9.3 and master, and then apply your freeze fix on top (which I'm cleaning up a bit -- will resend later.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 48b6f5eead73a41880dd311b7a55d2b88794ca3d Mon Sep 17 00:00:00 2001 From: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Wed, 4 Dec 2013 17:20:14 -0300 Subject: [PATCH 1/4] Avoid resetting Xmax when it's a multi with an aborted update HeapTupleSatisfiesUpdate can very easily forget tuple locks while checking the contents of a multixact and finding it contains an aborted update, by setting the HEAP_XMAX_INVALID bit. This would lead to concurrent transactions not noticing any previous locks held by transactions that might still be running, and thus being able to acquire subsequent locks they wouldn't be normally able to acquire. This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3, like that commit. This change reverts the change to the delete-abort-savept isolation test in 1ce150b7bb, because that behavior change was caused by this bug. Noticed by Andres Freund while investigating a different issue reported by Noah Misch. --- src/backend/utils/time/tqual.c | 21 .../isolation/expected/delete-abort-savept.out | 13 +--- 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index 4d63b1c..f787f2c 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -789,13 +789,26 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid, if (TransactionIdDidCommit(xmax)) return HeapTupleUpdated; - /* no member, even just a locker, alive anymore */ + /* + * By here, the update in the Xmax is either aborted or crashed, but + * what about the other members? + */ + if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple))) + { + /* + * There's no member, even just a locker, alive anymore, so we can + * mark the Xmax as invalid. + */ SetHintBits(tuple, buffer, HEAP_XMAX_INVALID, InvalidTransactionId); - - /* it must have aborted or crashed */ - return HeapTupleMayBeUpdated; + return HeapTupleMayBeUpdated; + } + else + { + /* There are lockers running */ + return HeapTupleBeingUpdated; + } } if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) diff --git a/src/test/isolation/expected/delete-abort-savept.out b/src/test/isolation/expected/delete-abort-savept.out index 5b8c444..3420cf4 100644 --- a/src/test/isolation/expected/delete-abort-savept.out +++ b/src/test/isolation/expected/delete-abort-savept.out @@ -23,11 +23,12 @@ keyvalue step s1svp: SAVEPOINT f; step s1d: DELETE FROM foo; step s1r: ROLLBACK TO f; -step s2l: SELECT * FROM foo FOR UPDATE; +step s2l: SELECT * FROM foo FOR UPDATE; waiting ... +step s1c: COMMIT; +step s2l: ... completed keyvalue 1 1 -step s1c: COMMIT; step s2c: COMMIT; starting permutation: s1l s1svp s1d s1r s2l s2c s1c @@ -38,12 +39,8 @@ keyvalue step s1svp: SAVEPOINT f; step s1d: DELETE FROM foo; step s1r: ROLLBACK TO f; -step s2l: SELECT * FROM foo FOR UPDATE; -keyvalue - -1 1 -step s2c: COMMIT; -step s1c: COMMIT; +step s2l: SELECT * FROM foo FOR UPDATE; waiting ... +invalid permutation detected starting permutation: s1l s1svp s1d s2l s1r s1c s2c step s1l: SELECT * FROM foo FOR KEY SHARE; -- 1.7.10.4 From fbe641eeb7eb535c1cac422e3f5962928e3b0362 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Wed, 4 Dec 2013 17:26:07 -0300 Subject: [PATCH 2/4] Fix improper abort during update chain locking In 247c76a98909, I added some code to do fine-grained checking of MultiXact status of locking/updating transactions when traversing an update chain. There was a thinko in that patch which would have the traversing abort, that is return HeapTupleUpdated, when the
Re: [HACKERS] Why we are going to have to go DirectIO
* Peter Geoghegan (p...@heroku.com) wrote: On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus j...@agliodbs.com wrote: But you know what? 2.6, overall, still performs better than any kernel in the 3.X series, at least for Postgres. What about the fseek() scalability issue? Not to mention that the 2.6 which I suspect you're referring to (RHEL) isn't exactly 2.6... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing
Hi, On 2013-12-05 10:42:35 -0300, Alvaro Herrera wrote: I intend to apply these two to 9.3 and master, and then apply your freeze fix on top (which I'm cleaning up a bit -- will resend later.) I sure hope it get's cleaned up - it's an evening's hack, with a good glass of wine ontop. Do you agree with the general direction? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changes in Trigger Firing
One scenario where I can forsee an issue is when someone uses the tablename with-in the trigger function on which the trigger was fired. I am not sure how and what issue might crop up but this will be one of my test cases. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Hacker-RFC-Changes-in-Trigger-Firing-tp5781566p5781861.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Feature request: Logging SSL connections
Hello, we were really missing the information in our log files if (and which of) our users are using SSL during their connections. The attached patch is a very simple solution to this problem - it just tests if the ssl pointer in Port is null. If no, it adds SSL to the logfile, otherwise it adds NOSSL. Or have I been missing an existing way to reach the same? Regards, Andreas -- - __ Dr. Andreas Kunert / __/ / / / __/ HU-Berlin, ZE Rechenzentrum (CMS) / /_ / / / / __\\ www.hu-berlin.de/~kunert /___/ /_/_/_/ /___/ - --- postgresql-9.1.10/src/backend/utils/init/postinit.c 2013-10-08 05:13:47.0 +0200 +++ /usr/postgres/sources/postgresql-9.1.10/src/backend/utils/init/postinit.c 2013-12-05 14:21:53.180148568 +0100 @@ -225,9 +225,22 @@ (errmsg(replication connection authorized: user=%s, port-user_name))); else +#ifdef USE_SSL + if (port-ssl 0) { +ereport(LOG, + (errmsg(connection authorized: user=%s database=%s SSL, +port-user_name, port-database_name))); + } + else { +ereport(LOG, + (errmsg(connection authorized: user=%s database=%s NOSSL, +port-user_name, port-database_name))); + } +#else ereport(LOG, (errmsg(connection authorized: user=%s database=%s, port-user_name, port-database_name))); +#endif } set_ps_display(startup, false); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On 12/05/2013 06:32 AM, Robert Haas wrote: During development of the dynamic shared memory facility, Noah and I spent a lot of time arguing about whether it was practical to ensure that a dynamic shared memory segment got mapped at the same address in every backend that used it. My vote goes for not trying to map at same address. I don't see how you could do that reliably, and I don't see much need for it anyway. That said, it naturally depends on what you're going to use the dynamic shared memory facility for. It's the same problem I have with reviewing the already-committed DSM patch and the message queue patch. The patches look fine as far as they go, but I have the nagging feeling that there are a bunch of big patches coming up later that use the facilities, and I can't tell if the facilities are over-engineered for what's actually needed, or not sufficient. As a side-note, I've been thinking that we don't really need same-address mapping for shared_buffers either. Getting rid of it wouldn't buy us anything right now, but if we wanted e.g to make shared_buffers changeable without a restart, that would be useful. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
Andres Freund and...@2ndquadrant.com writes: On 2013-12-04 18:48:44 -0500, Robert Haas wrote: And record_image_eq does a rather elaborate dance around here, calling the appropriate GET_x_BYTES macro depending on the type-width. If we can really count on the high-order bits to be zero, that's all completely unnecessary tomfoolery. I don't think we can get rid of that dance in record_image_eq - it very well could used on records originally generated when those bits haven't been guaranteed to be zeroed. No, you're failing to think about the context here. A Datum is not an on-disk concept, only an in-memory one. In the case of record_image_eq, simplifying the comparison of by-value Datums is unconditionally safe as long as heap_deform_tuple is consistent about using the proper GetDatum macros, which it is. (So actually, that elaborate dance is a 100% waste of effort today, regardless of any changes we're discussing here.) The risk we take by simplifying comparisons in a more general context is that some function somewhere might've been sloppy about doing a native-type-to-Datum conversion on its result. In the case of V0 functions that risk is unavoidable except by adding some explicit cleanup code, but I'm a bit worried that somebody, particularly third-party code, might've sloppily written return foo in a V1 function when return Int32GetDatum(foo) would be correct. In that case, the resultant Datum might have not-per-spec high-order bits, and if it reaches the fast comparator without ever having been squeezed into a physical tuple, we've got a problem. I would certainly regard this as a bug in that function, but nonetheless it's a hazard that we need to set against any performance improvement that we can buy this way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On 2013-12-05 15:57:22 +0200, Heikki Linnakangas wrote: As a side-note, I've been thinking that we don't really need same-address mapping for shared_buffers either. Getting rid of it wouldn't buy us anything right now, but if we wanted e.g to make shared_buffers changeable without a restart, that would be useful. I doubt it's that easy to gid of atm (at least in !EXEC_BACKEND), but if we ever want to properly support ALSR in EXEC_BACKEND environments, we might need to go there. The hacks windows does around it are already quite ugly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
On 2013-12-05 08:58:55 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I don't think we can get rid of that dance in record_image_eq - it very well could used on records originally generated when those bits haven't been guaranteed to be zeroed. No, you're failing to think about the context here. Ah yes. I was completely forgetting that heap_form_tuple()/heap_fill_tuple() will properly take care to only use meaningful parts of the (to-be-)stored data, not random padding. Thanks. The risk we take by simplifying comparisons in a more general context is that some function somewhere might've been sloppy about doing a native-type-to-Datum conversion on its result. In the case of V0 functions that risk is unavoidable except by adding some explicit cleanup code, but I'm a bit worried that somebody, particularly third-party code, might've sloppily written return foo in a V1 function when return Int32GetDatum(foo) would be correct. In that case, the resultant Datum might have not-per-spec high-order bits, and if it reaches the fast comparator without ever having been squeezed into a physical tuple, we've got a problem. Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have allowed asserts checking against such cases by setting fcinfo-has_returned = true or such... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regression tests failing if not launched on db regression
Michael Paquier michael.paqu...@gmail.com writes: It happens that the following regression tests are failing if they are run on a database not named regression: This does not seem like a bug to me, although maybe we'd better update the documentation to specify that you need to use a DB named regression. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [bug fix] pg_ctl always uses the same event source
Hello, I've removed a limitation regarding event log on Windows with the attached patch. I hesitate to admit this is a bug fix and want to regard this an improvement, but maybe it's a bug fix from users' perspective. Actually, I received problem reports from some users. [Problem] pg_ctl always uses event source PostgreSQL to output messages to the event log. Some example messages are: server starting server started This causes the problems below: 1. Cannot distinguish PostgreSQL instances because they use the same event source. 2. If you use multiple installations of PostgreSQL on the same machine, whether they are the same or different versions, Windows event viewer reports an error event source not found in the following sequence. Other system administration software which deal with event log may show other anomalies. 2-1 Install the first PostgreSQL (e.g. 9.3) into dir_1, register dir_1\lib\pgevent.dll for event source PostgreSQL, then creates instance_1. 2-2 Install the second PostgreSQL (e.g. 9.2) into dir_2 as part of some packaged application, register dir_2\lib\pgevent.dll for event source PostgreSQL, then creates instance_2. The application uses PostgreSQL as its internal data repository. It sets event_source = 'appname_db' in instance_2's postgresql.conf. 2-3 At this point, there's no problem. 2-4 Uninstall the second PostgreSQL. instance_2 is deleted. The packaged application installer may or may not unregister pgevent.dll with regsvr32.exe /u. ANyway, dir_2\bin\pgevent.dll is deleted. 2-5 instance_1 continues to run and output messages to event log. 2-6 When you view the messages with event viewer, it reports an error because the event source PostgreSQL and/or pgevent.dll was deleted in 2-4. [Fix] Make pg_ctl use event_source setting in postgresql.conf. This eliminates the need for every instance to use the event source PostgreSQL for its pg_ctl. Regards MauMau pg_ctl_eventsrc.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backup throttling
On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote: Hi, I am reviewing your patch. Thanks. New version attached. * Does it follow the project coding guidelines? Yes. A nitpicking: this else branch below might need brackets because there is also a comment in that branch: + /* The 'real data' starts now (header was ignored). */ + throttled_last = GetCurrentIntegerTimestamp(); + } + else + /* Disable throttling. */ + throttling_counter = -1; + Done. * Does it do what it says, correctly? Yes. Although it should be mentioned in the docs that rate limiting applies to walsenders individually, not globally. I tried this on a freshly created database: $ time pg_basebackup -D data2 -r 1M -X stream -h 127.0.0.1 real0m26.508s user0m0.054s sys0m0.360s The source database had 3 WAL files in pg_xlog, one of them was also streamed. The final size of data2 was 43MB or 26MB without pg_xlog, i.e. without the -X stream option. The backup used 2 walsenders in parallel (one for WAL) which is a known feature. Right, if the method is 'stream', a separate WAL sender is used and the transfer is not limited: client must keep up with the stream unconditionally. I added a note to documentation. But there's no reason not to throttle WAL files if the method is 'fetch'. It's fixed now. Another note. This chunk should be submitted separately as a comment bugfix: diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index c3c71b7..5736fd8 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -1288,7 +1288,7 @@ GetCurrentTimestamp(void) /* * GetCurrentIntegerTimestamp -- get the current operating system time as int64 * - * Result is the number of milliseconds since the Postgres epoch. If compiled + * Result is the number of microseconds since the Postgres epoch. If compiled * with --enable-integer-datetimes, this is identical to GetCurrentTimestamp(), * and is implemented as a macro. */ Will do. // Tony diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index c379df5..e878592 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -189,6 +189,26 @@ PostgreSQL documentation /varlistentry varlistentry + termoption-r/option/term + termoption--max-rate/option/term + listitem + para +The maximum amount of data transferred from server per second. +The purpose is to limit impact of applicationpg_basebackup/application +on a running master server. + /para + para +This option always affects transfer of the data directory. Transfer of +WAL files is only affected if the collection method is literalfetch/literal. + /para + para +Suffixes literalk/literal (kilobytes) and literalM/literal +(megabytes) are accepted. For example: literal10M/literal + /para + /listitem + /varlistentry + + varlistentry termoption-R/option/term termoption--write-recovery-conf/option/term listitem diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ba8d173..f194746 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -33,6 +33,7 @@ #include utils/builtins.h #include utils/elog.h #include utils/ps_status.h +#include utils/timestamp.h #include pgtar.h typedef struct @@ -42,6 +43,7 @@ typedef struct bool fastcheckpoint; bool nowait; bool includewal; + uint32 maxrate; } basebackup_options; @@ -59,6 +61,7 @@ static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir); static void parse_basebackup_options(List *options, basebackup_options *opt); static void SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli); static int compareWalFileNames(const void *a, const void *b); +static void throttle(size_t increment); /* Was the backup currently in-progress initiated in recovery mode? */ static bool backup_started_in_recovery = false; @@ -68,6 +71,41 @@ static bool backup_started_in_recovery = false; */ #define TAR_SEND_SIZE 32768 + +/* + * The maximum amount of data per second - bounds of the user input. + * + * If the maximum should be increased to more than 4 GB, uint64 must + * be introduced for the related variables. However such high values have + * little to do with throttling. + */ +#define MAX_RATE_LOWER 32768 +#define MAX_RATE_UPPER (1024 20) + +/* + * Transfer rate is only measured when this number of bytes has been sent. + * (Too frequent checks would impose too high CPU overhead.) + * + * The default value is used unless it'd result in too frequent checks. + */ +#define THROTTLING_SAMPLE_MIN 32768 + +/* The maximum number of checks per
Re: [HACKERS] More legacy code: pg_ctl
On 11/18/13, 8:09 PM, Josh Berkus wrote: a) by default, it returns to the caller without waiting for postgres to actually start/stop/restart. In this mode, it also always returns success regardless of result. The reason for this is that until sometime recently (PQping) we didn't have a reliable way to check whether the server is up. This might be different now, but it would need to be verified. b) by default, it remains attached to the caller's tty for stdout and stderr, even after it has switched to the regular postgres log. Can pg_ctl tell when the postgres log has switched? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More legacy code: pg_ctl
On 11/18/13, 8:20 PM, Josh Berkus wrote: c) that stop defaults to smart mode, instead of fast mode. This has been discussed many times already, so you'd need to check the archives. (I'm not in favor of smart mode either.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On 2013-12-05 07:44:27 -0500, Robert Haas wrote: On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund and...@2ndquadrant.com wrote: Hi Robert, On 2013-12-04 23:32:27 -0500, Robert Haas wrote: But I'm also learning painfully that this kind of thing only goes so far. For example, I spent some time looking at what it would take to provide a dynamic shared memory equivalent of palloc/pfree, a facility that I feel fairly sure would attract a few fans. I have to admit, I don't fully see the point in a real palloc() equivalent for shared memory. If you're allocating shared memory in a granular fashion at somewhat high frequency you're doing something *wrong*. And the result is not going to be scalable. Why? Lots of people have written lots of programs that do just that. Well, but we're a database, not a generic programming library ;) I agree it's overdone, but saying it should never happen seems like an over-rotation in the opposite direction. That might be true. But now let's suppose the input data was estimated to fit in work_mem but in the end did not, and therefore we need to instead do an external sort. That means we're going to start evicting tuples from memory and to make room for new tuples we read from the input stream. Well, now our strategy of tight-packing everything does not look so good, because we have no way of tracking which tuples we no longer need and reusing that space for new tuples. If external sort is not something we know how to do in parallel, we can potentially copy all of the tuples out of the dynamic shared memory segment and back to backend-private memory in palloc'd chunks, and then deallocate the dynamic shared memory segment... but that's potentially expensive if work_mem is large, and it temporarily uses double what we're allowed by work_mem. But what's your alternative if you have a shared_palloc() like thingy? The overhead is going to be crazy if you allocate so granular that you can easily grow, and if you allocate on a coarse level, this isn't going to buy you much? Essentially, I am not arguing against a facility to dynamically allocate memory from shared memory, but I think the tradeoffs are different enough that I think palloc()/mcxt.c isn't necessarily the thing to model it after. But then, I don't really have an idea how it should look like, not having thought about it much. And suppose we want to do a parallel external sort with, say, the worker backend pushing tuples into a heap stored in the dynamic shared memory segment and other backends writing them out to tapes and removing them from the heap. You can't do that without some kind of space management, i.e. an allocator. Hm. I'd have thought one would implement that by pushing fixed size amounts of tuples to individual workers, let them sort each chunk in memory and then write the result out to disk. Any thoughts on what the least painful compromise is here? I think a reasonable route is having some kind of smart-pointer on C level that abstracts away the offset math and allows to use pointers locally. Something like void *sptr_deref(sptr *); where the returned pointer can be used as long as it is purely in memory. And sptr_update(sptr *, void *); which allows a sptr to point elsewhere in the same segment. + lots of smarts Can you elaborate on this a little bit? I'm not sure I understand what you're getting at here. So, my thought is that we really want something that acts like a pointer, just in shared memory, so we don't have to do too much offset math. And I think for performance and readability we want to use pointers when operating locally in a backend. So, I guess there are situations in which we essentially want pointers: a) pointing somewhere in the same segment. b) pointing at an offset in *another* segment. So, for a) what if we, instead of storing plain offsets have something like: typedef struct sptr { Offset sptr_self; Offset sptr_offset; } sptr; and always store that in dynamically allocated shared memory instead of raw offsets. That allows to define a function like: void * sptr_deref(sptr *sp) { return ((char *)sp) - sp-sptr_self + sp-sptr_offset; } to get what we point to. Without having to reference the segment base, at the price of having to store two offsets. To update the pointer we could have: void sptr_set(sptr *sp, void *p) { char *segment_base; segment_base = ((char *)sp) - sp-sptr_self; /* make sure pointer doesn't point below the beginning of the segment */ Assert(segment_base = p); /* make sure pointer doesn't point beyond the segment */ Assert(p segment_base + GetSegmentSize(segment_base)); sp-sptr_offset = ((char *)p) - segment_base; } To initialize we'd have: void sptr_init(dsm_segment *segment, sptr *sp, void *p) { sp-sptr_self = ((char *)sp) - (char *)(segment-mapped_address) sptr_set(sp, p); } for b) we could have typedef struct sptr_far {
Re: [HACKERS] Feature request: Logging SSL connections
On 12/5/13, 8:53 AM, Dr. Andreas Kunert wrote: we were really missing the information in our log files if (and which of) our users are using SSL during their connections. The attached patch is a very simple solution to this problem - it just tests if the ssl pointer in Port is null. If no, it adds SSL to the logfile, otherwise it adds NOSSL. That seems useful. Do we need more information, like whether a client certificate was presented, or what ciphers were used? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: Yes. And using something efficiently DirectIO is more difficult than BufferedIO. If we change write() flag with direct IO in PostgreSQL, it will execute hardest ugly randomIO. Using DirectIO presumes you're using libaio or threads to implement prefetching and asynchronous I/O scheduling. I think in the long term there are only two ways to go here. Either a) we use DirectIO and implement an I/O scheduler in Postgres or b) We use mmap and use new system calls to give the kernel all the information Postgres has available to it to control the I/O scheduler. (a) is by far the lower risk option as it's well trodden and doesn't depend on other projects to do anything. The most that would be valuable is if the kernel provided an interface to learn about the hardware properties such as the raid geometry and queue depth for different parts of the devices. (b) is the way more interesting research project though. I don't think anyone's tried it and the kernel interface to provide the kinds of information Postgres needs requires a lot of thought. If it's done right then Postgres wouldn't need a buffer cache manager at all. It would just mmap the entire database and tell the kernel when it's safe to flush buffers and let the kernel decide when based on when it's convenient for the hardware. I don't think it's tenable in the long run to have Postgres manage buffers that are then copied to another buffer in memory which are then flushed to disk based on another scheduler. That it works at all is a testament to the quality of the code in Postgres and Linux but it's implausibly inefficient. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
Pavel Golub pa...@microolap.com writes: I personally see two approaches: 1. Implement GUC variable controling this behaviour per session 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc. I don't think new types are a good idea. If we are afraid to change the behavior of the input converters, what we should do is introduce new functions, eg toregclass(text) returns regclass. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On 2013-12-05 15:44:34 +0100, Andres Freund wrote: On 2013-12-05 07:44:27 -0500, Robert Haas wrote: And then I thought, boy, it sucks not to be able to declare what kind of a thing we're pointing *at* here, but apart from using C++ I see no solution to that problem. I guess we could do something like this: #define relptr(type) Size So the compiler wouldn't enforce anything, but at least notationally we'd know what sort of object we were supposedly referencing. There might be some ugly compiler dependent magic we could do. Depending on how we decide to declare offsets. Like (very, very roughly) #define relptr(type, struct_name, varname) union struct_name##_##varname{ \ type relptr_type; \ Offset relptr_off; } And then, for accessing have: #define relptr_access(seg, off) \ typeof(off.relptr_type)* (((char *)seg-base_address) + off.relptr_off) But boy, that's ugly. On second thought - there's probably no reason to name the union, making it somewhat less ugly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On Thu, Dec 5, 2013 at 11:42 AM, Greg Stark st...@mit.edu wrote: (b) is the way more interesting research project though. I don't think anyone's tried it and the kernel interface to provide the kinds of information Postgres needs requires a lot of thought. If it's done right then Postgres wouldn't need a buffer cache manager at all. It would just mmap the entire database and tell the kernel when it's safe to flush buffers and let the kernel decide when based on when it's convenient for the hardware. That's a bad idea in the current state of affairs. MM files haven't been designed for that usage, and getting stable performance out of that will be way too difficult. systemd's journal is finding that out the hard way. It uses mmap too. Having the buffer manager mmap buffers into its shared address space, however, might be an interesting idea to pursue. However, one must not forget that the kernel has similar scalability issues when the number of memory mappings increase arbitrarily. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On 12/5/13, 9:41 AM, Tom Lane wrote: Pavel Golub pa...@microolap.com writes: I personally see two approaches: 1. Implement GUC variable controling this behaviour per session 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc. I don't think new types are a good idea. If we are afraid to change the behavior of the input converters, what we should do is introduce new functions, eg toregclass(text) returns regclass. We could invent some sneaky syntax variants, like 'pg_klass'::regclass errors, but '?pg_klass'::regclass does not. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On Thu, Dec 5, 2013 at 8:57 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/05/2013 06:32 AM, Robert Haas wrote: During development of the dynamic shared memory facility, Noah and I spent a lot of time arguing about whether it was practical to ensure that a dynamic shared memory segment got mapped at the same address in every backend that used it. My vote goes for not trying to map at same address. I don't see how you could do that reliably, and I don't see much need for it anyway. That said, it naturally depends on what you're going to use the dynamic shared memory facility for. It's the same problem I have with reviewing the already-committed DSM patch and the message queue patch. The patches look fine as far as they go, but I have the nagging feeling that there are a bunch of big patches coming up later that use the facilities, and I can't tell if the facilities are over-engineered for what's actually needed, or not sufficient. Sure, well, that's one of the challenges of any doing any sort of large software development project. One rarely knows at the outset all of the problems that one will encounter before finishing said project. For small projects, you can usually predict these things pretty well, but as the scope of the project goes, it becomes more and more difficult to know whether you've made the right initial steps. That having been said, I'm pretty confident in the steps taken thus far - but if you're imagining that I have all the answers completely worked out and am choosing to reveal them only bit by bit, it's not like that. If you want to see the overall vision for this project, see https://wiki.postgresql.org/wiki/Parallel_Sort . Here, I expect to use dynamic shared memory for three purposes. First, I expect to use a shared memory message queue to propagate any error or warning messages generated in the workers back to the user backend. That's the point of introducing that infrastructure now, though I hope it will eventually also be suitable for streaming tuples between backends, so that you can run one part of the query tree in one backend and stream the output to a different backend that picks it up and processes it further. We could surely contrive a simpler solution just for error messages, but I viewed that as short-sighted. Second, I expect to store the SortTuple array, or some analogue of it, in dynamic shared memory. Third, I expect to store the tuples themselves in dynamic shared memory. Down the road, I imagine wanting to put hash tables in shared memory, so we can parallelize things like hash joins and hash aggregates. As a side-note, I've been thinking that we don't really need same-address mapping for shared_buffers either. Getting rid of it wouldn't buy us anything right now, but if we wanted e.g to make shared_buffers changeable without a restart, that would be useful. Very true. One major obstacle to that is that changing the size of shared_buffers also means resizing the LWLock array and the buffer descriptor array. If we got rid of the idea of having lwlocks in their own data structure and moved the buffer lwlocks into the buffer descriptors, that would get us down to two segments, but that still feels like one too many. There's also the problem of the buffer mapping hash table, which would need to grow somehow as well. Technically, the size of the fsync absorb queue also depends on shared_buffers, but we could decide not to care about that, I think. The other problem here is that once you do implement all this, a reference to a buffer beyond what your backend has mapped will necessitate an unmap and remap of the shared-buffers segment. If the remap fails, and you hold any buffer pins, you will have to PANIC. There could be some performance overhead from inserting bounds checks in all the right places too, although there might not be enough places to matter, since a too-high buffer number can only come from a limited number of places - either a lookup in the buffer mapping table, or a buffer allocation event. I don't mention any of these things to discourage you from working on the problem, but rather to because I've thought about it too - and the aforementioned problems have are the things that have stumped me so far. If you've got ideas about how to solve them, or even better yet want to implement something, great! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
Andres Freund and...@2ndquadrant.com writes: On 2013-12-05 08:58:55 -0500, Tom Lane wrote: I'm a bit worried that somebody, particularly third-party code, might've sloppily written return foo in a V1 function when return Int32GetDatum(foo) would be correct. In that case, the resultant Datum might have not-per-spec high-order bits, and if it reaches the fast comparator without ever having been squeezed into a physical tuple, we've got a problem. Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have allowed asserts checking against such cases by setting fcinfo-has_returned = true or such... [ shrug... ] PG_RETURN_DATUM has no practical way to verify that the given Datum was constructed safely, so I think we'd just be adding overhead with not much real safety gain. In practice, if we were to change Datum to be a signed type (intptr_t not uintptr_t), the most common cases would probably do the right thing anyway, ie an int or short return value would get promoted to Datum with sign-extension. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
Peter Eisentraut pete...@gmx.net writes: We could invent some sneaky syntax variants, like 'pg_klass'::regclass errors, but '?pg_klass'::regclass does not. Hmm ... cute idea, but shoehorning it into regoperator might be problematic. You'd have to pick a flag character that wasn't a valid operator character, which lets out '?' as well as a lot of the other mnemonically-reasonable choices. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
On 2013-12-05 10:02:56 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-12-05 08:58:55 -0500, Tom Lane wrote: I'm a bit worried that somebody, particularly third-party code, might've sloppily written return foo in a V1 function when return Int32GetDatum(foo) would be correct. In that case, the resultant Datum might have not-per-spec high-order bits, and if it reaches the fast comparator without ever having been squeezed into a physical tuple, we've got a problem. Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have allowed asserts checking against such cases by setting fcinfo-has_returned = true or such... [ shrug... ] PG_RETURN_DATUM has no practical way to verify that the given Datum was constructed safely, so I think we'd just be adding overhead with not much real safety gain. I was thinking of doing it for assert only anyway. In practice, if we were to change Datum to be a signed type (intptr_t not uintptr_t), the most common cases would probably do the right thing anyway, ie an int or short return value would get promoted to Datum with sign-extension. I was actually thinking about making Datum (and some other types we have) structs or unions. Currently it's far, far to easy to mix them. We throw away pretty much all of the little typesafety C has by typedef'ing them to integral types with lots of autocasting behaviour. Making Datum a union with all the base-types inside, would also get rid of us violating the standard's aliasing rules... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] same-address mappings vs. relative pointers
On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund and...@2ndquadrant.com wrote: Why? Lots of people have written lots of programs that do just that. Well, but we're a database, not a generic programming library ;) I think we're arguably both. But what's your alternative if you have a shared_palloc() like thingy? The overhead is going to be crazy if you allocate so granular that you can easily grow, and if you allocate on a coarse level, this isn't going to buy you much? I don't know if you've noticed, but the overhead of palloc is pretty much insane for large memory contexts right now. Consider a bunch of 100-byte tuples that you want to sort, on a 64-bit machine. IIUC, we're going to round the request up to 128 bytes and then add 16 bytes of overhead for 2 pointers before the beginning of the chunk. That's 44% overhead. For small memory contexts like the ones that store syscache entries it's really hard to do any better, but there are much better algorithms (which I am investigating) for cases where you expect to allocate a large amount of memory. 44% overhead on a 1kB is no big deal; on 1GB it's a big deal. But the question of how efficient palloc, or a hypothetical shared_palloc() is, is really separate from whether or not we ought to have one. All allocators have some overhead, and people use them anyway because it drastically simplifies programming. And suppose we want to do a parallel external sort with, say, the worker backend pushing tuples into a heap stored in the dynamic shared memory segment and other backends writing them out to tapes and removing them from the heap. You can't do that without some kind of space management, i.e. an allocator. Hm. I'd have thought one would implement that by pushing fixed size amounts of tuples to individual workers, let them sort each chunk in memory and then write the result out to disk. I haven't really investigated external sort algorithms at this point, so it is possible that you are right. However, that sounds like the equivalent of constructing runs by loading a batch of tuples, quicksorting them, writing them out, and then moving on to the next batch, a modification to tuplesort.c which I have tried and found to be significantly inferior to the present algorithm. As Tom has been at pains to point out, our current algorithm produces much longer runs, especially if the input data has some intrinsic ordering, which is not an uncommon situation. So while I don't know for sure, I think it's probably unwise to assume that the algorithm we pick won't require the ability to remove tuples from the dsm incrementally, and I don't want the choice of algorithm to be constrained by overly-rigid architectural decisions from the outset. So, for a) what if we, instead of storing plain offsets have something like: typedef struct sptr { Offset sptr_self; Offset sptr_offset; } sptr; I like the name Offset, but I'm firmly convinced that a relative pointer needs to be the same size as a regular pointer. 8 bytes for a pointer is already painfully large in some contexts; 16 is just too much. And then I thought, boy, it sucks not to be able to declare what kind of a thing we're pointing *at* here, but apart from using C++ I see no solution to that problem. I guess we could do something like this: #define relptr(type) Size So the compiler wouldn't enforce anything, but at least notationally we'd know what sort of object we were supposedly referencing. There might be some ugly compiler dependent magic we could do. Depending on how we decide to declare offsets. Like (very, very roughly) #define relptr(type, struct_name, varname) union struct_name##_##varname{ \ type relptr_type; \ Offset relptr_off; } And then, for accessing have: #define relptr_access(seg, off) \ typeof(off.relptr_type)* (((char *)seg-base_address) + off.relptr_off) But boy, that's ugly. It's clever, though, and probably worth further experimentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
Hello, My customers and colleagues sometimes (or often?) ask about the following message: FATAL: the database system is starting up This message is often output dozens of times during a failover or PITR. The users seem to be worried because the message level is FATAL and they don't know why such severe message is output in a successful failover and recovery. I can blame the users, because the message is merely a sub-product of pg_ctl's internal ping. Similarly, the below message is output when I stop the standby server normally. Why FATAL as a result of successful operation? I'm afraid DBAs are annoyed by these messages, as system administration software collects ERROR and more severe messages for daily monitoring. FATAL: terminating walreceiver process due to administrator command Shouldn't we lower the severity or avoiding those messages to server log? How about the following measures? 1. FATAL: the database system is starting up 2. FATAL: the database system is shutting down 3. FATAL: the database system is in recovery mode 4. FATAL: sorry, too many clients already Report these as FATAL to the client because the client wants to know the reason. But don't output them to server log because they are not necessary for DBAs (4 is subtle.) 5. FATAL: terminating walreceiver process due to administrator command 6. FATAL: terminating background worker \%s\ due to administrator command Don't output these to server log. Why are they necessary? For troubleshooting purposes? If necessary, the severity should be LOG (but I wonder why other background processes are not reported...) To suppress server log output, I think we can do as follows. I guess ereport(FATAL) is still needed for easily handling both client report and process termination. log_min_messages = PANIC; ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg(the database system is starting up))); May I hear your opinions? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On Thu, Dec 5, 2013 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote: Pavel Golub pa...@microolap.com writes: I personally see two approaches: 1. Implement GUC variable controling this behaviour per session 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc. I don't think new types are a good idea. If we are afraid to change the behavior of the input converters, what we should do is introduce new functions, eg toregclass(text) returns regclass. That seems like a pretty reasonable approach. I don't have a strong opinion on whether it's worth the backward-compatibility break that would ensue from just changing this outright. I admit that I've been annoyed by this behavior more than once, but I've also been beaten by customers enough times to know that backward-compatibility has value to other people in some cases where it does not have such value to me. Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
Sorry for my late. I checked the part-3 (shm-mq-v1.patc) portion as below. Your comments towards part-1 and part-2 are reasonable for me, so I have no argue on this portion. Even though shm_mq_create() expects the address is aligned, however, no mechanism to ensure. How about to put Assert() here? shm_mq_send_bytes() has an Assert() to check the length to be added to mq_bytes_written. Is the MAXALIGN64() right check in 32-bit architecture? The sendnow value might be Min(ringsize - used, nbytes - sent), and the ringsize came from mq-mq_ring_size being aligned using MAXALIGN(_DOWN) that has 32bit width. /* * Update count of bytes written, with alignment padding. Note * that this will never actually insert any padding except at the * end of a run of bytes, because the buffer size is a multiple of * MAXIMUM_ALIGNOF, and each read is as well. */ Assert(sent == nbytes || sendnow == MAXALIGN64(sendnow)); shm_mq_inc_bytes_written(mq, MAXALIGN64(sendnow)); What will happen if sender tries to send a large chunk that needs to be split into multiple sub-chunks and receiver concurrently detaches itself from the queue during the writes by sender? It seems to me the sender gets SHM_MQ_DETACHED and only earlier half of the chunk still remains on the queue even though its total length was already in the message queue. It may eventually lead infinite loop on the receiver side when another receiver appeared again later, then read incomplete chunk. Does it a feasible scenario? If so, it might be a solution to prohibit enqueuing something without receiver, and reset queue when a new receiver is attached. The source code comments in shm_mq_wait_internal() is a little bit misleading for me. It says nothing shall be written without attaching the peer process, however, we have no checks as long as nsend is less than queue size. * If handle != NULL, the queue can be read or written even before the * other process has attached. We'll wait for it to do so if needed. The * handle must be for a background worker initialized with bgw_notify_pid * equal to our PID. Right now, that's all I can comment on. I'll do follow-up code reading in the weekend. Thanks, 2013/11/20 Robert Haas robertmh...@gmail.com: On Tue, Nov 19, 2013 at 12:33 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: * on-dsm-detach-v2.patch It reminded me the hook registration/invocation mechanism on apache/httpd. It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE, LAST, REALLY_LAST), but these are alias of integer values, in other words, they are just kicks the hook in order to the priority value associated with a function pointer. These flexibility may make sense. We may want to control the order to release resources more fine grained in the future. For example, isn't it a problem if we have only two levels when a stuff in a queue needs to be released earlier than the queue itself? I'm not 100% certain on this suggestion because on_shmen_exit is a hook that does not host so much callbacks, thus extension may implement their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage of course. I don't really see much point in adding more flexibility here until we need it, but I can imagine that we might someday need it, for reasons that are not now obvious. * shm-toc-v1.patch From my experience, it makes sense to put a magic number on the tail of toc segment to detect shared-memory usage overrun. It helps to find bugs bug hard to find because of concurrent jobs. Probably, shm_toc_freespace() is a point to put assertion. How many entries is shm_toc_lookup() assumed to chain? It assumes a liner search from the head of shm_toc segment, and it is prerequisite for lock-less reference, isn't it? Does it make a problem if shm_toc host many entries, like 100 or 1000? Or, it is not an expected usage? It is not an expected usage.In typical usage, I expect that the number of TOC entries will be about N+K, where K is a small constant ( 10) and N is the number of cooperating parallel workers. It's possible that we'll someday be in a position to leverage 100 or 1000 parallel workers on the same task, but I don't expect it to be soon. And, actually, I doubt that revising the data structure would pay off at N=100. At N=1000, maybe. At N=1, probably. But we are *definitely* not going to need that kind of scale any time soon, and I don't think it makes sense to design a complex data structure to handle that case when there are so many more basic problems that need to be solved before we can go there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Performance optimization of btree binary search
Andres Freund and...@2ndquadrant.com writes: I was actually thinking about making Datum (and some other types we have) structs or unions. Currently it's far, far to easy to mix them. We throw away pretty much all of the little typesafety C has by typedef'ing them to integral types with lots of autocasting behaviour. That's intentional; on many ABIs, making Datum a struct would be catastrophic performance-wise because it would not be eligible for simple register pass or return conventions. In any case, the number of bugs I can remember that such a thing would've prevented is negligible. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 5, 2013 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote: I don't think new types are a good idea. If we are afraid to change the behavior of the input converters, what we should do is introduce new functions, eg toregclass(text) returns regclass. That seems like a pretty reasonable approach. I don't have a strong opinion on whether it's worth the backward-compatibility break that would ensue from just changing this outright. I admit that I've been annoyed by this behavior more than once, but I've also been beaten by customers enough times to know that backward-compatibility has value to other people in some cases where it does not have such value to me. I'm getting less enamored of just-change-the-input-behavior myself. The case that occurred to me is, suppose somebody's got a table containing a regclass or regproc column, and he dumps and reloads it. If the input converter silently replaces unknown names by 0, he's at risk of unexpected data loss, if the reload is done before he's created all the referenced objects. Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. Yeah, I was thinking of that too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire klaussfre...@gmail.com wrote: That's a bad idea in the current state of affairs. MM files haven't been designed for that usage, and getting stable performance out of that will be way too difficult. I'm talking about long-term goals here. Either of these two routes would require whole new kernel interfaces to work effectively. Without those new kernel interfaces our current approach is possibly the best we can get. I think the way to use mmap would be to mmap very large chunks, possibly whole tables. We would need some way to control page flushes that doesn't involve splitting mappings and can be efficiently controlled without having the kernel storing arbitrarily large tags on page tables or searching through all the page tables to mark pages flushable. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/5/13, 10:25 AM, MauMau wrote: Report these as FATAL to the client because the client wants to know the reason. But don't output them to server log because they are not necessary for DBAs Yeah, this is part of a more general problem, which you have characterized correctly: What is fatal (or error, or warning, ...) to the client isn't necessarily fatal (or error, or warning, ...) to the server or DBA. Fixing this would need a larger enhancement of the logging infrastructure. It's been discussed before, but it's a bit of work. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
MauMau maumau...@gmail.com writes: Shouldn't we lower the severity or avoiding those messages to server log? No. They are FATAL so far as the individual session is concerned. Possibly some documentation effort is needed here, but I don't think any change in the code behavior would be an improvement. 1. FATAL: the database system is starting up 2. FATAL: the database system is shutting down 3. FATAL: the database system is in recovery mode 4. FATAL: sorry, too many clients already Report these as FATAL to the client because the client wants to know the reason. But don't output them to server log because they are not necessary for DBAs (4 is subtle.) The notion that a DBA should not be allowed to find out how often #4 is happening is insane. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
You could try my lwlock-scalability improvement patches - for some workloads here, the improvements have been rather noticeable. Which version are you testing? I tried your patches on next link. As you suspect I didn't see any improvements. I tested it on PostgreSQL 9.2 Stable. http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/REL9_2_STABLE-rwlock-contention On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund and...@2ndquadrant.comwrote: On 2013-12-04 20:19:55 +0200, Metin Doslu wrote: - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is disappeared for 8 core machines and come back with 16 core machines on Amazon EC2. Would it be related with PostgreSQL locking mechanism? You could try my lwlock-scalability improvement patches - for some workloads here, the improvements have been rather noticeable. Which version are you testing? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] same-address mappings vs. relative pointers
On 2013-12-05 10:17:18 -0500, Robert Haas wrote: On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund and...@2ndquadrant.com wrote: Why? Lots of people have written lots of programs that do just that. Well, but we're a database, not a generic programming library ;) I think we're arguably both. Fair enough. I don't know if you've noticed, but the overhead of palloc is pretty much insane for large memory contexts right now. Yea, I have more than noticed. I actually still have a partial replacement of aset.c lying around. It didn't even have 16bytes overhead, although that had required some ugly hacks when wanting to continue allowing several parallel context implementations. Btw, it's even worse if you have lots of allocations above ALLOC_CHUNK_LIMIT, we walk a linear list on reset, and we reset damn frequently - partially because we think it's cheap. Consider a bunch of 100-byte tuples that you want to sort, on a 64-bit machine. IIUC, we're going to round the request up to 128 bytes and then add 16 bytes of overhead for 2 pointers before the beginning of the chunk. I think we add the 16bytes overhead before rounding up, but I might misremember. That's 44% overhead. For small memory contexts like the ones that store syscache entries it's really hard to do any better, but there are much better algorithms (which I am investigating) for cases where you expect to allocate a large amount of memory. What I was working on was a slab allocator where, in addition to default sizes like the current ones, you could create specific slabs for certain sizes one knew would be frequently be used which then would a) be faster to allocate b) have lower space overhead c) were guaranteed not to cause fragmentation. While I agree that the memory consumption is one concern, I am more worried about the time it eats up in many workloads. So, for a) what if we, instead of storing plain offsets have something like: typedef struct sptr { Offset sptr_self; Offset sptr_offset; } sptr; I like the name Offset, but I'm firmly convinced that a relative pointer needs to be the same size as a regular pointer. 8 bytes for a pointer is already painfully large in some contexts; 16 is just too much. I don't find that all that convincing. In 90% of the cases only a few pointers will be needed, for those something like the above will be helpful, allow bounds checking and all. For the 10% of other cases, we can still open-code stuff. Quite possibly we would want to avoid using 8 bytes there anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
On 2013-12-05 17:46:44 +0200, Metin Doslu wrote: I tried your patches on next link. As you suspect I didn't see any improvements. I tested it on PostgreSQL 9.2 Stable. You tested the correct branch, right? Which commit does git rev-parse HEAD show? But generally, as long as your profile hides all the important information behind the hypervisor's cost, you're going to have a hard time analyzing the problems. You really should try to reproduce the problems on native hardware (as similar to the host hardware as possible), to get accurate data. On CPU bound workloads that information is often transportable to the virtual world. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 12/5/13, 6:07 AM, Simon Riggs wrote: On 5 December 2013 01:55, Peter Eisentraut pete...@gmx.net wrote: On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote: If an application wants to allow these connection parameters to be used, it would need to do PQenableStartServer() first. If it doesn't, those connection parameters will be rejected. Stupid idea: Would it work that we require an environment variable to be set before we allow the standalone_backend connection parameter? That's easy to do, easy to audit, and doesn't require any extra code in the individual clients. I like the idea... should it be in pg_hba.conf ? Or should it be next to listen_addresses in postgresql.conf? No, it's an environment variable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
Robert Haas robertmh...@gmail.com writes: Another advantage of this approach is that, IIUC, type input functions can't return a NULL value. So 'pg_klass'::regclass could return 0, but not NULL. On the other hand, toregclass('pg_klass') *could* return NULL, which seems conceptually cleaner. BTW, another arguable advantage of fixing this via new functions is that users could write equivalent (though no doubt slower) functions for use in pre-9.4 releases, and thus not need to maintain multiple versions of app code that relies on this behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
You tested the correct branch, right? Which commit does git rev-parse HEAD show? I applied last two patches manually on PostgreSQL 9.2 Stable.
Re: [HACKERS] Why we are going to have to go DirectIO
Greg Stark st...@mit.edu writes: I think the way to use mmap would be to mmap very large chunks, possibly whole tables. We would need some way to control page flushes that doesn't involve splitting mappings and can be efficiently controlled without having the kernel storing arbitrarily large tags on page tables or searching through all the page tables to mark pages flushable. I might be missing something, but AFAICS mmap's API is just fundamentally wrong for this. The kernel is allowed to write-back a modified mmap'd page to the underlying file at any time, and will do so if say it's under memory pressure. You can tell the kernel to sync now, but you can't tell it *not* to sync. I suppose you are thinking that some wart could be grafted onto that API to reverse that, but I wouldn't have a lot of confidence in it. Any VM bug that caused the kernel to sometimes write too soon would result in nigh unfindable data consistency hazards. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
From what I've seen so far the bigger problem than contention in the lwlocks itself, is the spinlock protecting the lwlocks... Postgres 9.3.1 also reports spindelay, it seems that there is no contention on spinlocks. PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0 PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0 PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0 PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0 PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0 PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0 PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0 PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0 PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0 PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0 PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0 PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0 PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0 PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0 PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0 PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0 PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0
Re: [HACKERS] Performance optimization of btree binary search
On 2013-12-05 10:34:16 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I was actually thinking about making Datum (and some other types we have) structs or unions. Currently it's far, far to easy to mix them. We throw away pretty much all of the little typesafety C has by typedef'ing them to integral types with lots of autocasting behaviour. That's intentional; on many ABIs, making Datum a struct would be catastrophic performance-wise because it would not be eligible for simple register pass or return conventions. Unions should behave saner in that regard tho? And it be fairly easy to make it an optional thing. In any case, the number of bugs I can remember that such a thing would've prevented is negligible. Cases talked about upthread, where a plain datatype is returned as a Datum instead of using FooGetDatum() and the reverse, would be impossible. I don't think those are that infrequent? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
Andres Freund and...@2ndquadrant.com writes: On 2013-12-05 10:34:16 -0500, Tom Lane wrote: In any case, the number of bugs I can remember that such a thing would've prevented is negligible. Cases talked about upthread, where a plain datatype is returned as a Datum instead of using FooGetDatum() and the reverse, would be impossible. I don't think those are that infrequent? [ shrug... ] The performance changes we're talking about here would have the effect of making the compiler's implicit casts be the right thing anyway. In any case, I don't think you'd have accomplished much by forcing people to use FooGetDatum, unless you can force them to use the *right* FooGetDatum. The errors I can remember seeing in this area were more in the line of choosing the wrong macro. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic Shared Memory stuff
On 11/20/2013 09:58 PM, Robert Haas wrote: On Wed, Nov 20, 2013 at 8:32 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: How many allocations? What size will they have have typically, minimum and maximum? The facility is intended to be general, so the answer could vary widely by application. The testing that I have done so far suggests that for message-passing, relatively small queue sizes (a few kB, perhaps 1 MB at the outside) should be sufficient. However, applications such as parallel sort could require vast amounts of shared memory. Consider a machine with 1TB of memory performing a 512GB internal sort. You're going to need 512GB of shared memory for that. Hmm. Those two use cases are quite different. For message-passing, you want a lot of small queues, but for parallel sort, you want one huge allocation. I wonder if we shouldn't even try a one-size-fits-all solution. For message-passing, there isn't much need to even use dynamic shared memory. You could just assign one fixed-sized, single-reader multiple-writer queue for each backend. For parallel sort, you'll want to utilize all the available memory and all CPUs for one huge sort. So all you really need is a single huge shared memory segment. If one process is already using that 512GB segment to do a sort, you do *not* want to allocate a second 512GB segment. You'll want to wait for the first operation to finish first. Or maybe you'll want to have 3-4 somewhat smaller segments in use at the same time, but not more than that. * As discussed in the Something fishy happening on frogmouth thread, I don't like the fact that the dynamic shared memory segments will be permanently leaked if you kill -9 postmaster and destroy the data directory. Your test elicited different behavior for the dsm code vs. the main shared memory segment because it involved running a new postmaster with a different data directory but the same port number on the same machine, and expecting that that new - and completely unrelated - postmaster would clean up the resources left behind by the old, now-destroyed cluster. I tend to view that as a defect in your test case more than anything else, but as I suggested previously, we could potentially change the code to use something like 100 + (port * 100) with a forward search for the control segment identifier, instead of using a state file, mimicking the behavior of the main shared memory segment. I'm not sure we ever reached consensus on whether that was overall better than what we have now. I really think we need to do something about it. To use your earlier example of parallel sort, it's not acceptable to permanently leak a 512 GB segment on a system with 1 TB of RAM. One idea is to create the shared memory object with shm_open, and wait until all the worker processes that need it have attached to it. Then, shm_unlink() it, before using it for anything. That way the segment will be automatically released once all the processes close() it, or die. In particular, kill -9 will release it. (This is a variant of my earlier idea to create a small number of anonymous shared memory file descriptors in postmaster startup with shm_open(), and pass them down to child processes with fork()). I think you could use that approach with SysV shared memory as well, by destroying the segment with sgmget(IPC_RMID) immediately after all processes have attached to it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature request: Logging SSL connections
On Thu, Dec 05, 2013 at 09:43:31AM -0500, Peter Eisentraut wrote: On 12/5/13, 8:53 AM, Dr. Andreas Kunert wrote: we were really missing the information in our log files if (and which of) our users are using SSL during their connections. The attached patch is a very simple solution to this problem - it just tests if the ssl pointer in Port is null. If no, it adds SSL to the logfile, otherwise it adds NOSSL. That seems useful. Do we need more information, like whether a client certificate was presented, or what ciphers were used? Yes, please show ciphersuite and TLS version too. Andreas, you can use my recent \conninfo patch as template: https://github.com/markokr/postgres/commit/7d1b27ac74643abd15007cc4ec0b56ba92b39d90 Also, please show the SSL level also for walsender connections. It's quite important to know whether they are using SSL or not. But I think the 'bits' output is unnecessary, as it's cipher strength is known by ciphersuite. Perhaps it can be removed from \conninfo too. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Tom Lane-2 wrote MauMau lt; maumau307@ gt; writes: Shouldn't we lower the severity or avoiding those messages to server log? No. They are FATAL so far as the individual session is concerned. Possibly some documentation effort is needed here, but I don't think any change in the code behavior would be an improvement. 1. FATAL: the database system is starting up 2. FATAL: the database system is shutting down 3. FATAL: the database system is in recovery mode 4. FATAL: sorry, too many clients already Report these as FATAL to the client because the client wants to know the reason. But don't output them to server log because they are not necessary for DBAs (4 is subtle.) The notion that a DBA should not be allowed to find out how often #4 is happening is insane. Agreed #4 is definitely DBA territory. ISTM that instituting some level of categorization for messages would be helpful. Then logging and reporting frameworks would be able to identify and segregate the logs in whatever way they and the configuration deems appropriate. FATAL: [LOGON] too many clients already I'd make the category output disabled by default for a long while then eventually enabled by default but leave the ability to disable. Calls that do not supply a category get [N/A] output in category mode. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5781925.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
I think this proposal is a bit deadlocked now. - There are technical concerns about launching a server executable from within a client. - There are conceptual concerns about promoting an embedded database mode. On the other hand: - Everyone would like to have a way to use psql (and other basic clients) in stand-alone mode. The compromise would be to not launch the server from within the client, but have client and server communicate over external mechanisms (e.g., Unix-domain socket). The concern about that was that it would open up standalone mode to accidental third-party connections. While there are some ways around that (socket in private directory), they are not easy and not portable. So standalone mode would became less robust and reliable overall. The only solutions I see are: 1. do nothing 2. do everything (i.e., existing terminal mode plus socket mode plus embedded mode), letting the user work out the differences Pick one. ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.7
Hi, On 2013-12-05 22:03:51 +0900, Kyotaro HORIGUCHI wrote: - You assined HeapTupleGetOid(tuple) into relid to read in several points but no modification. Nevertheless, you left HeapTupleGetOid not replaced there. I think 'relid' just for repeated reading has far small merit compared to demerit of lowering readability. You'd be better to make them uniform in either way. It's primarily to get the line lengths halfway under control. Mm. I'm afraid I couldn't catch your words, do you mean that IsSystemClass() or isTempNamespace() could change the NULL bitmap in the tuple? Huh? No. I just meant that the source code lines are longer if you use HeapTupleGetOid(tuple) instead of just relid. Anway, that patch has since been committed... = 0002: - You are identifying the wal_level with the expr 'wal_level = WAL_LEVEL_LOGICAL' but it seems somewhat improper. Hm. Why? It actually does no harm and somewhat trifling so I don't assert you should fix it. The reason for the comment is the greater values for wal_level are undefined at the moment, so strictly saying, such values should be handled as invalid ones. Note that other checks for wal_level are written the same way. Consider how much bigger this patch would be if every usage of wal_level would need to get changed because a new level had been added. - RelationIsAccessibleInLogicalDecoding and RelationIsLogicallyLogged are identical in code. Together with the name ambiguity, this is quite confising and cause of future misuse between these macros, I suppose. Plus the names seem too long. Hm, don't think they are equivalent, rather the contrary. Note one returns false if IsCatalogRelation(), the other true. Oops, I'm sorry. I understand they are not same. Then I have other questions. The name for the first one 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing what its comment reads. | /* True if we need to log enough information to have access via | decoding snapshot. */ Making the macro name for this comment directly, I suppose it would be something like 'NeedsAdditionalInfoInLogicalDecoding' or more directly 'LogicalDeodingNeedsCids' or so.. The comment talks about logging enough information that it is accessible - just as the name. - In heap_insert, the information conveyed on rdata[3] seems to be better to be in rdata[2] because of the scarecity of the additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only seems to be needed. Also is in heap_multi_insert and heap_upate. Could you explain a bit more what you mean by that? The reason it's a separate rdata entry is that otherwise a full page write will remove the information. Sorry, I missed the comment 'so that an eventual FPW doesn't remove the tuple's data'. Although given the necessity of removal prevention, rewriting rdata[].buffer which is required by design (correct?) with InvalidBuffer seems a bit outrageous for me and obfuscating the objective of it. Well, it's added in a separate rdata element. Just as in dozens of other places. - In heap_delete, at the point adding replica identity, same to the aboves, rdata[3] could be moved into rdata[2] making new type like 'xl_heap_replident'. Hm. I don't think that'd be a good idea, because we'd then need special case decoding code for deletes because the wal format would be different for inserts/updates and deletes. Hmm. Although one common xl_heap_replident is impractical, splitting a logcally single entity into two or more XLogRecDatas also seems not to be a good idea. That's done everywhere. There's basically two reasons to use separate rdata elements: * the buffers are different * the data pointer is different The rdata chain elements don't survive in the WAL. Considering above, looking heap_xlog_insert(), you marked on xlrec.flags with XLOG_HEAP_CONTAINS_NEW_TUPLE to signal decoder that the record should have tuple data not being removed by fpw. This is the same for the update record. So the redoer(?) also can distinguish whether the update record contains extra tuple data or not. But it doesn't know the length of the individual records, so knowing there are two doesn't help. On the other hand, the update record after patched is longer by sizeof(uint16) regardless of whether 'tuple data' is attached or not. I don't know the value of the size in WAL stream, but the smaller would be better maybe. I think that'd make things too complicated without too much gain in comparison to the savings. # Of course, it doesn't matter if the object for OLD can # naturally be missing as English. Well, I think the context makes it clear enough. I'm not a native English speaker as you know:-) Neither am I ;). undefining XLOG_HEAP_CONTAINS_OLD and use separte macros type 1 | if (xlrec-flags XLOG_HEAP_CONTAINS_OLD_KEY || |
Re: [HACKERS] Extension Templates S03E11
On 12/1/13, 10:47 PM, Stephen Frost wrote: Having a management system for sets of objects is a *great* idea- and one which we already have through schemas. What we don't have is any kind of versioning system built-in or other metadata about it, nor do we have good tooling which leverages such a versioning or similar system. Extensions provide some of that metadata around schemas and object definitions, Schemas can't manage objects that are not in schemas, so that won't work. It would be great if we could take the dependency tracking mechanism in extensions and expose it separately. I would like to be able to say START PACKAGE foo -- bad name bunch of DDL STOP PACKAGE use it, later DROP PACKAGE foo; This mechanism already exists in extensions, but it's combined with a bunch of other things. Separating those things (and naming them separately) might clear a few things up. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 2013-12-05 11:39:29 -0500, Peter Eisentraut wrote: I think this proposal is a bit deadlocked now. - There are technical concerns about launching a server executable from within a client. - There are conceptual concerns about promoting an embedded database mode. On the other hand: - Everyone would like to have a way to use psql (and other basic clients) in stand-alone mode. The only solutions I see are: 1. do nothing 2. do everything (i.e., existing terminal mode plus socket mode plus embedded mode), letting the user work out the differences Pick one. ;-) 3) make it an explicit parameter, outside the database DSN, and let the clients contain a tiny bit of explict code about it. There really aren't that many clients that can use such a mode sensibly. If we ever want to support a real embedded mode, much, much more than this is needed. I don't think we should let that stop us from improving single user mode. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: variant of regclass
On 12/5/13, 10:08 AM, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: We could invent some sneaky syntax variants, like 'pg_klass'::regclass errors, but '?pg_klass'::regclass does not. Hmm ... cute idea, but shoehorning it into regoperator might be problematic. You'd have to pick a flag character that wasn't a valid operator character, which lets out '?' as well as a lot of the other mnemonically-reasonable choices. Well, you could pick any letter, I suppose. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3
On Thu, Nov 28, 2013 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, Do you still have the core file around? If so could you 'p *ShmemVariableCache' and 'p *ControlFile'? So sorry, I didn't see this message until just today. Seems it was accidentally archived before hitting my eyeballs. I see that 9.3.2 was released today along with what appears to be some fixes regarding this and similar issues. Sorry if my missing this message held anything up. We still have the core file kicking around, so here's the output at any rate. (gdb) p *ShmemVariableCache $1 = {nextOid = 8036795, oidCount = 6974, nextXid = 10260628, oldestXid = 1673, xidVacLimit = 21673, xidWarnLimit = 2136485320, xidStopLimit = 2146485320, xidWrapLimit = 2147485320, oldestXidDB = 1, latestCompletedXid = 10260621} (gdb) p *ControlFile $2 = {system_identifier = 5942827484423487452, pg_control_version = 937, catalog_version_no = 201306121, state = DB_IN_PRODUCTION, time = 1385234938, checkPoint = 381771294048, prevCheckPoint = 381771293888, checkPointCopy = {redo = 381771293992, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = 1 '\001', nextXidEpoch = 0, nextXid = 10217377, nextOid = 8035577, nextMulti = 13448295, nextMultiOffset = 32161320, oldestXid = 1673, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1385234938, oldestActiveXid = 10217377}, unloggedLSN = 1, minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = 0 '\000', wal_level = 2, MaxConnections = 200, max_prepared_xacts = 0, max_locks_per_xact = 64, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996, enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1 '\001', data_checksum_version = 0, crc = 517534097} (gdb) Cheers? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On Thu, Dec 5, 2013 at 7:57 AM, Simon Riggs si...@2ndquadrant.com wrote: On 5 December 2013 08:51, Magnus Hagander mag...@hagander.net wrote: Not recalling the older thread, but it seems the breaks on clock drift, I think we can fairly easily make that situation good enough. Just have IDENTIFY_SYSTEM return the current timestamp on the master, and refuse to start if the time difference is too great. Yes, that doesn't catch the case when the machines are in perfect sync when they start up and drift *later*, but it will catch the most common cases I bet. But I think that's good enough that we can accept the feature, given that *most* people will have ntp, and that it's a very useful feature for those people. But we could help people who run into it because of a simple config error.. Or maybe the suggested patch already does this, in which case ignore that part :) I think the very nature of *this* feature is that it doesnt *require* the clocks to be exactly in sync, even though that is the basis for measurement. The setting of this parameter for sane usage would be minimum 5 mins, but more likely 30 mins, 1 hour or more. In that case, a few seconds drift either way makes no real difference to this feature. So IMHO, without prejudice to other features that may be more critically reliant on time synchronisation, we are OK to proceed with this specific feature. Hi all, I saw the comments of all of you. I'm a few busy with some customers issues (has been a crazy week), but I'll reply and/or fix your suggestions later. Thanks for all review and sorry to delay in reply. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] Proposal: variant of regclass
On Dec 5, 2013, at 7:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, another arguable advantage of fixing this via new functions is that users could write equivalent (though no doubt slower) functions for use in pre-9.4 releases, and thus not need to maintain multiple versions of app code that relies on this behavior. +1 to this idea. Feels cleanest. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On 12/05/2013 07:40 AM, Greg Stark wrote: On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire klaussfre...@gmail.com wrote: That's a bad idea in the current state of affairs. MM files haven't been designed for that usage, and getting stable performance out of that will be way too difficult. I'm talking about long-term goals here. Either of these two routes would require whole new kernel interfaces to work effectively. Without those new kernel interfaces our current approach is possibly the best we can get. Well, in the long run we'll probably be using persistent RAM. And the geeks who manage that have already said that MMAP is a bad interface for persistent RAM. They haven't defined a good one, though. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
On 12/05/2013 05:48 AM, Stephen Frost wrote: * Peter Geoghegan (p...@heroku.com) wrote: On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus j...@agliodbs.com wrote: But you know what? 2.6, overall, still performs better than any kernel in the 3.X series, at least for Postgres. What about the fseek() scalability issue? Not to mention that the 2.6 which I suspect you're referring to (RHEL) isn't exactly 2.6.. Actually, I've been able to do 35K TPS on commodity hardware on Ubuntu 10.04. I have yet to go about 15K on any Ubuntu running a 3.X Kernel. The CPU scheduling on 2.6 just seems to be far better tuned, aside from the IO issues; at 35K TPS, the CPU workload is evenly distributed across cores, whereas on 3.X it lurches from core to core like a drunk in a cathedral. However, the hardware is not identical, and this is on proprietary, not benchmark, workloads, which is why I haven't published anything. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Select query performance and shared buffers
On Thu, Dec 5, 2013 at 1:03 PM, Metin Doslu me...@citusdata.com wrote: From what I've seen so far the bigger problem than contention in the lwlocks itself, is the spinlock protecting the lwlocks... Postgres 9.3.1 also reports spindelay, it seems that there is no contention on spinlocks. Did you check hugepages? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
* David Johnston (pol...@yahoo.com) wrote: ISTM that instituting some level of categorization for messages would be helpful. Then logging and reporting frameworks would be able to identify and segregate the logs in whatever way they and the configuration deems appropriate. I've wanted to do that and have even discussed it with folks in the past, the trick is finding enough toit's, which is difficult when you start to look at the size of the task... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/05/2013 10:21 AM, Stephen Frost wrote: * David Johnston (pol...@yahoo.com) wrote: ISTM that instituting some level of categorization for messages would be helpful. Then logging and reporting frameworks would be able to identify and segregate the logs in whatever way they and the configuration deems appropriate. I've wanted to do that and have even discussed it with folks in the past, the trick is finding enough toit's, which is difficult when you start to look at the size of the task... But ... if we set a firm policy on this, then we could gradually clean up the error messages piecemeal over the next couple of major versions. We could also make sure that any new features complied with the categorization policy. Right now, how to categorize errors is up to each individual patch author, which means that things are all over the place, and get worse with each new feature added. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with displaying wide tables in psql
On Thu, Dec 5, 2013 at 1:09 AM, Sergey Muraviov sergey.k.murav...@gmail.com wrote: And my patch affects the row view only. To help us avoid forgetting about this patch, please add it here: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regression tests failing if not launched on db regression
On Thu, Dec 5, 2013 at 9:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: It happens that the following regression tests are failing if they are run on a database not named regression: This does not seem like a bug to me, although maybe we'd better update the documentation to specify that you need to use a DB named regression. At the same thing, supporting it might not cost anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Josh Berkus j...@agliodbs.com writes: On 12/05/2013 10:21 AM, Stephen Frost wrote: But ... if we set a firm policy on this, then we could gradually clean up the error messages piecemeal over the next couple of major versions. We could also make sure that any new features complied with the categorization policy. Right now, how to categorize errors is up to each individual patch author, which means that things are all over the place, and get worse with each new feature added. I don't think there's that much randomness in is-it-an-ERROR-or-not. What I believe Stephen is talking about is a classification that simply doesn't exist today, namely something around how likely is the message to be of interest to a DBA as opposed to the client application. We currently compose messages almost entirely with the client in mind, and that's as it should be. But we could use some new decoration that's more DBA-oriented to help decide what goes into the postmaster log. Before we could get very far we'd need a better understanding than we have of what cases a DBA might be interested in. To take the specific example that started this thread, there wouldn't be a lot of value IMO in a classification like connection failure messages. I think the OP is probably right that those are often uninteresting --- but as I mentioned, too many clients might become interesting if he's wondering whether he needs to enlarge max_connections. Or password failure cases might become interesting if he starts to suspect breakin attempts. So I'd want to see a design that credibly covers those sorts of needs before we put any large effort into code changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Regression tests failing if not launched on db regression
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 5, 2013 at 9:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: It happens that the following regression tests are failing if they are run on a database not named regression: This does not seem like a bug to me, although maybe we'd better update the documentation to specify that you need to use a DB named regression. At the same thing, supporting it might not cost anything. Well, changing these specific tests today might not be terribly expensive, but what is it that prevents more such tests from being committed next week? Perhaps by somebody who feels current_database() should be included in code coverage, for example? More generally, we never have and never can promise that the regression tests pass regardless of environment. If you turn off enable_seqscan, for instance, you'll get a whole lot of not-terribly-exciting diffs. I see no particular benefit to promising that the name of the regression database isn't significant. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/05/2013 10:46 AM, Tom Lane wrote: Before we could get very far we'd need a better understanding than we have of what cases a DBA might be interested in. To take the specific example that started this thread, there wouldn't be a lot of value IMO in a classification like connection failure messages. I think the OP is probably right that those are often uninteresting --- but as I mentioned, too many clients might become interesting if he's wondering whether he needs to enlarge max_connections. Or password failure cases might become interesting if he starts to suspect breakin attempts. So I'd want to see a design that credibly covers those sorts of needs before we put any large effort into code changes. Heck, I'd be happy just to have a class of messages which specifically means OMG, there's something wrong with the server, that is, a flag for messages which only occur when PostgreSQL encounters a bug, data corrpution, or platform error. Right now, I have to suss those out by regex. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers