Re: [HACKERS] Minmax indexes
On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila escribió: On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: On Windows, patch gives below compilation errors: src\backend\access\minmax\mmtuple.c(96): error C2057: expected constant expression I have fixed all these compile errors (fix attached). Thanks for reporting them. I'll post a new version shortly. Thanks for fixing it. In last few days I had spent some time reading about minmax or equivalent indexes in other databases (Netezza and Oracle) and going through some parts of your proposal. Its a bit bigger patch and needs much more time, but I would like to share my findings/thoughts I had developed till now. Firstly about interface and use case, as far as I could understand other databases provide this index automatically rather than having a separate Create Index command which may be because such an index can be mainly useful when the data is ordered or if it's distributed in such a way that it's quite useful for repeatedly executing queries. You have proposed it as a command which means user needs to take care of it which I find is okay for first version, later may be we can also have some optimisations so that it can get created automatically. For the page range, If I read correctly, currently you have used hash define, do you want to expose it to user in some way like GUC or maintain it internally and assign the right value based on performance of different queries? Operations on this index seems to be very fast, like Oracle has this as an in-memory structure and I read in Netezza that write operations doesn't carry any significant overhead for zone maps as compare to other indexes, so shouldn't we consider it to be without WAL logged? OTOH I think because these structures get automatically created in those databases, so it might be okay but if we provide it as a command, then user might be bothered if he didn't find it automatically on server restart. Few Questions and observations: 1. + When a new heap tuple is inserted in a summarized page range, it is possible to + compare the existing index tuple with the new heap tuple. If the heap tuple is + outside the minimum/maximum boundaries given by the index tuple for any indexed + column (or if the new heap tuple contains null values but the index tuple + indicate there are no nulls), it is necessary to create a new index tuple with + the new values. To do this, a new index tuple is inserted, and the reverse range + map is updated to point to it. The old index tuple is left in place, for later + garbage collection. Is there a reason why we can't directly update the value rather then new insert in index, as I understand for other indexes like btree we do this because we might need to rollback, but here even if after updating the min or max value, rollback happens, it will not cause any harm (tuple loss). 2. + If the reverse range map points to an invalid TID, the corresponding page range + is not summarized. 3. It might be better if you can mention when range map will point to an invalid TID, it's not explained in your proposal, but you have used it in you proposal to explain some other things. 4. Range reverse map is a good terminology, but isn't Range translation map better. I don't mind either way, it's just a thought came to my mind while understanding concept of Range Reverse map. 5. /* * As above, except that instead of scanning the complete heap, only the given * range is scanned. Scan to end-of-rel can be signalled by passing * InvalidBlockNumber as end block number. */ double IndexBuildHeapRangeScan(Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo, bool allow_sync, BlockNumber start_blockno, BlockNumber numblocks, IndexBuildCallback callback, void *callback_state) In comments you have used end block number, which parameter does it refer to? I could see only start_blockno and numb locks? 6. currently you are passing 0 as start block and InvalidBlockNumber as number of blocks, what's the logic for it? return IndexBuildHeapRangeScan(heapRelation, indexRelation, indexInfo, allow_sync, 0, InvalidBlockNumber, callback, callback_state); 7. In mmbuildCallback, it only add's tuple to minmax index, if it satisfies page range, else this can lead to waste of big scan incase page range is large (1280 pages as you mentiones in one of your mails). Why can't we include it end of scan? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Sep 27, 2013 at 11:49 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila escribió: On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: On Windows, patch gives below compilation errors: src\backend\access\minmax\mmtuple.c(96): error C2057: expected constant expression I have fixed all these compile errors (fix attached). Thanks for reporting them. I'll post a new version shortly. Thanks for fixing it. In last few days I had spent some time reading about minmax or equivalent indexes in other databases (Netezza and Oracle) and going through some parts of your proposal. Its a bit bigger patch and needs much more time, but I would like to share my findings/thoughts I had developed till now. Firstly about interface and use case, as far as I could understand other databases provide this index automatically rather than having a separate Create Index command which may be because such an index can be mainly useful when the data is ordered or if it's distributed in such a way that it's quite useful for repeatedly executing queries. You have proposed it as a command which means user needs to take care of it which I find is okay for first version, later may be we can also have some optimisations so that it can get created automatically. For the page range, If I read correctly, currently you have used hash define, do you want to expose it to user in some way like GUC or maintain it internally and assign the right value based on performance of different queries? Operations on this index seems to be very fast, like Oracle has this as an in-memory structure and I read in Netezza that write operations doesn't carry any significant overhead for zone maps as compare to other indexes, so shouldn't we consider it to be without WAL logged? OTOH I think because these structures get automatically created in those databases, so it might be okay but if we provide it as a command, then user might be bothered if he didn't find it automatically on server restart. Few Questions and observations: 1. + When a new heap tuple is inserted in a summarized page range, it is possible to + compare the existing index tuple with the new heap tuple. If the heap tuple is + outside the minimum/maximum boundaries given by the index tuple for any indexed + column (or if the new heap tuple contains null values but the index tuple + indicate there are no nulls), it is necessary to create a new index tuple with + the new values. To do this, a new index tuple is inserted, and the reverse range + map is updated to point to it. The old index tuple is left in place, for later + garbage collection. Is there a reason why we can't directly update the value rather then new insert in index, as I understand for other indexes like btree we do this because we might need to rollback, but here even if after updating the min or max value, rollback happens, it will not cause any harm (tuple loss). 2. + If the reverse range map points to an invalid TID, the corresponding page range + is not summarized. 3. It might be better if you can mention when range map will point to an invalid TID, it's not explained in your proposal, but you have used it in you proposal to explain some other things. 4. Range reverse map is a good terminology, but isn't Range translation map better. I don't mind either way, it's just a thought came to my mind while understanding concept of Range Reverse map. 5. /* * As above, except that instead of scanning the complete heap, only the given * range is scanned. Scan to end-of-rel can be signalled by passing * InvalidBlockNumber as end block number. */ double IndexBuildHeapRangeScan(Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo, bool allow_sync, BlockNumber start_blockno, BlockNumber numblocks, IndexBuildCallback callback, void *callback_state) In comments you have used end block number, which parameter does it refer to? I could see only start_blockno and numb locks? 6. currently you are passing 0 as start block and InvalidBlockNumber as number of blocks, what's the logic for it? return IndexBuildHeapRangeScan(heapRelation, indexRelation, indexInfo, allow_sync, 0, InvalidBlockNumber, callback, callback_state); I got it, I think here it means scan all the pages. 7. In mmbuildCallback, it only add's tuple to minmax index, if it satisfies page range, else this can lead to waste of big scan incase page range is large (1280 pages as you mentiones in one of your mails). Why can't we include it end of scan? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix
On Fri, Sep 27, 2013 at 10:28 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 26, 2013 at 5:18 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com wrote: I think I must have forgot to save it before I emailed it... Here's the correct version. Ah ha. Looks better. I'm working on getting this committed now. Aside from some stylistic things, I've found one serious functional bug, which is that you need to test padding != 0, not padding 0, when deciding which path to take. No need to send a new patch, I've already fixed it in my local copy... Oops, negative padding. My bad. I was focusing too much on the benchmarks I think. Committed now. Let me know if I broke anything. Great, thanks for commiting! Thank you Albe for your review too! Regards David. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] pgbench filler columns
On Thu, Sep 26, 2013 at 7:20 PM, Noah Misch n...@leadboat.com wrote: On Thu, Sep 26, 2013 at 03:23:30PM +0530, Pavan Deolasee wrote: Should we just fix the comment and say its applicable for all tables except accounts ? Please do. How about something like this ? Patch attached. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee pgbench_filler_column_notes.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some interesting news about Linux 3.12 OOM
On Wed, Sep 25, 2013 at 8:00 AM, Greg Stark st...@mit.edu wrote: On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina dan...@heroku.com wrote: Enable the memcg OOM killer only for user faults, where it's really the only option available. Is this really a big deal? I would expect most faults to be user faults. It's certainly a big deal that we need to ensure we can handle ENOMEM from syscalls and library functions we weren't expecting to return it. But I don't expect it to actually reduce the OOM killing sprees by much. Hmm, I see what you mean. I have been reading through the mechanism: I got too excited about 'allocations by system calls', because I thought that might mean brk and friends, except that's not much of an allocation at all, just reservation. I think. There is some interesting stuff coming in along with these patches in bringing the user-space memcg OOM handlers up to snuff that may make it profitable to issue SIGTERM to backends when a safety margin is crossed (too bad the error messages will be confusing in that case). I was rather hoping that a regular ENOMEM could be injected by this mechanism the next time a syscall is touched (unknown), but I'm not confident if this is made easier or not, one way or another. One could imagine the kernel injecting such a fault when the amount of memory being consumed starts to look hairy, but I surmise part of the impetus for userspace handling of that is to avoid getting into that particular heuristics game. Anyway, I did do some extensive study of cgroups and memcg's implementation in particular and found it not really practical for Postgres use unless one was happy with lots and lots of database restarts, and this work still gives me some hope to try again, even if smaller modifications still seem necessary. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 27.09.2013 01:55, Andres Freund wrote: Hello, We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. Increasing the padding to a full cacheline helps making the partitioning of the partition space actually effective (before it's essentially halved on a read-mostly workload), but that still leaves one with very hot spinlocks. So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 How does that compare with simply increasing NUM_BUFFER_PARTITIONS? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
Hi, On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote: On 27.09.2013 01:55, Andres Freund wrote: We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. Increasing the padding to a full cacheline helps making the partitioning of the partition space actually effective (before it's essentially halved on a read-mostly workload), but that still leaves one with very hot spinlocks. So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 How does that compare with simply increasing NUM_BUFFER_PARTITIONS? Heaps better. In the case causing this investigation lots of the pages with hot spinlocks were the simply the same ones over and over again, partitioning the lockspace won't help much there. That's not exactly an uncommon scenario since often enough there's a small amount of data hit very frequently and lots more that is accessed only infrequently. E.g. recently inserted data and such tends to be very hot. I can run a test on the 4 socket machine if it's unused, but on my 2 socket workstation the benefits of at least our simulation of the original workloads the improvements were marginal after increasing the padding to a full cacheline. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 2013-09-27 09:21:05 +0200, Andres Freund wrote: So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 How does that compare with simply increasing NUM_BUFFER_PARTITIONS? Heaps better. In the case causing this investigation lots of the pages with hot spinlocks were the simply the same ones over and over again, partitioning the lockspace won't help much there. That's not exactly an uncommon scenario since often enough there's a small amount of data hit very frequently and lots more that is accessed only infrequently. E.g. recently inserted data and such tends to be very hot. I can run a test on the 4 socket machine if it's unused, but on my 2 socket workstation the benefits of at least our simulation of the original workloads the improvements were marginal after increasing the padding to a full cacheline. Ok, was free: padding + 16 partitions: tps = 147884.648416 padding + 32 partitions: tps = 141777.841125 padding + 64 partitions: tps = 141561.539790 padding + 16 partitions + new lwlocks tps = 601895.580903 (yeha, still reproduces after some sleep!) Now, the other numbers were best-of-three, these aren't, but I think it's pretty clear that you're not going to see the same benefits. I am not surprised... The current implementation of lwlocks will frequently block others, both during acquiration and release of locks. What's even worse, trying to fruitlessly acquire a spinlock will often prevent releasing it because we need the spinlock during release. With the proposed algorithm, even if we need the spinlock during release of an lwlock because there are queued PGPROCs, we will acquire that spinlock only after already having released the lock... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao masao.fu...@gmail.com wrote: Hmm... when synchronous_transfer is set to data_flush, IMO the intuitive behaviors are (1) synchronous_commit = on A data flush should wait for the corresponding WAL to be flushed in the standby (2) synchronous_commit = remote_write A data flush should wait for the corresponding WAL to be written to OS in the standby. (3) synchronous_commit = local (4) synchronous_commit = off A data flush should wait for the corresponding WAL to be written locally in the master. I thought synchronous_commit and synchronous_transfer are kind of orthogonal to each other. synchronous_commit only controls whether and how to wait for the standby only when a transaction commits. synchronous_transfer OTOH tells how to interpret the standby listed in synchronous_standbys parameter. If set to commit then they are synchronous standbys (like today). If set to data_flush, they are asynchronous failback safe standby and if set to all then they are synchronous failback safe standbys. Well, its confusing :-( So IMHO in the current state of things, the synchronous_transfer GUC can not be changed at a session/transaction level since all backends, including background workers must honor the settings to guarantee failback safety. synchronous_commit still works the same way, but is ignored if synchronous_transfer is set to data_flush because that effectively tells us that the standbys listed under synchronous_standbys are really *async* standbys with failback safety. Thank you for comment. I think it is good simple idea. In your opinion, if synchronous_transfer is set 'all' and synchronous_commit is set 'on', the master wait for data flush eve if user sets synchronous_commit to 'local' or 'off'. For example, when user want to do transaction early, user can't do this. we leave the such situation as constraint? Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 27.09.2013 10:21, Andres Freund wrote: Hi, On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote: On 27.09.2013 01:55, Andres Freund wrote: We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. Increasing the padding to a full cacheline helps making the partitioning of the partition space actually effective (before it's essentially halved on a read-mostly workload), but that still leaves one with very hot spinlocks. So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 How does that compare with simply increasing NUM_BUFFER_PARTITIONS? Heaps better. In the case causing this investigation lots of the pages with hot spinlocks were the simply the same ones over and over again, partitioning the lockspace won't help much there. That's not exactly an uncommon scenario since often enough there's a small amount of data hit very frequently and lots more that is accessed only infrequently. E.g. recently inserted data and such tends to be very hot. I see. So if only a few buffers are really hot, I'm assuming the problem isn't just the buffer partition lock, but also the spinlock on the buffer header, and the buffer content lwlock. Yeah, improving LWLocks would be a nice wholesale solution to that. I don't see any fundamental flaw in your algorithm. Nevertheless, I'm going to throw in a couple of other ideas: * Keep a small 4-5 entry cache of buffer lookups in each backend of most recently accessed buffers. Before searching for a buffer in the SharedBufHash, check the local cache. * To pin a buffer, use an atomic fetch-and-add instruction to increase the refcount. PinBuffer() also increases usage_count, but you could do that without holding a lock; it doesn't need to be accurate. One problem with your patch is going to be to make it also work without the CAS and fetch-and-add instructions. Those are probably present in all the architectures we support, but it'll take some effort to get the architecture-specific code done. Until it's all done, it would be good to be able to fall back to plain spinlocks, which we already have. Also, when someone ports PostgreSQL to a new architecture in the future, it would be helpful if you wouldn't need to write all the architecture-specific code immediately to get it to compile. Did you benchmark your patch against the compare-and-swap patch I posted earlier? (http://www.postgresql.org/message-id/519a3587.80...@vmware.com). Just on a theoretical level, I would assume your patch to scale better since it uses fetch-and-add instead of compare-and-swap for acquiring a shared lock. But in practice it might be a wash. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 2013-09-27 09:57:07 +0200, Andres Freund wrote: On 2013-09-27 09:21:05 +0200, Andres Freund wrote: So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 How does that compare with simply increasing NUM_BUFFER_PARTITIONS? Heaps better. In the case causing this investigation lots of the pages with hot spinlocks were the simply the same ones over and over again, partitioning the lockspace won't help much there. That's not exactly an uncommon scenario since often enough there's a small amount of data hit very frequently and lots more that is accessed only infrequently. E.g. recently inserted data and such tends to be very hot. I can run a test on the 4 socket machine if it's unused, but on my 2 socket workstation the benefits of at least our simulation of the original workloads the improvements were marginal after increasing the padding to a full cacheline. Ok, was free: padding + 16 partitions: tps = 147884.648416 padding + 32 partitions: tps = 141777.841125 padding + 64 partitions: tps = 141561.539790 padding + 16 partitions + new lwlocks tps = 601895.580903 (yeha, still reproduces after some sleep!) Pgbench numbers for writes on the machine (fsync = off, synchronous_commit = off): padding + 16 partitions: tps = 8903.532163 vs padding + 16 partitions + new lwlocks tps = 13324.080491 So, on bigger machines the advantages seem to be there for writes as well... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko sawada.m...@gmail.comwrote: Thank you for comment. I think it is good simple idea. In your opinion, if synchronous_transfer is set 'all' and synchronous_commit is set 'on', the master wait for data flush eve if user sets synchronous_commit to 'local' or 'off'. For example, when user want to do transaction early, user can't do this. we leave the such situation as constraint? No, user can still override the transaction commit point wait. So if synchronous_transfer is set to all: - If synchronous_commit is ON - wait at all points - If synchronous_commit is OFF - wait only at buffer flush (and other related to failback safety) points synchronous_transfer is set to data_flush: - If synchronous_commit is either ON o OFF - do not wait at commit points, but wait at all other points synchronous_transfer is set to commit: - If synchronous_commit is ON - wait at commit point - If synchronous_commit is OFF - do not wait at any point Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Wait free LW_SHARED acquisition
On 2013-09-27 11:11:56 +0300, Heikki Linnakangas wrote: On 27.09.2013 10:21, Andres Freund wrote: Heaps better. In the case causing this investigation lots of the pages with hot spinlocks were the simply the same ones over and over again, partitioning the lockspace won't help much there. That's not exactly an uncommon scenario since often enough there's a small amount of data hit very frequently and lots more that is accessed only infrequently. E.g. recently inserted data and such tends to be very hot. I see. So if only a few buffers are really hot, I'm assuming the problem isn't just the buffer partition lock, but also the spinlock on the buffer header, and the buffer content lwlock. Yeah, improving LWLocks would be a nice wholesale solution to that. I don't see any fundamental flaw in your algorithm. Nevertheless, I'm going to throw in a couple of other ideas: * Keep a small 4-5 entry cache of buffer lookups in each backend of most recently accessed buffers. Before searching for a buffer in the SharedBufHash, check the local cache. I thought about that as well, but you'd either need to revalidate after pinning the buffers, or keep them pinned. I had a very hacky implementation of that, but it just make the buffer content locks the top profile spots. Similar issue there. It might be worthwile to do this nonetheless - lock xadd; certainly isn't cheap, even due it's cheaper than a full spinnlock - but it's not trivial. * To pin a buffer, use an atomic fetch-and-add instruction to increase the refcount. PinBuffer() also increases usage_count, but you could do that without holding a lock; it doesn't need to be accurate. Yes, Pin/UnpinBuffer() are the primary contention points after this patch. I want to tackle them, but that seems like a separate thing to do. I think we should be able to get rid of most or even all LockBufHdr() calls by a) introducing PinBufferInternal() which increase pins but not usage count using an atomic increment. That can replace locking headers in many cases. b) make PinBuffer() increment pin and usagecount using a single 64bit atomic add if available and fit flags in there as well. Then back off the usagecount if it's too high or even wraps around, that doesn't hurt much, we're pinned in that moment. One problem with your patch is going to be to make it also work without the CAS and fetch-and-add instructions. Those are probably present in all the architectures we support, but it'll take some effort to get the architecture-specific code done. Until it's all done, it would be good to be able to fall back to plain spinlocks, which we already have. Also, when someone ports PostgreSQL to a new architecture in the future, it would be helpful if you wouldn't need to write all the architecture-specific code immediately to get it to compile. I think most recent compilers have intrinsics for stuff for operations like that. I quite like the API provided by gcc for this kind of stuff, I think we should model an internal wrapper API similarly. I don't see any new platforming coming that has a compiler without intrinsics? But yes, you're right, I think we need a spinlock based fallback for now. Even if it's just because nobody of us can verify the implementations on the more obsolete platforms we claim to support.I just didn't see it as a priority in the PoC. Did you benchmark your patch against the compare-and-swap patch I posted earlier? (http://www.postgresql.org/message-id/519a3587.80...@vmware.com). Just on a theoretical level, I would assume your patch to scale better since it uses fetch-and-add instead of compare-and-swap for acquiring a shared lock. But in practice it might be a wash. I've tried compare-and-swap for shared acquisition and it performed worse, didn't try your patch though as you seemed to have concluded it's a wash after doing the unlocked test. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fast gin cache performance improvement
I wrote: I had a look over this patch. I think this patch is interesting and very useful. Here are my review points: 8. I think there are no issues in this patch. However, I have one question: how this patch works in the case where gin_fast_limit/fast_cache_size = 0? In this case, in my understanding, this patch inserts new entries into the pending list temporarily and immediately moves them to the main GIN data structure using ginInsertCleanup(). Am I right? If so, that is obviously inefficient. Sorry, There are incorrect expressions. I mean gin_fast_limit 0 and fast_cache_size = 0. Although I asked this question, I've reconsidered about these parameters, and it seems that these parameters not only make code rather complex but are a little confusing to users. So I'd like to propose to introduce only one parameter: fast_cache_size. While users that give weight to update performance for the fast update technique set this parameter to a large value, users that give weight not only to update performance but to search performance set this parameter to a small value. What do you think about this? Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] insert throw error when year field len 4 for timestamptz datatype
Sorry for delay in reply. On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi haribabu.ko...@huawei.comwrote: On Tue, 17 September 2013 14:33 Rushabh Lathia wrote: On Mon, Sep 16, 2013 at 7:22 PM, Haribabu kommi haribabu.ko...@huawei.com wrote: ***On *14 August 2013 Rushabh Lathia wrote: postgres=# create table test ( a timestamptz); CREATE TABLE -- Date with year 1000 postgres=# insert into test values ( 'Sat Mar 11 23:58:48 1000 IST');* *** INSERT 0 1 -- Now try with year 1 it will return error postgres=# insert into test values ( 'Sat Mar 11 23:58:48 1 IST'); ERROR: invalid input syntax for type timestamp with time zone: Sat Mar 11 23:58:48 1 IST LINE 1: insert into test values ( 'Sat Mar 11 23:58:48 1 IST'); Patch applies cleanly to HEAD. As this patch tries to improve in inserting the date of the year value to be more than 4 in length. But it didn’t solve all the ways to insert the year field more than 4 in length. Please check the following test. ** ** postgres=# insert into test values ('10001010 10:10:10 IST'); INSERT 0 1 postgres=# insert into test values ('100011010 10:10:10 IST'); ERROR: invalid input syntax for type timestamp with time zone: 100011010 10:10:10 IST at character 26 STATEMENT: insert into test values ('100011010 10:10:10 IST'); ERROR: invalid input syntax for type timestamp with time zone: 100011010 10:10:10 IST LINE 1: insert into test values ('100011010 10:10:10 IST'); ^ I feel it is better to provide the functionality of inserting year field more than 4 in length in all flows. ** ** +1. Nice catch. ** ** Here is the latest version of patch which handles the functionality in all flows. Could you test it and share you comments. ** ** I am getting some other failures with the updated patch also, please check the following tests. ** ** select date 'January 8, 19990'; select timestamptz 'January 8, 199910 01:01:01 IST'; INSERT INTO TIMESTAMPTZ_TST VALUES(4, '10001 SAT 8 MAR 10:10:10 IST'); ** ** you can get the test scripts from regress test files of date.sql, timetz.sql, timestamp.sql and timestamptz.sql and modify according to the patch for verification. ** ** I feel changing the year value to accept the length (4) is not simple. ** ** So many places the year length crossing more than length 4 is not considered. Search in the code with “” and correct all related paths. Right, changing the year value to accept the length (4) is not simple because so many places the year length crossing plus most of the please having assumption that it will be always 4. Tried to fix issue more couple of places but I don't feeling like its always going to be safe to assume that we covered all path. Still looking and wondering if we can do change in any simple place or whether we can find any other smarter way to fix the issue. ** ** Regards, Hari babu. ** ** -- Rushabh Lathia
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko sawada.m...@gmail.com wrote: Thank you for comment. I think it is good simple idea. In your opinion, if synchronous_transfer is set 'all' and synchronous_commit is set 'on', the master wait for data flush eve if user sets synchronous_commit to 'local' or 'off'. For example, when user want to do transaction early, user can't do this. we leave the such situation as constraint? No, user can still override the transaction commit point wait. So if synchronous_transfer is set to all: - If synchronous_commit is ON - wait at all points - If synchronous_commit is OFF - wait only at buffer flush (and other related to failback safety) points synchronous_transfer is set to data_flush: - If synchronous_commit is either ON o OFF - do not wait at commit points, but wait at all other points synchronous_transfer is set to commit: - If synchronous_commit is ON - wait at commit point - If synchronous_commit is OFF - do not wait at any point Thank you for explain. Understood. if synchronous_transfer is set 'all' and user changes synchronous_commit to 'off'( or 'local') at a transaction, the master server wait at buffer flush, but doesn't wait at commit points. Right? In currently patch, synchronous_transfer works in cooperation with synchronous_commit. But if user changes synchronous_commit at a transaction, they are not in cooperation. So, your idea might be better than currently behaviour of synchronous_transfer. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] insert throw error when year field len 4 for timestamptz datatype
On 27 September 2013 15:04 Rushabh Lathia wrote: On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi haribabu.ko...@huawei.commailto:haribabu.ko...@huawei.com wrote: I feel changing the year value to accept the length (4) is not simple. So many places the year length crossing more than length 4 is not considered. Search in the code with and correct all related paths. Right, changing the year value to accept the length (4) is not simple because so many places the year length crossing plus most of the please having assumption that it will be always 4. Tried to fix issue more couple of places but I don't feeling like its always going to be safe to assume that we covered all path. Still looking and wondering if we can do change in any simple place or whether we can find any other smarter way to fix the issue. If the changes are very high to deal all scenarios, I feel it is better do it only in scenarios where the use cases needs it, until it is not confusing users. The rest can be documented. Any other opinions/suggestions welcome. Regards, Hari babu.
Re: [HACKERS] backup.sgml-cmd-v003.patch
On Thu, Sep 26, 2013 at 1:15 PM, Ivan Lezhnjov IV i...@commandprompt.com wrote: Thanks for a detailed response. I attached a patch file that builds on your corrections and introduces some of the edits discussed above. This patch makes at least five unrelated sets of changes: 1. Attempting to encourage people to consider custom format dumps. 2. Suggesting that superuser access isn't necessary to dump the database. 3. Adding a note that OIDs on user objects are deprecated. 4. Describing the manner in which pg_dump can be used with pg_dumpall to back up all databases in a cluster. 5. Promoting pigz. It's not a good idea to bundle so many unrelated changes together into a single patch, because if there is disagreement about any item on the list, then the patch doesn't go in. Moreover, we typically try to commit one change at a time, so this leaves the eventual committer with the unenviable task of splitting up the patch for commit. I think that #3 and #5 have no business are things we shouldn't do at all. There are many compression utilities in the world other than gzip, and everyone has their favorites. I see no reason why we should promote one over the other. Certainly, the fact that you yourself have found pigz useful is not sufficient evidence that everyone should prefer it. And, while it's true that OIDs on user objects are deprecated, I don't think the pg_dump documentation necessarily needs to recapitulate this point particularly; hopefully, it's documented elsewhere; if not, this isn't the right place to mention it. If, contrary to my feelings on the matter, we do decide to make a change in either of these areas, it's unrelated to the rest of this patch and needs to be submitted and discussed separately. I think that #2 is probably a worthwhile change in some form. However, I don't particularly agree with the way you've worded it here. This section is trying to give a general overview of how to back up your whole database; it's not the pg_dump reference page, which exists elsewhere. I would suggest that the relevant caveat here is not so much that you might have a non-superuser account that just happens to have enough privileges to back up the entire database, but rather that we shouldn't give people the impression that the only possible use of pg_dump is as for a full backup. Because in my experience, a full-database backup almost always DOES require superuser privileges; it's the selective-dump case where you can often get by without them. I've attached a proposed patch along these lines for your consideration. The remaining changes (#1 and #4) can probably be blended together in a single patch. However, I think that you need to make more of an effort to use neutral wording. The purpose of the documentation is not to pass judgement on the tools. Examples: + The above example creates a plain text file. This type of dump can be used + to restore a database in full. However there are more sophisticated + productnamePostgreSQL/ backup formats which allow for far greater + control when working with backups. One of these is + the quotecustom/quote format, which the following more elaborate + example creates: I don't find it helpful to classify the other formats as more sophisticated or to say that they allow far greater control, or to call the example elaborate. What's important is what you can do, and there's basically one thing: selective restores. So I think you could replace all that, the following example, and the paragraph after that with: The above example creates a plain text file. pg_dump can also create backups in other formats, such as the custom format. This may be advantageous, since all formats other than the plain text file format make it possible to selectively restore objects from the dump. See xref linkend=app-pgrestore for more details. + para +Unfortunately, applicationpg_dumpall/ can only create plaintext +backups. However, it is currently the only way to backup the globals in you Similarly, let's not label pg_dumpall's behavior as unfortunate. I think it's appropriate to explain the pg_dumpall -g / pg_dump -Fc combination somewhere in the documentation, as it is widely used and very useful. But I don't think it's useful to imply to users that the tools are inadequate; generally, they're not, improvements that we'd like to make nonwithstanding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index ccb76d8..54681a0 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -47,8 +47,13 @@ pg_dump replaceable class=parameterdbname/replaceable gt; replaceable cl that you can perform this backup procedure from any remote host that has access to the database. But remember that applicationpg_dump/ does not operate with special permissions. In particular, it must - have read access to all tables that
Re: [HACKERS] Extra functionality to createuser
On Thu, Sep 26, 2013 at 1:04 PM, Christopher Browne cbbro...@gmail.com wrote: Sitting on my todo list for a while has been to consider the idea of adding a bit of additional functionality to createuser. One of the functions of CREATE ROLE is to associate the role with other roles, thus... create role my_new_user nosuperuser nocreatedb login IN ROLE app_readonly_role, app2_writer_role; That isn't something that I can do using createuser; to do that, I would need to submit two requests separately: PGUSER=postgres createuser -D -S -l my_new_user PGUSER=postgres psql -c grant app_readonly_role, app2_writer_role to my_new_user; I could certainly change over to using psql to do all the work, but it would be rather nice if createuser had (say) a -g option which allowed specifying the set of roles that should be assigned. Thus, the above commands might be replaced by: PGUSER=postgres createuser -D -S -l -g app_readonly_role,app2_writer_role my_new_user Would this be worth adding to the ToDo list? I'd be inclined to favor a patch implementing this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls
On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: But how do we ensure that the BMS is allocated in a context? You'd have to switch contexts each time you call bms_add_member. I don't have a good answer to this. The coding of bms_add_member is pretty funky. Why doesn't it just repalloc() the input argument if it's not big enough? If it did that, the new allocation would be in the same memory context as the original one, and we'd not need to care about the memory context when adding an element to an already non-empty set. But even if we did decide to switch memory contexts on every call, it would still be much cleaner than this. It's only three lines of code (and about as many instructions) to save and restore CurrentMemoryContext. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote: Well, I think we can rule out value locks that are held for the duration of a transaction right away. That's just not going to fly. I think I agree with that. I don't think I remember hearing that proposed. I think I might have been unclear - I mean locks that are held for the duration of *another* transaction, not our own, as we wait for that other transaction to commit/abort. I think that earlier remarks from yourself and Andres implied that this would be necessary. Perhaps I'm mistaken. Your most recent design proposal doesn't do this, but I think that that's only because it restricts the user to a single unique index - it would otherwise be necessary to sit on the earlier value locks (index tuples belonging to an unfinished transaction) pending the completion of some other conflicting transaction, which has numerous disadvantages (as described in my it suits my purposes to have the value locks be held for only an instant mail to you [1]). OK, now I understand what you are saying. I don't think I agree with it. I don't have another idea either. In fact, I'd go so far as to say that doing any third thing that's better than those two to any reasonable person is obviously impossible. But I'd add that we simple cannot rollback at read committed, so we're just going to have to hold our collective noses and do strange things with visibility. I don't accept that as a general principal. We're writing the code; we can make it behave any way we think best. This is something that I haven't given remotely enough thought yet, so please take it with a big grain of salt. I doubt that any change to HeapTupleSatisfiesMVCC() will be acceptable. This feature needs to restrain itself to behavior changes that only affect users of this feature, I think. There is certainly value in considering that, and you're right to take that tact - it is generally valuable to have a patch be minimally invasive. However, ultimately that's just one aspect of any given design, an aspect that needs to be weighed against others where there is a tension. Obviously in this instance I believe, rightly or wrongly, that doing more - adding more infrastructure than might be considered strictly necessary - is the least worst thing. Also, sometimes the apparent similarity of a design to what we have today is illusory - certainly, I think you'd at least agree that the problems that bloating during the interim between value locking and row locking present are qualitatively different to other problems that bloat presents in all existing scenarios. TBH, no, I don't think I agree with that. See further below. Let me try and explain myself better, with reference to a concrete example. Suppose we have a table with a primary key column, A, and a unique constraint column, B, and we lock the pk value first and the unique constraint value second. I'm assuming your design, but allowing for multiple unique indexes because I don't think doing anything less will be accepted - promise tuples have some of the same problems, as well as some other idiosyncratic ones (see my earlier remarks on recovery/freezing [2] for examples of those). OK, so far I'm right with you. So there is a fairly high probability that the pk value on A will be unique, and a fairly low probability that the unique constraint value on B will be unique, at least in this usage pattern of interest, where the user is mostly going to end up updating. Mostly, we insert a speculative regular index tuple (that points to a speculative heap tuple that we might decide to kill) into the pk column, A, right away, and then maybe block pending the resolution of a conflicting transaction on the unique constraint column B. I don't think we have any reasonable way of not blocking on A - if we go clean it up for the wait, that's going to bloat quite dramatically, *and* we have to WAL log. In any case you seemed to accept that cleaning up bloat synchronously like that was just going to be too expensive. So I suppose that rules that out. That just leaves sitting on the value lock (that the pk index tuple already inserted effectively is) indefinitely, pending the outcome of the first transaction. Agreed. What are the consequences of sitting on that value lock indefinitely? Well, xacts are going to block on the pk value much more frequently, by simple virtue of the fact that the value locks there are held for a long time - they just needed to hear a no answer, which the unique constraint was in most cases happy to immediately give, so this is totally unnecessary. Contention is now in a certain sense almost as bad for every unique index as it is for the weakest link. That's only where the problems begin, though, and it isn't necessary for there to be bad contention on more than one unique
Re: [HACKERS] Wait free LW_SHARED acquisition
On Sep27, 2013, at 00:55 , Andres Freund and...@2ndquadrant.com wrote: So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 That's rougly 400%. Interesting. I played with pretty much the same idea two years or so ago. At the time, I compared a few different LWLock implementations. Those were AFAIR A) Vanilla LWLocks B) A + an atomic-increment fast path, very similar to your proposal C) B but with a partitioned atomic-increment counter to further reduce cache-line contention D) A with the spinlock-based queue replaced by a lockless queue At the time, the improvements seemed to be negligible - they looked great in synthetic benchmarks of just the locking code, but didn't translate to improved TPS numbers. Though I think the only version that ever got tested on more than a handful of cores was C… My (rather hacked together) benchmarking code can be found here: https://github.com/fgp/lockbench. The different LWLock implementations live in the various pg_lwlock_* subfolders. Here's a pointer into the relevant thread: http://www.postgresql.org/message-id/651002c1-2ec1-4731-9b29-99217cb36...@phlo.org best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] backup.sgml-cmd-v003.patch
Hi Robert, On 09/27/2013 05:56:52 AM, Robert Haas wrote: 1. Attempting to encourage people to consider custom format dumps. What's important is what you can do... Your critique seems obvious in retrospect. Sorry you had to step in here and do my job. The above point is particularly salient. I will try to keep it in mind in the future. Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.1
On 27 September 2013 16:14, Andres Freund and...@2ndquadrant.com wrote: Hi, Attached you can find an updated version of the series taking in some of the review comments (the others are queued, not ignored), including: * split of things from the big Introduce wal decoding via ... patch * fix the bug Steve notice where CreateLogicalDecodingContext was passed the wrong is_init = false where it should have been true * A number of smaller bugs I noticed while reviewing * Renaming of some variables, including guaranteedlyLogged ;) * Comment improvements in decode.c * rename pg_receivellog to pg_recvlogical I'll work more on the other points in the next days, so far they are clear of other big stuff. 0001 wal_decoding: Allow walsender's to connect to a specific database - as before 0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done - as before 0003 wal_decoding: Add information about a tables primary key to struct RelationData - as before 0004 wal_decoding: Add wal_level = logical and log data required for logical decoding - splitof patch that contains the wal format changes including the addition of a new wal_level option 0005 wal_decoding: Add option to treat additional tables as catalog tables - Option to treat user defined table as a catalog table which means it can be accessed during logical decoding from an output plugin 0006 wal_decoding: Introduce wal decoding via catalog timetravel - The guts of changeset extraction, without a user interface 0007 wal_decoding: logical changeset extraction walsender interface - splitof patch containing the walsender changes, which allow to receive the changeset data in a streaming fashion, supporting sync rep and such fancy things 0008 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding - splitof optimization which reduces the pain 06 introduces by pegging the xmin horizon to the smallest of the logical decoding slots. Now it's pegged differently for data tables than from catalog tables 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib - Example output plugin which is also used in tests 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes - renamed client for the walsender interface 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding - SQL SRF to get data from a decoding slot, also used as a vehicle for tests 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5 I'm encountering a make error: install pg_basebackup '/home/thom/Development/psql/bin/pg_basebackup' install pg_receivexlog '/home/thom/Development/psql/bin/pg_receivexlog' install pg_recvlogical(X) '/home/thom/Development/psql/bin/pg_receivellog' /bin/dash: 1: Syntax error: ( unexpected make[3]: *** [install] Error 2 make[3]: Leaving directory `/home/thom/Development/postgresql/src/bin/pg_basebackup' make[2]: *** [install-pg_basebackup-recurse] Error 2 Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6.1
On 2013-09-27 16:35:53 +0100, Thom Brown wrote: On 27 September 2013 16:14, Andres Freund and...@2ndquadrant.com wrote: Hi, Attached you can find an updated version of the series taking in some of the review comments (the others are queued, not ignored), including: * split of things from the big Introduce wal decoding via ... patch * fix the bug Steve notice where CreateLogicalDecodingContext was passed the wrong is_init = false where it should have been true * A number of smaller bugs I noticed while reviewing * Renaming of some variables, including guaranteedlyLogged ;) * Comment improvements in decode.c * rename pg_receivellog to pg_recvlogical I'll work more on the other points in the next days, so far they are clear of other big stuff. 0001 wal_decoding: Allow walsender's to connect to a specific database - as before 0002 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done - as before 0003 wal_decoding: Add information about a tables primary key to struct RelationData - as before 0004 wal_decoding: Add wal_level = logical and log data required for logical decoding - splitof patch that contains the wal format changes including the addition of a new wal_level option 0005 wal_decoding: Add option to treat additional tables as catalog tables - Option to treat user defined table as a catalog table which means it can be accessed during logical decoding from an output plugin 0006 wal_decoding: Introduce wal decoding via catalog timetravel - The guts of changeset extraction, without a user interface 0007 wal_decoding: logical changeset extraction walsender interface - splitof patch containing the walsender changes, which allow to receive the changeset data in a streaming fashion, supporting sync rep and such fancy things 0008 wal_decoding: Only peg the xmin horizon for catalog tables during logical decoding - splitof optimization which reduces the pain 06 introduces by pegging the xmin horizon to the smallest of the logical decoding slots. Now it's pegged differently for data tables than from catalog tables 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib - Example output plugin which is also used in tests 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for logical changes - renamed client for the walsender interface 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of logical decoding - SQL SRF to get data from a decoding slot, also used as a vehicle for tests 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5 I'm encountering a make error: Gah. Lastminute changes. Always the same... Updated patch attached. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services 0010-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz Description: application/patch-gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] hstore extension version screwup
When adding json support to hstore, I made a major blunder and added the new functionality to the existing sql script instead of bumping the version, renaming the script and adding an update script. This was lazy and there's no real excuse, although I will note that it was a mistake far too easy to make. Perhaps as a warning indicator we should remove write permissions from these files. Anyway, I have had some discussions with Dimitri, and the best idea seems to be that we should do all the above, but in the update script use conditional logic that only adds the functions if they aren't already there and dependent on the extension. In the release notes we should advise anyone who has loaded hstore to run 'ALTER EXTENSION hstore UPDATE TO '1.2'; The minor downside of this is that the upgrade script will depend on plpgsql be available. We'll need to note that too, although I don't recall the last time I came across a site that didn't have it loaded. See attached for details of what's proposed. cheers andrew diff --git a/contrib/hstore/hstore--1.1--1.2.sql b/contrib/hstore/hstore--1.1--1.2.sql new file mode 100644 index 000..99b8a16 --- /dev/null +++ b/contrib/hstore/hstore--1.1--1.2.sql @@ -0,0 +1,47 @@ +/* contrib/hstore/hstore--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use ALTER EXTENSION hstore UPDATE TO '1.2' to load this file. \quit + + +-- A version of 1.1 was shipped with these objects mistakenly in 9.3.0. +-- Therefore we only add them if we detect that they aren't already there and +-- dependent on the extension. + +DO LANGUAGE plpgsql + +$$ + +BEGIN + + PERFORM 1 + FROM pg_proc p + JOIN pg_depend d + ON p.proname = 'hstore_to_json_loose' +AND d.objid = p.oid +AND d.refclassid = 'pg_extension'::regclass + JOIN pg_extension x + ON d.refobjid = x.oid +AND x.extname = 'hstore'; + + IF NOT FOUND + THEN + +CREATE FUNCTION hstore_to_json(hstore) +RETURNS json +AS 'MODULE_PATHNAME', 'hstore_to_json' +LANGUAGE C IMMUTABLE STRICT; + +CREATE CAST (hstore AS json) + WITH FUNCTION hstore_to_json(hstore); + +CREATE FUNCTION hstore_to_json_loose(hstore) +RETURNS json +AS 'MODULE_PATHNAME', 'hstore_to_json_loose' +LANGUAGE C IMMUTABLE STRICT; + + END IF; + +END; + +$$; diff --git a/contrib/hstore/hstore--1.1.sql b/contrib/hstore/hstore--1.2.sql similarity index 100% rename from contrib/hstore/hstore--1.1.sql rename to contrib/hstore/hstore--1.2.sql diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control index 4104e17..9daf5e2 100644 --- a/contrib/hstore/hstore.control +++ b/contrib/hstore/hstore.control @@ -1,5 +1,5 @@ # hstore extension comment = 'data type for storing sets of (key, value) pairs' -default_version = '1.1' +default_version = '1.2' module_pathname = '$libdir/hstore' relocatable = true -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] backup.sgml-cmd-v003.patch
On Fri, Sep 27, 2013 at 9:49 AM, Karl O. Pinc k...@meme.com wrote: Hi Robert, On 09/27/2013 05:56:52 AM, Robert Haas wrote: 1. Attempting to encourage people to consider custom format dumps. What's important is what you can do... Your critique seems obvious in retrospect. Sorry you had to step in here and do my job. The above point is particularly salient. I will try to keep it in mind in the future. Hey, your reviewing is much appreciated. I actually didn't notice your reply before I sent mine; I started writing it yesterday afternoon and finished this morning, so I wasn't intending to preempt you in any way. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 9/26/13 2:46 PM, Robert Haas wrote: On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby j...@nasby.net wrote: Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward. Well, they sorta do the trick, but see e.g. commit ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only code that's poorly-optimized for multiple forks; IOW, every time someone adds a new fork, there's a system-wide cost to that, even if that fork is only used in a tiny percentage of the relations that exist in the system. Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements that need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving them vs what we gain by using them. Of course there's always some added cost so we shouldn't just blindly use them all over the place without considering the fork cost either... It's tempting to think that we can use the fork mechanism every time we have multiple logical streams of blocks within a relation and don't want to figure out a way to multiplex them onto the same physical file. However, the reality is that the fork mechanism isn't up to the job. I certainly don't want to imply that we shouldn't have gone in that direction - both the fsm and the vm are huge steps forward, and we wouldn't have gotten them in 8.4 without that mechanism. But they haven't been entirely without their own pain, too, and that pain is going to grow the more we push in the direction of relying on forks. Agreed. Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed). If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using multiple directories instead of glomming everything into one place? That's not the only complaint about forks - but I support what you're proposing there anyhow, because it will be helpful to users with lots of relations regardless of what we do or do not decide to do about forks. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra functionality to createuser
On 9/27/13 6:01 AM, Robert Haas wrote: Thus, the above commands might be replaced by: PGUSER=postgres createuser -D -S -l -g app_readonly_role,app2_writer_role my_new_user Would this be worth adding to the ToDo list? I'd be inclined to favor a patch implementing this. +1 -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby j...@nasby.net wrote: Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements that need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving them vs what we gain by using them. We think this gives short change to the decision to introduce forks. If you go back to the discussion at the time it was a topic of debate and the argument which won the day is that interleaving different streams of data in one storage system is exactly what the file system is designed to do and we would just be reinventing the wheel if we tried to do it ourselves. I think that makes a lot of sense for things like the fsm or vm which grow indefinitely and are maintained by a different piece of code from the main heap. The tradeoff might be somewhat different for the pieces of a data structure like a bitmap index or gin index where the code responsible for maintaining it is all the same. Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed). The ultimate goal here would be to get the filesystem to issue a TRIM call so an SSD storage system can reuse the underlying blocks. Truncating 1GB files might be a convenient way to do it, especially if we have some new kind of vacuum full that can pack tuples within each 1GB file. But there may be easier ways to achieve the same thing. If we can notify the filesystem that we're not using some of the blocks in the middle of the file we might be able to just leave things where they are and have holes in the files. Or we might be better off not depending on truncate and just look for ways to mark entire 1GB files as deprecated and move tuples out of them until we can just remove that whole file. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote: On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote: + bool + RecoveryMightBeInProgress(void) + { + /* + * We check shared state each time only until we leave recovery mode. We + * can't re-enter recovery, so there's no need to keep checking after the + * shared variable has once been seen false. + */ + if (!LocalRecoveryInProgress) + return false; + else + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + /* Intentionally query xlogctl without spinlocking! */ + LocalRecoveryInProgress = xlogctl-SharedRecoveryInProgress; + + return LocalRecoveryInProgress; + } + } I don't think it's acceptable to *set* LocalRecoveryInProgress here. That should only be done in the normal routine. quite right -- that was a major error -- you could bypass the initialization call to the xlog with some bad luck. I've seen this in profiles since, so I'd appreciate pushing this forward. roger that -- will push ahead when I get into the office... attached is new version fixing some comment typos. Attached is simplified patch that replaces the spinlock with a read barrier based on a suggestion made by Andres offlist. The approach has different performance characteristics -- a barrier call is being issued instead of a non-blocking read. I don't have a performance test case in hand to prove that's better so I'm going with Andre's approach because it's simpler. Aside: can this truly the only caller of pg_read_barrier()? Also, moving to -hackers from -performance. merlin recovery5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 9/27/13 1:43 PM, Greg Stark wrote: Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed). The ultimate goal here would be to get the filesystem to issue a TRIM call so an SSD storage system can reuse the underlying blocks. Truncating 1GB files might be a convenient way to do it, especially if we have some new kind of vacuum full that can pack tuples within each 1GB file. But there may be easier ways to achieve the same thing. If we can notify the filesystem that we're not using some of the blocks in the middle of the file we might be able to just leave things where they are and have holes in the files. Or we might be better off not depending on truncate and just look for ways to mark entire 1GB files as deprecated and move tuples out of them until we can just remove that whole file. Yeah, there's a ton of different things we might do. And dealing with free space is just one example... things like the VM give us the ability to detect areas of the heap that have gone dormant; imagine if we could seamlessly move that data to it's own storage, possibly compressing it at the same time. (Yes, I realize there's partitioning and tablespaces and compressing filesystems, but those are a lot more work and will never be as efficient as what the database itself can do). Anyway, I think we're all on the same page. We should stop hijacking Alvaro's thread... ;) -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls
On 27.09.2013 14:12, Robert Haas wrote: On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: But how do we ensure that the BMS is allocated in a context? You'd have to switch contexts each time you call bms_add_member. I don't have a good answer to this. The coding of bms_add_member is pretty funky. Why doesn't it just repalloc() the input argument if it's not big enough? If it did that, the new allocation would be in the same memory context as the original one, and we'd not need to care about the memory context when adding an element to an already non-empty set. But even if we did decide to switch memory contexts on every call, it would still be much cleaner than this. It's only three lines of code (and about as many instructions) to save and restore CurrentMemoryContext. [looks] Huh, yeah, as it stands, bms_add_member() is an accident waiting to happen. It's totally non-obvious that it might cause the BMS to jump from another memory context to CurrentMemoryContext. Same with bms_add_members(). And it would be good to Assert in bms_join that both arguments are in the same from MemoryContext. Let's fix that... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On 27.09.2013 22:00, Merlin Moncure wrote: On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncuremmonc...@gmail.com wrote: On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncuremmonc...@gmail.com wrote: On Thu, Sep 26, 2013 at 6:08 PM, Andres Freundand...@2ndquadrant.com wrote: On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote: On Tue, Aug 27, 2013 at 10:55 AM, Andres Freundand...@2ndquadrant.com wrote: On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote: + bool + RecoveryMightBeInProgress(void) + { + /* + * We check shared state each time only until we leave recovery mode. We + * can't re-enter recovery, so there's no need to keep checking after the + * shared variable has once been seen false. + */ + if (!LocalRecoveryInProgress) + return false; + else + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + /* Intentionally query xlogctl without spinlocking! */ + LocalRecoveryInProgress = xlogctl-SharedRecoveryInProgress; + + return LocalRecoveryInProgress; + } + } I don't think it's acceptable to *set* LocalRecoveryInProgress here. That should only be done in the normal routine. quite right -- that was a major error -- you could bypass the initialization call to the xlog with some bad luck. I've seen this in profiles since, so I'd appreciate pushing this forward. roger that -- will push ahead when I get into the office... attached is new version fixing some comment typos. Attached is simplified patch that replaces the spinlock with a read barrier based on a suggestion made by Andres offlist. The approach has different performance characteristics -- a barrier call is being issued instead of a non-blocking read. I don't have a performance test case in hand to prove that's better so I'm going with Andre's approach because it's simpler. Can you think of a concrete example of what might go wrong if we simply removed the spinlock, and did an unprotected read through the volatile pointer? Is there some particular call sequence that might get optimized in an unfortunate way? I'm not suggesting that we do that - better safe than sorry even if we can't think of any particular scenario - but it would help me understand this. I don't think a read-barrier is enough here. See README.barrier: When a memory barrier is being used to separate two loads, use pg_read_barrier(); when it is separating two stores, use pg_write_barrier(); when it is a separating a load and a store (in either order), use pg_memory_barrier(). The function RecoveryInProgress() function does just one load, to read the variable, and wouldn't even need a barrier by itself. The other load or store that needs to be protected by the barrier happens in the caller, before or after the function, and we can't say for sure if it's a load or a store. So, let's use pg_memory_barrier(). Aside: can this truly the only caller of pg_read_barrier()? Yep. I only added the first caller of the barriers altogether in the xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it wasn't used until now, in 9.4. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Yep. I only added the first caller of the barriers altogether in the xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it wasn't used until now, in 9.4. FWIW, it was actually during 9.2 development that Robert first added the barriers. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6
On 09/26/2013 02:47 PM, Steve Singer wrote: I've determined that when in this test the walsender seems to be hitting this when it is decode the transactions that are behind the slonik commands to add tables to replication (set add table, set add sequence). This is before the SUBSCRIBE SET is submitted. I've also noticed something else that is strange (but might be unrelated). If I stop my slon process and restart it I get messages like: WARNING: Starting logical replication from 0/a9321360 ERROR: cannot stream from 0/A9321360, minimum is 0/A9320B00 Where 0/A9321360 was sent in the last packet my slon received from the walsender before the restart. If force it to restart replication from 0/A9320B00 I see datarows that I appear to have already seen before the restart. I think this is happening when I process the data for 0/A9320B00 but don't get the feedback message my slon was killed. Is this expected? I've further narrowed this down to something (or the combination of) what the _disorder_replica.altertableaddTriggers(1); stored function does. (or @SLONYNAMESPACE@.altertableaddTriggers(int); Which is essentially * Get an exclusive lock on sl_config_lock * Get an exclusive lock on the user table in question * create a trigger (the deny access trigger) * create a truncate trigger * create a deny truncate trigger I am not yet able to replicate the error by issuing the same SQL commands from psql, but I must be missing something. I can replicate this when just using the test_decoding plugin. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote: On 27.09.2013 22:00, Merlin Moncure wrote: Attached is simplified patch that replaces the spinlock with a read barrier based on a suggestion made by Andres offlist. The approach has different performance characteristics -- a barrier call is being issued instead of a non-blocking read. I don't have a performance test case in hand to prove that's better so I'm going with Andre's approach because it's simpler. Can you think of a concrete example of what might go wrong if we simply removed the spinlock, and did an unprotected read through the volatile pointer? Is there some particular call sequence that might get optimized in an unfortunate way? I'm not suggesting that we do that - better safe than sorry even if we can't think of any particular scenario - but it would help me understand this. I think in some architecutres the write from the other side might just not immediately be visible, so RecoveryInProgress() might return a spurious true, although we already finished. I haven't checked whether there's any callers that would have a problem with that. I don't think a read-barrier is enough here. See README.barrier: When a memory barrier is being used to separate two loads, use pg_read_barrier(); when it is separating two stores, use pg_write_barrier(); when it is a separating a load and a store (in either order), use pg_memory_barrier(). I think the documentation is just worded in an easy to misunderstand way. If you look at the following example, we're fine. The two stores, two loads it is talking about are what's happening in one thread of execution. The important part of a read barrier is that it forces a fresh LOAD, separated from earlier LOADs to the same address. The function RecoveryInProgress() function does just one load, to read the variable, and wouldn't even need a barrier by itself. The other load or store that needs to be protected by the barrier happens in the caller, before or after the function, and we can't say for sure if it's a load or a store. So, let's use pg_memory_barrier(). The caller uses a spinlock, so it's guaranteed to write out before the spinlock is released. A write barrier (the spinlock in the startup process) should always be paired by a read barrier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v6
Hi Steve, On 2013-09-27 17:06:59 -0400, Steve Singer wrote: I've determined that when in this test the walsender seems to be hitting this when it is decode the transactions that are behind the slonik commands to add tables to replication (set add table, set add sequence). This is before the SUBSCRIBE SET is submitted. I've also noticed something else that is strange (but might be unrelated). If I stop my slon process and restart it I get messages like: WARNING: Starting logical replication from 0/a9321360 ERROR: cannot stream from 0/A9321360, minimum is 0/A9320B00 Where 0/A9321360 was sent in the last packet my slon received from the walsender before the restart. Uh, that looks like I fumbled some comparison. Let me check. I've further narrowed this down to something (or the combination of) what the _disorder_replica.altertableaddTriggers(1); stored function does. (or @SLONYNAMESPACE@.altertableaddTriggers(int); Which is essentially * Get an exclusive lock on sl_config_lock * Get an exclusive lock on the user table in question * create a trigger (the deny access trigger) * create a truncate trigger * create a deny truncate trigger I am not yet able to replicate the error by issuing the same SQL commands from psql, but I must be missing something. I can replicate this when just using the test_decoding plugin. Thanks. That should get me started with debugging. Unless it's possibly fixed in the latest version, one bug fixed there might cause something like this if the moon stands exactly right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wait free LW_SHARED acquisition
On 2013-09-27 14:46:50 +0200, Florian Pflug wrote: On Sep27, 2013, at 00:55 , Andres Freund and...@2ndquadrant.com wrote: So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 That's rougly 400%. Interesting. I played with pretty much the same idea two years or so ago. At the time, I compared a few different LWLock implementations. Those were AFAIR A) Vanilla LWLocks B) A + an atomic-increment fast path, very similar to your proposal C) B but with a partitioned atomic-increment counter to further reduce cache-line contention D) A with the spinlock-based queue replaced by a lockless queue At the time, the improvements seemed to be negligible - they looked great in synthetic benchmarks of just the locking code, but didn't translate to improved TPS numbers. Though I think the only version that ever got tested on more than a handful of cores was C… I think you really need multi-socket systems to see the big benefits from this. My laptop barely shows any improvements, while my older 2 socket workstation already shows some in workloads that have more contention than pgbench -S. From a quick look, you didn't have any sleeping queueing in at least one of the variants in there? In my tests, that was tremendously important to improve scaling if there was any contention. Which is not surprising in the end, because otherwise you essentially have rw-spinlocks which really aren't suitable for many of the lwlocks we use. Getting the queueing semantics, including releaseOK, right was what took me a good amount of time, the atomic ops part was pretty quick... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
Hi, What confuses me is that pg_read_barrier() is just a compiler barrier on x86[-64] in barrier.h. According to my knowledge it needs to be an lfence or the full barrier? The linked papers from Paul McKenney - which are a great read - seem to agree? On 2013-09-27 23:12:17 +0200, Andres Freund wrote: On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote: The function RecoveryInProgress() function does just one load, to read the variable, and wouldn't even need a barrier by itself. The other load or store that needs to be protected by the barrier happens in the caller, before or after the function, and we can't say for sure if it's a load or a store. So, let's use pg_memory_barrier(). The caller uses a spinlock, so it's guaranteed to write out before the spinlock is released. A write barrier (the spinlock in the startup process) should always be paired by a read barrier. s/caller/startup process/ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra functionality to createuser
Attached is a patch implementing the -g / --roles option for createuser. I'll be attaching it to the open CommitFest shortly. createuser.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] record identical operator - Review
Bruce Momjian br...@momjian.us wrote: On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote: On 09/12/2013 06:27 PM, Kevin Grittner wrote: Attached is a patch for a bit of infrastructure I believe to be necessary for correct behavior of REFRESH MATERIALIZED VIEW CONCURRENTLY as well as incremental maintenance of matviews. Here is attempt at a review of the v1 patch. There has been a heated discussion on list about if we even want this type of operator. I will try to summarize it here The arguments against it are * that a record_image_identical call on two records that returns false can still return true under the equality operator, and the records might (or might not) appear to be the same to users * Once we expose this as an operator via SQL someone will find it and use it and then complain when we change things such as the internal representation of a datatype which might break there queries I don't see where that is possible unless they count on the order of records sorted with the new operators, which would not be something I recommend. Changing the internal representation cannot break simple use of the alternative equality definition as long as all you cared about was whether the binary storage format of the values was identical. * The differences between = and record_image_identical might not be understood by everywhere who finds the operator leading to confusion Either they find it and read the code, or we document it and they read the docs. * Using a criteria other than equality (the = operator provided by the datatype) might cause problems if we later provide 'on change' triggers for materialized views I would fight user-defined triggers for matviews tooth and nail. We need to get incremental maintenance working instead. The arguments for this patch are * We want the materialized view to return the same value as would be returned if the query were executed directly. This not only means that the values should be the same according to a datatypes = operator but that they should appear the same 'to the eyeball'. And to functions the user can run against the values. The example with the null bitmap for arrays being included when not necessary is something that isn't directly apparent to the eye, but queries which use pg_column_size would not get equal results. * Supporting the materialized view refresh check via SQL makes a lot of sense but doing that requires exposing something via SQL * If we adequately document what we mean by record_image_identical and the operator we pick for this then users shouldn't be surprised at what they get if they use this We first need to document the existing record comparison operators. If they read the docs for comparing row_constructors and expect that to be the behavior they get when they compare records, they will be surprised. This is a good summary. I think there are a few things that make this issue difficult to decide: 1. We have to use an operator to give the RMVC (REFRESH MATERIALIZED VIEW CONCURRENTLY) SPI query the ability to optimize this query. If we could do this with an SQL or C function, there would be less concern about the confusion caused by adding this capability. (a) We can't. (b) Why would that be less confusing? Question: If we are comparing based on some primary key, why do we need this to optimize? Because comparing primary keys doesn't tell us whether the old and new values in the row all match. Certainly the primary key index will be used, no? It currently uses all columns which are part of unique indexes on the matview which are not partial (i.e., they do not use a WHERE clause), they index only on columns (not expressions). It is not possible to define a primary key on a matview. There are two reasons for using these columns: (a) It provides a correctness guarantee against having duplicated rows which have no nulls. This guarantee is what allows us to use the differential technique which is faster than retail DELETE and re-INSERT of everything. (b) Since we don't support hash joins of records, and it would tend to be pretty slow if we did, it allows faster hash joins on one or more columns which stand a good chance of doing this efficiently. 2. Agregates are already non-deterministic, Only some of them, and only with certain source data. so some feel that adding this feature doesn't improve much. It allows a materialized view which contains a column of a type without a default btree opclass to *use* concurrent refresh, it makes queries with deterministic results always generate matview results with exactly match the results of a VIEW using the query if it were run at the same moment, and for nondeterministic queries, it provides results consistent with *some* result set which could have been returned by a view at the same time. Unless we do something, none of these things are true. That seems like much to me. The counter-argument is that
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas robertmh...@gmail.com wrote: I don't have another idea either. In fact, I'd go so far as to say that doing any third thing that's better than those two to any reasonable person is obviously impossible. But I'd add that we simple cannot rollback at read committed, so we're just going to have to hold our collective noses and do strange things with visibility. I don't accept that as a general principal. We're writing the code; we can make it behave any way we think best. I presume you're referring to the principle that we cannot throw serialization failures at read committed. I'd suggest that letting that happen would upset a lot of people, because it's so totally unprecedented. A large segment of our user base would just consider that to be Postgres randomly throwing errors, and would be totally dismissive of the need to do so, and not without some justification - no one else does the same. The reality is that the majority of our users don't even know what an isolation level is. I'm not just talking about people that use Postgres more casually, such as Heroku customers. I've personally talked to people who didn't even know what a transaction isolation level was, that were in a position where they really, really should have known. I doubt that any change to HeapTupleSatisfiesMVCC() will be acceptable. This feature needs to restrain itself to behavior changes that only affect users of this feature, I think. I agree with the principle of what you're saying, but I'm not aware that those changes to HeapTupleSatisfiesMVCC() imply any behavioral changes for those not using the feature. Certainly, the standard regression tests and isolation tests still pass, for what it's worth. Having said that, I have not thought about it enough to be willing to actually defend that bit of code. Though I must admit that I am a little encouraged by the fact that it passes casual inspection. I am starting to wonder if it's really necessary to have a blessed update that can see the locked, not-otherwise-visible tuple. Doing that certainly has its disadvantages, both in terms of code complexity and in terms of being arbitrarily restrictive. We're going to have to allow the user to see the locked row after it's updated (the new row version that we create will naturally be visible to its creating xact) - is it really any worse that the user can see it before an update (or a delete)? The user could decide to effectively make the update change nothing, and see the same thing anyway. I get why you're averse to doing odd things to visibility - I was too. I just don't see that we have a choice if we want this feature to work acceptably with read committed. In addition, as it happens I just don't see that the general situation is made any worse by the fact that the user might be able to see the row before an update/delete. Isn't is also weird to update or delete something you cannot see? Couldn't EvalPlanQual() be said to be an MVCC violation on similar grounds? It also reaches into the future. Locking a row isn't really that distinct from updating it in terms of the code footprint, but also from a logical perspective. It's probably important to avoid having them keep recreating speculative tuples and then killing them as long as a candidate tuple is available, so that they don't create a dead tuple per iteration. But that seems doable. I'm not so sure. Realize you can generally only kill the heap tuple *before* you have the row lock, because otherwise a totally innocent non-HOT update (may not update any unique indexed columns at all) will deadlock with your session, which I don't think is defensible, and will probably happen often if allowed to (after all, this is upsert - users are going to want to update their locked rows!). I must be obtuse; I don't see why that would deadlock. If you don't see it, then you aren't being obtuse in asking for clarification. It's really easy to be wrong about this kind of thing. If the non-HOT update updates some random row, changing the key columns, it will lock that random row version. It will then proceed with value locking (i.e. inserting index tuples in the usual way, in this case with entirely new values). It might then block on one of the index tuples we, the upserter, have already inserted (these are our value locks under your scheme). Meanwhile, we (the upserter) might have *already* concluded that the *old* heap row that the regular updater is in the process of rendering invisible is to blame in respect of some other value in some later unique index, and that *it* must be locked. Deadlock. This seems very possible if the key values are somewhat correlated, which is probably generally quite common. The important observation here is that an updater, in effect, locks both the old and new sets of values (for non-HOT updates). And as I've already noted, any practical value locking implementation isn't going to be able
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics
On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund and...@2ndquadrant.com wrote: Various messages are discussing semantics around visibility. I by now have a hard time keeping track. So let's keep the discussion of the desired semantics to this thread. Yes, it's pretty complicated. I meant to comment on this here, but ended up saying some stuff to Robert about this in the main thread, so I should probably direct you to that. You were probably right to start a new thread, because I think we can usefully discuss this topic in parallel, but that's just what ended up happening. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers