Re: [HACKERS] Shared row locking, revisited
Alvaro Herrera [EMAIL PROTECTED] writes Because we can't reuse MultiXactIds at system crash (else we risk taking an Id which is already stored in some tuple), we need to XLog it. Not at the locking operation, because we don't want to log that one (too expensive.) We can log the current value of MultiXactId counter once in a while; say, one each (say) 1000 acquisitions. Following a crash, the value is restored to the last one logged + 1000. (I think this can be a problem if we log one acquisition, then write some tuples, and then crash, without flushing the acquisition logged. Maybe a solution is to force a flush after logging an acquisition.) Does Oid have a similar problem? Regards, Qingqing ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking, revisited
On Thu, Apr 07, 2005 at 02:34:03PM +0800, Qingqing Zhou wrote: Alvaro Herrera [EMAIL PROTECTED] writes Because we can't reuse MultiXactIds at system crash (else we risk taking an Id which is already stored in some tuple), we need to XLog it. Not at the locking operation, because we don't want to log that one (too expensive.) We can log the current value of MultiXactId counter once in a while; say, one each (say) 1000 acquisitions. Following a crash, the value is restored to the last one logged + 1000. (I think this can be a problem if we log one acquisition, then write some tuples, and then crash, without flushing the acquisition logged. Maybe a solution is to force a flush after logging an acquisition.) Does Oid have a similar problem? Good question. Yes, and in fact the solution is similar; see XLogPutNextOid(). I think we could do the same for MultiXactId. -- Alvaro Herrera ([EMAIL PROTECTED]) La soledad es compañía ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Shared row locking, revisited
Hackers, Again I come back to this topic. Now I have a better hold on an idea that is hopefully implementable and doesn't have expensive performance penalties. Forget the idea of using the regular lock manager directly on tuples. It's folly because we can't afford to have that many locks. Instead we will lock on the transactions that are involved in a shared row lock. We will have a cluster-wide counter, call it MultiXactId for the sake of this explanation. When someone wants to share lock an (unlocked) tuple, he creates a new MultiXactId, a bit is set in t_infomask (HEAP_XMAX_FOR_SHARE, say), and the tuple's Xmax is set to the new MultiXactId. We will have two additional SLRU areas; one will contain offsets to the other, so that the second can store variable length arrays. In the first (call it pg_offsets) we will use the MultiXactId as key, and the last offset of the second as value. The second SLRU area (call it pg_multixact_members) is accessed using the offset from pg_offsets, and contains the members of the respective MultiXactId. So, returning to the locker: it grabbed the MultiXactId; it records an offset of (previous offset + 1) in pg_offsets, and its own TransactionId in that offset of pg_multixact_members. This was a locker of an unlocked tuple. But if it finds that the tuple is already locked, it gets the MultiXactId from the tuple, goes to pg_offsets and from there to pg_multixact_members; it then knows what transactions are locking this tuple. It can check which of those are still running and discard those that aren't. So at this point it has a list of TransactionIds, to which it adds itself, and stores it as a new MultiXactId following the procedure outlined above. Possible optimization: keep a list of the MultiXactIds the current TransactionId is member of (this can be done in local memory). If at some point it is going to create a new MultiXactId first check this list to see if the same members can be found, and reuse the MultiXactId if so. Thus we save a MultiXactId (useful trick for when somebody wants to lock a whole table, for example). Sleeping is done using XactLockTableWait() on each of the members of the MultiXactId. If it wakes after sleeping on all of them only to find that the MultiXactId has changed, it will have to sleep again. This can cause starvation if more shared lockers come while it is sleeping. Not sure how to solve this (and I think it's an important problem to solve.) One way would be locking the MultiXactId itself so that nobody can change it. Note that we do this only for shared locks. For exclusive lock, using the TransactionId as is done in the current code is OK. At transaction start, the current value of the current MultiXactId variable is saved. For cleanup, we can truncate the pg_offsets table at the MultiXactId previous to the youngest one saved; and pg_multixact_members, at whatever pg_offsets has in the truncating position. Because we can't reuse MultiXactIds at system crash (else we risk taking an Id which is already stored in some tuple), we need to XLog it. Not at the locking operation, because we don't want to log that one (too expensive.) We can log the current value of MultiXactId counter once in a while; say, one each (say) 1000 acquisitions. Following a crash, the value is restored to the last one logged + 1000. (I think this can be a problem if we log one acquisition, then write some tuples, and then crash, without flushing the acquisition logged. Maybe a solution is to force a flush after logging an acquisition.) Comments are welcome. -- Alvaro Herrera ([EMAIL PROTECTED]) El miedo atento y previsor es la madre de la seguridad (E. Burke) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
On Wed, 29 Dec 2004 19:57:15 -0500, Tom Lane [EMAIL PROTECTED] wrote: Manfred Koizar [EMAIL PROTECTED] writes: I don't see too much of a difference between #1 (an on-disk structure buffered in shared memory) and #2 (a shared memory structure spilling to disk). If you stand back that far, maybe you can't see a difference ;-) ... Well, I tried to step back a bit to see the whole picture. You think I went too far this time? :-) but the proposal on the table was for an extremely special-purpose on-disk structure. I'd prefer to see first if we can solve a more general problem, namely the fixed size of the existing lock table. I haven't been digging into the code for some time (:-() -- but isn't this basically a key/value mapping with random access? My concern is that as soon as you start to push entries out of the memory structure you have to implement some kind of on-disk structure to support fast lookups, which sooner or later might lead to something that looks suspiciously like the existing hash or btree code. So the question is whether starting by making nbtree more flexible isn't the lower hanging fruit... Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
Manfred Koizar [EMAIL PROTECTED] writes: So the question is whether starting by making nbtree more flexible isn't the lower hanging fruit... Certainly not; indexes depend on locks, not vice versa. You'd not be able to do that without introducing an infinite recursion into the system design. In any case nbtree is much more heavyweight than we need for this --- the lock table does not want WAL support for example, nor REINDEX/VACUUM, nor support for arbitrary index lookup conditions, nor even multiple datatypes or multiple index columns. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Shared row locking
On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote: Certainly not; indexes depend on locks, not vice versa. You'd not be able to do that without introducing an infinite recursion into the system design. Wouldn't you have to face the same sort of problems if you spill part of the lock table to disk? While you do I/O you have to hold some lock. In either case there has to be a special class of locks that are pinned in memory. In any case nbtree is much more heavyweight than we need for this Having funcionality we don't need is not a showstopper ... unless heavyweight implies slow, which I have to admit may well be the case. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
Manfred Koizar [EMAIL PROTECTED] writes: On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote: Certainly not; indexes depend on locks, not vice versa. You'd not be able to do that without introducing an infinite recursion into the system design. Wouldn't you have to face the same sort of problems if you spill part of the lock table to disk? While you do I/O you have to hold some lock. See LWLocks ... or spinlocks underneath those. But (some) operations on tables and indexes make use of heavyweight locks. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera [EMAIL PROTECTED] wrote: Else, it will have to wait, using XactLockTableWait, for the first transaction in the array that is still running. We can be sure that no one will try to share-lock the tuple while we check the btree because we hold an exclusive lock on the tuple's heap page's buffer. Do you really want to XactLockTableWait while holding an exclusive lock on a heap page? Or are you going to release the lock before entering the wait? Thanks to tablespaces, it's very easy to create special Relations that can be dealt with by standard buffer and storage manager, etc. I don't get how tablespaces would help. The btree idea: - does not need crash recovery. Maybe we could use a stripped down version of nbtree. This could cause a maintanibility nightmare. It could be a nightmare if you simply duplicate and then modify the code. A better way would be to refactor nbtree to be able to handle btrees with different properties: . shared/private . logged/non-logged . flexible value data type. - can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page). I doubt this is a real problem. Or you store each lock as a separate index tuple. If this is expected to be a problem because of many repeating key vaules, nbtree should be enhanced to support duplicate key compression, which might be a good thing per se. Servus Manfred ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
On Mon, 20 Dec 2004 21:44:01 +0100, [EMAIL PROTECTED] wrote: Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21: #1 could have a pretty serious performance impact, too. For small numbers of FOR UPDATE locks (too few to force spill to disk) I would expect #2 to substantially beat #1. #1 essentially imposes the worst case performance at all times, whereas #2 degrades (at a currently unknown rate) when there are lots and lots of FOR UPDATE locks. I don't see too much of a difference between #1 (an on-disk structure buffered in shared memory) and #2 (a shared memory structure spilling to disk). As long as we can avoid unnecessary writes, #1 has the nice property that it automatically adapts to different usage patterns because it uses the normal shared buffer pool and cache replacement strategy. [My gut feeling would be against another permanent on-disk structure, since this is one more thing for a user to delete to save space etc...] Irrelevant, IMHO. Whoever manually deletes any file under PGDATA deserves whatever this may cause. I haven't seen many programs that use extended SELECT FOR UPDATE logic. AFAICS, SELECT FOR UPDATE is not the primary issue here, because it locks rows exclusively. This thread is about shared locks, which should be used for foreign key checks. Servus Manfred ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
Manfred Koizar [EMAIL PROTECTED] writes: I don't see too much of a difference between #1 (an on-disk structure buffered in shared memory) and #2 (a shared memory structure spilling to disk). If you stand back that far, maybe you can't see a difference ;-) ... but the proposal on the table was for an extremely special-purpose on-disk structure. I'd prefer to see first if we can solve a more general problem, namely the fixed size of the existing lock table. It's certainly possible that that idea won't work out, but if it can be done for a similar time investment as the special-purpose log would take, it'd be a win. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
In general, I agree with Tom: I haven't seen many programs that use extended SELECT FOR UPDATE logic. However, the ones I have seen have been batch style programs written using a whole-table cursor - these latter ones have been designed for the cursor stability approach. I think if we add shared locks we should by default behave like cursor stability isolation level, that only holds one shared lock for the current cursor row. The semantics are well defined in SQL. If you want repeatable read you need to change isolation level. I know FK checks will need to keep the locks, but I would special case that. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
On Mon, Dec 20, 2004 at 03:09:24PM -0300, Alvaro Herrera wrote: To solve the problem I want to solve, we have three orthogonal possibilities: 1. implement shared row locking using the ideas outlined in the mail starting this thread (pg_clog-like seems to be the winner, details TBD). 2. implement shared lock table spill-to-disk mechanism. 3. implement lock escalation. Some people seems to think 3 is better than 2. What do they think of 1? Some facts: - DB2 implements 3 and some people have problems with deadlocks. FWIW, I have seen DB2 deadlock on a single row update statement in a database with no one else connected. It was an issue with how they were implementing repeatable read concurrency. What this indicates to me is that they've got some 'issues' with their locking mechanisms, and that #3 shouldn't be thrown out just because DB2 doesn't do it right. AFAIK Sybase and SQL-server also use lock escalation and I've not heard of issues with it. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Shared row locking
On Mon, 2004-12-20 at 06:34, Jim C. Nasby wrote: On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote: On Sun, 19 Dec 2004, Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: On Sun, 19 Dec 2004, Alvaro Herrera wrote: This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). Nonetheless, it would mean that locks would be taken depending on implementation-dependent, not-visible-to-the-user considerations. Shared locks can still cause deadlocks, and so you would have an unreliable application, which would only be unreliable under load. As I said in connection with the other proposal, weird user-visible semantics should be the last resort not the first. I agree that lock escalation is not a good solution, we run into problems with DB2 lock escalation at work all the time. Does anyone know how Oracle deals with this? They use MVCC like PostgreSQL, so they'd be a better source for inspiration. Oracle only uses MVCC in its widest sense - versioning info is stored in UNDO tablespaces (rollback segments). That implementation is covered by aggressive patent attorneys. Oracle implements locking at row level within each data block. The block header expands dynamically to accommodate a list of transactions that can access, with minimum and maximum sizes settable by the DBA. This works reasonably well. Each SELECT FOR UPDATE is actually a block-write, whether or not the rows are modified, which has some additional code to recover from this without crashing/redo. Later transactions end up cleaning up the lock header info (which later became a problem in Parallel Server). https://cwisdb.cc.kuleuven.ac.be/ora10doc/server.101/b10743/consist.htm -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Shared row locking
On Mon, Dec 20, 2004 at 06:23:24PM +1100, Gavin Sherry wrote: On Sat, 18 Dec 2004, Bruce Momjian wrote: Agreed. Once concern I have about allowing the lock table to spill to disk is that a large number of FOR UPDATE locks could push out lock entries used by other backends, causing very poor performance. I think if we allow the lock manager to spill to disk (and I think we do need to allow it) then we should also be able to control the amount of shared memory allocated. There's little point spilling the lock table to disk if we have huge amounts of memory. This is a interesting idea. Gavin also mentioned to me we should also control the amount of memory the shared inval queue uses. Causing all backends to refill the cache is (I assume) a big performance hit. Maybe we should expose this via new knobs in postgresql.conf, to ease implementation, or maybe not, to rid users of configuring it. As a start, we could have WARNINGs when the lock table is spilled and when a SInvalReset occurs. So the user can know whether he should increase memory use for those settings. -- Alvaro Herrera ([EMAIL PROTECTED]) Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end. (2nd Commandment for C programmers) ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Shared row locking
Gavin Sherry [EMAIL PROTECTED] writes: I think if we allow the lock manager to spill to disk (and I think we do need to allow it) then we should also be able to control the amount of shared memory allocated. You mean like max_locks_per_transaction? regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
Alvaro Herrera [EMAIL PROTECTED] writes: Gavin also mentioned to me we should also control the amount of memory the shared inval queue uses. Perhaps, but I've really seen no evidence that there's a need to worry about that. Without demonstrated problems I'd sooner keep that code a bit simpler and faster. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
Tom lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: I think if we allow the lock manager to spill to disk (and I think we do need to allow it) then we should also be able to control the amount of shared memory allocated. You mean like max_locks_per_transaction? IMO, max_locks_per_transaction could use a better name a little more documentation. I've mentioned this a couple of times before, but there is at least one type of lock that does not expire when the transaction ends (user locks). I may be over my head here, but I think lock spillover is dangerous. In the extreme situations where this would happen, it would be a real performance buster. Personally, I would rather see locks escalate when the table gets full, or at least allow this as a configuration parameter. Merlin ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Shared row locking
Merlin Moncure [EMAIL PROTECTED] writes: I may be over my head here, but I think lock spillover is dangerous. In the extreme situations where this would happen, it would be a real performance buster. Personally, I would rather see locks escalate when the table gets full, or at least allow this as a configuration parameter. To me, performance buster is better than random, unrepeatable deadlock failures. In any case, if we find we *can't* implement this in a non-performance-busting way, then it would be time enough to look at alternatives that force the user to manage the problem for us. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Shared row locking
On Mon, Dec 20, 2004 at 11:47:41AM -0500, Tom Lane wrote: To me, performance buster is better than random, unrepeatable deadlock failures. In any case, if we find we *can't* implement this in a non-performance-busting way, then it would be time enough to look at alternatives that force the user to manage the problem for us. I am confused by this discussion. To solve the problem I want to solve, we have three orthogonal possibilities: 1. implement shared row locking using the ideas outlined in the mail starting this thread (pg_clog-like seems to be the winner, details TBD). 2. implement shared lock table spill-to-disk mechanism. 3. implement lock escalation. Some people seems to think 3 is better than 2. What do they think of 1? Some facts: - DB2 implements 3 and some people have problems with deadlocks. - 2 could have a performance impact, and we don't even know how to start. For example, what would be an algorithm to decide what locks to send to disk? - I am interested in implementing 1, maybe 2. Don't know about 3. -- Alvaro Herrera ([EMAIL PROTECTED]) One man's impedance mismatch is another man's layer of abstraction. (Lincoln Yeoh) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
Alvaro Herrera [EMAIL PROTECTED] writes: To solve the problem I want to solve, we have three orthogonal possibilities: 1. implement shared row locking using the ideas outlined in the mail starting this thread (pg_clog-like seems to be the winner, details TBD). 2. implement shared lock table spill-to-disk mechanism. 3. implement lock escalation. Check. - 2 could have a performance impact, and we don't even know how to start. For example, what would be an algorithm to decide what locks to send to disk? LRU, perhaps? That's all open for investigation still. #1 could have a pretty serious performance impact, too. For small numbers of FOR UPDATE locks (too few to force spill to disk) I would expect #2 to substantially beat #1. #1 essentially imposes the worst case performance at all times, whereas #2 degrades (at a currently unknown rate) when there are lots and lots of FOR UPDATE locks. Most of the applications I've seen don't take out that many FOR UPDATE locks at once, so I'm unclear on the rationale for choosing a fixed-but- poor performance curve over one that is fast for few locks and degrades for many locks. Especially when the value of many is user-configurable. Furthermore, we have also seen issues with too many locks on ordinary objects, which #2 would solve simultaneously. So I feel that #2 is clearly the approach to try first. If we find that we can't do spill-to-disk without serious performance degradation, then I'd be inclined to try #1 next. I really don't care for the user-visible semantics changes implied by #3 ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: Re: [HACKERS] Shared row locking
Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21: Alvaro Herrera writes: To solve the problem I want to solve, we have three orthogonal possibilities: 1. implement shared row locking using the ideas outlined in the mail starting this thread (pg_clog-like seems to be the winner, details TBD). 2. implement shared lock table spill-to-disk mechanism. 3. implement lock escalation. Check. - 2 could have a performance impact, and we don't even know how to start. For example, what would be an algorithm to decide what locks to send to disk? LRU, perhaps? That's all open for investigation still. #1 could have a pretty serious performance impact, too. For small numbers of FOR UPDATE locks (too few to force spill to disk) I would expect #2 to substantially beat #1. #1 essentially imposes the worst case performance at all times, whereas #2 degrades (at a currently unknown rate) when there are lots and lots of FOR UPDATE locks. Agreed. [My gut feeling would be against another permanent on-disk structure, since this is one more thing for a user to delete to save space etc...] Most of the applications I've seen don't take out that many FOR UPDATE locks at once, so I'm unclear on the rationale for choosing a fixed-but- poor performance curve over one that is fast for few locks and degrades for many locks. Especially when the value of many is user-configurable. Furthermore, we have also seen issues with too many locks on ordinary objects, which #2 would solve simultaneously. So I feel that #2 is clearly the approach to try first. If we find that we can't do spill-to-disk without serious performance degradation, then I agree. We need to understand what type of application logic we are coding for. In general, I agree with Tom: I haven't seen many programs that use extended SELECT FOR UPDATE logic. However, the ones I have seen have been batch style programs written using a whole-table cursor - these latter ones have been designed for the cursor stability approach. It would be much better to analyze one or more representative application suites to understand which option to pick. Without a set of programs, or some driving force, the wrong one could be picked. Spill-to-disk would not be that bad, since WARNINGs could appear in the log. That's much better than doing a lock escalation, definitely. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Shared row locking
On Sun, 2004-12-19 at 04:04, Bruce Momjian wrote: BTom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. For whatever data structure we use, we may reset the structure to empty during backend-crash recovery. So your objection boils down to what if a backend exits normally but forgets to clean up its locks? Assuming that doesn't happen isn't any worse than assuming a backend will clean up its shared memory state on non-crash exit, so I don't think it's a serious concern. That brings another thought: really what this is all about is working around the fact that the standard lock manager can only cope with a finite number of coexisting locks, because it's working in a fixed-size shared memory arena. Maybe we should instead think about ways to allow the existing lock table to spill to disk when it gets too big. That would eliminate max_locks_per_transaction as a source of hard failures, which would be a nice benefit. Agreed. Once concern I have about allowing the lock table to spill to disk is that a large number of FOR UPDATE locks could push out lock entries used by other backends, causing very poor performance. In similar circumstances, DB2 uses these techniques: - when locktable X % full, then escalate locks to full table locks: both locktable memory and threshold% are instance parameters - use a lock mode called Cursor Stability that locks only those rows currently being examined by a cursor, those maintaining the lock usage of a cursor at a constant level as the cursor moves. The lock mode of Repeatable Read *does* lock all rows read (these are not actually mutually exclusive) The first one is a real pain, but the idea might be of use somewhere. The second is a usable, practical alternative that should be considered and might avoid the need to write the spill-to-disk code and then discover it performs very badly. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Shared row locking
On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote: Simon, In similar circumstances, DB2 uses these techniques: - when locktable X % full, then escalate locks to full table locks: both locktable memory and threshold% are instance parameters This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Keep in mind that this is for foreign key locking, which is one area where deadlocks are frequently encountered because we use too strong a lock. - use a lock mode called Cursor Stability that locks only those rows currently being examined by a cursor, those maintaining the lock usage of a cursor at a constant level as the cursor moves. The lock mode of Repeatable Read *does* lock all rows read We can't release the locks until the end of the transaction. The second is a usable, practical alternative that should be considered and might avoid the need to write the spill-to-disk code and then discover it performs very badly. I don't think any of them serves the objective I'm trying to accomplish :-( Thanks for your ideas anyway. And keep having them! -- Alvaro Herrera ([EMAIL PROTECTED]) Uno puede defenderse de los ataques; contra los elogios se esta indefenso ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
Alvaro Herrera [EMAIL PROTECTED] writes: On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote: - use a lock mode called Cursor Stability that locks only those rows currently being examined by a cursor, those maintaining the lock usage of a cursor at a constant level as the cursor moves. The lock mode of Repeatable Read *does* lock all rows read We can't release the locks until the end of the transaction. Well, what Simon is essentially proposing is that we work around a potential performance issue by exposing a less-clean semantic model to users. I'd prefer to adopt such a thing as a last resort, not a first one. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Shared row locking
On Sun, 19 Dec 2004, Alvaro Herrera wrote: On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote: Simon, In similar circumstances, DB2 uses these techniques: - when locktable X % full, then escalate locks to full table locks: both locktable memory and threshold% are instance parameters This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Keep in mind that this is for foreign key locking, which is one area where deadlocks are frequently encountered because we use too strong a lock. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). I believe DB2 does the escalation also for perfomance. Getting a full table lock is cheaper than individually locking every row. - Heikki ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Shared row locking
Heikki Linnakangas [EMAIL PROTECTED] writes: On Sun, 19 Dec 2004, Alvaro Herrera wrote: This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). Nonetheless, it would mean that locks would be taken depending on implementation-dependent, not-visible-to-the-user considerations. Shared locks can still cause deadlocks, and so you would have an unreliable application, which would only be unreliable under load. As I said in connection with the other proposal, weird user-visible semantics should be the last resort not the first. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
On Sun, 19 Dec 2004, Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: On Sun, 19 Dec 2004, Alvaro Herrera wrote: This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). Nonetheless, it would mean that locks would be taken depending on implementation-dependent, not-visible-to-the-user considerations. Shared locks can still cause deadlocks, and so you would have an unreliable application, which would only be unreliable under load. As I said in connection with the other proposal, weird user-visible semantics should be the last resort not the first. I agree that lock escalation is not a good solution, we run into problems with DB2 lock escalation at work all the time. - Heikki ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Shared row locking
On Sun, 2004-12-19 at 18:31, Alvaro Herrera wrote: Thanks for your ideas anyway. And keep having them! No problem. Just giving some info on what works and doesn't work in other implementations. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Shared row locking
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: On Sun, 19 Dec 2004, Alvaro Herrera wrote: This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). Nonetheless, it would mean that locks would be taken depending on implementation-dependent, not-visible-to-the-user considerations. Shared locks can still cause deadlocks, and so you would have an unreliable application, which would only be unreliable under load. As I said in connection with the other proposal, weird user-visible semantics should be the last resort not the first. One idea is to stamp the same shared lock counter on multiple rows and just prevent another application from also sharing that lock if too many rows are locked. It would wait for the lock to be released. This prevents the problem of having to store thousands of shared lock counters. There would have to be some flag so say which shared counters are only for a single backend. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote: On Sun, 19 Dec 2004, Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: On Sun, 19 Dec 2004, Alvaro Herrera wrote: This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). Nonetheless, it would mean that locks would be taken depending on implementation-dependent, not-visible-to-the-user considerations. Shared locks can still cause deadlocks, and so you would have an unreliable application, which would only be unreliable under load. As I said in connection with the other proposal, weird user-visible semantics should be the last resort not the first. I agree that lock escalation is not a good solution, we run into problems with DB2 lock escalation at work all the time. Does anyone know how Oracle deals with this? They use MVCC like PostgreSQL, so they'd be a better source for inspiration. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Shared row locking
On Sat, 18 Dec 2004, Bruce Momjian wrote: BTom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. For whatever data structure we use, we may reset the structure to empty during backend-crash recovery. So your objection boils down to what if a backend exits normally but forgets to clean up its locks? Assuming that doesn't happen isn't any worse than assuming a backend will clean up its shared memory state on non-crash exit, so I don't think it's a serious concern. That brings another thought: really what this is all about is working around the fact that the standard lock manager can only cope with a finite number of coexisting locks, because it's working in a fixed-size shared memory arena. Maybe we should instead think about ways to allow the existing lock table to spill to disk when it gets too big. That would eliminate max_locks_per_transaction as a source of hard failures, which would be a nice benefit. Agreed. Once concern I have about allowing the lock table to spill to disk is that a large number of FOR UPDATE locks could push out lock entries used by other backends, causing very poor performance. I think if we allow the lock manager to spill to disk (and I think we do need to allow it) then we should also be able to control the amount of shared memory allocated. There's little point spilling the lock table to disk if we have huge amounts of memory. Gavin ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Shared row locking
BTom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. For whatever data structure we use, we may reset the structure to empty during backend-crash recovery. So your objection boils down to what if a backend exits normally but forgets to clean up its locks? Assuming that doesn't happen isn't any worse than assuming a backend will clean up its shared memory state on non-crash exit, so I don't think it's a serious concern. That brings another thought: really what this is all about is working around the fact that the standard lock manager can only cope with a finite number of coexisting locks, because it's working in a fixed-size shared memory arena. Maybe we should instead think about ways to allow the existing lock table to spill to disk when it gets too big. That would eliminate max_locks_per_transaction as a source of hard failures, which would be a nice benefit. Agreed. Once concern I have about allowing the lock table to spill to disk is that a large number of FOR UPDATE locks could push out lock entries used by other backends, causing very poor performance. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
[HACKERS] Shared row locking
Hi, I've been thinking on how to do shared row locking. There are some very preliminar ideas on this issue. Please comment; particularly if any part of it sounds unworkable or too incomplete. There are several problems to be solved here: the grammar, the internal SelectStmt representation, how to store and share the info between backends, how to clean up at transaction end, and how to clean up at backend crash. The Grammar === The SQL spec does not say anything on this respect (that I can find). It only talks of FOR UPDATE and FOR READ ONLY. However, because the FK code uses SPI to do the locking, we definitely have to expose the funcionality through SQL. So I think we need a new clause, which I propose to be FOR SHARE. The Parser and SelectStmt = The parser uses for_update_clause and opt_for_update_clause nonterminals. I assume it's best to change them to (new) locking_clause, which can in turn be for_update_clause or (new) for_share_clause. SelectStmt currently has a forUpdate field (a List to the to-be-locked tables, or an empty list meaning all of them). We could simply add another list, say forShare, or use a common list and a flag saying that it's one or the other. I prefer adding a new list. (Same with the Query node.) How to Store the Info = This is the really interesting part. I have two ideas, one mine (btree) and other Bruce's (bitmap). Using a B-tree -- When a backend wants to lock a tuple, it set a bit in its infomask. Then it inserts to a btree in a special tablespace, using RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId as value; actually, an array with a single element containing those two values. When a backend wants to lock a tuple that is already locked, it goes to the btree and inserts itself into the array. To do this, it inserts a new index item (the enlarged array) and delete the previous one. No other backend may want to insert simultaneously (thus causing an ugly race condition), because we hold an exclusive lock on the tuple's heap page's buffer. At transaction end, nothing special happens (tuples are not unlocked explicitly). When someone wants to know if the tuple is locked (to mark it FOR UPDATE, or to delete it), it checks the infomask. If it says it's locked, it goes to check the btree. If the array contains only BackendId-TransactionId pairs that are no longer running, then the tuple is not locked and can be deleted/marked (and the btree can be cleaned up). Else, it will have to wait, using XactLockTableWait, for the first transaction in the array that is still running. We can be sure that no one will try to share-lock the tuple while we check the btree because we hold an exclusive lock on the tuple's heap page's buffer. Note that to check whether a transaction is running we need to lock SInvalLock. To minimize the time we hold it, we save the BackendId so we don't have to scan the whole shmInvalBuffer-procState array, only the item that we need to look at. Another possibility would be to use stock TransactionIdIsInProgress and save the extra 4 bytes of storage. At server restart, the btree is created empty (or just deleted). There is one btree per database. Using a Bitmap -- First we create a counter called shared lock row counter. Then we create a file like pg_clog, and each counter slot has a bit for every backend. When we want to shared lock a row we increment the counter and put that counter value on the row, and set our backend bit in the new file. We also store that counter value in our backend local memory. On commit we go through that local memory list and reset all our bits for those counters. When a row has all zeros, it can be recycled like we do with pg_clog. Problems and random comments There is possibility of starvation, if somebody wants to lock exclusively a tuple and shared lockers are coming all the time. Not sure how to solve this. The wakeup mechanism is not discussed ... is there a special need for something beyond what we can do with XactLockTable{Insert,Wait} ? Thanks to tablespaces, it's very easy to create special Relations that can be dealt with by standard buffer and storage manager, etc. The btree idea: - does not need crash recovery. Maybe we could use a stripped down version of nbtree. This could cause a maintanibility nightmare. - can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page). I doubt this is a real problem. - could have problems (excessive storage requirements) in the long run because of empty or almost-empty pages. The bitmap idea: - seems to have lower overhead - can use the same lazy cleanup mechanism exposed for the btree idea (in which case we don't need the list in local memory). - What can happen in presence of large max_connections settings? Is this a real problem? -- Alvaro
Re: [HACKERS] Shared row locking
Alvaro Herrera wrote: The btree idea: - does not need crash recovery. Maybe we could use a stripped down version of nbtree. This could cause a maintanibility nightmare. Are you saying the btree is an index with no heap? If so, what about the xid's? Are they just in the btree? How does the btree get cleaned up over time? The bitmap idea: - seems to have lower overhead - can use the same lazy cleanup mechanism exposed for the btree idea (in which case we don't need the list in local memory). You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. - What can happen in presence of large max_connections settings? Is this a real problem? I thought about that. 50 backends is 7 bytes, 1000 backends is 128 bytes. For a large number of backends you could just allow X concurrent locks and use space X*4 bytes. I think the basic issue is that the btree can be of variable length while the bitmap has to be of a fixed length. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Shared row locking
The SQL spec does not say anything on this respect (that I can find). It only talks of FOR UPDATE and FOR READ ONLY. However, because the FK code uses SPI to do the locking, we definitely have to expose the funcionality through SQL. So I think we need a new clause, which I propose to be FOR SHARE. MySQL uses LOCK IN SHARE MODE: http://dev.mysql.com/doc/mysql/en/InnoDB_locking_reads.html ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
Alvaro Herrera [EMAIL PROTECTED] writes: Using a B-tree At transaction end, nothing special happens (tuples are not unlocked explicitly). I don't think that works, because there is no guarantee that an entry will get cleaned out before the XID counter wraps around. Worst case, you might think that a tuple is locked when the XID is left over from the previous cycle. (Possibly this could be avoided by cleaning out old XIDs in this table whenever we truncate pg_clog, but that seems a tad messy.) I'm also a bit concerned about how we avoid table bloat if there's no proactive cleanup at transaction end. I think I like the pg_clog-modeled structure a bit better. However it could be objected that that puts a hard limit of 4G share-locked tuples at any one time. In the clog-modeled idea, it wasn't real clear how you decide whether to assign a new counter value to a previously locked row, or reuse its previous counter. You must *not* assign a new value when the existing entry still has bits set, but you probably do want to be aggressive about assigning new values when you can; else it gets tough to be sure that the log can be truncated in a reasonable time. ISTM that your description is conflating several orthogonal issues: how do we identify entries in this data structure (by CTID, or a shared counter that increments each time a new lock is acquired); how do we index the data structure (btree or linear array); and what is stored in each entry (array of XIDs, or bitmap indexed by BackendId). Not all of the eight combinations work, but we do have more alternatives than the two offered, even without coming up with any new ideas ;-) Note that to check whether a transaction is running we need to lock SInvalLock. To minimize the time we hold it, we save the BackendId so we don't have to scan the whole shmInvalBuffer-procState array, only the item that we need to look at. Another possibility would be to use stock TransactionIdIsInProgress and save the extra 4 bytes of storage. I'm a bit worried about deadlocks and race conditions associated with the conflict between locking a page of this data structure and locking SInvalLock. At server restart, the btree is created empty (or just deleted). There is one btree per database. One per cluster you meant, right? (Else we can't do locking of rows in shared tables.) regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Shared row locking
Bruce Momjian [EMAIL PROTECTED] writes: You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. For whatever data structure we use, we may reset the structure to empty during backend-crash recovery. So your objection boils down to what if a backend exits normally but forgets to clean up its locks? Assuming that doesn't happen isn't any worse than assuming a backend will clean up its shared memory state on non-crash exit, so I don't think it's a serious concern. That brings another thought: really what this is all about is working around the fact that the standard lock manager can only cope with a finite number of coexisting locks, because it's working in a fixed-size shared memory arena. Maybe we should instead think about ways to allow the existing lock table to spill to disk when it gets too big. That would eliminate max_locks_per_transaction as a source of hard failures, which would be a nice benefit. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Shared row locking
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Using a B-tree At transaction end, nothing special happens (tuples are not unlocked explicitly). I don't think that works, because there is no guarantee that an entry will get cleaned out before the XID counter wraps around. Worst case, you might think that a tuple is locked when the XID is left over from the previous cycle. (Possibly this could be avoided by cleaning out old XIDs in this table whenever we truncate pg_clog, but that seems a tad messy.) I'm also a bit concerned about how we avoid table bloat if there's no proactive cleanup at transaction end. I think I like the pg_clog-modeled structure a bit better. However it could be objected that that puts a hard limit of 4G share-locked tuples at any one time. In the clog-modeled idea, it wasn't real clear how you decide whether to assign a new counter value to a previously locked row, or reuse its previous counter. You must *not* assign a new value when the existing entry still has bits set, but you probably do want to be aggressive about assigning new values when you can; else it gets tough to be sure that the log can be truncated in a reasonable time. I assume you check and if all the bits are zero, you don't reuse it and get a new counter. In fact you shouldn't reuse it in case the log is being truncated while you are looking. :-) ISTM that your description is conflating several orthogonal issues: how do we identify entries in this data structure (by CTID, or a shared counter that increments each time a new lock is acquired); how do we index the data structure (btree or linear array); and what is stored in each entry (array of XIDs, or bitmap indexed by BackendId). Not all of the eight combinations work, but we do have more alternatives than the two offered, even without coming up with any new ideas ;-) True. The only advantage to a bitmap vs. just a counter of locked backends is that you can clean out your own backend bits from the table without having to record them in your memory. However, because recording your own counters in local memory doesn't require fixed shared memory we might be better just recording the shared lock indexes in your local backend memory and just use an int4 counter in the pg_clog-like file that we can decrement on backend commit. However I am unclear that we can guarantee an exiting backend will do that. Certainly it is cleared on server start. Note that to check whether a transaction is running we need to lock SInvalLock. To minimize the time we hold it, we save the BackendId so we don't have to scan the whole shmInvalBuffer-procState array, only the item that we need to look at. Another possibility would be to use stock TransactionIdIsInProgress and save the extra 4 bytes of storage. I'm a bit worried about deadlocks and race conditions associated with the conflict between locking a page of this data structure and locking SInvalLock. At server restart, the btree is created empty (or just deleted). There is one btree per database. One per cluster you meant, right? (Else we can't do locking of rows in shared tables.) He meant one per database, I think. I suppose we would need another one for global tables or disallow shared locking of them. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])