Re: [HACKERS] Shared row locking, revisited

2005-04-07 Thread Qingqing Zhou

Alvaro Herrera [EMAIL PROTECTED] writes
 Because we can't reuse MultiXactIds at system crash (else we risk taking
 an Id which is already stored in some tuple), we need to XLog it.  Not
 at the locking operation, because we don't want to log that one (too
 expensive.)  We can log the current value of MultiXactId counter once in
 a while; say, one each (say) 1000 acquisitions.  Following a crash, the
 value is restored to the last one logged + 1000.  (I think this can be a
 problem if we log one acquisition, then write some tuples, and then
 crash, without flushing the acquisition logged.  Maybe a solution is to
 force a flush after logging an acquisition.)


Does Oid have a similar problem?

Regards,
Qingqing



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking, revisited

2005-04-07 Thread Alvaro Herrera
On Thu, Apr 07, 2005 at 02:34:03PM +0800, Qingqing Zhou wrote:
 
 Alvaro Herrera [EMAIL PROTECTED] writes
  Because we can't reuse MultiXactIds at system crash (else we risk taking
  an Id which is already stored in some tuple), we need to XLog it.  Not
  at the locking operation, because we don't want to log that one (too
  expensive.)  We can log the current value of MultiXactId counter once in
  a while; say, one each (say) 1000 acquisitions.  Following a crash, the
  value is restored to the last one logged + 1000.  (I think this can be a
  problem if we log one acquisition, then write some tuples, and then
  crash, without flushing the acquisition logged.  Maybe a solution is to
  force a flush after logging an acquisition.)
 
 Does Oid have a similar problem?

Good question.  Yes, and in fact the solution is similar; see
XLogPutNextOid().  I think we could do the same for MultiXactId.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
La soledad es compañía

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] Shared row locking, revisited

2005-04-06 Thread Alvaro Herrera
Hackers,

Again I come back to this topic.  Now I have a better hold on an idea
that is hopefully implementable and doesn't have expensive performance
penalties.

Forget the idea of using the regular lock manager directly on tuples.
It's folly because we can't afford to have that many locks.  Instead we
will lock on the transactions that are involved in a shared row lock.

We will have a cluster-wide counter, call it MultiXactId for the sake of
this explanation.  When someone wants to share lock an (unlocked) tuple,
he creates a new MultiXactId, a bit is set in t_infomask
(HEAP_XMAX_FOR_SHARE, say), and the tuple's Xmax is set to the new
MultiXactId.

We will have two additional SLRU areas; one will contain offsets to the
other, so that the second can store variable length arrays.  In the
first (call it pg_offsets) we will use the MultiXactId as key, and the
last offset of the second as value.  The second SLRU area (call it
pg_multixact_members) is accessed using the offset from pg_offsets, and
contains the members of the respective MultiXactId.

So, returning to the locker: it grabbed the MultiXactId; it records an
offset of (previous offset + 1) in pg_offsets, and its own TransactionId
in that offset of pg_multixact_members.

This was a locker of an unlocked tuple.  But if it finds that the tuple
is already locked, it gets the MultiXactId from the tuple, goes to
pg_offsets and from there to pg_multixact_members; it then knows what
transactions are locking this tuple.  It can check which of those are
still running and discard those that aren't.  So at this point it has a
list of TransactionIds, to which it adds itself, and stores it as a new
MultiXactId following the procedure outlined above.

Possible optimization: keep a list of the MultiXactIds the current
TransactionId is member of (this can be done in local memory).  If at
some point it is going to create a new MultiXactId first check this list
to see if the same members can be found, and reuse the MultiXactId if
so.  Thus we save a MultiXactId (useful trick for when somebody wants to
lock a whole table, for example).

Sleeping is done using XactLockTableWait() on each of the members of the
MultiXactId.  If it wakes after sleeping on all of them only to find
that the MultiXactId has changed, it will have to sleep again.  This can
cause starvation if more shared lockers come while it is sleeping.  Not
sure how to solve this (and I think it's an important problem to solve.)
One way would be locking the MultiXactId itself so that nobody can
change it.


Note that we do this only for shared locks.  For exclusive lock, using
the TransactionId as is done in the current code is OK.


At transaction start, the current value of the current MultiXactId
variable is saved.  For cleanup, we can truncate the pg_offsets table at
the MultiXactId previous to the youngest one saved; and
pg_multixact_members, at whatever pg_offsets has in the truncating
position.

Because we can't reuse MultiXactIds at system crash (else we risk taking
an Id which is already stored in some tuple), we need to XLog it.  Not
at the locking operation, because we don't want to log that one (too
expensive.)  We can log the current value of MultiXactId counter once in
a while; say, one each (say) 1000 acquisitions.  Following a crash, the
value is restored to the last one logged + 1000.  (I think this can be a
problem if we log one acquisition, then write some tuples, and then
crash, without flushing the acquisition logged.  Maybe a solution is to
force a flush after logging an acquisition.)


Comments are welcome.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
El miedo atento y previsor es la madre de la seguridad (E. Burke)

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-30 Thread Manfred Koizar
On Wed, 29 Dec 2004 19:57:15 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Manfred Koizar [EMAIL PROTECTED] writes:
 I don't see too much of a difference between #1 (an on-disk structure
 buffered in shared memory) and #2 (a shared memory structure spilling
 to disk).

If you stand back that far, maybe you can't see a difference ;-) ...

Well, I tried to step back a bit to see the whole picture.  You think I
went too far this time?  :-)

but the proposal on the table was for an extremely special-purpose
on-disk structure.  I'd prefer to see first if we can solve a more
general problem, namely the fixed size of the existing lock table.

I haven't been digging into the code for some time (:-() -- but isn't
this basically a key/value mapping with random access?  My concern is
that as soon as you start to push entries out of the memory structure
you have to implement some kind of on-disk structure to support fast
lookups, which sooner or later might lead to something that looks
suspiciously like the existing hash or btree code.

So the question is whether starting by making nbtree more flexible isn't
the lower hanging fruit...

Servus
 Manfred

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-30 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 So the question is whether starting by making nbtree more flexible isn't
 the lower hanging fruit...

Certainly not; indexes depend on locks, not vice versa.  You'd not be
able to do that without introducing an infinite recursion into the
system design.  In any case nbtree is much more heavyweight than we need
for this --- the lock table does not want WAL support for example, nor
REINDEX/VACUUM, nor support for arbitrary index lookup conditions, nor
even multiple datatypes or multiple index columns.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Shared row locking

2004-12-30 Thread Manfred Koizar
On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Certainly not; indexes depend on locks, not vice versa.  You'd not be
able to do that without introducing an infinite recursion into the
system design.

Wouldn't you have to face the same sort of problems if you spill part of
the lock table to disk?  While you do I/O you have to hold some lock.
In either case there has to be a special class of locks that are pinned
in memory.

  In any case nbtree is much more heavyweight than we need
for this

Having funcionality we don't need is not a showstopper ... unless
heavyweight implies slow, which I have to admit may well be the case.

Servus
 Manfred

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-30 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 On Thu, 30 Dec 2004 13:36:53 -0500, Tom Lane [EMAIL PROTECTED] wrote:
 Certainly not; indexes depend on locks, not vice versa.  You'd not be
 able to do that without introducing an infinite recursion into the
 system design.

 Wouldn't you have to face the same sort of problems if you spill part of
 the lock table to disk?  While you do I/O you have to hold some lock.

See LWLocks ... or spinlocks underneath those.  But (some) operations on
tables and indexes make use of heavyweight locks.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-29 Thread Manfred Koizar
On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera
[EMAIL PROTECTED] wrote:
  Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running.  We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Do you really want to XactLockTableWait while holding an exclusive
lock on a heap page?  Or are you going to release the lock before
entering the wait?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

I don't get how tablespaces would help.

The btree idea:
- does not need crash recovery.  Maybe we could use a stripped down
  version of nbtree.  This could cause a maintanibility nightmare.

It could be a nightmare if you simply duplicate and then modify the
code.  A better way would be to refactor nbtree to be able to handle
btrees with different properties:

. shared/private

. logged/non-logged

. flexible value data type.

- can't hold more than 300 or so simultaneous lockers (because of value
  length, limited to 1/3 of a page).  I doubt this is a real problem.

Or you store each lock as a separate index tuple.  If this is expected
to be a problem because of many repeating key vaules, nbtree should be
enhanced to support duplicate key compression, which might be a good
thing per se.

Servus
 Manfred


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-29 Thread Manfred Koizar
On Mon, 20 Dec 2004 21:44:01 +0100, [EMAIL PROTECTED] wrote:
Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21:
 #1 could have a pretty serious performance impact, too.  For small
 numbers of FOR UPDATE locks (too few to force spill to disk) I would
 expect #2 to substantially beat #1.  #1 essentially imposes the worst
 case performance at all times, whereas #2 degrades (at a currently
 unknown rate) when there are lots and lots of FOR UPDATE locks.

I don't see too much of a difference between #1 (an on-disk structure
buffered in shared memory) and #2 (a shared memory structure spilling
to disk).

As long as we can avoid unnecessary writes, #1 has the nice property
that it automatically adapts to different usage patterns because it
uses the normal shared buffer pool and cache replacement strategy.

[My gut feeling would be against another permanent on-disk structure,
since this is one more thing for a user to delete to save space
etc...]

Irrelevant, IMHO.  Whoever manually deletes any file under PGDATA
deserves whatever this may cause.

I haven't seen many programs that use
extended SELECT FOR UPDATE logic.

AFAICS, SELECT FOR UPDATE is not the primary issue here, because it
locks rows exclusively.  This thread is about shared locks, which
should be used for foreign key checks.

Servus
 Manfred


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-29 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 I don't see too much of a difference between #1 (an on-disk structure
 buffered in shared memory) and #2 (a shared memory structure spilling
 to disk).

If you stand back that far, maybe you can't see a difference ;-) ...
but the proposal on the table was for an extremely special-purpose
on-disk structure.  I'd prefer to see first if we can solve a more
general problem, namely the fixed size of the existing lock table.
It's certainly possible that that idea won't work out, but if it can be
done for a similar time investment as the special-purpose log would
take, it'd be a win.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-21 Thread Zeugswetter Andreas DAZ SD

 In general, I agree with Tom: I haven't seen many programs that use
 extended SELECT FOR UPDATE logic. However, the ones I have seen have
 been batch style programs written using a whole-table cursor - these
 latter ones have been designed for the cursor stability approach.

I think if we add shared locks we should by default behave like 
cursor stability isolation level, that only holds one shared lock for 
the current cursor row. The semantics are well defined in SQL.
If you want repeatable read you need to change isolation level.
I know FK checks will need to keep the locks, but I would special case 
that.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-21 Thread Jim C. Nasby
On Mon, Dec 20, 2004 at 03:09:24PM -0300, Alvaro Herrera wrote:
 To solve the problem I want to solve, we have three orthogonal
 possibilities:
 
 1. implement shared row locking using the ideas outlined in the mail
 starting this thread (pg_clog-like seems to be the winner, details TBD).
 
 2. implement shared lock table spill-to-disk mechanism.
 
 3. implement lock escalation.
 
 
 Some people seems to think 3 is better than 2.  What do they think of 1?
 
 
 Some facts:
 
 - DB2 implements 3 and some people have problems with deadlocks.

FWIW, I have seen DB2 deadlock on a single row update statement in a
database with no one else connected. It was an issue with how they were
implementing repeatable read concurrency. What this indicates to me is
that they've got some 'issues' with their locking mechanisms, and that
#3 shouldn't be thrown out just because DB2 doesn't do it right. AFAIK
Sybase and SQL-server also use lock escalation and I've not heard of
issues with it.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Shared row locking

2004-12-20 Thread Simon Riggs
On Mon, 2004-12-20 at 06:34, Jim C. Nasby wrote:
 On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:
  On Sun, 19 Dec 2004, Tom Lane wrote:
  
  Heikki Linnakangas [EMAIL PROTECTED] writes:
  On Sun, 19 Dec 2004, Alvaro Herrera wrote:
  This is not useful at all, because the objective of this exercise is to
  downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
  shared row locking.
  
  Actually it might help in some scenarios. Remember, we're not talking
  about upgrading shared locks to exclusive locks. We're only talking about
  locking more rows than necessary (all rows).
  
  Nonetheless, it would mean that locks would be taken depending on
  implementation-dependent, not-visible-to-the-user considerations.
  Shared locks can still cause deadlocks, and so you would have an
  unreliable application, which would only be unreliable under load.
  
  As I said in connection with the other proposal, weird user-visible
  semantics should be the last resort not the first.
  
  I agree that lock escalation is not a good solution, we run into problems 
  with DB2 lock escalation at work all the time.
 
 Does anyone know how Oracle deals with this? They use MVCC like
 PostgreSQL, so they'd be a better source for inspiration.

Oracle only uses MVCC in its widest sense - versioning info is stored in
UNDO tablespaces (rollback segments). That implementation is covered by
aggressive patent attorneys.

Oracle implements locking at row level within each data block. The block
header expands dynamically to accommodate a list of transactions that
can access, with minimum and maximum sizes settable by the DBA. This
works reasonably well.

Each SELECT FOR UPDATE is actually a block-write, whether or not the
rows are modified, which has some additional code to recover from this
without crashing/redo. Later transactions end up cleaning up the lock
header info (which later became a problem in Parallel Server).

https://cwisdb.cc.kuleuven.ac.be/ora10doc/server.101/b10743/consist.htm

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Shared row locking

2004-12-20 Thread Alvaro Herrera
On Mon, Dec 20, 2004 at 06:23:24PM +1100, Gavin Sherry wrote:
 On Sat, 18 Dec 2004, Bruce Momjian wrote:

  Agreed. Once concern I have about allowing the lock table to spill to
  disk is that a large number of FOR UPDATE locks could push out lock
  entries used by other backends, causing very poor performance.
 
 I think if we allow the lock manager to spill to disk (and I think we do
 need to allow it) then we should also be able to control the amount of
 shared memory allocated. There's little point spilling the lock table to
 disk if we have huge amounts of memory.

This is a interesting idea.

Gavin also mentioned to me we should also control the amount of memory
the shared inval queue uses.  Causing all backends to refill the cache
is (I assume) a big performance hit.

Maybe we should expose this via new knobs in postgresql.conf, to ease
implementation, or maybe not, to rid users of configuring it.

As a start, we could have WARNINGs when the lock table is spilled and
when a SInvalReset occurs.  So the user can know whether he should
increase memory use for those settings.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end. (2nd Commandment for C programmers)

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Shared row locking

2004-12-20 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 I think if we allow the lock manager to spill to disk (and I think we do
 need to allow it) then we should also be able to control the amount of
 shared memory allocated.

You mean like max_locks_per_transaction?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-20 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Gavin also mentioned to me we should also control the amount of memory
 the shared inval queue uses.

Perhaps, but I've really seen no evidence that there's a need to worry
about that.  Without demonstrated problems I'd sooner keep that code a
bit simpler and faster.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-20 Thread Merlin Moncure
Tom lane wrote:
 Gavin Sherry [EMAIL PROTECTED] writes:
  I think if we allow the lock manager to spill to disk (and I think
we do
  need to allow it) then we should also be able to control the amount
of
  shared memory allocated.
 
 You mean like max_locks_per_transaction?

IMO, max_locks_per_transaction could use a better name a little more
documentation.  I've mentioned this a couple of times before, but there
is at least one type of lock that does not expire when the transaction
ends (user locks).

I may be over my head here, but I think lock spillover is dangerous.  In
the extreme situations where this would happen, it would be a real
performance buster.  Personally, I would rather see locks escalate when
the table gets full, or at least allow this as a configuration
parameter. 

Merlin





---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Shared row locking

2004-12-20 Thread Tom Lane
Merlin Moncure [EMAIL PROTECTED] writes:
 I may be over my head here, but I think lock spillover is dangerous.  In
 the extreme situations where this would happen, it would be a real
 performance buster.  Personally, I would rather see locks escalate when
 the table gets full, or at least allow this as a configuration
 parameter. 

To me, performance buster is better than random, unrepeatable
deadlock failures.  In any case, if we find we *can't* implement this
in a non-performance-busting way, then it would be time enough to look
at alternatives that force the user to manage the problem for us.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Shared row locking

2004-12-20 Thread Alvaro Herrera
On Mon, Dec 20, 2004 at 11:47:41AM -0500, Tom Lane wrote:

 To me, performance buster is better than random, unrepeatable
 deadlock failures.  In any case, if we find we *can't* implement this
 in a non-performance-busting way, then it would be time enough to look
 at alternatives that force the user to manage the problem for us.

I am confused by this discussion.

To solve the problem I want to solve, we have three orthogonal
possibilities:

1. implement shared row locking using the ideas outlined in the mail
starting this thread (pg_clog-like seems to be the winner, details TBD).

2. implement shared lock table spill-to-disk mechanism.

3. implement lock escalation.


Some people seems to think 3 is better than 2.  What do they think of 1?


Some facts:

- DB2 implements 3 and some people have problems with deadlocks.

- 2 could have a performance impact, and we don't even know how to
  start.  For example, what would be an algorithm to decide what locks
  to send to disk?

- I am interested in implementing 1, maybe 2.  Don't know about 3.

-- 
Alvaro Herrera ([EMAIL PROTECTED])
One man's impedance mismatch is another man's layer of abstraction.
(Lincoln Yeoh)

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Shared row locking

2004-12-20 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 To solve the problem I want to solve, we have three orthogonal
 possibilities:

 1. implement shared row locking using the ideas outlined in the mail
 starting this thread (pg_clog-like seems to be the winner, details TBD).

 2. implement shared lock table spill-to-disk mechanism.

 3. implement lock escalation.

Check.

 - 2 could have a performance impact, and we don't even know how to
   start.  For example, what would be an algorithm to decide what locks
   to send to disk?

LRU, perhaps?  That's all open for investigation still.

#1 could have a pretty serious performance impact, too.  For small
numbers of FOR UPDATE locks (too few to force spill to disk) I would
expect #2 to substantially beat #1.  #1 essentially imposes the worst
case performance at all times, whereas #2 degrades (at a currently
unknown rate) when there are lots and lots of FOR UPDATE locks.

Most of the applications I've seen don't take out that many FOR UPDATE
locks at once, so I'm unclear on the rationale for choosing a fixed-but-
poor performance curve over one that is fast for few locks and degrades
for many locks.  Especially when the value of many is
user-configurable.

Furthermore, we have also seen issues with too many locks on ordinary
objects, which #2 would solve simultaneously.

So I feel that #2 is clearly the approach to try first.  If we find that
we can't do spill-to-disk without serious performance degradation, then
I'd be inclined to try #1 next.  I really don't care for the
user-visible semantics changes implied by #3 ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: Re: [HACKERS] Shared row locking

2004-12-20 Thread simon

Tom Lane [EMAIL PROTECTED] wrote on 20.12.2004, 19:34:21:
 Alvaro Herrera  writes:
  To solve the problem I want to solve, we have three orthogonal
  possibilities:
 
  1. implement shared row locking using the ideas outlined in the mail
  starting this thread (pg_clog-like seems to be the winner, details TBD).
 
  2. implement shared lock table spill-to-disk mechanism.
 
  3. implement lock escalation.
 
 Check.
 
  - 2 could have a performance impact, and we don't even know how to
start.  For example, what would be an algorithm to decide what locks
to send to disk?
 
 LRU, perhaps?  That's all open for investigation still.
 
 #1 could have a pretty serious performance impact, too.  For small
 numbers of FOR UPDATE locks (too few to force spill to disk) I would
 expect #2 to substantially beat #1.  #1 essentially imposes the worst
 case performance at all times, whereas #2 degrades (at a currently
 unknown rate) when there are lots and lots of FOR UPDATE locks.

Agreed.

[My gut feeling would be against another permanent on-disk structure,
since this is one more thing for a user to delete to save space
etc...]

 Most of the applications I've seen don't take out that many FOR UPDATE
 locks at once, so I'm unclear on the rationale for choosing a fixed-but-
 poor performance curve over one that is fast for few locks and degrades
 for many locks.  Especially when the value of many is
 user-configurable.
 
 Furthermore, we have also seen issues with too many locks on ordinary
 objects, which #2 would solve simultaneously.
 
 So I feel that #2 is clearly the approach to try first.  If we find that
 we can't do spill-to-disk without serious performance degradation, then

I agree. We need to understand what type of application logic we are
coding for. 

In general, I agree with Tom: I haven't seen many programs that use
extended SELECT FOR UPDATE logic. However, the ones I have seen have
been batch style programs written using a whole-table cursor - these
latter ones have been designed for the cursor stability approach.

It would be much better to analyze one or more representative
application suites to understand which option to pick. Without a set of
programs, or some driving force, the wrong one could be picked.

Spill-to-disk would not be that bad, since WARNINGs could appear in the
log. That's much better than doing a lock escalation, definitely. 

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Shared row locking

2004-12-19 Thread Simon Riggs
On Sun, 2004-12-19 at 04:04, Bruce Momjian wrote:
 BTom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   You mean all empty/zero rows can be removed?  Can we guarantee that on
   commit we can clean up the bitmap?  If not the idea doesn't work.
  
  For whatever data structure we use, we may reset the structure to empty
  during backend-crash recovery.  So your objection boils down to what if
  a backend exits normally but forgets to clean up its locks?  Assuming
  that doesn't happen isn't any worse than assuming a backend will clean
  up its shared memory state on non-crash exit, so I don't think it's a
  serious concern.
  
  That brings another thought: really what this is all about is working
  around the fact that the standard lock manager can only cope with a
  finite number of coexisting locks, because it's working in a fixed-size
  shared memory arena.  Maybe we should instead think about ways to allow
  the existing lock table to spill to disk when it gets too big.  That
  would eliminate max_locks_per_transaction as a source of hard failures,
  which would be a nice benefit.
 
 Agreed. Once concern I have about allowing the lock table to spill to
 disk is that a large number of FOR UPDATE locks could push out lock
 entries used by other backends, causing very poor performance.

In similar circumstances, DB2 uses these techniques:

- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters

- use a lock mode called Cursor Stability that locks only those rows
currently being examined by a cursor, those maintaining the lock usage
of a cursor at a constant level as the cursor moves. The lock mode of
Repeatable Read *does* lock all rows read

(these are not actually mutually exclusive)

The first one is a real pain, but the idea might be of use somewhere.

The second is a usable, practical alternative that should be considered
and might avoid the need to write the spill-to-disk code and then
discover it performs very badly.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Shared row locking

2004-12-19 Thread Alvaro Herrera
On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote:

Simon,

 In similar circumstances, DB2 uses these techniques:
 
 - when locktable X % full, then escalate locks to full table locks: both
 locktable memory and threshold% are instance parameters

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.  Keep in mind that this is for foreign key locking,
which is one area where deadlocks are frequently encountered because we
use too strong a lock.

 - use a lock mode called Cursor Stability that locks only those rows
 currently being examined by a cursor, those maintaining the lock usage
 of a cursor at a constant level as the cursor moves. The lock mode of
 Repeatable Read *does* lock all rows read

We can't release the locks until the end of the transaction.

 The second is a usable, practical alternative that should be considered
 and might avoid the need to write the spill-to-disk code and then
 discover it performs very badly.

I don't think any of them serves the objective I'm trying to accomplish
:-(

Thanks for your ideas anyway.  And keep having them!

-- 
Alvaro Herrera ([EMAIL PROTECTED])
Uno puede defenderse de los ataques; contra los elogios se esta indefenso

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Shared row locking

2004-12-19 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote:
 - use a lock mode called Cursor Stability that locks only those rows
 currently being examined by a cursor, those maintaining the lock usage
 of a cursor at a constant level as the cursor moves. The lock mode of
 Repeatable Read *does* lock all rows read

 We can't release the locks until the end of the transaction.

Well, what Simon is essentially proposing is that we work around a
potential performance issue by exposing a less-clean semantic model
to users.  I'd prefer to adopt such a thing as a last resort, not a
first one.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Shared row locking

2004-12-19 Thread Heikki Linnakangas
On Sun, 19 Dec 2004, Alvaro Herrera wrote:
On Sun, Dec 19, 2004 at 09:52:01AM +, Simon Riggs wrote:
Simon,
In similar circumstances, DB2 uses these techniques:
- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters
This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.  Keep in mind that this is for foreign key locking,
which is one area where deadlocks are frequently encountered because we
use too strong a lock.
Actually it might help in some scenarios. Remember, we're not talking 
about upgrading shared locks to exclusive locks. We're only talking about 
locking more rows than necessary (all rows).

I believe DB2 does the escalation also for perfomance. Getting a full 
table lock is cheaper than individually locking every row.

- Heikki
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Shared row locking

2004-12-19 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 On Sun, 19 Dec 2004, Alvaro Herrera wrote:
 This is not useful at all, because the objective of this exercise is to
 downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
 shared row locking.

 Actually it might help in some scenarios. Remember, we're not talking 
 about upgrading shared locks to exclusive locks. We're only talking about 
 locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Shared row locking

2004-12-19 Thread Heikki Linnakangas
On Sun, 19 Dec 2004, Tom Lane wrote:
Heikki Linnakangas [EMAIL PROTECTED] writes:
On Sun, 19 Dec 2004, Alvaro Herrera wrote:
This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.

Actually it might help in some scenarios. Remember, we're not talking
about upgrading shared locks to exclusive locks. We're only talking about
locking more rows than necessary (all rows).
Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.
As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.
I agree that lock escalation is not a good solution, we run into problems 
with DB2 lock escalation at work all the time.

- Heikki
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Shared row locking

2004-12-19 Thread Simon Riggs
On Sun, 2004-12-19 at 18:31, Alvaro Herrera wrote:
 Thanks for your ideas anyway.  And keep having them!

No problem. Just giving some info on what works and doesn't work in
other implementations.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Shared row locking

2004-12-19 Thread Bruce Momjian
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  On Sun, 19 Dec 2004, Alvaro Herrera wrote:
  This is not useful at all, because the objective of this exercise is to
  downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
  shared row locking.
 
  Actually it might help in some scenarios. Remember, we're not talking 
  about upgrading shared locks to exclusive locks. We're only talking about 
  locking more rows than necessary (all rows).
 
 Nonetheless, it would mean that locks would be taken depending on
 implementation-dependent, not-visible-to-the-user considerations.
 Shared locks can still cause deadlocks, and so you would have an
 unreliable application, which would only be unreliable under load.
 
 As I said in connection with the other proposal, weird user-visible
 semantics should be the last resort not the first.
 

One idea is to stamp the same shared lock counter on multiple rows and
just prevent another application from also sharing that lock if too many
rows are locked.  It would wait for the lock to be released.  This
prevents the problem of having to store thousands of shared lock
counters.  There would have to be some flag so say which shared counters
are only for a single backend.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Shared row locking

2004-12-19 Thread Jim C. Nasby
On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:
 On Sun, 19 Dec 2004, Tom Lane wrote:
 
 Heikki Linnakangas [EMAIL PROTECTED] writes:
 On Sun, 19 Dec 2004, Alvaro Herrera wrote:
 This is not useful at all, because the objective of this exercise is to
 downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
 shared row locking.
 
 Actually it might help in some scenarios. Remember, we're not talking
 about upgrading shared locks to exclusive locks. We're only talking about
 locking more rows than necessary (all rows).
 
 Nonetheless, it would mean that locks would be taken depending on
 implementation-dependent, not-visible-to-the-user considerations.
 Shared locks can still cause deadlocks, and so you would have an
 unreliable application, which would only be unreliable under load.
 
 As I said in connection with the other proposal, weird user-visible
 semantics should be the last resort not the first.
 
 I agree that lock escalation is not a good solution, we run into problems 
 with DB2 lock escalation at work all the time.

Does anyone know how Oracle deals with this? They use MVCC like
PostgreSQL, so they'd be a better source for inspiration.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Shared row locking

2004-12-19 Thread Gavin Sherry
On Sat, 18 Dec 2004, Bruce Momjian wrote:

 BTom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   You mean all empty/zero rows can be removed?  Can we guarantee that on
   commit we can clean up the bitmap?  If not the idea doesn't work.
 
  For whatever data structure we use, we may reset the structure to empty
  during backend-crash recovery.  So your objection boils down to what if
  a backend exits normally but forgets to clean up its locks?  Assuming
  that doesn't happen isn't any worse than assuming a backend will clean
  up its shared memory state on non-crash exit, so I don't think it's a
  serious concern.
 
  That brings another thought: really what this is all about is working
  around the fact that the standard lock manager can only cope with a
  finite number of coexisting locks, because it's working in a fixed-size
  shared memory arena.  Maybe we should instead think about ways to allow
  the existing lock table to spill to disk when it gets too big.  That
  would eliminate max_locks_per_transaction as a source of hard failures,
  which would be a nice benefit.

 Agreed. Once concern I have about allowing the lock table to spill to
 disk is that a large number of FOR UPDATE locks could push out lock
 entries used by other backends, causing very poor performance.

I think if we allow the lock manager to spill to disk (and I think we do
need to allow it) then we should also be able to control the amount of
shared memory allocated. There's little point spilling the lock table to
disk if we have huge amounts of memory.

Gavin

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Shared row locking

2004-12-18 Thread Bruce Momjian
BTom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  You mean all empty/zero rows can be removed?  Can we guarantee that on
  commit we can clean up the bitmap?  If not the idea doesn't work.
 
 For whatever data structure we use, we may reset the structure to empty
 during backend-crash recovery.  So your objection boils down to what if
 a backend exits normally but forgets to clean up its locks?  Assuming
 that doesn't happen isn't any worse than assuming a backend will clean
 up its shared memory state on non-crash exit, so I don't think it's a
 serious concern.
 
 That brings another thought: really what this is all about is working
 around the fact that the standard lock manager can only cope with a
 finite number of coexisting locks, because it's working in a fixed-size
 shared memory arena.  Maybe we should instead think about ways to allow
 the existing lock table to spill to disk when it gets too big.  That
 would eliminate max_locks_per_transaction as a source of hard failures,
 which would be a nice benefit.

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


[HACKERS] Shared row locking

2004-12-16 Thread Alvaro Herrera
Hi,

I've been thinking on how to do shared row locking.  There are some very
preliminar ideas on this issue.  Please comment; particularly if any
part of it sounds unworkable or too incomplete.

There are several problems to be solved here: the grammar, the internal
SelectStmt representation, how to store and share the info between
backends, how to clean up at transaction end, and how to clean up at
backend crash.

The Grammar
===

The SQL spec does not say anything on this respect (that I can find).
It only talks of FOR UPDATE and FOR READ ONLY.  However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL.  So I think we need a new clause, which I
propose to be FOR SHARE.


The Parser and SelectStmt
=

The parser uses for_update_clause and opt_for_update_clause
nonterminals.  I assume it's best to change them to (new)
locking_clause, which can in turn be for_update_clause or (new)
for_share_clause.

SelectStmt currently has a forUpdate field (a List to the to-be-locked
tables, or an empty list meaning all of them).  We could simply add
another list, say forShare, or use a common list and a flag saying that
it's one or the other.  I prefer adding a new list.  (Same with the
Query node.)


How to Store the Info
=

This is the really interesting part.  I have two ideas, one mine (btree)
and other Bruce's (bitmap).

Using a B-tree
--
When a backend wants to lock a tuple, it set a bit in its infomask.
Then it inserts to a btree in a special tablespace, using
RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId
as value; actually, an array with a single element containing those two
values.

When a backend wants to lock a tuple that is already locked, it goes to
the btree and inserts itself into the array.  To do this, it inserts a
new index item (the enlarged array) and delete the previous one.  No
other backend may want to insert simultaneously (thus causing an ugly
race condition), because we hold an exclusive lock on the tuple's heap
page's buffer.

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

When someone wants to know if the tuple is locked (to mark it FOR
UPDATE, or to delete it), it checks the infomask.  If it says it's
locked, it goes to check the btree.  If the array contains only
BackendId-TransactionId pairs that are no longer running, then the 
tuple is not locked and can be deleted/marked (and the btree can be
cleaned up).  Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running.  We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Note that to check whether a transaction is running we need to lock
SInvalLock.  To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer-procState array, only
the item that we need to look at.  Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

At server restart, the btree is created empty (or just deleted).  There
is one btree per database.


Using a Bitmap
--
First we create a counter called shared lock row counter.  Then we
create a file like pg_clog, and each counter slot has a bit for every
backend.  When we want to shared lock a row we increment the counter and
put that counter value on the row, and set our backend bit in the new
file.  We also store that counter value in our backend local memory.  On
commit we go through that local memory list and reset all our bits for
those counters.  When a row has all zeros, it can be recycled like we do
with pg_clog.


Problems and random comments


There is possibility of starvation, if somebody wants to lock
exclusively a tuple and shared lockers are coming all the time.  Not
sure how to solve this.

The wakeup mechanism is not discussed ... is there a special need for
something beyond what we can do with XactLockTable{Insert,Wait} ?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

The btree idea:
- does not need crash recovery.  Maybe we could use a stripped down
  version of nbtree.  This could cause a maintanibility nightmare.

- can't hold more than 300 or so simultaneous lockers (because of value
  length, limited to 1/3 of a page).  I doubt this is a real problem.

- could have problems (excessive storage requirements) in the long run
  because of empty or almost-empty pages.

The bitmap idea:
- seems to have lower overhead

- can use the same lazy cleanup mechanism exposed for the btree idea (in
  which case we don't need the list in local memory).

- What can happen in presence of large max_connections settings?  Is
  this a real problem?

-- 
Alvaro 

Re: [HACKERS] Shared row locking

2004-12-16 Thread Bruce Momjian
Alvaro Herrera wrote:
 The btree idea:
 - does not need crash recovery.  Maybe we could use a stripped down
   version of nbtree.  This could cause a maintanibility nightmare.

Are you saying the btree is an index with no heap?  If so, what about
the xid's?  Are they just in the btree?

How does the btree get cleaned up over time?

 The bitmap idea:
 - seems to have lower overhead
 
 - can use the same lazy cleanup mechanism exposed for the btree idea (in
   which case we don't need the list in local memory).

You mean all empty/zero rows can be removed?  Can we guarantee that on
commit we can clean up the bitmap?  If not the idea doesn't work.

 - What can happen in presence of large max_connections settings?  Is
   this a real problem?

I thought about that.  50 backends is 7 bytes, 1000 backends is 128
bytes.  For a large number of backends you could just allow X concurrent
locks and use space X*4 bytes.

I think the basic issue is that the btree can be of variable length
while the bitmap has to be of a fixed length.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Shared row locking

2004-12-16 Thread Christopher Kings-Lynne
The SQL spec does not say anything on this respect (that I can find).
It only talks of FOR UPDATE and FOR READ ONLY.  However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL.  So I think we need a new clause, which I
propose to be FOR SHARE.
MySQL uses LOCK IN SHARE MODE:
http://dev.mysql.com/doc/mysql/en/InnoDB_locking_reads.html
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-16 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Using a B-tree

 At transaction end, nothing special happens (tuples are not unlocked
 explicitly).

I don't think that works, because there is no guarantee that an entry
will get cleaned out before the XID counter wraps around.  Worst case,
you might think that a tuple is locked when the XID is left over from
the previous cycle.  (Possibly this could be avoided by cleaning out old
XIDs in this table whenever we truncate pg_clog, but that seems a tad
messy.)  I'm also a bit concerned about how we avoid table bloat if
there's no proactive cleanup at transaction end.

I think I like the pg_clog-modeled structure a bit better.  However it
could be objected that that puts a hard limit of 4G share-locked tuples
at any one time.

In the clog-modeled idea, it wasn't real clear how you decide whether to
assign a new counter value to a previously locked row, or reuse its
previous counter.  You must *not* assign a new value when the existing
entry still has bits set, but you probably do want to be aggressive
about assigning new values when you can; else it gets tough to be sure
that the log can be truncated in a reasonable time.

ISTM that your description is conflating several orthogonal issues:
how do we identify entries in this data structure (by CTID, or a shared
counter that increments each time a new lock is acquired); how do we
index the data structure (btree or linear array); and what is stored in
each entry (array of XIDs, or bitmap indexed by BackendId).  Not all of
the eight combinations work, but we do have more alternatives than the
two offered, even without coming up with any new ideas ;-)

 Note that to check whether a transaction is running we need to lock
 SInvalLock.  To minimize the time we hold it, we save the BackendId so
 we don't have to scan the whole shmInvalBuffer-procState array, only
 the item that we need to look at.  Another possibility would be to use
 stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

I'm a bit worried about deadlocks and race conditions associated with
the conflict between locking a page of this data structure and locking
SInvalLock.

 At server restart, the btree is created empty (or just deleted).  There
 is one btree per database.

One per cluster you meant, right?  (Else we can't do locking of rows in
shared tables.)

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Shared row locking

2004-12-16 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 You mean all empty/zero rows can be removed?  Can we guarantee that on
 commit we can clean up the bitmap?  If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery.  So your objection boils down to what if
a backend exits normally but forgets to clean up its locks?  Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena.  Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big.  That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Shared row locking

2004-12-16 Thread Bruce Momjian
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Using a B-tree
 
  At transaction end, nothing special happens (tuples are not unlocked
  explicitly).
 
 I don't think that works, because there is no guarantee that an entry
 will get cleaned out before the XID counter wraps around.  Worst case,
 you might think that a tuple is locked when the XID is left over from
 the previous cycle.  (Possibly this could be avoided by cleaning out old
 XIDs in this table whenever we truncate pg_clog, but that seems a tad
 messy.)  I'm also a bit concerned about how we avoid table bloat if
 there's no proactive cleanup at transaction end.
 
 I think I like the pg_clog-modeled structure a bit better.  However it
 could be objected that that puts a hard limit of 4G share-locked tuples
 at any one time.
 
 In the clog-modeled idea, it wasn't real clear how you decide whether to
 assign a new counter value to a previously locked row, or reuse its
 previous counter.  You must *not* assign a new value when the existing
 entry still has bits set, but you probably do want to be aggressive
 about assigning new values when you can; else it gets tough to be sure
 that the log can be truncated in a reasonable time.

I assume you check and if all the bits are zero, you don't reuse it and
get a new counter.  In fact you shouldn't reuse it in case the log is
being truncated while you are looking.  :-)

 ISTM that your description is conflating several orthogonal issues:
 how do we identify entries in this data structure (by CTID, or a shared
 counter that increments each time a new lock is acquired); how do we
 index the data structure (btree or linear array); and what is stored in
 each entry (array of XIDs, or bitmap indexed by BackendId).  Not all of
 the eight combinations work, but we do have more alternatives than the
 two offered, even without coming up with any new ideas ;-)

True.  The only advantage to a bitmap vs. just a counter of locked
backends is that you can clean out your own backend bits from the table
without having to record them in your memory.  However, because
recording your own counters in local memory doesn't require fixed shared
memory we might be better just recording the shared lock indexes in your
local backend memory and just use an int4 counter in the pg_clog-like
file that we can decrement on backend commit.  However I am unclear that
we can guarantee an exiting backend will do that.  Certainly it is
cleared on server start.

  Note that to check whether a transaction is running we need to lock
  SInvalLock.  To minimize the time we hold it, we save the BackendId so
  we don't have to scan the whole shmInvalBuffer-procState array, only
  the item that we need to look at.  Another possibility would be to use
  stock TransactionIdIsInProgress and save the extra 4 bytes of storage.
 
 I'm a bit worried about deadlocks and race conditions associated with
 the conflict between locking a page of this data structure and locking
 SInvalLock.
 
  At server restart, the btree is created empty (or just deleted).  There
  is one btree per database.
 
 One per cluster you meant, right?  (Else we can't do locking of rows in
 shared tables.)

He meant one per database, I think.  I suppose we would need another one
for global tables or disallow shared locking of them.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])