date:20110524

Re: [HACKERS] Pre-alloc ListCell's optimization

2011-05-24 Thread Stephen Frost

* Stephen Frost (sfr...@snowman.net) wrote:
   Finally, sorry it's kind of a fugly patch, it's just a proof of
   concept and I'd be happy to clean it up if others feel it's worthwhile
   and a reasonable approach, but I really need to get it out there and
   take a break from it (I've been a bit obsessive-compulsive about it
   since PGCon.. :D).

Erm, sorry, just to clarify, while it's a P-O-C patch, it does compile
cleanly and passes all the regression tests, so it's something that one
can play with at least.  Not sure if it'd be worth benchmarking it until
we feel comfortable that this is a decent approach, but I wouldn't
complain if someone decided to...

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Should partial dumps include extensions?

2011-05-24 Thread Robert Haas

On Tue, May 24, 2011 at 4:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 There's a complaint here
 http://archives.postgresql.org/pgsql-general/2011-05/msg00714.php
 about the fact that 9.1 pg_dump always dumps CREATE EXTENSION commands
 for all loaded extensions.  Should we change that?  A reasonable
 compromise might be to suppress extensions in the same cases where we
 suppress procedural languages, ie if --schema or --table was used
 (see include_everything switch in pg_dump.c).

Making it work like procedural languages seems sensible to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pre-alloc ListCell's optimization

2011-05-24 Thread Alvaro Herrera

Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011:

   A couple of notes regarding the patch:
 
   First, it uses ffs(), which might not be fully portable..  We could
   certainly implement the same thing in userspace and use ffs() when
   it's available.

Err, see RIGHTMOST_ONE in bitmapset.c.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-24 Thread Bruce Momjian

Robert Haas wrote:
 2. The other fairly obvious alternative is to adjust our existing WAL
 record types to be idempotent - i.e. to not rely on the existing page
 contents.  For XLOG_HEAP_INSERT, we currently store the target tid and
 the tuple contents.  I'm not sure if there's anything else, but we
 would obviously need the offset where the new tuple should be written,
 which we currently infer from reading the existing page contents.  For
 XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would
 certainly need to store its offset within the block, and maybe the
 infomask.  For XLOG_HEAP_UPDATE, we'd need the old and new offsets and
 perhaps also the old and new infomasks.  Assuming that's all we need
 and I'm not missing anything (which I won't bet on), that means we'd
 be adding, say, 4 bytes per insert or delete and 8 bytes per update.
 So, if checkpoints are spread out widely enough that there will be
 more than ~2K operations per page between checkpoints, then it makes
 more sense to just do a full page write and call it good.  If not,
 this idea might have legs.

I vote for wal_level = idempotent because so few people will know what
idempotent means.  ;-)

Idempotent does seem like the most promising idea.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] The way to know whether the standby has caught up with the master

2011-05-24 Thread Fujii Masao

Hi,

For reliable high-availability, when the master crashes, the clusterware must
know whether it can promote the standby safely without any data loss,
before actually promoting it. IOW, it must know whether the standby has
already caught up with the primary. Otherwise, failover might cause data loss.
We can know that from pg_stat_replication on the master. But the problem
is that pg_stat_replication is not available since the master is not running at
that moment. So that info should be available also on the standby.

To achieve that, I'm thinking to change walsender so that, when the standby
has caught up with the master, it sends back the message indicating that to
the standby. And I'm thinking to add new function (or view like
pg_stat_replication)
available on the standby, which shows that info.

Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

2011-05-24 Thread Heikki Linnakangas


On 25.05.2011 07:42, Fujii Masao wrote:

For reliable high-availability, when the master crashes, the clusterware must
know whether it can promote the standby safely without any data loss,
before actually promoting it. IOW, it must know whether the standby has
already caught up with the primary. Otherwise, failover might cause data loss.
We can know that from pg_stat_replication on the master. But the problem
is that pg_stat_replication is not available since the master is not running at
that moment. So that info should be available also on the standby.

To achieve that, I'm thinking to change walsender so that, when the standby
has caught up with the master, it sends back the message indicating that to
the standby. And I'm thinking to add new function (or view like
pg_stat_replication)
available on the standby, which shows that info.


By the time the standby has received that message, it might not be 
caught-up anymore because new WAL might've been generated in the master 
already.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

2011-05-24 Thread Fujii Masao

On Wed, May 25, 2011 at 2:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 25.05.2011 07:42, Fujii Masao wrote:

 For reliable high-availability, when the master crashes, the clusterware
 must
 know whether it can promote the standby safely without any data loss,
 before actually promoting it. IOW, it must know whether the standby has
 already caught up with the primary. Otherwise, failover might cause data
 loss.
 We can know that from pg_stat_replication on the master. But the problem
 is that pg_stat_replication is not available since the master is not
 running at
 that moment. So that info should be available also on the standby.

 To achieve that, I'm thinking to change walsender so that, when the
 standby
 has caught up with the master, it sends back the message indicating that
 to
 the standby. And I'm thinking to add new function (or view like
 pg_stat_replication)
 available on the standby, which shows that info.

 By the time the standby has received that message, it might not be caught-up
 anymore because new WAL might've been generated in the master already.

Right. But, thanks to sync rep, until such a new WAL has been replicated to
the standby, the commit of transaction is not visible to the client. So, even if
there are some WAL not replicated to the standby, the clusterware can promote
the standby safely without any data loss (to the client point of view), I think.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign memory context read

2011-05-24 Thread Vaibhav Kaushal

Indeed I was acting weird there. I had completely forgotten about the
bool pointer. Moreover, I actually got confused about the palloc0's
return type...whether it was a datum or a pointer to datum. Looked back
at the expansion and got it clear. 

Thanks a lot Mr. Tom. 

Regards,
Vaibhav

On Mon, 2011-05-23 at 09:58 -0400, Tom Lane wrote:
 Vaibhav Kaushal vaibhavkaushal...@gmail.com writes:
  My mind started wandering after that error. Now, actually, i was trying to
  do something like this:
 
  *last_result = palloc0(sizeof(Datum));
  bool *isnnuull = true;
  *last_result = slot_getattr(slot, num_atts, *isnnuull);
 
 This seems utterly confused about data types.  The first line thinks
 that last_result is of type Datum ** (ie, pointer to pointer to Datum),
 since it's storing a pointer-to-Datum through it.  The third line
 however is treating last_result as of type Datum *, since it's storing
 a Datum (not pointer to Datum) through it.  And the second line is
 assigning true (a bool value) to a variable declared as pointer to
 bool, which you then proceed to incorrectly dereference while passing it
 as the last argument to slot_getattr.  The code will certainly crash on
 that deref, independently of the multiple other bugs here.
 
 Recommendation: gcc is your friend.  Pay attention to the warnings it
 gives you.
 
   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-24 Thread Pavan Deolasee

Hi All,

Some of the ideas regarding vacuum improvements were discussed here:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00863.php
http://archives.postgresql.org/pgsql-patches/2008-06/msg00059.php

A recent thread was started by Robert Haas, but I don't know if we logically
concluded that either.
http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php

This was once again brought up by Robert Haas in a discussion with Tom and
me during the PGCon and  we agreed there are few things we can do make
vacuum more performant. One of the things that Tom mentioned is that the
vacuum today is not aware of the fact that its a periodic operation and
there might be ways to utilize that in some way.

The biggest gripe today is that vacuum needs two heap scans and each scan
dirties the buffer. While visibility map ensures that not-all blocks are
read and written during the scan, for a very large table, even a small
percentage of blocks can be significant. Further, post-HOT, the second scan
of the heap does not really reclaim any significant space, except for dead
line pointers. So there is a good reason to avoid that. I wanted to start a
discussion just about that. I am proposing one solution below, but I am not
married to the idea.

So the idea is to separate the index vacuum (removing index pointers to dead
tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
or using regular vacuum), we can spool the dead line pointers somewhere. To
avoid any hot-spots during normal processing, the spooling can be done
periodically like the stats collection. One obvious choice for spooling dead
line pointers is to use a relation fork. The index vacuum will be kicked off
periodically depending on the number of spooled deal line pointers. When
that happens, the index vacuum will remove all index pointers pointing to
those dead   line pointers and forget the spooled line pointers.

The dead line pointers themselves will be removed whenever a heap page is
later vacuumed, either as part of HOT pruning or the next heap vacuum. We
would need some mechanism though to know that the index pointers to the
existing dead line pointers have been vacuumed and its safe to remove them
now. May be we can track the last operation that generated a dead line
pointer in the page using a LSN in the page header and also keep track of
the LSN of the last successful index vacuum. If the index vacuum LSN is
greater than the page header vacuum LSN, we can safely remove the existing
dead line pointers. I am deliberately not suggesting how to track the index
vacuum LSN since my last proposal to do something similar through a pg_class
column was shot down by Tom :-)

In nutshell, what I am suggesting is to do heap and index vacuuming
independently. The heap will be vacuumed either by HOT pruning or a periodic
heap vacuum and the dead line pointers will be collected. An index vacuum
will remove the index pointers to those dead line pointers. And at some
later point, the dead line pointers will be removed, either as part of
retail or complete heap vacuum. Its not clear if its useful, but a single
index vacuum can follow multiple heap vacuums or vice versa.

Another advantage of this technique would be that we can then support
start/stop heap vacuum or vacuuming a range of blocks at a time or even
vacuuming only those blocks which are already cached in the buffer cache.
Just a hand-waving at this point, but seems possible.

Suggestions/comments/criticism all welcome, but please don't shoot down the
idea on implementation details since I have really not spent time on that,
so it will be easy find holes and corner cases. That can be worked out if we
believe something like this will be useful.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-24 Thread Noah Misch

On Mon, May 23, 2011 at 09:15:27PM -0400, Robert Haas wrote:
 On Fri, May 13, 2011 at 4:16 PM, Noah Misch n...@leadboat.com wrote:
  ? ? ? ?if (level = ShareUpdateExclusiveLock)
  ? ? ? ? ? ? ? ?++strong_lock_counts[my_strong_lock_count_partition]
  ? ? ? ? ? ? ? ?sfence
  ? ? ? ? ? ? ? ?if (strong_lock_counts[my_strong_lock_count_partition] == 1)
  ? ? ? ? ? ? ? ? ? ? ? ?/* marker 1 */
  ? ? ? ? ? ? ? ? ? ? ? ?import_all_local_locks
  ? ? ? ? ? ? ? ?normal_LockAcquireEx
  ? ? ? ?else if (level = RowExclusiveLock)
  ? ? ? ? ? ? ? ?lfence
  ? ? ? ? ? ? ? ?if (strong_lock_counts[my_strong_lock_count_partition] == 0)
  ? ? ? ? ? ? ? ? ? ? ? ?/* marker 2 */
  ? ? ? ? ? ? ? ? ? ? ? ?local_only
  ? ? ? ? ? ? ? ? ? ? ? ?/* marker 3 */
  ? ? ? ? ? ? ? ?else
  ? ? ? ? ? ? ? ? ? ? ? ?normal_LockAcquireEx
  ? ? ? ?else
  ? ? ? ? ? ? ? ?normal_LockAcquireEx
 
  At marker 1, we need to block until no code is running between markers two 
  and
  three. ?You could do that with a per-backend lock (LW_SHARED by the strong
  locker, LW_EXCLUSIVE by the backend). ?That would probably still be a win 
  over
  the current situation, but it would be nice to have something even cheaper.
 
 Barring some brilliant idea, or anyway for a first cut, it seems to me
 that we can adjust the above pseudocode by assuming the use of a
 LWLock.  In addition, two other adjustments: first, the first line
 should test level  ShareUpdateExclusiveLock, rather than =, per
 previous discussion.  Second, import_all_local locks needn't really
 move everything; just those locks with a matching locktag.  Thus:
 
 !if (level  ShareUpdateExclusiveLock)
 !++strong_lock_counts[my_strong_lock_count_partition]
 !sfence
 !for each backend
 !take per-backend lwlock for target backend
 !transfer fast-path entries with matching locktag
 !release per-backend lwlock for target backend
 !normal_LockAcquireEx
 !else if (level = RowExclusiveLock)
 !lfence
 !if (strong_lock_counts[my_strong_lock_count_partition] == 0)
 !take per-backend lwlock for own backend
 !fast-path lock acquisition
 !release per-backend lwlock for own backend
 !else
 !normal_LockAcquireEx
 !else
 !normal_LockAcquireEx

This drops the part about only transferring fast-path entries once when a
strong_lock_counts cell transitions from zero to one.  Granted, that itself
requires some yet-undiscussed locking.  For that matter, we can't have
multiple strong lockers completing transfers on the same cell in parallel.
Perhaps add a FastPathTransferLock, or an array of per-cell locks, that each
strong locker holds for that entire if body and while decrementing the
strong_lock_counts cell at lock release.

As far as the level of detail of this pseudocode goes, there's no need to hold
the per-backend LWLock while transferring the fast-path entries.  You just
need to hold it sometime between bumping strong_lock_counts and transferring
the backend's locks.  This ensures that, for example, the backend is not
sleeping in the middle of a fast-path lock acquisition for the whole duration
of this code.

 Now, a small fly in the ointment is that we haven't got, with
 PostgreSQL, a portable library of memory primitives.  So there isn't
 an obvious way of doing that sfence/lfence business.

I was thinking that, if the final implementation could benefit from memory
barrier interfaces, we should create those interfaces now.  Start with only a
platform-independent dummy implementation that runs a lock/unlock cycle on a
spinlock residing in backend-local memory.  I'm 75% sure that would be
sufficient on all architectures for which we support spinlocks.  It may turn
out that we can't benefit from such interfaces at this time ...

 Now, it seems to
 me that in the strong lock case, the sfence isn't really needed
 anyway, because we're about to start acquiring and releasing an lwlock
 for every backend, and that had better act as a full memory barrier
 anyhow, or we're doomed.  The weak lock case is more interesting,
 because we need the fence before we've taken any LWLock.

Agreed.

 But perhaps it'd be sufficient to just acquire the per-backend lwlock
 before checking strong_lock_counts[].  If, as we hope, we get back a
 zero, then we do the fast-path lock acquisition, release the lwlock,
 and away we go.  If we get back any other value, then we've wasted an
 lwlock acquisition cycle.  Or actually maybe not: it seems to me that
 in that case we'd better transfer all of our fast-path entries into
 the main hash table before trying to acquire any lock the slow way, at
 least if we don't want the deadlock detector to have to know about the
 fast-path.  So then we get this:
 
 !if (level

Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-24 Thread Kevin Grittner

Kevin Grittner  wrote:
 Dan Ports  wrote:
 
 Does that make sense to you?
 
 Makes sense to me. Like the proof I offered, you have shown that
 there is no cycle which can develop with the locks copied which
 isn't there anyway if we don't copy the locks.
 
I woke up with the nagging thought that while the above is completely
accurate, it deserves some slight elaboration. These proofs show that
there is no legitimate cycle which could cause an anomaly which the
move from row-based to tuple-based logic will miss.  They don't prove
that the change will generate all the same serialization failures;
and in fact, some false positives are eliminated by the change. 
That's a good thing.  In addition to the benefits mentioned in prior
posts, there will be a reduction in the rate of rollbacks (in
particular corner cases) from what people see in beta1 without a loss
of correctness.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Operator families vs. casts

2011-05-24 Thread Noah Misch

PostgreSQL 9.1 will implement ALTER TABLE ALTER TYPE operations that use a
binary coercion cast without rewriting the table or unrelated indexes. It
will always rewrite any indexes and recheck any foreign key constraints that
depend on a changing column. This is unnecessary for 100% of core binary
coercion casts. In my original design[1], I planned to detect this by
comparing the operator families of the old and would-be-new indexes. (This
still yields some unnecessary rewrites; oid_ops and int4_ops are actually
compatible, for example.) When I implemented[2] it, I found that the
contracts[3] for operator families are not strong enough to prove that the
existing indexes and constraints remain valid. Specifically, I wished assume
val0 = val1 iff val0::a = val1::b for any val0, val1, a, b such that we
resolve both equality operators in the same operator family. The operator
family contracts say nothing about consistency with casts. Is there a
credible use case for violating that assumption? If not, I'd like to document
it as a requirement for operator family implementors.

The above covers B-tree and hash operator families. GIN and GiST have no
operator family contracts. Here was the comment in my first patch intended to
sweep that under the table:

! * We do not document a contract for GIN or GiST operator families. Only the
! * GIN operator family array_ops has more than one constituent operator
class,
! * and only typmod-only changes to arrays can avoid a rewrite. Preserving a
GIN
! * index across such a change is safe. We therefore support GiST and GIN here
! * using the same rules as for B-tree and hash indexes, but that is mostly
! * academic. Any forthcoming contract for GiST or GIN operator families
should,
! * all other things being equal, bolster the validity of this assumption.
! *
! * Exclusion constraints raise the question: can we trust that the operator
has
! * the same semantics with the new type? The operator will fall in the
index's
! * operator family. For B-tree or hash, the operator will be = or ,
! * yielding an affirmative answer from contractual requirements. For GiST and
! * GIN, we assume that a similar requirement would fall out of any contract
for
! * their operator families, should one arise. We therefore support exclusion
! * constraints without any special treatment, but this is again mostly
academic.

Any thoughts on what to do here? We could just add basic operator family
contracts requiring what we need. Perhaps, instead, the ALTER TABLE code
should require an operator family match for B-tree and hash but an operator
class match for other access methods.

For now, I plan to always rewrite indexes on expressions or having predicates.
With effort, we could detect compatible changes there, too.

I also had a more mundane design question in the second paragraph of [2]. It
can probably wait for the review of the next version of the patch. However,
given that it affects a large percentage of the patch, I'd appreciate any
early feedback on it.

Thanks,
nm

[1]
http://archives.postgresql.org/message-id/20101229125625.ga27...@tornado.gateway.2wire.net
[2]
http://archives.postgresql.org/message-id/20110113230124.ga18...@tornado.gateway.2wire.net
[3] http://www.postgresql.org/docs/9.0/interactive/xindex.html#XINDEX-OPFAMILY

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: fix relkind handling on foreign tables

2011-05-24 Thread Kohei KaiGai

2011/5/23 Robert Haas robertmh...@gmail.com:
 On Sun, May 22, 2011 at 5:52 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 The attached patch fixes up case handling in foreign tables.

 Now it didn't assign security label on foreign table on its creation
 time, and didn't check access rights on the dml hook.
 This patch fixes these problems; It allows foreign tables default
 labeling and access checks as db_table object class.

 A foreign table is really more like a view, or a function call.  Are
 you sure you want to handle it like a table?

It might be a tentative solution, so I'll want to cancel this patch.

Its nature is indeed more similar to function call rather than tables,
but not a function itself. So, it might be a better idea to define its
own object class such as db_foreign_table instead of existing
object classes.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sepgsql: fix relkind handling on foreign tables

2011-05-24 Thread Robert Haas

On Tue, May 24, 2011 at 6:57 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2011/5/23 Robert Haas robertmh...@gmail.com:
 On Sun, May 22, 2011 at 5:52 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 The attached patch fixes up case handling in foreign tables.

 Now it didn't assign security label on foreign table on its creation
 time, and didn't check access rights on the dml hook.
 This patch fixes these problems; It allows foreign tables default
 labeling and access checks as db_table object class.

 A foreign table is really more like a view, or a function call.  Are
 you sure you want to handle it like a table?

 It might be a tentative solution, so I'll want to cancel this patch.

 Its nature is indeed more similar to function call rather than tables,
 but not a function itself. So, it might be a better idea to define its
 own object class such as db_foreign_table instead of existing
 object classes.

Perhaps.  Or else use db_view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch for GiST: move childoffnum to child

2011-05-24 Thread Alexander Korotkov

During preparing patch of my GSoC project I found reasonable to
move childoffnum (GISTInsertStack structure) from parent to child. This
means that now child have childoffnum of parent's link to child. It allows
to maintain entire parts of tree in that GISTInsertStack structures. Also it
simplifies existing code a bit.
Heikki advice me that since this change simplifies existing code it can be
considered as a separate patch.

--
With best regards,
Alexander Korotkov.


gist_childoffnum.path
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.1 support for hashing arrays

2011-05-24 Thread Bruce Momjian

Robert Haas wrote:
On Sun, May 22, 2011 at 11:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Robert Haas robertmh...@gmail.com writes:
I believe, however, that applying this will invalidate the contents of
any hash indexes on array types that anyone has built using 9.1beta1.
Do we need to do something about that?

Like bumping catversion?

Sure. Although note that the system catalogs are not actually
changing, which goes to someone else's recent point about catversion
getting bumped for things other than changes in the things for which
the cat in catversion is an abbreviation.

I would probably complain about that, except you already did it post-beta1:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=9bb6d9795253bb521f81c626fea49a704a369ca9

Unfortunately, I was unable to make that omelet without breaking some eggs.
:-(

Possibly Bruce will feel like adding a check to pg_upgrade for the case.
I wouldn't bother myself though. ?It seems quite unlikely that anyone's
depending on the feature yet.

I'll leave that to you, Bruce, and whoever else wants to weigh in to
hammer that one out.

Oh, you are worried someone might have stored hash indexes with the old
catalog format? Seems like something we might mention in the next beta
release announcement, but nothing more.

--
Bruce Momjian br...@momjian.ushttp://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-24 Thread Bruce Momjian

Robert Haas wrote:
 On Mon, May 23, 2011 at 8:26 AM, Bruce Momjian br...@momjian.us wrote:
  Sorry, I was unclear. ?The question is whether the case of _name_ of the
  locale is significant, meaning can you have two locale names that differ
  only by case and behave differently?
 
 That would seem surprising to me, but I really have no idea.
 
 There's the other direction, too: two locales that vary by something
 more than case, but still have identical behavior.  Maybe we just
 decide not to worry about that, but then why worry about this?

Well, if we remove the check then people could easily get broken
upgrades by upgrading to a server with a different locale.  A Google
search seems to indicate the locale names are case-sensitive so I am
thinking the problem is that the user didn't have exact locales, and
needs that to use pg_upgrade.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-24 Thread Robert Haas

On Tue, May 24, 2011 at 5:07 AM, Noah Misch n...@leadboat.com wrote:
 This drops the part about only transferring fast-path entries once when a
 strong_lock_counts cell transitions from zero to one.

Right: that's because I don't think that's what we want to do.  I
don't think we want to transfer all per-backend locks to the shared
hash table as soon as anyone attempts to acquire a strong lock;
instead, I think we want to transfer only those fast-path locks which
have the same locktag as the strong lock someone is attempting to
acquire.  If we do that, then it doesn't matter whether the
strong_lock_counts[] cell is transitioning from 0 to 1 or from 6 to 7:
we still have to check for strong locks with that particular locktag.

 Granted, that itself
 requires some yet-undiscussed locking.  For that matter, we can't have
 multiple strong lockers completing transfers on the same cell in parallel.
 Perhaps add a FastPathTransferLock, or an array of per-cell locks, that each
 strong locker holds for that entire if body and while decrementing the
 strong_lock_counts cell at lock release.

I was imagining that the per-backend LWLock would protect the list of
fast-path locks.  So to transfer locks, you would acquire the
per-backend LWLock for the backend which has the lock, and then the
lock manager partition LWLock, and then perform the transfer.

 As far as the level of detail of this pseudocode goes, there's no need to hold
 the per-backend LWLock while transferring the fast-path entries.  You just
 need to hold it sometime between bumping strong_lock_counts and transferring
 the backend's locks.  This ensures that, for example, the backend is not
 sleeping in the middle of a fast-path lock acquisition for the whole duration
 of this code.

See above; I'm lost.

 Now, a small fly in the ointment is that we haven't got, with
 PostgreSQL, a portable library of memory primitives.  So there isn't
 an obvious way of doing that sfence/lfence business.

 I was thinking that, if the final implementation could benefit from memory
 barrier interfaces, we should create those interfaces now.  Start with only a
 platform-independent dummy implementation that runs a lock/unlock cycle on a
 spinlock residing in backend-local memory.  I'm 75% sure that would be
 sufficient on all architectures for which we support spinlocks.  It may turn
 out that we can't benefit from such interfaces at this time ...

OK.

 Now, it seems to
 me that in the strong lock case, the sfence isn't really needed
 anyway, because we're about to start acquiring and releasing an lwlock
 for every backend, and that had better act as a full memory barrier
 anyhow, or we're doomed.  The weak lock case is more interesting,
 because we need the fence before we've taken any LWLock.

 Agreed.

 But perhaps it'd be sufficient to just acquire the per-backend lwlock
 before checking strong_lock_counts[].  If, as we hope, we get back a
 zero, then we do the fast-path lock acquisition, release the lwlock,
 and away we go.  If we get back any other value, then we've wasted an
 lwlock acquisition cycle.  Or actually maybe not: it seems to me that
 in that case we'd better transfer all of our fast-path entries into
 the main hash table before trying to acquire any lock the slow way, at
 least if we don't want the deadlock detector to have to know about the
 fast-path.  So then we get this:

 !        if (level  ShareUpdateExclusiveLock)
 !                ++strong_lock_counts[my_strong_lock_count_partition]
 !                for each backend
 !                        take per-backend lwlock for target backend
 !                        transfer fastpath entries with matching locktag
 !                        release per-backend lwlock for target backend
 !        else if (level = RowExclusiveLock)
 !                take per-backend lwlock for own backend
 !                if (strong_lock_counts[my_strong_lock_count_partition] == 0)
 !                        fast-path lock acquisition
 !                        done = true
 !                else
 !                        transfer all fastpath entries
 !                release per-backend lwlock for own backend
 !        if (!done)
 !                normal_LockAcquireEx

 Could you elaborate on the last part (the need for else transfer all fastpath
 entries) and, specifically, how it aids deadlock avoidance?  I didn't think
 this change would have any impact on deadlocks, because all relevant locks
 will be in the global lock table before any call to normal_LockAcquireEx.

Oh, hmm, maybe you're right.  I was concerned about the possibility
that of a backend which already holds locks going to sleep on a lock
wait, and maybe running the deadlock detector, and failing to notice a
deadlock.  But I guess that can't happen: if any of the locks it holds
are relevant to the deadlock detector, the backend attempting to
acquire those locks will transfer them before attempting to acquire
the lock itself, so it should be OK.

 To

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-24 Thread Noah Misch

On Tue, May 24, 2011 at 08:53:11AM -0400, Robert Haas wrote:
 On Tue, May 24, 2011 at 5:07 AM, Noah Misch n...@leadboat.com wrote:
  This drops the part about only transferring fast-path entries once when a
  strong_lock_counts cell transitions from zero to one.
 
 Right: that's because I don't think that's what we want to do.  I
 don't think we want to transfer all per-backend locks to the shared
 hash table as soon as anyone attempts to acquire a strong lock;
 instead, I think we want to transfer only those fast-path locks which
 have the same locktag as the strong lock someone is attempting to
 acquire.  If we do that, then it doesn't matter whether the
 strong_lock_counts[] cell is transitioning from 0 to 1 or from 6 to 7:
 we still have to check for strong locks with that particular locktag.

Oh, I see.  I was envisioning that you'd transfer all locks associated with
the strong_lock_counts cell; that is, all the locks that would now go directly
to the global lock table when requested going forward.  Transferring only
exact matches seems fine too, and then I agree with your other conclusions.

  Granted, that itself
  requires some yet-undiscussed locking. ?For that matter, we can't have
  multiple strong lockers completing transfers on the same cell in parallel.
  Perhaps add a FastPathTransferLock, or an array of per-cell locks, that each
  strong locker holds for that entire if body and while decrementing the
  strong_lock_counts cell at lock release.
 
 I was imagining that the per-backend LWLock would protect the list of
 fast-path locks.  So to transfer locks, you would acquire the
 per-backend LWLock for the backend which has the lock, and then the
 lock manager partition LWLock, and then perform the transfer.

I see later in your description that the transferer will delete each fast-path
lock after transferring it.  Given that, this does sound adequate.

  As far as the level of detail of this pseudocode goes, there's no need to 
  hold
  the per-backend LWLock while transferring the fast-path entries. ?You just
  need to hold it sometime between bumping strong_lock_counts and transferring
  the backend's locks. ?This ensures that, for example, the backend is not
  sleeping in the middle of a fast-path lock acquisition for the whole 
  duration
  of this code.
 
 See above; I'm lost.

It wasn't a particularly useful point.

  To validate the locking at this level of detail, I think we need to sketch 
  the
  unlock protocol, too. ?On each strong lock release, we'll decrement the
  strong_lock_counts cell. ?No particular interlock with fast-path lockers
  should be needed; a stray AccessShareLock needlessly making it into the 
  global
  lock table is no problem. ?As mentioned above, we _will_ need an interlock
  with lock transfer operations. ?How will transferred fast-path locks get
  removed from the global lock table? ?Presumably, the original fast-path 
  locker
  should do so at transaction end; anything else would contort the life cycle.
  Then add a way for the backend to know which locks had been transferred as
  well as an interlock against concurrent transfer operations. ?Maybe that's
  all.
 
 I'm thinking that the backend can note, in its local-lock table,
 whether it originally acquired a lock via the fast-path or not.  Any
 lock not originally acquired via the fast-path will be released just
 as now.  For any lock that WAS originally acquired via the fast-path,
 we'll take our own per-backend lwlock, which protects the fast-path
 queue, and scan the fast-path queue for a matching entry.  If none is
 found, then we know the lock was transferred, so release the
 per-backend lwlock and do it the regular way (take lock manager
 partition lock, etc.).

Sounds good.

  To put it another way: the current system is fair; the chance of hitting 
  lock
  exhaustion is independent of lock level. ?The new system would be unfair; 
  lock
  exhaustion is much more likely to appear for a  ShareUpdateExclusiveLock
  acquisition, through no fault of that transaction. ?I agree this isn't 
  ideal,
  but it doesn't look to me like an unacceptable weakness. ?Making lock slots
  first-come, first-served is inherently unfair; we're not at all set up to
  justly arbitrate between mutually-hostile lockers competing for slots. ?The
  overall situation will get better, not worse, for the admin who wishes to
  defend against hostile unprivileged users attempting a lock table DOS.
 
 Well, it's certainly true that the proposed system is far less likely
 to bomb out trying to acquire an AccessShareLock than what we have
 today, since in the common case the AccessShareLock doesn't use up any
 shared resources.  And that should make a lot of people happy.  But as
 to the bad scenario, one needn't presume that the lockers are hostile
 - it may just be that the system is running on the edge of a full lock
 table.  In the worst case, someone wanting a strong lock on a table
 may end up transferring a hundred or

1 2 >

1 - 100 of 102 matches

Mail list logo