Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Heikki Linnakangas

On 06.06.2011 07:12, Robert Haas wrote:

I did some further investigation of this.  It appears that more than
99% of the lock manager lwlock traffic that remains with this patch
applied has locktag_type == LOCKTAG_VIRTUALTRANSACTION.  Every SELECT
statement runs in a separate transaction, and for each new transaction
we run VirtualXactLockTableInsert(), which takes a lock on the vxid of
that transaction, so that other processes can wait for it.  That
requires acquiring and releasing a lock manager partition lock, and we
have to do the same thing a moment later at transaction end to dump
the lock.

A quick grep seems to indicate that the only places where we actually
make use of those VXID locks are in DefineIndex(), when CREATE INDEX
CONCURRENTLY is in use, and during Hot Standby, when max_standby_delay
expires.  Considering that these are not commonplace events, it seems
tremendously wasteful to incur the overhead for every transaction.  It
might be possible to make the lock entry spring into existence "on
demand" - i.e. if a backend wants to wait on a vxid entry, it creates
the LOCK and PROCLOCK objects for that vxid.  That presents a few
synchronization challenges, and plus we have to make sure that the
backend that's just been "given" a lock knows that it needs to release
it, but those seem like they might be manageable problems, especially
given the new infrastructure introduced by the current patch, which
already has to deal with some of those issues.  I'll look into this
further.


Ah, I remember I saw that vxid lock pop up quite high in an oprofile 
profile recently. I think it was the case of executing a lot of very 
simple prepared queries. So it would be nice to address that, even from 
a single CPU point of view.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-06-05 Thread Jim Nasby
On Jun 4, 2011, at 3:56 AM, Greg Stark wrote:
> On Thu, Jun 2, 2011 at 8:58 PM, Jim Nasby  wrote:
>> 
>> I'm torn between whether the type should store the original time or the 
>> original time converted to GMT.
> 
> This is the wrong way to think about it. We *never* store time
> "converted to GMT".  When we want to represent a point in time we
> represent it as seconds since the epoch.
Right. Sorry, my bad.

> The question here is how to represent more complex concepts than
> simply points in time. I think the two concepts under discussion are
> a) a composite type representing a point in time and a timezone it
> should be interpreted in for operations and display and b) the
> original input provided which is a text string with the constraint
> that it's a valid input which can be interpreted as a point in time.

My fear with A is that something could change that would make it impossible to 
actually get back to the time that was originally entered. For example, a new 
version of the timezone database could change something. Though, that problem 
also exists for timestamptz today, so presumably if it was much of an issue 
we'd have gotten complaints by now.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap vacuum & cleanup locks

2011-06-05 Thread Pavan Deolasee
On Sun, Jun 5, 2011 at 8:33 AM, Robert Haas  wrote:
> We've occasionally seen problems with VACUUM getting stuck for failure
> to acquire a cleanup lock due to, for example, a cursor holding a pin
> on the buffer page.  In the worst case, this can cause an undetected
> deadlock, if the backend holding the buffer pin blocks trying to
> acquire a heavyweight lock that is in turn blocked by VACUUM.  A while
> back, someone (Greg Stark? me?) floated the idea of not waiting for
> the cleanup lock.  If we can't get it immediately, or within some
> short period of time, then we just skip the page and continue on.
>

Do we know if this is really a problem though ? The deadlock for
example, can happen only when a backend tries to get a table level
conflicting lock while holding the buffer pin and I am not sure if we
do that.

The contention issue would probably make sense for small tables
because for large to very large tables, the probability that a backend
and vacuum would process the same page would be quite low. With the
current default for vac_threshold, the small tables can get vacuumed
very frequently and if they are also heavily accessed, the cleanup
lock can become a bottleneck.

Another issue that might be worth paying attention to is the single
pass vacuum that I am currently working on. The design that we agreed
up on, assumes that the index vacuum must clear index pointers to all
the dead line pointers. If we skip any page, we must at least collect
the existing dead line pointers and remove those index pointers. If we
create dead line pointers and we want to vacuum them later, we store
the LSN in the page and that may require defrag. Of course, we can
work around that, but I think it will be useful if we do some tests to
show that the cleanup lock is indeed a major bottleneck.

Thanks,
Pavan
-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap vacuum & cleanup locks

2011-06-05 Thread Jim Nasby
On Jun 6, 2011, at 1:00 AM, Robert Haas wrote:
> On Mon, Jun 6, 2011 at 12:19 AM, Itagaki Takahiro
>  wrote:
>> On Sun, Jun 5, 2011 at 12:03, Robert Haas  wrote:
>>> If other buffer pins do exist, then we can't
>>> defragment the page, but that doesn't mean no useful work can be done:
>>> we can still mark used line pointers dead, or dead line pointers
>>> unused.  We cannot defragment, but that can be done either by the next
>>> VACUUM or by a HOT cleanup.
>> 
>> This is just an idea -- Is it possible to have copy-on-write techniques?
>> VACUUM allocates a duplicated page for the pinned page, and copy valid
>> tuples into the new page. Following buffer readers after the VACUUM will
>> see the cloned page instead of the old pinned one.
> 
> Heikki suggested the same thing, and it's not a bad idea, but I think
> it would be more work to implement than what I proposed.  The caller
> would need to be aware that, if it tries to re-acquire a content lock
> on the same page, the offset of the tuple within the page might
> change.  I'm not sure how much work would be required to cope with
> that possibility.

I've had a related idea that I haven't looked into... if you're scanning a 
relation (ie: index scan, seq scan) I've wondered if it would be more efficient 
to deal with the entire page at once, possibly be making a copy of it. This 
would reduce the number of times you pin the page (often quite dramatically). I 
realize that means copying the entire page, but I suspect that would occur 
entirely in the L1 cache, which would be fast.

So perhaps instead of copy on write we should try for copy on read on all 
appropriate plan nodes.

On a related note, I've also wondered if it would be useful to allow nodes to 
deal with more than one tuple at a time; the idea being that it's better to 
execute a smaller chunk of code over a bigger chunk of data instead of 
dribbling tuples through an entire execution tree one at a time. Perhaps that 
will only be useful if nodes are executing in parallel...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap vacuum & cleanup locks

2011-06-05 Thread Robert Haas
On Mon, Jun 6, 2011 at 12:19 AM, Itagaki Takahiro
 wrote:
> On Sun, Jun 5, 2011 at 12:03, Robert Haas  wrote:
>> If other buffer pins do exist, then we can't
>> defragment the page, but that doesn't mean no useful work can be done:
>> we can still mark used line pointers dead, or dead line pointers
>> unused.  We cannot defragment, but that can be done either by the next
>> VACUUM or by a HOT cleanup.
>
> This is just an idea -- Is it possible to have copy-on-write techniques?
> VACUUM allocates a duplicated page for the pinned page, and copy valid
> tuples into the new page. Following buffer readers after the VACUUM will
> see the cloned page instead of the old pinned one.

Heikki suggested the same thing, and it's not a bad idea, but I think
it would be more work to implement than what I proposed.  The caller
would need to be aware that, if it tries to re-acquire a content lock
on the same page, the offset of the tuple within the page might
change.  I'm not sure how much work would be required to cope with
that possibility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIREAD lock versus ACCESS EXCLUSIVE lock

2011-06-05 Thread Dan Ports
On Sun, Jun 05, 2011 at 12:45:41PM -0500, Kevin Grittner wrote:
> Is this possible?  If a transaction gets its snapshot while OID of N
> is assigned to relation X, can that transaction wind up seeing an OID
> of N as a reference to relation Y?  If not, there aren't any false
> positives possible.

This ought to be possible, assuming that the transaction doesn't try to
read (and take an AccessShareLock on) X.

On the other hand, given all the timing constraints you've noted, I
think it's reasonable to ignore the risk of false positives here. What
you're saying is that we might inadvertently roll back a transaction,
if a table gets dropped and a new table created and assigned the same
OID, and there's an uncommitted transaction that started before the
DROP, *and* certain other conditions hold about the reads and timing of
overlapping transactions.

This combination of conditions seems quite unlikely and I have a hard
time getting too worked up about it. Occasional false positives are
already a fact of life when using SSI.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types and extensions

2011-06-05 Thread Darren Duncan

Pavel Stehule wrote:

2011/6/6 Darren Duncan :

Jeff Davis wrote:

I'd like to take another look at Range Types and whether part of it
should be an extension. Some of these issues relate to extensions in
general, not just range types.

First of all, what are the advantages to being in core?


it should be supported by FOREACH statement in PL/pgSQL


Yes, absolutely.  I know this feature is loved in Perl.  But this usage would 
only work for a more limited range of data types, namely those over which one 
can build a sequence generator, such as integers, because they have a 
next-value/prev-value function defined.  In other words, while range types in 
general would work for any ordered type, FOREACH would only work for the subset 
of those that are ordinal types. -- Darren Duncan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types and extensions

2011-06-05 Thread Pavel Stehule
2011/6/6 Darren Duncan :
> Jeff Davis wrote:
>>
>> I'd like to take another look at Range Types and whether part of it
>> should be an extension. Some of these issues relate to extensions in
>> general, not just range types.
>>
>> First of all, what are the advantages to being in core?

it should be supported by FOREACH statement in PL/pgSQL

Pavel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auto adjust send buffer size to congention window

2011-06-05 Thread Radosław Smogura
Tom Lane  Monday 06 of June 2011 04:07:41
> =?utf-8?q?Rados=C5=82aw_Smogura?=  writes:
> > I've got idea to auto adjust send buffer size to size of TCP congention
> > window. Will this increase performance and in which way. I suppose
> > throughput may be increased, but latency decreased. What do You think?
> 
> I think if that makes any difference at all, it would be proof that the
> TCP stack on your machine was implemented by incompetents.  The kernel
> should already be handling such considerations.
> 
>   regards, tom lane
I was thought about send buffer size in libpq. 

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types and extensions

2011-06-05 Thread Darren Duncan

Jeff Davis wrote:

I'd like to take another look at Range Types and whether part of it
should be an extension. Some of these issues relate to extensions in
general, not just range types.

First of all, what are the advantages to being in core?


I believe that ranges aka intervals are widely useful generic types, next after 
relations/tuples/arrays, and they *should* be supported in core, same as arrays are.


In particular, the usefulness of ranges/intervals is often orthogonal to many 
other things, and for many types including numbers, strings, temporals.


Now assuming that a range/interval value is generally defined in terms of a pair 
of endpoints of some ordered type (that is, a type for which ORDER BY or RANK or 
{<,>,<=,>=} etc or LIMIT makes sense), it will be essential that this value is 
capable of distinguishing open and closed intervals.


For example, a range value can be represented by a tuple with 4 attributes, 
where two of those are the endpoint values, and two of those are booleans saying 
whether each of the endpoints is inside or outside the range/interval.


Also, if Postgres has some concept of type-generic special values -Inf and +Inf 
(which always sort before or after any other value in the type system), those 
can be used as endpoints to indicate that the interval is unbounded.


Unless you have some other syntax in mind, I suggest lifting the range literal 
syntax from Perl 6, where ".." is an infix operator building a range between its 
arguments, and a "^" on either side means that side is open, I think; so there 
are 4 variants: {..,^..,..^,^..^}.


Now as to general usefulness of intervals ...

Any operation that wants to deal with a range somehow, such as the BETWEEN 
syntax, could instead use a range/interval; for example, both of:


  foo in 1..10

  foo between 1 and 10

... would mean the same thing, but the 1..10 can be replaced by an arbitrary 
value expression or variable reference.


Likewise with:

  date in start ..^ end

  date >= start and date < end

... mean the same thing.

The LIMIT clause could take a range to specify take and skip count at once.

Array slicing can be done using foo[first..last] or such.

A random number generator that takes endpoints can take a range argument.

An array or relation of these range can represent ranges with holes, and the 
general results of range union operations.


-- Darren Duncan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap vacuum & cleanup locks

2011-06-05 Thread Itagaki Takahiro
On Sun, Jun 5, 2011 at 12:03, Robert Haas  wrote:
> If other buffer pins do exist, then we can't
> defragment the page, but that doesn't mean no useful work can be done:
> we can still mark used line pointers dead, or dead line pointers
> unused.  We cannot defragment, but that can be done either by the next
> VACUUM or by a HOT cleanup.

This is just an idea -- Is it possible to have copy-on-write techniques?
VACUUM allocates a duplicated page for the pinned page, and copy valid
tuples into the new page. Following buffer readers after the VACUUM will
see the cloned page instead of the old pinned one.

Of course, copy-on-writing is more complex than skipping pinned pages,
but I wonder we cannot vacuum at all in some edge cases with the
skipping method.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Robert Haas
On Sun, Jun 5, 2011 at 10:16 PM, Robert Haas  wrote:
> I'm definitely interested in investigating what to do
> about that, but I don't think it's this patch's problem to fix all of
> our lock manager bottlenecks.

I did some further investigation of this.  It appears that more than
99% of the lock manager lwlock traffic that remains with this patch
applied has locktag_type == LOCKTAG_VIRTUALTRANSACTION.  Every SELECT
statement runs in a separate transaction, and for each new transaction
we run VirtualXactLockTableInsert(), which takes a lock on the vxid of
that transaction, so that other processes can wait for it.  That
requires acquiring and releasing a lock manager partition lock, and we
have to do the same thing a moment later at transaction end to dump
the lock.

A quick grep seems to indicate that the only places where we actually
make use of those VXID locks are in DefineIndex(), when CREATE INDEX
CONCURRENTLY is in use, and during Hot Standby, when max_standby_delay
expires.  Considering that these are not commonplace events, it seems
tremendously wasteful to incur the overhead for every transaction.  It
might be possible to make the lock entry spring into existence "on
demand" - i.e. if a backend wants to wait on a vxid entry, it creates
the LOCK and PROCLOCK objects for that vxid.  That presents a few
synchronization challenges, and plus we have to make sure that the
backend that's just been "given" a lock knows that it needs to release
it, but those seem like they might be manageable problems, especially
given the new infrastructure introduced by the current patch, which
already has to deal with some of those issues.  I'll look into this
further.

It's likely that if we lick this problem, the BufFreelistLock and
BufMappingLocks are going to be the next hot spot.  Of course, we're
ignoring the ten-thousand pound gorilla in the corner, which is that
on write workloads we have a pretty bad contention problem with
WALInsertLock, which I fear will not be so easily addressed.  But one
problem at a time, I guess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jun 05, 2011 at 02:17:00PM -0400, Tom Lane wrote:
>> Attached are two versions of a patch to fix this.  The second one
>> modifies the code that tracks what's "pending" as per the above thought.
>> I'm not entirely sure which one I like better ... any comments?

> +1 for the second approach.  It had bothered me that SetReindexProcessing() 
> and
> ResetReindexProcessing() manipulated one thing, but ReindexIsProcessingIndex()
> checked something else besides.  (That's still technically true, but the 
> overall
> effect seems more natural.)

OK, done that way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Jeff Davis
On Sun, 2011-06-05 at 15:09 -0400, Tom Lane wrote:
> so once we've set the index as the currentlyReindexedIndex, there's
> no need for it still to be in pendingReindexedIndexes.

OK. The second version of the patch looks good to me.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Robert Haas
On Sun, Jun 5, 2011 at 5:46 PM, Robert Haas  wrote:
> Could you compile with LWLOCK_STATS, rerun these tests, total up the
> "blk" numbers by LWLockId, and post the results?  (Actually, totalling
> up the shacq and exacq numbers would be useful as well, if you
> wouldn't mind.)

I did this on the loaner 24-core box from Nate Boley and got the
following results.  This is just the LWLocks that had blk>0.

lwlock 0: shacq 0 exacq 200625 blk 24044
lwlock 4: shacq 80101430 exacq 196 blk 28
lwlock 33: shacq 8333673 exacq 11977 blk 864
lwlock 34: shacq 7092293 exacq 11890 blk 803
lwlock 35: shacq 7893875 exacq 11909 blk 848
lwlock 36: shacq 7567514 exacq 11912 blk 830
lwlock 37: shacq 7427774 exacq 11930 blk 745
lwlock 38: shacq 7120108 exacq 11989 blk 853
lwlock 39: shacq 7584952 exacq 11982 blk 782
lwlock 40: shacq 7949867 exacq 12056 blk 821
lwlock 41: shacq 6612240 exacq 11929 blk 746
lwlock 42: shacq 47512112 exacq 11844 blk 4503
lwlock 43: shacq 7943511 exacq 11871 blk 878
lwlock 44: shacq 7534558 exacq 12033 blk 800
lwlock 45: shacq 7128256 exacq 12045 blk 856
lwlock 46: shacq 7575339 exacq 12015 blk 818
lwlock 47: shacq 6745173 exacq 12094 blk 806
lwlock 48: shacq 8410348 exacq 12104 blk 977
lwlock 49: shacq 0 exacq 5007594 blk 172533
lwlock 50: shacq 0 exacq 5011704 blk 172282
lwlock 51: shacq 0 exacq 5003356 blk 172802
lwlock 52: shacq 0 exacq 5009020 blk 174648
lwlock 53: shacq 0 exacq 5010808 blk 172080
lwlock 54: shacq 0 exacq 5004908 blk 169934
lwlock 55: shacq 0 exacq 5009324 blk 170281
lwlock 56: shacq 0 exacq 5005904 blk 171001
lwlock 57: shacq 0 exacq 5006984 blk 169942
lwlock 58: shacq 0 exacq 5000346 blk 170001
lwlock 59: shacq 0 exacq 5004884 blk 170484
lwlock 60: shacq 0 exacq 5006304 blk 171325
lwlock 61: shacq 0 exacq 5008421 blk 170866
lwlock 62: shacq 0 exacq 5008162 blk 170868
lwlock 63: shacq 0 exacq 5002238 blk 170291
lwlock 64: shacq 0 exacq 5005348 blk 169764
lwlock 307: shacq 0 exacq 2 blk 1
lwlock 315: shacq 0 exacq 3 blk 2
lwlock 337: shacq 0 exacq 4 blk 3
lwlock 345: shacq 0 exacq 2 blk 1
lwlock 349: shacq 0 exacq 2 blk 1
lwlock 231251: shacq 0 exacq 2 blk 1
lwlock 253831: shacq 0 exacq 2 blk 1

So basically, even with the patch, at 24 cores the lock manager locks
are still under tremendous pressure.  But note that there's a big
difference between what's happening here and what's happening without
the patch.  Here's without the patch:

lwlock 0: shacq 0 exacq 191613 blk 17591
lwlock 4: shacq 21543085 exacq 102 blk 20
lwlock 33: shacq 2237938 exacq 11976 blk 463
lwlock 34: shacq 1907344 exacq 11890 blk 458
lwlock 35: shacq 2125308 exacq 11908 blk 442
lwlock 36: shacq 2038220 exacq 11912 blk 430
lwlock 37: shacq 1998059 exacq 11927 blk 449
lwlock 38: shacq 1916179 exacq 11953 blk 409
lwlock 39: shacq 2042173 exacq 12019 blk 479
lwlock 40: shacq 2140002 exacq 12056 blk 448
lwlock 41: shacq 1776772 exacq 11928 blk 392
lwlock 42: shacq 12777368 exacq 11842 blk 2451
lwlock 43: shacq 2132240 exacq 11869 blk 478
lwlock 44: shacq 2026845 exacq 12031 blk 446
lwlock 45: shacq 1918618 exacq 12045 blk 449
lwlock 46: shacq 2038437 exacq 12011 blk 472
lwlock 47: shacq 1814660 exacq 12089 blk 401
lwlock 48: shacq 2261208 exacq 12105 blk 478
lwlock 49: shacq 0 exacq 1347524 blk 17020
lwlock 50: shacq 0 exacq 1350678 blk 16888
lwlock 51: shacq 0 exacq 1346260 blk 16744
lwlock 52: shacq 0 exacq 1348432 blk 16864
lwlock 53: shacq 0 exacq 22216779 blk 4914363
lwlock 54: shacq 0 exacq 22217309 blk 4525381
lwlock 55: shacq 0 exacq 1348406 blk 13438
lwlock 56: shacq 0 exacq 1345996 blk 13299
lwlock 57: shacq 0 exacq 1347890 blk 13654
lwlock 58: shacq 0 exacq 1343486 blk 13349
lwlock 59: shacq 0 exacq 1346198 blk 13471
lwlock 60: shacq 0 exacq 1346236 blk 13532
lwlock 61: shacq 0 exacq 1343688 blk 13547
lwlock 62: shacq 0 exacq 1350068 blk 13614
lwlock 63: shacq 0 exacq 1345302 blk 13420
lwlock 64: shacq 0 exacq 1348858 blk 13635
lwlock 321: shacq 0 exacq 2 blk 1
lwlock 329: shacq 0 exacq 4 blk 3
lwlock 337: shacq 0 exacq 6 blk 4
lwlock 347: shacq 0 exacq 5 blk 4
lwlock 357: shacq 0 exacq 3 blk 2
lwlock 363: shacq 0 exacq 3 blk 2
lwlock 369: shacq 0 exacq 4 blk 3
lwlock 379: shacq 0 exacq 2 blk 1
lwlock 383: shacq 0 exacq 2 blk 1
lwlock 445: shacq 0 exacq 2 blk 1
lwlock 449: shacq 0 exacq 2 blk 1
lwlock 451: shacq 0 exacq 2 blk 1
lwlock 1023: shacq 0 exacq 2 blk 1
lwlock 11401: shacq 0 exacq 2 blk 1
lwlock 115591: shacq 0 exacq 2 blk 1
lwlock 117177: shacq 0 exacq 2 blk 1
lwlock 362839: shacq 0 exacq 2 blk 1

In the unpatched case, two lock manager locks are getting beaten to
death, and the others all about equally contended.  By eliminating the
portion of the lock manager contention that pertains specifically to
the two heavily trafficked locks, system throughput improves by about
3.5x - and, not surprisingly, traffic on the lock manager locks
increases by approximately the same multiple.  Those locks now become
the contention bottleneck, with about 12x the blocking they had
pre-patch.  I'm definitely 

Re: [HACKERS] SIREAD lock versus ACCESS EXCLUSIVE lock

2011-06-05 Thread Kevin Grittner
"Kevin Grittner" wrote:

> Maybe I should submit a patch without added complexity of the
> scheduled cleanup and we can discuss that as a separate patch?

Here's a patch which adds the missing support for DDL. Cleanup of
predicate locks at commit time for transactions which ran DROP TABLE
or TRUNCATE TABLE can be added as a separate patch. I consider those
to be optimizations which are of dubious benefit, especially compared
to the complexity of the extra code required.

Also, Heikki pointed out that explicitly releasing the predicate
locks after a DROP DATABASE is an optimization, which on reflection
also seems to be of dubious value compared to the code needed.

Unless someone can find a way in which any of these early cleanups
are needed for correctness, I suggest they are better left as
enhancements in some future release, where there can be some
demonstration that they matter enough for performance to justify the
extra code, and there can be more testing to ensure the new code
doesn't break anything.

I stashed the partial work on the more aggressive cleanup, so if
there's a consensus that we want it, I can post a patch pretty
quickly.
 
In making sure that the new code for this patch was in pgindent
format, I noticed that the ASCII art and bullet lists recently added
to OnConflict_CheckForSerializationFailure() were mangled badly by
pgindent, so I added the dashes to protect those and included a
pgindent form of that function.  That should save someone some
trouble sorting things out after the next global pgindent run.

-Kevin




ssi-ddl-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auto adjust send buffer size to congention window

2011-06-05 Thread Tom Lane
=?utf-8?q?Rados=C5=82aw_Smogura?=  writes:
> I've got idea to auto adjust send buffer size to size of TCP congention 
> window. Will this increase performance and in which way. I suppose throughput 
> may be increased, but latency decreased. What do You think?

I think if that makes any difference at all, it would be proof that the
TCP stack on your machine was implemented by incompetents.  The kernel
should already be handling such considerations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Noah Misch
On Sun, Jun 05, 2011 at 02:17:00PM -0400, Tom Lane wrote:
> I wrote:
> > Noah Misch  writes:
> >> Sounds reasonable.  Need to remove the index from pendingReindexedIndexes, 
> >> not
> >> just call ResetReindexProcessing().
> 
> > [ looks again... ]  Uh, right.  I was thinking that the pending list was
> > just "pending" and not "in progress" indexes.  I wonder if we should
> > rejigger things so that that's actually true, ie, remove an index's OID
> > from the pending list when we mark it as the current one?
> 
> Attached are two versions of a patch to fix this.  The second one
> modifies the code that tracks what's "pending" as per the above thought.
> I'm not entirely sure which one I like better ... any comments?

+1 for the second approach.  It had bothered me that SetReindexProcessing() and
ResetReindexProcessing() manipulated one thing, but ReindexIsProcessingIndex()
checked something else besides.  (That's still technically true, but the overall
effect seems more natural.)

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: psql include file using relative path

2011-06-05 Thread Gurjeet Singh
On Sun, Jun 5, 2011 at 1:06 PM, Josh Kupershmidt  wrote:

> On Sun, Jun 5, 2011 at 10:21 AM, Gurjeet Singh 
> wrote:
> > On Sat, May 21, 2011 at 11:59 AM, Josh Kupershmidt 
> > wrote:
>
> > Tweaks applied, but omitted the C variable names as I don't think that
> adds
> > much value.
>
> Your rewordings are fine, but the the article "the" is missing in a
> few spots, e.g.
>  * "uses \ir command" -> "uses the \ir command"
>  * "to currently processing file" -> "to the currently processing file"
>  * "same as \i command" -> "same as the \i command"
>
> I think "processing" is better (and consistent with the rest of the
> comments) than "processed" here:
> + * the file from where the currently processed file (if any) is located.
>
> > New version of the patch attached. Thanks for the review.
>
> I think the patch is in pretty good shape now. The memory leak is gone
> AFAICT, and the comments and documentation updates look good.
>

Attached an updated patch.

If you find it ready for committer, please mark it so in the commitfest app.

Thanks,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index eaf901d..9bd6d93 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1626,6 +1626,30 @@ Tue Oct 26 21:40:57 CEST 1999
 
 
   
+\ir filename
+
+
+The \ir command is similar to \i, but is useful for files which
+include and execute other files.
+
+
+
+When used within a file loaded via \i, \ir, or
+-f, if the filename
+is specified using a relative path, then the location of the file will be determined
+relative to the currently executing file's location.
+
+
+
+If filename is given as an absolute path,
+or if this command is used in interactive mode, then \ir behaves the same
+as the \i command.
+
+
+  
+
+
+  
 \l (or \list)
 \l+ (or \list+)
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 378330b..faa3087 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -785,10 +785,15 @@ exec_command(const char *cmd,
 
 
 	/* \i is include file */
-	else if (strcmp(cmd, "i") == 0 || strcmp(cmd, "include") == 0)
+	/* \ir is include file relative to file currently being processed */
+	else if (strcmp(cmd, "i") == 0 || strcmp(cmd, "include") == 0
+			|| strcmp(cmd, "ir") == 0 || strcmp(cmd, "include_relative") == 0)
 	{
-		char	   *fname = psql_scan_slash_option(scan_state,
-   OT_NORMAL, NULL, true);
+		char   *fname = psql_scan_slash_option(scan_state,
+			   OT_NORMAL, NULL, true);
+
+		bool	include_relative = (strcmp(cmd, "ir") == 0
+	|| strcmp(cmd, "include_relative") == 0);
 
 		if (!fname)
 		{
@@ -798,7 +803,7 @@ exec_command(const char *cmd,
 		else
 		{
 			expand_tilde(&fname);
-			success = (process_file(fname, false) == EXIT_SUCCESS);
+			success = (process_file(fname, false, include_relative) == EXIT_SUCCESS);
 			free(fname);
 		}
 	}
@@ -1969,15 +1974,19 @@ do_edit(const char *filename_arg, PQExpBuffer query_buf,
  * process_file
  *
  * Read commands from filename and then them to the main processing loop
- * Handler for \i, but can be used for other things as well.  Returns
+ * Handler for \i and \ir, but can be used for other things as well.  Returns
  * MainLoop() error code.
+ *
+ * If use_relative_path is true and filename is not an absolute path, then open
+ * the file from where the currently processed file (if any) is located.
  */
 int
-process_file(char *filename, bool single_txn)
+process_file(char *filename, bool single_txn, bool use_relative_path)
 {
 	FILE	   *fd;
 	int			result;
 	char	   *oldfilename;
+	char	   *relative_file = NULL;
 	PGresult   *res;
 
 	if (!filename)
@@ -1986,6 +1995,39 @@ process_file(char *filename, bool single_txn)
 	if (strcmp(filename, "-") != 0)
 	{
 		canonicalize_path(filename);
+
+		/*
+		 * If the currently processing file uses the \ir command, and the
+		 * 'filename' parameter to the command is given as a relative file path,
+		 * then we resolve this path relative to the currently processing file.
+		 *
+		 * If the \ir command was executed in interactive mode (i.e. not via \i,
+		 * \ir or -f) the we treat it the same as the \i command.
+		 */
+		if (use_relative_path && pset.inputfile)
+		{
+			char	   *last_slash;
+
+			/* find the / that splits the file from its path */
+			last_slash = strrchr(pset.inputfile, '/');
+
+			if (last_slash && !is_absolute_path(filename))
+			{
+size_t dir_len = (last_slash - pset.inputfile) + 1;
+size_t file_len = strlen(filename);
+
+relative_file = pg_malloc(dir_len + 1 + file_len + 1);
+
+relative_file[0] = '\0';
+strncat(relative_file, pset.inputfile, dir_len);
+strcat(relative_file, filename);
+
+canonicalize_path(relative_fi

Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Robert Haas
On Sun, Jun 5, 2011 at 4:01 PM, Stefan Kaltenbrunner
 wrote:
> On 06/05/2011 09:12 PM, Heikki Linnakangas wrote:
>> On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:
>>> and one for the -j80 case(also patched).
>>>
>>>
>>> 485798   48.9667  postgres                 s_lock
>>> 60327     6.0808  postgres                 LWLockAcquire
>>> 57049     5.7503  postgres                 LWLockRelease
>>> 18357     1.8503  postgres                 hash_search_with_hash_value
>>> 17033     1.7169  postgres                 GetSnapshotData
>>> 14763     1.4881  postgres                 base_yyparse
>>> 14460     1.4575  postgres                 SearchCatCache
>>> 13975     1.4086  postgres                 AllocSetAlloc
>>> 6416      0.6467  postgres                 PinBuffer
>>> 5024      0.5064  postgres                 SIGetDataEntries
>>> 4704      0.4741  postgres                 core_yylex
>>> 4625      0.4662  postgres                 _bt_compare
>>
>> Hmm, does that mean that it's spending 50% of the time spinning on a
>> spinlock? That's bad. It's one thing to be contended on a lock, and have
>> a lot of idle time because of that, but it's even worse to spend a lot
>> of time spinning because that CPU time won't be spent on doing more
>> useful work, even if there is some other process on the system that
>> could make use of that CPU time.
>
> well yeah - we are broken right now with only being able to use ~20% of
> CPU on a modern mid-range box, but using 80% CPU (or 4x like in the
> above case) and only getting less than 2x the performance seems wrong as
> well. I also wonder if we are still missing something fundamental -
> because even with the current patch we are quite far away from linear
> scaling and light-years from some of our competitors...

Could you compile with LWLOCK_STATS, rerun these tests, total up the
"blk" numbers by LWLockId, and post the results?  (Actually, totalling
up the shacq and exacq numbers would be useful as well, if you
wouldn't mind.)

Unless I very much miss my guess, we're going to see zero contention
on the new structures introduced by this patch.  Rather, I suspect
what we're going to find is that, with the hideous contention on one
particular lock manager partition lock removed, there's a more
spread-out contention problem, likely involving the lock manager
partition lock, the buffer mapping locks, and possibly other LWLocks
as well.  The fact that the system is busy-waiting rather than just
not using the CPU at all probably means that the remaining contention
is more spread out than that which is removed by this patch.  We don't
actually have everything pile up on a single LWLock (as happens in git
master), but we do spend a lot of time fighting cache lines away from
other CPUs.  Or at any rate, that's my guess: we need some real
numbers to know for sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BLOB support

2011-06-05 Thread Dimitri Fontaine
Tom Lane  writes:
> Yes.  I think the appropriate problem statement is "provide streaming
> access to large field values, as an alternative to just fetching/storing
> the entire value at once".  I see no good reason to import the entire
> messy notion of LOBS/CLOBS.  (The fact that other databases have done it
> is not a good reason.)

Spent some time in the archive to confirm a certain “déjà vu”
impression.  Couldn't find it.  Had to manually search in closed commit
fests… but here we are, I think:

  https://commitfest.postgresql.org/action/patch_view?id=70
  http://archives.postgresql.org/message-id/17891.1246301...@sss.pgh.pa.us
  http://archives.postgresql.org/message-id/4a4bf87e.7010...@ak.jp.nec.com

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Stefan Kaltenbrunner
On 06/05/2011 09:12 PM, Heikki Linnakangas wrote:
> On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:
>> and one for the -j80 case(also patched).
>>
>>
>> 485798   48.9667  postgres s_lock
>> 60327 6.0808  postgres LWLockAcquire
>> 57049 5.7503  postgres LWLockRelease
>> 18357 1.8503  postgres hash_search_with_hash_value
>> 17033 1.7169  postgres GetSnapshotData
>> 14763 1.4881  postgres base_yyparse
>> 14460 1.4575  postgres SearchCatCache
>> 13975 1.4086  postgres AllocSetAlloc
>> 6416  0.6467  postgres PinBuffer
>> 5024  0.5064  postgres SIGetDataEntries
>> 4704  0.4741  postgres core_yylex
>> 4625  0.4662  postgres _bt_compare
> 
> Hmm, does that mean that it's spending 50% of the time spinning on a
> spinlock? That's bad. It's one thing to be contended on a lock, and have
> a lot of idle time because of that, but it's even worse to spend a lot
> of time spinning because that CPU time won't be spent on doing more
> useful work, even if there is some other process on the system that
> could make use of that CPU time.

well yeah - we are broken right now with only being able to use ~20% of
CPU on a modern mid-range box, but using 80% CPU (or 4x like in the
above case) and only getting less than 2x the performance seems wrong as
well. I also wonder if we are still missing something fundamental -
because even with the current patch we are quite far away from linear
scaling and light-years from some of our competitors...


Stefan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Auto adjust send buffer size to congention window

2011-06-05 Thread Radosław Smogura
Hi,

I've got idea to auto adjust send buffer size to size of TCP congention 
window. Will this increase performance and in which way. I suppose throughput 
may be increased, but latency decreased. What do You think?

Regards,
Radek


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Heikki Linnakangas

On 05.06.2011 22:04, Stefan Kaltenbrunner wrote:

and one for the -j80 case(also patched).


485798   48.9667  postgres s_lock
60327 6.0808  postgres LWLockAcquire
57049 5.7503  postgres LWLockRelease
18357 1.8503  postgres hash_search_with_hash_value
17033 1.7169  postgres GetSnapshotData
14763 1.4881  postgres base_yyparse
14460 1.4575  postgres SearchCatCache
13975 1.4086  postgres AllocSetAlloc
6416  0.6467  postgres PinBuffer
5024  0.5064  postgres SIGetDataEntries
4704  0.4741  postgres core_yylex
4625  0.4662  postgres _bt_compare


Hmm, does that mean that it's spending 50% of the time spinning on a 
spinlock? That's bad. It's one thing to be contended on a lock, and have 
a lot of idle time because of that, but it's even worse to spend a lot 
of time spinning because that CPU time won't be spent on doing more 
useful work, even if there is some other process on the system that 
could make use of that CPU time.


I like the overall improvement on the throughput, of course, but we have 
to find a way to avoid the busy-wait.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Tom Lane
Jeff Davis  writes:
> On Sun, 2011-06-05 at 14:17 -0400, Tom Lane wrote:
>> Attached are two versions of a patch to fix this.  The second one
>> modifies the code that tracks what's "pending" as per the above thought.
>> I'm not entirely sure which one I like better ... any comments?

> I think I'm missing something simple: if it's not in the pending list,
> what prevents systable_beginscan() from using it?

The test that uses is

/*
 * ReindexIsProcessingIndex
 *  True if index specified by OID is currently being reindexed,
 *  or should be treated as invalid because it is awaiting reindex.
 */
bool
ReindexIsProcessingIndex(Oid indexOid)
{
return indexOid == currentlyReindexedIndex ||
list_member_oid(pendingReindexedIndexes, indexOid);
}

so once we've set the index as the currentlyReindexedIndex, there's
no need for it still to be in pendingReindexedIndexes.

(Arguably, this function should have been renamed when it was changed
to look at pendingReindexedIndexes too ...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks - now, with WIP patch

2011-06-05 Thread Stefan Kaltenbrunner
On 06/03/2011 03:17 PM, Robert Haas wrote:
[...]
> 
> As you can see, this works out to a bit more than a 4% improvement on
> this two-core box.  I also got access (thanks to Nate Boley) to a
> 24-core box and ran the same test with scale factor 100 and
> shared_buffers=8GB.  Here are the results of alternating runs without
> and with the patch on that machine:
> 
> tps = 36291.996228 (including connections establishing)
> tps = 129242.054578 (including connections establishing)
> tps = 36704.393055 (including connections establishing)
> tps = 128998.648106 (including connections establishing)
> tps = 36531.208898 (including connections establishing)
> tps = 131341.367344 (including connections establishing)
> 
> That's an improvement of about ~3.5x.  According to the vmstat output,
> when running without the patch, the CPU state was about 40% idle.
> With the patch, it dropped down to around 6%.

nice - but lets see on real hardware...

Testing this on a brand new E7-4850 4 Socket/10cores+HT Box - so 80
hardware threads:

first some numbers with -HEAD(-T 120, runtimes at lower -c counts have
fairly high variation in the results, first number is the number of
connections/threads):


-j1:tps = 7928.965493 (including connections establishing)
-j8:tps = 53610.572347 (including connections establishing)
-j16:   tps = 80835.446118 (including connections establishing)
-j32:   tps = 75666.731883 (including connections establishing)
-j40:   tps = 74628.568388 (including connections establishing)
-j64.   tps = 68268.081973 (including connections establishing)
-c80tps = 66704.216166 (including connections establishing)

postgresql is completely lock limited in this test anything beyond
around -j10 is basically not able to push the box to more than 80% IDLE(!)


and now with the patch applied:

-j1:tps = 7783.295587 (including connections establishing)  
-j8:tps = 44361.661947 (including connections establishing)
-j16:   tps = 92270.464541 (including connections establishing)
-j24:   tps = 108259.524782 (including connections establishing)
-j32:   tps = 183337.422612 (including connections establishing)
-j40tps = 209616.052430 (including connections establishing)
-j48:   tps = 229621.292382 (including connections establishing)
-j56:   tps = 218690.391603 (including connections establishing)
-j64:   tps = 188028.348501 (including connections establishing)
-j80.   tps = 118814.741609 (including connections establishing)


so much better - but I still think there is some headroom left still,
although pgbench itself is a CPU hog in those benchmark with eating up
to 10 cores in the worst case scenario - will retest with sysbench which
in the past showed more reasonable CPU usage for me.



and a profile(patched code) for the -j48(aka fastest) case:

731535   11.8408  postgres s_lock
2918784.7244  postgres LWLockAcquire
2423733.9231  postgres AllocSetAlloc
2390833.8698  postgres LWLockRelease
2023413.2751  postgres SearchCatCache
1900553.0763  postgres hash_search_with_hash_value
1871483.0292  postgres base_yyparse
1732652.8045  postgres GetSnapshotData
75700 1.2253  postgres core_yylex
74974 1.2135  postgres MemoryContextAllocZeroAligned
61404 0.9939  postgres _bt_compare
57529 0.9312  postgres MemoryContextAlloc


and one for the -j80 case(also patched).


485798   48.9667  postgres s_lock
60327 6.0808  postgres LWLockAcquire
57049 5.7503  postgres LWLockRelease
18357 1.8503  postgres hash_search_with_hash_value
17033 1.7169  postgres GetSnapshotData
14763 1.4881  postgres base_yyparse
14460 1.4575  postgres SearchCatCache
13975 1.4086  postgres AllocSetAlloc
6416  0.6467  postgres PinBuffer
5024  0.5064  postgres SIGetDataEntries
4704  0.4741  postgres core_yylex
4625  0.4662  postgres _bt_compare



Stefan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Range Types and extensions

2011-06-05 Thread Jeff Davis
I'd like to take another look at Range Types and whether part of it
should be an extension. Some of these issues relate to extensions in
general, not just range types.

First of all, what are the advantages to being in core?

1. ANYRANGE + CREATE TYPE ... AS RANGE
--
This is the most compelling, in my opinion. People can define new range
functions and new range types independently and each one gets the
benefit of the other automatically. Without this, there will be an
explosion of functions and a bunch of inconsistencies like functions
that support most range types but not all (merely because the function
author didn't know that the type existed).

In the several talks that I've given, a common question is related to
"multiranges" (ranges with holes). These get a little complex, and I
don't have a complete answer. However, multiranges can be approximated
with ordered arrays of non-overlapping, non-adjacent ranges. If someone
wants to take it upon themselves to develop a set of operators here,
that would be great -- but without ANYRANGE the operators would be
unmanageable.

2. Documentation and Tests
--
Let's say we take a minimalist view, and only have ANYRANGE and CREATE
TYPE ... AS RANGE in core; and leave the rest as an extension.

What exactly would the documentation say? I think it would be even more
hypothetical and abstract than the documentation for Exclusion
Constraints. So, there is a certain documentation advantage to having at
least enough functionality to allow someone to try out the feature.

And the tests for such a minimalist feature would be a significant
challenge -- what do we do there? Get pg_regress to load the extension
from PGXN?


3. Quality
--
PostgreSQL has a great reputation for quality, and for good reason. But
extensions don't follow the same quality-control standards; and even if
some do, there is no visible stamp of approval. So, to ask someone to
use an extension means that they have to evaluate the quality for
themselves, which is a pretty high barrier.

Since PGXN (thanks David Wheeler) and EXTENSIONs (thanks Dmitri) solve
many of the other issues, quality control is one of the biggest ones
remaining. I still get questions about when the temporal type will be
"in core", and I think this is why.

I don't think this is a good excuse to put it in core though. We need to
solve this problem, and the best way to start is by getting
well-reviewed, high-quality extensions out there.


4. Future work -- RANGE KEY, RANGE FOREIGN KEY, RANGE MERGE JOIN, etc.
-
There are a few aspects of range types that aren't in the first patch,
but are fairly obvious follow-up additions. These will require some
knowledge about ranges in the backend, like finding the "overlaps"
operator for a range. The current patch provides this knowledge by
providing a built-in overlaps operator for ANYRANGE. This would be a
non-issue if we had a good type interface system (that works on
polymorphic types) -- we could just have a built-in "range" interface,
and the range extension could add "&&" as the range interface's overlaps
operator for the type ANYRANGE.

=

So, where on this spectrum should range types fall? I think the most
minimalist would be to only support #1 (and the necessary type IO
functions); and leave all other functions, operators, and opclasses to
an extension. That has a lot of appeal, but I don't think we can ignore
the challenges above.

On the other hand, trying to make it a complete feature in core has
challenges as well. For instance, even with Range Types, Exclusion
Constraints aren't practical out-of-the-box unless we also have
BTree-GiST in core. So there's a snowball effect.

There might also be some middle ground, where its like the minimalist
approach, but with a few very basic constructors and accessors. That
would at least make it easier to test, but then to be actually useful
(with index support, operators, fancy functions, etc.) you'd need the
extension.

Thoughts?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Jeff Davis
On Sun, 2011-06-05 at 14:17 -0400, Tom Lane wrote:
> Attached are two versions of a patch to fix this.  The second one
> modifies the code that tracks what's "pending" as per the above thought.
> I'm not entirely sure which one I like better ... any comments?

I think I'm missing something simple: if it's not in the pending list,
what prevents systable_beginscan() from using it?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Tom Lane
I wrote:
> Noah Misch  writes:
>> Sounds reasonable.  Need to remove the index from pendingReindexedIndexes, 
>> not
>> just call ResetReindexProcessing().

> [ looks again... ]  Uh, right.  I was thinking that the pending list was
> just "pending" and not "in progress" indexes.  I wonder if we should
> rejigger things so that that's actually true, ie, remove an index's OID
> from the pending list when we mark it as the current one?

Attached are two versions of a patch to fix this.  The second one
modifies the code that tracks what's "pending" as per the above thought.
I'm not entirely sure which one I like better ... any comments?

regards, tom lane

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 53b4c3c59bf78cacf8def18ac4d46940a3a29b71..b046f38ecb8a47538cb12e036a6fca6c3509aec1 100644
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
*** static void validate_index_heapscan(Rela
*** 115,120 
--- 115,121 
  		Snapshot snapshot,
  		v_i_state *state);
  static Oid	IndexGetRelation(Oid indexId);
+ static bool ReindexIsCurrentlyProcessingIndex(Oid indexOid);
  static void SetReindexProcessing(Oid heapOid, Oid indexOid);
  static void ResetReindexProcessing(void);
  static void SetReindexPending(List *indexes);
*** index_build(Relation heapRelation,
*** 1758,1776 
  	}
  
  	/*
- 	 * If it's for an exclusion constraint, make a second pass over the heap
- 	 * to verify that the constraint is satisfied.
- 	 */
- 	if (indexInfo->ii_ExclusionOps != NULL)
- 		IndexCheckExclusion(heapRelation, indexRelation, indexInfo);
- 
- 	/* Roll back any GUC changes executed by index functions */
- 	AtEOXact_GUC(false, save_nestlevel);
- 
- 	/* Restore userid and security context */
- 	SetUserIdAndSecContext(save_userid, save_sec_context);
- 
- 	/*
  	 * If we found any potentially broken HOT chains, mark the index as not
  	 * being usable until the current transaction is below the event horizon.
  	 * See src/backend/access/heap/README.HOT for discussion.
--- 1759,1764 
*** index_build(Relation heapRelation,
*** 1824,1831 
  	   InvalidOid,
  	   stats->index_tuples);
  
! 	/* Make the updated versions visible */
  	CommandCounterIncrement();
  }
  
  
--- 1812,1834 
  	   InvalidOid,
  	   stats->index_tuples);
  
! 	/* Make the updated catalog row versions visible */
  	CommandCounterIncrement();
+ 
+ 	/*
+ 	 * If it's for an exclusion constraint, make a second pass over the heap
+ 	 * to verify that the constraint is satisfied.  We must not do this until
+ 	 * the index is fully valid.  (Broken HOT chains shouldn't matter, though;
+ 	 * see comments for IndexCheckExclusion.)
+ 	 */
+ 	if (indexInfo->ii_ExclusionOps != NULL)
+ 		IndexCheckExclusion(heapRelation, indexRelation, indexInfo);
+ 
+ 	/* Roll back any GUC changes executed by index functions */
+ 	AtEOXact_GUC(false, save_nestlevel);
+ 
+ 	/* Restore userid and security context */
+ 	SetUserIdAndSecContext(save_userid, save_sec_context);
  }
  
  
*** IndexCheckExclusion(Relation heapRelatio
*** 2270,2275 
--- 2273,2290 
  	ExprContext *econtext;
  
  	/*
+ 	 * If we are reindexing the target index, mark it as no longer being
+ 	 * reindexed, to forestall an Assert in index_beginscan when we try to
+ 	 * use the index for probes.  This is OK because the index is now
+ 	 * fully valid.
+ 	 */
+ 	if (ReindexIsCurrentlyProcessingIndex(RelationGetRelid(indexRelation)))
+ 	{
+ 		ResetReindexProcessing();
+ 		RemoveReindexPending(RelationGetRelid(indexRelation));
+ 	}
+ 
+ 	/*
  	 * Need an EState for evaluation of index expressions and partial-index
  	 * predicates.	Also a slot to hold the current tuple.
  	 */
*** reindex_relation(Oid relid, int flags)
*** 3030,3036 
   *		System index reindexing support
   *
   * When we are busy reindexing a system index, this code provides support
!  * for preventing catalog lookups from using that index.
   * 
   */
  
--- 3045,3053 
   *		System index reindexing support
   *
   * When we are busy reindexing a system index, this code provides support
!  * for preventing catalog lookups from using that index.  We also make use
!  * of this to catch attempted uses of user indexes during reindexing of
!  * those indexes.
   * 
   */
  
*** ReindexIsProcessingHeap(Oid heapOid)
*** 3049,3054 
--- 3066,3081 
  }
  
  /*
+  * ReindexIsCurrentlyProcessingIndex
+  *		True if index specified by OID is currently being reindexed.
+  */
+ static bool
+ ReindexIsCurrentlyProcessingIndex(Oid indexOid)
+ {
+ 	return indexOid == currentlyReindexedIndex;
+ }
+ 
+ /*
   * ReindexIsProcessingIndex
   *		True if index specified by OID is currently being reindexed,
   *		or should be treated as invalid because it i

[HACKERS] VIP: enhanced errors

2011-06-05 Thread Pavel Stehule
Hello all

I am working on new diagnostics fields in errors - CONSTRAINT_NAME,
SCHEMA_NAME, TABLE_NAME and COLUMN_NAME.

These fields is shown when verbosity mode is active. Actually this
works for table constraints, not null constraint and for RI
constraints.

postgres=# delete  from xxx;
ERROR:  23503: update or delete on table "xxx" violates foreign key
constraint "yyy_b_fkey" on table "yyy"
DETAIL:  Key (a)=(10) is still referenced from table "yyy".
LOCATION:  ri_ReportViolation, ri_triggers.c:3593
CONSTRAINT:  yyy_b_fkey
SCHEMA:  public
TABLE:  xxx
COLUMN:  a

These fields should be available from GET DIAGNOSTICS statement too -
It could be next step after support for stacked diagnostics.

I looked on column name identification for column constraints - and
there is not any possible identification - so I need a new column to
pg_constraint table - "attrid", that should to identify column for
column's constraint.

This patch is just concept. Final patch will be significantly longer -
we need to check any "ereport" call, as this most important
constraints are table and ri constraints.

Regards

Pavel Stehule
*** ./src/backend/executor/execMain.c.orig	2011-06-05 16:09:09.0 +0200
--- ./src/backend/executor/execMain.c	2011-06-05 19:46:41.469227198 +0200
***
*** 1565,1571 
  ereport(ERROR,
  		(errcode(ERRCODE_NOT_NULL_VIOLATION),
  		 errmsg("null value in column \"%s\" violates not-null constraint",
! 		NameStr(rel->rd_att->attrs[attrChk - 1]->attname;
  		}
  	}
  
--- 1565,1575 
  ereport(ERROR,
  		(errcode(ERRCODE_NOT_NULL_VIOLATION),
  		 errmsg("null value in column \"%s\" violates not-null constraint",
! 			NameStr(rel->rd_att->attrs[attrChk - 1]->attname)),
! 		 errconstraintname("not_null_constraint"),
! 		 errschemaname(get_namespace_name(RelationGetNamespace(rel))),
! 		 errtablename(RelationGetRelationName(rel)),
! 		 errcolumnname(NameStr(rel->rd_att->attrs[attrChk - 1]->attname;
  		}
  	}
  
***
*** 1577,1583 
  			ereport(ERROR,
  	(errcode(ERRCODE_CHECK_VIOLATION),
  	 errmsg("new row for relation \"%s\" violates check constraint \"%s\"",
! 			RelationGetRelationName(rel), failed)));
  	}
  }
  
--- 1581,1590 
  			ereport(ERROR,
  	(errcode(ERRCODE_CHECK_VIOLATION),
  	 errmsg("new row for relation \"%s\" violates check constraint \"%s\"",
! 			RelationGetRelationName(rel), failed),
! 	 errconstraintname(failed),
! 	 errschemaname(get_namespace_name(RelationGetNamespace(rel))),
! 	 errtablename(RelationGetRelationName(rel;
  	}
  }
  
*** ./src/backend/utils/adt/ri_triggers.c.orig	2011-06-05 16:09:10.0 +0200
--- ./src/backend/utils/adt/ri_triggers.c	2011-06-05 19:57:28.489835680 +0200
***
*** 3537,3543 
   errmsg("insert or update on table \"%s\" violates foreign key constraint \"%s\"",
  		RelationGetRelationName(fk_rel), constrname),
   errdetail("No rows were found in \"%s\".",
! 		   RelationGetRelationName(pk_rel;
  	}
  
  	/* Get printable versions of the keys involved */
--- 3537,3546 
   errmsg("insert or update on table \"%s\" violates foreign key constraint \"%s\"",
  		RelationGetRelationName(fk_rel), constrname),
   errdetail("No rows were found in \"%s\".",
! 		   RelationGetRelationName(pk_rel)),
!  errconstraintname(constrname),
!  errschemaname(get_namespace_name(RelationGetNamespace(fk_rel))),
!  errtablename(RelationGetRelationName(fk_rel;
  	}
  
  	/* Get printable versions of the keys involved */
***
*** 3570,3576 
  		RelationGetRelationName(fk_rel), constrname),
   errdetail("Key (%s)=(%s) is not present in table \"%s\".",
  		   key_names.data, key_values.data,
! 		   RelationGetRelationName(pk_rel;
  	else
  		ereport(ERROR,
  (errcode(ERRCODE_FOREIGN_KEY_VIOLATION),
--- 3573,3583 
  		RelationGetRelationName(fk_rel), constrname),
   errdetail("Key (%s)=(%s) is not present in table \"%s\".",
  		   key_names.data, key_values.data,
! 		   RelationGetRelationName(pk_rel)),
!  errconstraintname(constrname),
!  errschemaname(get_namespace_name(RelationGetNamespace(fk_rel))),
!  errtablename(RelationGetRelationName(fk_rel)),
!  errcolumnname(key_names.data)));
  	else
  		ereport(ERROR,
  (errcode(ERRCODE_FOREIGN_KEY_VIOLATION),
***
*** 3579,3585 
  		constrname, RelationGetRelationName(fk_rel)),
  			errdetail("Key (%s)=(%s) is still referenced from table \"%s\".",
  	  key_names.data, key_values.data,
! 	  RelationGetRelationName(fk_rel;
  }
  
  /* --
--- 3586,3596 
  		constrname, RelationGetRelationName(fk_rel)),
  			errdetail("Key (%s)=(%s) is still referenced from table \"%s\".",
  	  key_names.data, key_values.data,
! 	  RelationGetRelationName(fk_rel)),
!  errconstraintname(constrname),
!  errschem

Re: [HACKERS] SIREAD lock versus ACCESS EXCLUSIVE lock

2011-06-05 Thread Kevin Grittner
> Heikki Linnakangas  wrote:
> On 03.06.2011 23:44, Kevin Grittner wrote:
>> Heikki Linnakangas wrote:
>>
>>> I think you'll need to just memorize the lock deletion command in
>>> a backend-local list, and perform the deletion in a post-commit
>>> function.
>>
>> Hmm. As mentioned earlier in the thread, cleaning these up doesn't
>> actually have any benefit beyond freeing space in the predicate
>> locking collections. I'm not sure that benefit is enough to
>> justify this much new mechanism. Maybe I should just leave them
>> alone and let them get cleaned up in due course with the rest of
>> the locks. Any opinions on that?
>
> Is there a chance of false positives if oid wraparound happens and
> a new table gets the same oid as the old one? It's also possible
> for a heap to get the OID of an old index or vice versa, will that
> confuse things?
 
Some additional thoughts after working on the new code...
 
The risk we're trying to cover is that a table can be dropped while
committed transactions still have predicate locks against it, which
won't be cleared until overlapping transactions which are still
active, and which can form a dangerous structure with the committed
transactions, commit.  If we just leave those predicate locks to be
cleaned up in the normal course of transaction completion, there
could be a new object created with the old OID which, if a
complicated set of conditions are met, could lead to a serialization
failure of a transaction which would otherwise succeed.  We can't
clean the predicate locks up immediately when the DROP TABLE appears
to succeed, because the enclosing transaction (or subtransaction)
could still be rolled back.
 
That last bit seems pretty hard to dodge, because premature predicate
lock cleanup could lead to false negatives -- which would allow data
inconsistencies.  So cleanup before commit is definitely out.  But
some additional time meditating on the dangerous structure diagram
today has me wondering whether we might be OK just ignoring cleanup
prior to what would normally happen as transactions complete.
 
  Tin - - -> Tpivot - - -> Tout
 
As mentioned above, this whole issue is based around concern about
false positives when a transaction has read from a table and
committed before the table is dropped.  We don't need to worry about
that for Tout -- its position in the dangerous structure is just
based around what it *writes*; it can't be responsible for the
read-and-commit before the table is dropped.  To have a problem,
Tpivot can't be the transaction which commits before the DROP TABLE
either -- since it reads what Tout writes and Tout has to be first to
commit, then it cannot read and commit before the DROP TABLE and
still have Tout write afterward.  That means that the transaction
which did the read and commit would have to be Tin, and Tpivot would
need to be the transaction which later wrote to an object with the
same OID as the dropped table.
 
For a problem to manifest itself, this sequence must occur:
 
- The timing of transaction starts doesn't matter (other than the RO
  optimizations) as long as Tpivot overlaps both Tin and Tout.
- Tout commits first
- Any time before Tin commits, it reads from relation X.
- Tin commits second
- Tpivot develops a rw-conflict out to Tout on a relation *other
  than* X, any time after both have started and before Tpivot
  commits.
- Tpivot writes to a relation which has the same OID as relation X,
  but which isn't relation X.  This requires that a new relation be
  created with the same OID which had been used by X, and that this
  new relation is visible to Tpivot -- which acquired its snapshot
  *before X was dropped*.
 
Is this possible?  If a transaction gets its snapshot while OID of N
is assigned to relation X, can that transaction wind up seeing an OID
of N as a reference to relation Y?  If not, there aren't any false
positives possible.
 
Given the quantity and complexity of code which I've needed to
generate to cover this, I'm wondering again whether it is justified
*even if* such a false positive is possible.  It certainly seems like
it would be a *very* infrequent occurrence,  There's also the
question of what shows in pg_locks, but I'm wondering whether even
that is enough to justify this much code.
 
Thoughts?  Maybe I should submit a patch without added complexity of
the scheduled cleanup and we can discuss that as a separate patch?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: psql include file using relative path

2011-06-05 Thread Josh Kupershmidt
On Sun, Jun 5, 2011 at 10:21 AM, Gurjeet Singh  wrote:
> On Sat, May 21, 2011 at 11:59 AM, Josh Kupershmidt 
> wrote:

> Tweaks applied, but omitted the C variable names as I don't think that adds
> much value.

Your rewordings are fine, but the the article "the" is missing in a
few spots, e.g.
 * "uses \ir command" -> "uses the \ir command"
 * "to currently processing file" -> "to the currently processing file"
 * "same as \i command" -> "same as the \i command"

I think "processing" is better (and consistent with the rest of the
comments) than "processed" here:
+ * the file from where the currently processed file (if any) is located.

> New version of the patch attached. Thanks for the review.

I think the patch is in pretty good shape now. The memory leak is gone
AFAICT, and the comments and documentation updates look good.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assert failure when rechecking an exclusion constraint

2011-06-05 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jun 04, 2011 at 05:49:31PM -0400, Tom Lane wrote:
>> So in short, I'm thinking move lines 1760-1772 (in HEAD) of index.c to
>> the end of index_build(), then insert a ResetReindexProcessing() call in
>> front of them; or maybe only do ResetReindexProcessing there if we
>> actually do call IndexCheckExclusion.

> Sounds reasonable.  Need to remove the index from pendingReindexedIndexes, not
> just call ResetReindexProcessing().

[ looks again... ]  Uh, right.  I was thinking that the pending list was
just "pending" and not "in progress" indexes.  I wonder if we should
rejigger things so that that's actually true, ie, remove an index's OID
from the pending list when we mark it as the current one?

> Also, wouldn't that specific construction
> make the catalog updates fail due to running in the table owner's security
> context?

AFAIR there's no security checks happening below this level.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: psql include file using relative path

2011-06-05 Thread Gurjeet Singh
On Sat, May 21, 2011 at 11:59 AM, Josh Kupershmidt wrote:

> On Fri, May 20, 2011 at 2:35 PM, Gurjeet Singh 
> wrote:
> > On Sat, May 14, 2011 at 5:03 PM, Josh Kupershmidt 
> > wrote:
> > Thanks a lot for the review. My responses are inline below.
>
> Thanks for the fixes. Your updated patch is sent as a
> patch-upon-a-patch, it'll probably be easier for everyone
> (particularly the final committer) if you send inclusive patches
> instead.
>

My Bad. I did not intend to do that.


>
> >> == Documentation ==
> >> The patch includes the standard psql help output description for the
> >> new \ir command. I think ./doc/src/sgml/ref/psql-ref.sgml needs to be
> >> patched as well, though.
> >
> > Done.
>
> This is a decent description from a technical standpoint:
>
>
>When used within a script, if the  class="parameter">filename
>uses relative path notation, then the file will be looked up
> relative to currently
>executing file's location.
>
>If the filename
> uses an absolute path
>notation, or if this command is being used in interactive
> mode, then it behaves the
>same as \i command.
>
>
> but I think these paragraphs could be made a little more clear, by
> making a suggestion about why someone would be interested in \ir. How
> about this:
>
>
>The \ir command is similar to \i, but
> is useful for files which
>load in other files.
>
>When used within a file loaded via \i,
> \ir, or
>-f, if the  class="parameter">filename
>is specified with a relative path, then the location of the
> file will be determined
>relative to the currently executing file's location.
>
>
>
>If filename is given as
> an
>absolute path, or if this command is used in interactive mode, then
>\ir behaves the same as the \i command.
>
>
> The sentence "When used within a file ..." is now a little
> clunky/verbose, but I was trying to avoid potential confusion from
> someone trying \ir via 'cat ../filename.sql | psql', which would be
> "used within a script", but \ir wouldn't know that.
>

Although a bit winded, I think that sounds much clearer.


>
>
> >> == Code ==
> >> 1.) I have some doubts about whether the memory allocated here:
> >>char *relative_file = pg_malloc(dir_len + 1 + file_len + 1);
> >> is always free()'d, particularly if this condition is hit:
> >>
> >>if (!fd)
> >>{
> >>psql_error("%s: %s\n", filename, strerror(errno));
> >>return EXIT_FAILURE;
> >>}
> >
> > Fixed.
>
> Well, this fix:
>
>if (!fd)
>{
> +   if (relative_path != NULL)
> +   free(relative_path);
> +
>psql_error("%s: %s\n", filename, strerror(errno));
>
> uses the wrong variable name (relative_path instead of relative_file),
> and the subsequent psql_error() call will then reference freed memory,
> since relative_file was assigned to filename.
>
> But even after fixing this snippet to get it to compile, like so:
>
>if (!fd)
>{
>psql_error("%s: %s\n", filename, strerror(errno));
> if (relative_file != NULL)
>free(relative_file);
>
>return EXIT_FAILURE;
>}
>
> I was still seeing memory leaks in valgrind growing with the number of
> \ir calls between files (see valgrind_bad_report.txt attached). I
> think that relative_file needs to be freed even in the non-error case,
> like so:
>
> error:
>if (fd != stdin)
>fclose(fd);
>
>if (relative_file != NULL)
>free(relative_file);
>
>pset.inputfile = oldfilename;
>return result;
>
> At least, that fix seemed to get rid of the ballooning valgrind
> reports for me. I then see a constant-sized < 500 byte leak complaint
> from valgrind, the same as with unpatched psql.
>

Yup, that fixes it. Thanks.



>
> >> 4.) I think the changes to process_file() merit another comment or
> >> two, e.g. describing what relative_file is supposed to be.
>
> > Added.
>
> Some cleanup for this comment:
>
> +   /*
> +* If the currently processing file uses \ir command, and
> the parameter
> +* to the command is a relative file path, then we resolve
> this path
> +* relative to currently processing file.
>
> suggested tweak:
>
>If the currently processing file uses the \ir command, and the filename
>parameter is given as a relative path, then we resolve this path
> relative
>to the currently processing file (pset.inputfile).
>
> +*
> +* If the \ir command was executed in interactive mode
> (i.e. not in a
> +* script) the we treat it the same as \i command.
> +*/
>
> suggested tweak:
>
>If the \ir command was executed in interactive mode (i.e. not in a
>script, and pset.inputfile will be NULL) then we treat the filename
>the same as the \i command does.
>

Tweak

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-06-05 Thread Mitsuru IWASAKI
Hi,

> On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:
> > For 1, I've just finish my work.  The latest patch is available at:
> > http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
> >
> 
> Reminder here--we can't accept code based on it being published to a web 
> page.  You'll need to e-mail it to the pgsql-hackers mailing list to be 
> considered for the next PostgreSQL CommitFest, which is starting in a 
> few weeks.  Code submitted to the mailing list is considered a release 
> of it to the project under the PostgreSQL license, which we can't just 
> assume for things when given only a URL to them.

Sorry about that, but I had enough time to revise my patches this week-end.
I attached the patches in this mail, and will update CommitFest page soon.

> Also, you suggested you were out of time to work on this.  If that's the 
> case, we'd like to know that so we don't keep cc'ing you about things in 
> expectation of an answer.  Someone else may pick this up as a project to 
> continue working on.  But it's going to need a fair amount of revision 
> before it matches what people want here, and I'm not sure how much of 
> what you've written is going to end up in any commit that may happen 
> from this idea.

It seems that I don't have enough time to complete this work.
You don't need to keep cc'ing me, and I'm very happy if postgres to be
the first DBMS which support buffer cache hibernation feature.

Thanks!


diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c
index b0e4c41..7a3a207 100644
--- src/backend/access/transam/xlog.c
+++ src/backend/access/transam/xlog.c
@@ -4834,6 +4834,19 @@ ReadControlFile(void)
 #endif
 }
 
+bool
+GetControlFile(ControlFileData *controlFile)
+{
+   if (ControlFile == NULL)
+   {
+   return false;
+   }
+
+   memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+
+   return true;
+}
+
 void
 UpdateControlFile(void)
 {
diff --git src/backend/bootstrap/bootstrap.c src/backend/bootstrap/bootstrap.c
index fc093cc..7ecf6bb 100644
--- src/backend/bootstrap/bootstrap.c
+++ src/backend/bootstrap/bootstrap.c
@@ -360,6 +360,15 @@ AuxiliaryProcessMain(int argc, char *argv[])
BaseInit();
 
/*
+* Only StartupProcess can call ResumeBufferCacheHibernation() after
+* InitFileAccess() and smgrinit().
+*/
+   if (auxType == StartupProcess && BufferCacheHibernationLevel > 0)
+   {
+   ResumeBufferCacheHibernation();
+   }
+
+   /*
 * When we are an auxiliary process, we aren't going to do the full
 * InitPostgres pushups, but there are a couple of things that need to 
get
 * lit up even in an auxiliary process.
diff --git src/backend/storage/buffer/buf_init.c 
src/backend/storage/buffer/buf_init.c
index dadb49d..52eb51a 100644
--- src/backend/storage/buffer/buf_init.c
+++ src/backend/storage/buffer/buf_init.c
@@ -127,6 +127,14 @@ InitBufferPool(void)
 
/* Init other shared buffer-management stuff */
StrategyInitialize(!foundDescs);
+
+   if (BufferCacheHibernationLevel > 0)
+   {
+   
ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_DESCRIPTORS,
+   (char *)BufferDescriptors, sizeof(BufferDesc), 
NBuffers);
+   
ResisterBufferCacheHibernation(BUFFER_CACHE_HIBERNATION_TYPE_BLOCKS,
+   (char *)BufferBlocks, BLCKSZ, NBuffers);
+   }
 }
 
 /*
diff --git src/backend/storage/buffer/bufmgr.c 
src/backend/storage/buffer/bufmgr.c
index f96685d..dba8ebf 100644
--- src/backend/storage/buffer/bufmgr.c
+++ src/backend/storage/buffer/bufmgr.c
@@ -31,6 +31,7 @@
 #include "postgres.h"
 
 #include 
+#include 
 #include 
 
 #include "catalog/catalog.h"
@@ -61,6 +62,13 @@
 #define BUF_WRITTEN0x01
 #define BUF_REUSABLE   0x02
 
+/*
+ * Buffer Cache Hibernation stuff.
+ */
+/* enable this to debug buffer cache hibernation. */
+#if 0
+#define DEBUG_BUFFER_CACHE_HIBERNATION
+#endif
 
 /* GUC variables */
 bool   zero_damaged_pages = false;
@@ -765,6 +773,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
ForkNumber forkNum,
}
}
 
+#ifdef DEBUG_BUFFER_CACHE_HIBERNATION
+   elog(DEBUG5,
+   "alloc  
[%d]\t%03x,%d,%d,%d,%d\t%08x,%d,%d,%d,%d,%d",
+   buf->buf_id, buf->flags, 
buf->usage_count, buf->refcount,
+   buf->wait_backend_pid, buf->freeNext,
+   newHash, newTag.rnode.spcNode,
+   newTag.rnode.dbNode, 
newTag.rnode.relNode,
+   newTag.forkNum, newTag.blockNum);
+#endif
+
return buf;
}
 
@@ -800,6 +818,16 @@ BufferAlloc(SMgrRelation