Re: [HACKERS] Caching for stable expressions with constant arguments v6

2012-02-04 Thread Marti Raudsepp
On Sat, Feb 4, 2012 at 09:49, Jaime Casanova ja...@2ndquadrant.com wrote:
 i little review...

Thanks! By the way, you should update to the v7 patch.

 first, i notice a change of behaviour... i'm not sure if i can say
 this is good or not.

 if you execute: select *, cached_random() from (select
 generate_series(1, 10) ) i;

Yeah, this is pretty much expected.

 seems you are moving code in simplify_function(), do you think is
 useful to do that independently? at least if it provides some clarity
 to the code

I think so, yeah, but separating it from the patch would be quite a bit of work.

  shared_buffers | 4096

Note that the ts table is 36MB so it doesn't fit in your
shared_buffers now. If you increase shared_buffers, you will see
better gains for tests #1 and #2

 from what i see there is no too much gain for the amount of complexity
 added... i can see there should be cases which a lot more gain

Yeah, the goal of this test script wasn't to demonstrate the best
cases of the patch, but to display how it behaves in different
scenarios. Here's what they do:

Test case #1 is a typical real world query that benefits from this
patch. With a higher shared_buffers you should see a 6-7x improvement
there, which I think is a compelling speedup for a whole class of
queries.

Test #2 doesn't benefit much from the patch since now() is already a
very fast function, but the point is that even saving the function
call overhead helps noticeably.

Tests #3 and #4 show the possible *worst* case of the patch -- it's
processing a complex expression, but there's just one row in the
table, so no opportunity for caching.


Besides stable functions, this patch also improves the performance of
expressions involving placeholders parameters, such as variable
references from PL/pgSQL, since these are not constant-folded. E.g:

DECLARE
  foo timestamptz = '2012-02-04';
BEGIN
  SELECT * FROM ts WHERE ts(foo - interval '1 day');

Before this patch, the interval subtraction was executed once per row;
now it's once per execution.

 a benchmark with pgbench scale 20 (average of 3 runs, -T 300 clients
 =1, except on second run)
 -scale 20

I think the differences here are mostly noise because the queries
pgbench generates aren't affected by caching.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA

2012-02-04 Thread Thomas Munro
On 16 January 2012 21:30, Josh Berkus j...@agliodbs.com wrote:

 Useful, yes.  Harder than it looks, probably.  I tried to mock up a
 version of this years ago for a project where I needed it, and ran into
 all kinds of race conditions.

Can you remember any details about those race conditions?

 Anyway, if it could be made to work, this is extremely useful for any
 application which needs queueing behavior (with is a large plurality, if
 not a majority, of applications).

Ok, based on this feedback I decided to push further and try
implementating this.  See POC/WIP patch attached.  It seems to work
for simple examples but I haven't yet tried to break it or see how it
interacts with more complicated queries or high concurrency levels.
It probably contains at least a few rookie mistakes!  Any feedback
gratefully received.

The approach is described in my original email.  Short version:
heap_lock_tuple now takes an enum called wait_policy instead of a
boolean called nowait, with the following values:

  LockWaitBlock: wait for lock (like nowait = false before),

  LockWaitError: error if not immediately lockable (like nowait = true
 before)

  LockWaitSkip:  give up and return HeapTupleWouldBlock if not
 immediately lockable (this is a new policy)

The rest of the patch is about getting the appropriate value down to
that function call, following the example of the existing nowait
support, and skipping rows if you said SKIP LOCKED DATA and you got
HeapTupleWouldBlock.

Compared to one very popular commercial database's implementation, I
think this is a little bit friendlier for the user who wants to
distribute work.  Let's say you want to lock one row without lock
contention, which this patch allows with FETCH FIRST 1 ROW ONLY FOR
UPDATE SKIP LOCKED DATA in an SQL query.  In that other system, the
mechanism for limiting the number of rows fetched is done in the WHERE
clause, and therefore the N rows are counted *before* checking if the
lock can be obtained, so users sometimes have to resort to stored
procedures so they can control the FETCH from a cursor imperatively.
In another popular commercial database from Redmond, you can ask for
the top (first) N rows while using the equivalent of SKIP LOCKED DATA
and it has the same effect as this patch as far as I can tell, and
another large blue system is the same.

As discussed in another branch of this thread, you can probably get
the same effect with transactional advisory locks.  But I personally
like row skipping better as an explicit feature because:

(1) I think there might be an order-of-evaluation problem with a WHERE
clause containing both lock testing and row filtering expressions (ie
it is undefined right?) which you might need subselects to work around
(ie to be sure to avoid false positive locks, not sure about this).

(2) The advisory technique requires you to introduce an integer
identifier if you didn't already have one and then lock that despite
having already said which rows you want to try to lock in standard row
filtering expressions.

(3) The advisory technique requires all users of the table to
participate in the advisory lock protocol even if they don't want to
use the option to skip lock.

(4) It complements NOWAIT (which could also have been done with
transactional advisory locks, with a function like try_lock_or_fail).

(5) I like the idea of directly porting applications from other
databases with this feature (that is what led me here).

I can also imagine some other arguments against SKIP LOCKED DATA: the
type of applications that would use this technique generate too much
dead tuple churn for PostgreSQL anyway (for example, compared to the
update-tuples-in-place systems like DB2 z/OS edition where SKIP LOCKED
DATA is used to distribute work efficiently), and databases shouldn't
be used as queues anyway, for that we have $X, and skipping rows
provides an inconsistent view of the data (ie a result set that never
strictly existed).

Here are some examples of previous requests or discussion of this
feature:

http://archives.postgresql.org/pgsql-general/2008-07/msg00442.php
http://archives.postgresql.org/pgsql-bugs/2003-12/msg00154.php
http://archives.postgresql.org/pgsql-general/2002-07/msg00744.php
http://blog.hydrobiont.com/2011/06/select-for-update-skip-locked-in.html

Thanks for reading!

Thomas
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 01c0104..58adcae 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -45,7 +45,7 @@ SELECT [ ALL | DISTINCT [ ON ( replaceable class=parameterexpression/replac
 [ LIMIT { replaceable class=parametercount/replaceable | ALL } ]
 [ OFFSET replaceable class=parameterstart/replaceable [ ROW | ROWS ] ]
 [ FETCH { FIRST | NEXT } [ replaceable class=parametercount/replaceable ] { ROW | ROWS } ONLY ]
-[ FOR { UPDATE | SHARE } [ OF replaceable class=parametertable_name/replaceable [, ...] ] [ NOWAIT ] 

Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-02-04 Thread Hitoshi Harada
On Thu, Feb 2, 2012 at 3:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ working on this patch now ... ]

 Matthew Draper matt...@trebex.net writes:
 On 25/01/12 18:37, Hitoshi Harada wrote:
 Should we throw an error in such ambiguity? Or did you make it happen
 intentionally? If latter, we should also mention the rule in the
 manual.

 I did consider it, and felt it was the most consistent:

 I believe the issue here is that a two-part name A.B has two possible
 interpretations (once we have eliminated table references supplied by
 the current SQL command inside the function): per the comment,

     * A.B        A = record-typed parameter name, B = field name
     *            OR: A = function name, B = parameter name

 If both interpretations are feasible, what should we do?  The patch
 tries them in the above order, but I think the other order would be
 better.  My argument is this: the current behavior doesn't provide any
 out other than changing the function or parameter name.  Now
 presumably, if someone is silly enough to use a parameter name the same
 as the function's name, he's got a good reason to do so and would not
 like to be forced to change it.  If we prefer the function.parameter
 interpretation, then he still has a way to get to a field name: he just
 has to use a three-part name function.parameter.field.  If we prefer the
 parameter.field interpretation, but he needs to refer to a scalar
 parameter, the only way to do it is to use an unqualified reference,
 which might be infeasible (eg, if it also matches a column name exposed
 in the SQL command).

 Another problem with the current implementation is that if A matches a
 parameter name, but ParseFuncOrColumn fails (ie, the parameter is not of
 composite type or doesn't contain a field named B), then the hook just
 fails to resolve anything; it doesn't fall back to consider the
 function-name-first alternative.  So that's another usability black
 mark.  We could probably complicate the code enough so it did consider
 the function.parameter case at that point, but I don't think that there
 is a strong enough argument for this precedence order to justify such
 complication.

 In short, I propose swapping the order in which these cases are tried.

+1 from me.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finer Extension dependencies

2012-02-04 Thread Hitoshi Harada
On Mon, Jan 23, 2012 at 3:06 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Mon, Jan 23, 2012 at 2:00 AM, Dimitri Fontaine
 dimi...@2ndquadrant.fr wrote:
 Hitoshi Harada umi.tan...@gmail.com writes:
 - What happens if DROP EXTENSION ... CASCADE? Does it work?

 It should, what happens when you try? :)

 I just tried DROP EXTENSION now, and found it broken :(

 db1=# create extension kmeans;
 CREATE EXTENSION
 db1=# drop extension kmeans;
 ERROR:  cannot drop extension kmeans because extension feature kmeans
 requires it
 HINT:  You can drop extension feature kmeans instead.

 Can you provide me the test case you've been using?  That looks like a
 bug I need to fix, indeed (unless the problem lies in the test case,
 which would mean I need to tighten things some more).

 The test case is just above; createdb db1 and create and drop an
 extension. The kmean extension is on pgxn. I tried my small test
 extension named ext1 which contains only one plpgsql function, and
 created it then dropped it, reproduced.


Ping. In case you don't have updates soon, I'll mark Returned with Feedback.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby fails if any backend crashes

2012-02-04 Thread Simon Riggs
On Fri, Feb 3, 2012 at 4:48 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I think saner behavior might only require this change:

            /*
             * Any unexpected exit (including FATAL exit) of the startup
             * process is treated as a crash, except that we don't want to
             * reinitialize.
             */
            if (!EXIT_STATUS_0(exitstatus))
            {
 -               RecoveryError = true;
 +               if (!FatalError)
 +                   RecoveryError = true;
                HandleChildCrash(pid, exitstatus,
                                 _(startup process));
                continue;
            }

 plus suitable comment adjustments of course.  Haven't tested this yet
 though.

Looks good, will test.

 It's a bit disturbing that nobody has reported this from the field yet.
 Seems to imply that hot standby isn't being used much.

There are many people I know using it in production for more than a year now.

Either they haven't seen it or they haven't reported it to us.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-02-04 Thread Greg Smith

On 01/24/2012 08:58 AM, Robert Haas wrote:

One somewhat odd thing about these numbers is that, on permanent
tables, all of the patches seemed to show regressions vs. master in
single-client throughput.  That's a slightly difficult result to
believe, though, so it's probably a testing artifact of some kind.


It looks like you may have run the ones against master first, then the 
ones applying various patches.  The one test artifact I have to be very 
careful to avoid in that situation is that later files on the physical 
disk are slower than earlier ones.  There's a 30% differences between 
the fastest part of a regular hard drive, the logical beginning, and its 
end.  Multiple test runs tend to creep forward onto later sections of 
disk, and be biased toward the earlier run in that case.  To eliminate 
that bias when it gets bad, I normally either a) run each test 3 times, 
interleaved, or b) rebuild the filesystem in between each initdb.


I'm not sure that's the problem you're running into, but it's the only 
one I've been hit by that matches the suspicious part of your results.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-02-04 Thread Kevin Grittner
Tom Lane  wrote:
 
 More to the point, a GUC rollback transition *has to always
 succeed*.  Period.
 
I was about to point out the exception of the transaction_read_only
GUC, which according to the standard must not be changed except at
the beginning of a transaction or a subtransaction, and must not be
changed from on to off in a subtransaction.  Then I noticed that,
while we protect against an explicit change in a prohibited way, we
allow a RESET:
 
test=# begin transaction read only;
BEGIN
test=# select * from x;
 x 
---
 1
(1 row)

test=# set transaction_read_only = off;
ERROR:  transaction read-write mode must be set before any query
test=# rollback;
ROLLBACK
test=# begin transaction read only;
BEGIN
test=# select * from x;
 x 
---
 1
(1 row)

test=# reset transaction_read_only ;
RESET
test=# insert into x VALUES (2);
INSERT 0 1
test=# commit;
COMMIT
 
I think that's a problem.  It could allow back-door violations of
invariants enforced by triggers, and seems to violate the SQL
standard.  I think this should be considered a bug, although I'm not
sure whether it's safe to back-patch, given the change to existing
behavior.
 
Whether such a (required) exception to what you assert above should
open the door to any others is another question.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-02-04 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane  wrote:
 More to the point, a GUC rollback transition *has to always
 succeed*.  Period.

 [ counterexample showing we should sometimes disallow RESET ]

This actually isn't what I was talking about: a RESET statement is a
commanded adoption of a new value that happens to be gotten off the
stack, and there's not a lot of logical difficulty in letting it fail.
What I was concerned about was the case where GUC is trying to
re-establish an old value during transaction rollback.  For example,
assume we are superuser to start with, and we do

begin;
set role unprivileged_user;
...
rollback;

The rollback needs to transition the role setting back to superuser.
A check based purely on allowed transitions would disallow that, since
it's not visibly different from the unprivileged_user trying to make
himself superuser.  You have to take the context of the state change
into account.

And yeah, I agree it's a bug that you can do what Kevin's example shows.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory usage during sorting

2012-02-04 Thread Jeff Janes
On Sat, Jan 21, 2012 at 4:51 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 16 January 2012 00:59, Jeff Janes jeff.ja...@gmail.com wrote:
 I think it would be better to pre-deduct the tape overhead amount we
 will need if we decide to switch to tape sort from the availMem before
 we even start reading (and then add it back if we do indeed make that
 switch).  That way we wouldn't over-run the memory in the first place.
  However, that would cause apparent regressions in which sorts that
 previously fit into maintenance_work_mem no longer do.  Boosting
 maintenance_work_mem to a level that was actually being used
 previously would fix those regressions, but pointing out that the
 previous behavior was not optimal doesn't change the fact that people
 are used to it and perhaps tuned to it.  So the attached patch seems
 more backwards-friendly.

 Hmm. Are people really setting maintenance_work_mem such that it is
 exactly large enough to quicksort when building an index in one case
 but not another?

It could also apply to work_mem in other situations.  I'm just using
maintenance_work_mem in my example because index creation is the thing
that lead me into this little diversion and so that is what my
test-bed is set up to use.

Some people do have very stable ritualized work-loads and so could
have tuned exactly to them.  I don't know how common that would be.
More common might be synthetic benchmarks which had been tuned, either
intentionally or accidentally, to just barely fit in the (overshot)
memory, so it would be a perception problem that there was a
regression when in fact the tuning merely became out of date.

 Is the difference large enough to warrant avoiding
 pre-deduction?

Switching to an external sort when you could have done it all by quick
sort (by cheating just a little bit on memory usage) makes a pretty
big difference, over 2 fold in my tests.  If the fast-path
optimization for qsort goes in, the difference might get even bigger.
Of course, there will always be some point at which that switch over
must occur, so the real question is do we need to keep that switch
point historically consistent to avoid nasty surprises for people with
excessively fine-tuned *work_mem settings based on old behavior?  And
do such people even exist outside of my imagination?

But a bigger question I now have is if any of this matters.  Since
there is currently no way to know how many connections might be
running how many concurrent sorts, there is really no way to tune
work_mem with much precision.  So, does it matter if we slightly lie
about how much we will actually use?  And is it worth worrying about
whether every last drop of memory is used efficiently?

I was trying to compare the performance I was getting with a
theoretical model of what I should get, and it was confusing to me
that we were using a originally defined size of memory for the
priority heap, and then later reducing that amount of memory.  That is
annoying because it complicates the theoretical model, and even more
annoying when I realized that the freed memory that results from
doing this is too fragmented to even be used for other purposes.  But
theoretical annoyances aside, it is hardly the worst thing about
memory usage in sorting.  It is just one that is standing in the way
of my further study.

So I don't think this patch I proposed is particularly important in
its own right.  I want to see if anyone will point out that I am all
wet because my analysis failed to take foo into account.   I
probably should have explained this when I submitted the patch.


The worst thing about the current memory usage is probably that big
machines can't qsort more than 16,777,216 tuples no matter how much
memory they have, because memtuples has to live entirely within a
single allocation, which is currently limited to 1GB.  It is probably
too late to fix this problem for 9.2. I wish I had started there
rather than here.

This 16,777,216 tuple limitation will get even more unfortunate if the
specializations that speed up qsort but not external sort get
accepted.

Attached is a completely uncommitable proof of concept/work in
progress patch to get around the limitation.  It shows a 2 fold
improvement when indexing an integer column on a 50,000,000 row
randomly ordered table.

As for what to do to make it commitable, it seems like maybe there
should be two different MaxAllocSize.  One determined by the aset.c
assumes they can compute twice an allocation's size without overflow.
 I would think that simply testing size_t at compile time would be
enough.  Allowing more than 1 GB allocations would probably be
pointless on 32 bit machines, and 64 bit should be able to compute
twice of a much larger value than 1GB without overflow.

For the varlena stuff, I don't know what to do.  Is the I swear to
God I won't pass this pointer to one of those functions sufficient?
Or does each function need to test it?

And I'm pretty sure that palloc, not just repalloc, 

Re: [HACKERS] Memory usage during sorting

2012-02-04 Thread Jeff Janes
On Sat, Feb 4, 2012 at 10:06 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 Attached is a completely uncommitable proof of concept/work in
 progress patch to get around the limitation.  It shows a 2 fold
 improvement when indexing an integer column on a 50,000,000 row
 randomly ordered table.

Oops, forgot to say that is with
set maintenance_work_mem=4194304;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple

2012-02-04 Thread Simon Riggs
On Fri, Feb 3, 2012 at 6:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 The reason it is relevant
 to our current problem is that even though RestoreBkpBlocks faithfully
 takes exclusive lock on the buffer, *that is not enough to guarantee
 that no one else is touching that buffer*.  Another backend that has
 already located a visible tuple on a page is entitled to keep accessing
 that tuple with only a buffer pin.  So the existing code transiently
 wipes the data from underneath the other backend's pin.

While deciding whether to apply the patch, I'm thinking about whether
we should be doing this at all. We already agreed that backup blocks
were removable from the WAL stream.

The cause here is data changing underneath the user. Your patch solves
the most obvious error, but it still allows other problems if applying
the backup block changes data. If the backup block doesn't do anything
at all then we don't need to apply it either.

So ISTM that we should just skip applying backup blocks over the top
of existing buffers once we have reached consistency.

Patch to do that attached, but the basic part of it is just this...

@@ -3700,8 +3701,21 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord
*record, bool cleanup)
memcpy(bkpb, blk, sizeof(BkpBlock));
blk += sizeof(BkpBlock);

+   hit = false;
buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork,
bkpb.block,
-
 RBM_ZERO);
+
 RBM_ZERO, hit);
+
+   /*
+* If we found the block in shared buffers and we are already
+* consistent then skip applying the backup block. The block
+* was already removable anyway, so we can skip
without problems.
+* This avoids us needing to take a cleanup lock in
all cases when
+* we apply backup blocks because of potential effects
on user queries,
+* which expect data on blocks to remain constant
while being read.
+*/
+   if (reachedConsistency  hit)
+   continue;
+
Assert(BufferIsValid(buffer));
if (cleanup)
LockBufferForCleanup(buffer);


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple

2012-02-04 Thread Simon Riggs
On Sat, Feb 4, 2012 at 6:37 PM, Simon Riggs si...@2ndquadrant.com wrote:

 Patch to do that attached


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 99a431a..4758931 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4590,6 +4590,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	int			ndead;
 	int			nunused;
 	Size		freespace;
+	bool		hit;
 
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
@@ -4608,7 +4609,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	if (record-xl_info  XLR_BKP_BLOCK_1)
 		return;
 
-	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL);
+	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL, hit);
 	if (!BufferIsValid(buffer))
 		return;
 	LockBufferForCleanup(buffer);
@@ -4664,6 +4665,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 	TransactionId cutoff_xid = xlrec-cutoff_xid;
 	Buffer		buffer;
 	Page		page;
+	bool		hit;
 
 	/*
 	 * In Hot Standby mode, ensure that there's no queries running which still
@@ -4677,7 +4679,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 	if (record-xl_info  XLR_BKP_BLOCK_1)
 		return;
 
-	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL);
+	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL, hit);
 	if (!BufferIsValid(buffer))
 		return;
 	LockBufferForCleanup(buffer);
@@ -4728,6 +4730,7 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	bool		hit;
 
 	/*
 	 * Read the heap page, if it still exists.  If the heap file has been
@@ -4736,7 +4739,7 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	 * will have to be cleared out at the same time.
 	 */
 	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block,
-	RBM_NORMAL);
+	RBM_NORMAL, hit);
 	if (!BufferIsValid(buffer))
 		return;
 	page = (Page) BufferGetPage(buffer);
@@ -4806,13 +4809,14 @@ heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
 	xl_heap_newpage *xlrec = (xl_heap_newpage *) XLogRecGetData(record);
 	Buffer		buffer;
 	Page		page;
+	bool		hit;
 
 	/*
 	 * Note: the NEWPAGE log record is used for both heaps and indexes, so do
 	 * not do anything that assumes we are touching a heap.
 	 */
 	buffer = XLogReadBufferExtended(xlrec-node, xlrec-forknum, xlrec-blkno,
-	RBM_ZERO);
+	RBM_ZERO, hit);
 	Assert(BufferIsValid(buffer));
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0f5c113..d10b0b8 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -466,6 +466,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
+	bool		hit;
 
 	xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
 
@@ -491,7 +492,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 			 * Another simple optimization would be to check if there's any
 			 * backends running; if not, we could just skip this.
 			 */
-			buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+			buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, blkno, RBM_NORMAL, hit);
 			if (BufferIsValid(buffer))
 			{
 LockBufferForCleanup(buffer);
@@ -513,7 +514,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
 	 * page. See nbtree/README for details.
 	 */
-	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL);
+	buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, xlrec-block, RBM_NORMAL, hit);
 	if (!BufferIsValid(buffer))
 		return;
 	LockBufferForCleanup(buffer);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cce87a3..3f4842d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3687,6 +3687,7 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
 	BkpBlock	bkpb;
 	char	   *blk;
 	int			i;
+	bool		hit;
 
 	if (!(record-xl_info  XLR_BKP_BLOCK_MASK))
 		return;
@@ -3700,8 +3701,21 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
 		memcpy(bkpb, blk, sizeof(BkpBlock));
 		blk += sizeof(BkpBlock);
 
+		hit = false;
 		buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
-		RBM_ZERO);
+		RBM_ZERO, hit);
+
+		/*
+		 * If we found the block in shared buffers and we are already
+		 * consistent then skip applying the backup block. The block
+		 * was already removable anyway, so we can 

Re: [HACKERS] [BUGS] BUG #6425: Bus error in slot_deform_tuple

2012-02-04 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 The cause here is data changing underneath the user. Your patch solves
 the most obvious error, but it still allows other problems if applying
 the backup block changes data. If the backup block doesn't do anything
 at all then we don't need to apply it either.

This is nonsense.  What applying the backup block does is to apply the
change that the WAL record would otherwise have applied, except we
decided to make it store a full-page image instead.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Finer Extension dependencies

2012-02-04 Thread Dimitri Fontaine
Hitoshi Harada umi.tan...@gmail.com writes:
 Ping. In case you don't have updates soon, I'll mark Returned with Feedback.

Pong.  Sorry about my recent silence. 

I've not been in a position to work on this recently, and am done with
those other duties now. I intend to be posting an updated patch soon
now.  Please wait some more before punting this patch out of the current
commit fest.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-04 Thread Jeff Janes
On Wed, Feb 1, 2012 at 9:47 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should be working to commit XLogInsert and then Group
 Commit, then come back to the discussion.

 I definitely agree that those two have way more promise than anything
 else on the table.  However, now that I understand how badly we're
 getting screwed by checkpoints, I'm curious to do some more
 investigation of what's going on there.  I can't help thinking that in
 these particular cases the full page writes may be a bigger issue than
 the checkpoint itself.  If that turns out to be the case it's not
 likely to be something we're able to address for 9.2, but I'd like to
 at least characterize it.

 I did another run with full_page_writes=off:

 http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png

 This is run with the master branch as of commit
 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
 so that the results are comparable.

 The graph is pretty dramatically different from the earlier one:

 http://wiki.postgresql.org/wiki/File:Tps-master.png

 So are the TPS numbers:

 full page writes on: tps = 2200.848350 (including connections establishing)
 full page writes off: tps = 9494.757916 (including connections establishing)

 It seems pretty clear that, even with full_page_writes=off, the
 checkpoint is affecting things negatively: the first 700 seconds or so
 show much better and more consistent performance than the remaining
 portion of the test.  I'm not sure why that is, but I'm guessing there
 was a checkpoint around that time.

We really need to nail that down.  Could you post the scripts (on the
wiki) you use for running the benchmark and making the graph?  I'd
like to see how much work it would be for me to change it to detect
checkpoints and do something like color the markers blue during
checkpoints and red elsewhen.

Also, I'm not sure how bad that graph really is.  The overall
throughput is more variable, and there are a few latency spikes but
they are few.  The dominant feature is simply that the long-term
average is less than the initial burst.  Of course the goal is to have
a high level of throughput with a smooth latency under sustained
conditions.  But to expect that that long-sustained smooth level of
throughput be identical to the initial burst throughput sounds like
more of a fantasy than a goal.  If we want to accept the lowered
throughput and work on the what variability/spikes are there, I think
a good approach would be to take the long term TPS average, and dial
the number of clients back until the initial burst TPS matches that
long term average.  Then see if the spikes still exist over the long
term using that dialed back number of clients.



 But the effect is much worse with
 full_page_writes=on: the distinctive parabolic shape of those graphs
 is apparently caused by the gradually decreasing frequency of full
 page writes as the number of transactions processed since the last
 checkpoint grows.

The dilution out of not-yet-full-page-written buffers as the
checkpoint ages is probably the ultimate cause of that pattern, but I
suspect there are intermediate causes that we could tackle.

I don't think the full-page-writes are leading to WALInsert
contention, for example, because that would probably lead to smooth
throughput decline, but not those latency spikes in which those entire
seconds go by without transactions.  I doubt it is leading to general
IO compaction, as the IO at that point should be pretty much
sequential (the checkpoint has not yet reached the sync stage, the WAL
is sequential).  So I bet that that is caused by fsyncs occurring at
xlog segment switches, and the locking that that entails.  If I
recall, we can have a segment which is completely written to OS and in
the process of being fsynced, and we can have another segment which is
in some state of partially in wal_buffers and partly written out to OS
cache, but that we can't start reusing the wal_buffers that were
already written to OS for that segment (and therefore are
theoretically available for reuse by the upcoming 3rd segment)  until
the previous segments fsync has completed.  So all WALInsert's freeze.
 Or something like that.  This code has changed a bit since last time
I studied it.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-02-04 Thread Kevin Grittner
Tom Lane  wrote:
 
 What I was concerned about was the case where GUC is trying to
 re-establish an old value during transaction rollback. For example,
 assume we are superuser to start with, and we do
 
 begin;
 set role unprivileged_user;
 ...
 rollback;
 
 The rollback needs to transition the role setting back to
 superuser.
 
Sorry for missing the point.  Yeah, I totally agree with that.
 
 And yeah, I agree it's a bug that you can do what Kevin's example
 shows.
 
I'll look at it and see if I can pull together a patch.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-04 Thread Florian Weimer
* Tom Lane:

 I wonder whether it wouldn't be sufficient to call sync(2) at the end,
 anyway, rather than cluttering the entire initdb codebase with fsync
 calls.

We tried to do this in the Debian package mananger.  It works as
expected on Linux systems, but it can cause a lot of data to hit the
disk, and there are kernel versions where sync(2) never completes if
the system is rather busy.

initdb is much faster with 9.1 than with 8.4.  It's so fast that you
can use it in test suites, instead of reusing an existing cluster.
I think this is a rather desirable property.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-04 Thread Jeff Davis
On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
 Yeah.  Personally I would be sad if initdb got noticeably slower, and
 I've never seen or heard of a failure that this would fix.

I worked up a patch, and it looks like it does about 6 file fsync's and
a 7th for the PGDATA directory. That degrades the time from about 1.1s
to 1.4s on my workstation.

pg_test_fsync says this about my workstation (one 8kB write):
open_datasync 117.495 ops/sec
fdatasync 117.949 ops/sec
fsync  25.530 ops/sec
fsync_writethroughn/a
open_sync  24.666 ops/sec

25 ops/sec means about 40ms per fsync, times 7 is about 280ms, so that
seems like about the right degradation for fsync. I tried with fdatasync
as well to see if it improved things, and I wasn't able to realize any
difference (not sure exactly why).

So, is it worth it? Should we make it an option that can be specified?

 I wonder whether it wouldn't be sufficient to call sync(2) at the end,
 anyway, rather than cluttering the entire initdb codebase with fsync
 calls.

It looks like there are only a few places, so I don't think clutter is
really the problem with the simple patch at this point (unless there is
a portability problem with just calling fsync).

Regards,
Jeff Davis


initdb-fsync.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-02-04 Thread Tom Lane
Matthew Draper matt...@trebex.net writes:
 [ sql-named-param-refs-v2.patch ]

Applied with some editorialization: I switched the behavior for two-part
names as discussed, and did some other mostly-cosmetic code cleanup,
and did some work on the documentation.

 I'm still not sure whether to just revise (almost) all the SQL function
 examples to use parameter names, and declare them the right choice; as
 it's currently written, named parameters still seem rather second-class.

They're less second-class in the docs as committed, but I left a lot of
examples still using $n for parameters.  I'm not sure how far to go in
that direction.  We should not be too eager to scrub the docs of $n,
because if nothing else people will need to understand the notation when
they see it for a long time to come.  But feel free to submit a
follow-up docs patch if you feel more is warranted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-04 Thread Noah Misch
On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
 On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
  Yeah.  Personally I would be sad if initdb got noticeably slower, and
  I've never seen or heard of a failure that this would fix.
 
 I worked up a patch, and it looks like it does about 6 file fsync's and
 a 7th for the PGDATA directory. That degrades the time from about 1.1s
 to 1.4s on my workstation.

 So, is it worth it? Should we make it an option that can be specified?

If we add fsync calls to the initdb process, they should cover the entire data
directory tree.  This patch syncs files that initdb.c writes, but we ought to
also sync files that bootstrap-mode backends had written.  An optimization
like the pg_flush_data() call in copy_file() may reduce the speed penalty.

initdb should do these syncs by default and offer an option to disable them.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-04 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 I suspect we will be unwilling to make such a break with the past.  In
 that case, I think I prefer the originally proposed semantics, even
 though I agree they are somewhat less natural.  ANALYZE is a big flag
 that means This query will be executed, not just planned.  If we are
 not going to make a major break, but only nibble around the edges,
 then I don't think we should remove the property that the query will
 be executed if and only if ANALYZE is specified.

Yeah, I think we need to preserve that property.  Unexpectedly executing
a query (which may have side-effects) is a very dangerous thing.  People
are used to the idea that ANALYZE == execute, and adding random other
flags that also cause execution is going to burn somebody.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-04 Thread Bruce Momjian
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote:
 On Sat, Dec 24, 2011 at 04:01:02PM +, Simon Riggs wrote:
  On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote:
   Why don't you use the same tricks as the former patch and copy the buffer,
   compute the checksum on that, and then write out that copy (you can even 
   do
   both at the same time). I have a hard time believing that the additional 
   copy
   is more expensive than the locking.
  
  ISTM we can't write and copy at the same time because the cheksum is
  not a trailer field.
 
 Ofcourse you can. If the checksum is in the trailer field you get the
 nice property that the whole block has a constant checksum. However, if
 you store the checksum elsewhere you just need to change the checking
 algorithm to copy the checksum out, zero those bytes and run the
 checksum and compare with the extracted checksum.
 
 Not pretty, but I don't think it makes a difference in performence.

Sorry to be late replying to this, but an interesting idea would be to
zero the _hint_ bits before doing the CRC checksum.  That would avoid
the problem of WAL-logging the hint bits.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-04 Thread Bruce Momjian
On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
  Also, as far as I can see this patch usurps the page version field,
  which I find unacceptably short-sighted.  Do you really think this is
  the last page layout change we'll ever make?
 
 No, I don't. I hope and expect the next page layout change to
 reintroduce such a field.
 
 But since we're agreed now that upgrading is important, changing page
 format isn't likely to be happening until we get an online upgrade
 process. So future changes are much less likely. If they do happen, we
 have some flag bits spare that can be used to indicate later versions.
 It's not the prettiest thing in the world, but it's a small ugliness
 in return for an important feature. If there was a way without that, I
 would have chosen it.

Have you considered the CRC might match a valuid page version number? 
Is that safe?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-04 Thread Kevin Grittner
Tom Lane  wrote:
 
 Yeah, I think we need to preserve that property. Unexpectedly
 executing query (which may have side-effects) is a very dangerous
 thing.  People are used to the idea that ANALYZE == execute, and
 adding random other flags that also cause execution is going to
 burn somebody.
 
+1
 
FWIW, another reason not to use Robert's suggested syntax is that you
get rows=n entries with or without the actual run.  You just don't
get the actual block to compare to the estimate.  So ROWS as an
option would be very ambiguous.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers