Re: [HACKERS] Online base backup from the hot-standby

2011-06-23 Thread Jun Ishiduka

 1) Today you can do a backup by just calling pg_start_backup('x'); copy
 the data directory and
 pg_stop_backup(); You do not need to use pg_basebackup to create a
 backup. The solution you are proposing would require pg_basebackup to be
 used to build backups from standby servers.

YES.


 2) If I run pg_basebackup but do not specify '-x' then no pg_xlog
 segments are included in the output. The relevant pg_xlog segments are
 in the archive from the master. I can see situations where you are
 already copying the archive to the remote site that the new standby will
 be created in so you don't want to have to copy the pg_xlog segments
 twice over your network.

No, I don't matter because of the same behavior as 9.1.
Please see the part of Before: of the following answer.


 What Heikki is proposing will work both when you aren't using
 pg_basebackup (as long the output of pg_stop_backup() is somehow
 captured in a way that it can be read) and will also work with
 pg_basebackup when '-x' isn't specified.

I receive this mail, so I notice I do wrong recognition to what 
Heikki is proposing. 

my recognition:
  Before:
* I thought Heikki proposes, Execute SQL(pg_start_backup('x'); copy 
  the data directory and pg_stop_backup();) from the standby server 
  to the primary server.
  - I disliked this. 
  Now:
* Heikki is proposing both No using pg_basebackup and Not specify -x.
  So,
* Use the output of pg_stop_backup().
* Don't use backup history file.
  he thinks.

Right?



Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote:

 Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011:
 Per:
 
 http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php
 
 It seems we did come up with a use case in the procpid discussion. The 
 ability to change the names of columns/databases etc, to handle the 
 fixing of bad decision decisions during development over time.
 
 Thoughts?
 
 Let's start with what was discussed and supported in that thread, that
 is, databases.  It seems less clear that columns are widely believed to
 be a good idea to have synonyms for.  Besides, synonyms for databases
 should be reasonably simple to implement, which is not something I would
 say for columns.



yes, implementing synonyms is not too hard.
some time ago (3 or 4 years ago most likely) we already posted a patch 
providing support for synonyms.
it was rejected because synonyms were said to be a bad design pattern which app 
developers to do nasty things.
so, if you want to work on it maybe this patch is the place to start.

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote:

 Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011:
 Per:
 
 http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php
 
 It seems we did come up with a use case in the procpid discussion. The 
 ability to change the names of columns/databases etc, to handle the 
 fixing of bad decision decisions during development over time.
 
 Thoughts?
 
 Let's start with what was discussed and supported in that thread, that
 is, databases.  It seems less clear that columns are widely believed to
 be a good idea to have synonyms for.  Besides, synonyms for databases
 should be reasonably simple to implement, which is not something I would
 say for columns.



sorry, i missed the links:

http://archives.postgresql.org/pgsql-patches/2006-03/msg00085.php

many thanks,

hans


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hugetables question

2011-06-23 Thread Martijn van Oosterhout
On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote:
 I strictly disagree with opinion if there is 1% it's worthless. 1%
 here, 1% there, and finally You get 10%, but of course hugepages
 will work quite well if will be used in code that require many
 random jumps. I think this can be reproduced and some not-common
 case may be found to get performance of about 10% (maybe upload
 whole table in shared buffer and randomly peek records one by
 one).

I think the point is not that 1% is worthless, but that it hasn't been
shown that it is a 1% improvement, becuase the noise is too large.

For benefits this small, what you need to is run each test 100 times
and check the mean and standard deviation and see whether the
improvment is real or not.

When the benefit is 10% you only need a handful of runs to prove it,
which is why they're accepted easier.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Hugetables question

2011-06-23 Thread Radosław Smogura
Martijn van Oosterhout klep...@svana.org Thursday 23 of June 2011 09:10:20
 On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote:
  I strictly disagree with opinion if there is 1% it's worthless. 1%
  here, 1% there, and finally You get 10%, but of course hugepages
  will work quite well if will be used in code that require many
  random jumps. I think this can be reproduced and some not-common
  case may be found to get performance of about 10% (maybe upload
  whole table in shared buffer and randomly peek records one by
  one).
 
 I think the point is not that 1% is worthless, but that it hasn't been
 shown that it is a 1% improvement, becuase the noise is too large.
 
 For benefits this small, what you need to is run each test 100 times
 and check the mean and standard deviation and see whether the
 improvment is real or not.
 
 When the benefit is 10% you only need a handful of runs to prove it,
 which is why they're accepted easier.
 
 Have a nice day,
 
  Patriotism is when love of your own people comes first; nationalism,
  when hate for people other than your own comes first.
  
- Charles de Gaulle
I think conclusion from this test was Much more important things are to do, 
then 1% benefit - not 1% is worthless.

I will try today hugepages, with random peeks.

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] DROP Reworks Part.0 - 'missing_ok' support of get_object_address

2011-06-23 Thread Kohei KaiGai
I revised my patch based on your there-is-no-try-v2.patch.
It enabled to implement 'missing_ok' support without modification of
orders to solve the object name and relation locking.

Thanks,

2011/6/22 Robert Haas robertmh...@gmail.com:
 On Wed, Jun 22, 2011 at 12:51 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mié jun 22 08:56:02 -0400 2011:

 Another option might be to leave heap_openrv() and relation_openrv()
 alone and add a missing_ok argument to try_heap_openrv() and
 try_relation_openrv().  Passing true would give the same behavior as
 presently; passing false would make them behave like the non-try
 version.

 That would be pretty weird, having two functions, one of them sometimes
 doing the same thing as the other one.

 I understand Noah's concern but I think your original proposal was saner
 than both options presented so far.

 I agree with you.  If we had a whole pile of options it might be worth
 having heap_openrv() and heap_openrv_extended() so as not to
 complicate the simple case, but since there's no forseeable need to
 add anything other than missing_ok, my gut is to just add it and call
 it good.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
KaiGai Kohei kai...@kaigai.gr.jp


pgsql-v9.2-drop-reworks-part-0.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Dmitriy Igrishin
2011/6/23 Merlin Moncure mmonc...@gmail.com

 On Jun 6 (
 http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php),
 Pavel discovered an issue with PQsetvalue that could cause libpq to
 wander off into unallocated memory that was present in 9.0.x.  A
 fairly uninteresting fix was quickly produced, but Tom indicated
 during subsequent review that he was not happy with the behavior of
 the function.  Everyone was busy with the beta wrap at the time so I
 didn't press the issue.

 A little history here: PQsetvalue's
 (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main
 reason to exist is to allow creating a PGresult out of scratch data on
 a result created via PQmakeEmptyResult(). This behavior works as
 intended and is not controversial...it was initially done to support
 libpqtypes but has apparently found use in other places like ecpg.

 PQsetvalue *also* allows you to mess with results returned by the
 server using standard query methods for any tuple, not just the one
 you are creating as you iterate through.  This behavior was
 unnecessary in terms of what libpqtypes and friends needed and may (as
 Tom suggested) come back to bite us at some point. As it turns out,
 PQsetvalue's operation on results that weren't created via
 PQmakeEmptyResult was totally busted because of the bug, so we have a
 unique opportunity to tinker with libpq here: you could argue that you
 have a window of opportunity to change the behavior here since we know
 it isn't being usefully used.  I think it's too late for that but it's
 if it had to be done the time is now.

 Pavel actually has been requesting to go further with being able to
 mess around with PGresults (see:
 http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php),
 such that the result would start to have some 'recordset' features you
 find in various other libraries.  Maybe it's not the job of libpq to
 do that, but I happen to think it's pretty cool.  Anyways, something
 has to be done -- I hate to see an unpatched known 9.0 issue remain in
 the wild.

+1

Exactly at this moment I am thinking about using modifiable
(via PQsetvalue) PGresult instead of std::map in my C++ library
for store parameters for binding to executing command.
I am already designed how to implement it, and I supposed that
PQsetvalue is intended to work with any PGresult and not only
with those which has been created via PQmakeEmptyResult...
So, I am absolutely sure, that PQsetvalue should works with
any PGresult.


 merlin

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
// Dmitriy.


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Pavel Golub
Hello, Merlin.

You wrote:

MM On Jun 6
MM (http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php),
MM Pavel discovered an issue with PQsetvalue that could cause libpq to
MM wander off into unallocated memory that was present in 9.0.x.  A
MM fairly uninteresting fix was quickly produced, but Tom indicated
MM during subsequent review that he was not happy with the behavior of
MM the function.  Everyone was busy with the beta wrap at the time so I
MM didn't press the issue.

MM A little history here: PQsetvalue's
MM (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main
MM reason to exist is to allow creating a PGresult out of scratch data on
MM a result created via PQmakeEmptyResult(). This behavior works as
MM intended and is not controversial...it was initially done to support
MM libpqtypes but has apparently found use in other places like ecpg.

MM PQsetvalue *also* allows you to mess with results returned by the
MM server using standard query methods for any tuple, not just the one
MM you are creating as you iterate through.  This behavior was
MM unnecessary in terms of what libpqtypes and friends needed and may (as
MM Tom suggested) come back to bite us at some point. As it turns out,
MM PQsetvalue's operation on results that weren't created via
MM PQmakeEmptyResult was totally busted because of the bug, so we have a
MM unique opportunity to tinker with libpq here: you could argue that you
MM have a window of opportunity to change the behavior here since we know
MM it isn't being usefully used.  I think it's too late for that but it's
MM if it had to be done the time is now.

MM Pavel actually has been requesting to go further with being able to
MM mess around with PGresults (see:
MM http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php),
MM such that the result would start to have some 'recordset' features you
MM find in various other libraries.  Maybe it's not the job of libpq to
MM do that, but I happen to think it's pretty cool.  Anyways, something
MM has to be done -- I hate to see an unpatched known 9.0 issue remain in
MM the wild.

MM merlin

I confirm my desire to have delete tuple routine. And I'm ready to
create\test\modify any sources for this.

-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Andrew Chernow

you are creating as you iterate through.  This behavior was
unnecessary in terms of what libpqtypes and friends needed and may (as
Tom suggested) come back to bite us at some point. As it turns out,
PQsetvalue's operation on results that weren't created via
PQmakeEmptyResult was totally busted because of the bug, so we have a
unique opportunity to tinker with libpq here: you could argue that you

+1

Exactly at this moment I am thinking about using modifiable
(via PQsetvalue) PGresult instead of std::map in my C++ library
for store parameters for binding to executing command.
I am already designed how to implement it, and I supposed that
PQsetvalue is intended to work with any PGresult and not only
with those which has been created via PQmakeEmptyResult...
So, I am absolutely sure, that PQsetvalue should works with
any PGresult.


All PGresults are created via PQmakeEmptyPGresult, including libpqtypes. 
 Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.


It might be better to say a server result vs. a client result. 
Currently, PQsetvalue is broken when provided a server generated result.


--
Andrew Chernow
eSilo, LLC
global backup
http://www.esilo.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make the visibility map crash-safe.

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 10:23 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, it seems I didn't put nearly enough thought into heap_update().
 The fix for the immediate problem looks simple enough - all the code
 has been refactored to use the new API, so the calls can be easily be
 moved into the critical section (see attached).  But looking at this a
 little more, I see that heap_update() is many bricks short of a load,
 because there are several places where the buffer can be unlocked and
 relocked, and we don't recheck whether the page is all-visible after
 reacquiring the lock.  So I've got some more work to do here.

See what you think of the attached.  I *think* this covers all bases.
It's a little more complicated than I would like, but I don't think
fatally so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


heap-update-vm-fix-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Dmitriy Igrishin
2011/6/23 Andrew Chernow a...@esilo.com

 you are creating as you iterate through.  This behavior was
unnecessary in terms of what libpqtypes and friends needed and may (as
Tom suggested) come back to bite us at some point. As it turns out,
PQsetvalue's operation on results that weren't created via
PQmakeEmptyResult was totally busted because of the bug, so we have a
unique opportunity to tinker with libpq here: you could argue that you

 +1

 Exactly at this moment I am thinking about using modifiable
 (via PQsetvalue) PGresult instead of std::map in my C++ library
 for store parameters for binding to executing command.
 I am already designed how to implement it, and I supposed that
 PQsetvalue is intended to work with any PGresult and not only
 with those which has been created via PQmakeEmptyResult...
 So, I am absolutely sure, that PQsetvalue should works with
 any PGresult.


 All PGresults are created via PQmakeEmptyPGresult, including libpqtypes.
  Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.

 It might be better to say a server result vs. a client result.
 Currently, PQsetvalue is broken when provided a server generated result.

Yeah, yeah. Thanks for clarification!
I am not libpq hacker, just user. So I was thinking that server-returned
result creating by libpq's private functions, rather than public
PQmakeEmptyPGresult...

-1 although if this feature (which I personally thought already exists
according to the
documentation) make future optimizations impossible or hard (as
mentioned by Tom). Otherwise - +1.


 --
 Andrew Chernow
 eSilo, LLC
 global backup
 http://www.esilo.com/




-- 
// Dmitriy.


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Merlin Moncure
On Thu, Jun 23, 2011 at 7:54 AM, Andrew Chernow a...@esilo.com wrote:
    you are creating as you iterate through.  This behavior was
    unnecessary in terms of what libpqtypes and friends needed and may (as
    Tom suggested) come back to bite us at some point. As it turns out,
    PQsetvalue's operation on results that weren't created via
    PQmakeEmptyResult was totally busted because of the bug, so we have a
    unique opportunity to tinker with libpq here: you could argue that you

 +1

 Exactly at this moment I am thinking about using modifiable
 (via PQsetvalue) PGresult instead of std::map in my C++ library
 for store parameters for binding to executing command.
 I am already designed how to implement it, and I supposed that
 PQsetvalue is intended to work with any PGresult and not only
 with those which has been created via PQmakeEmptyResult...
 So, I am absolutely sure, that PQsetvalue should works with
 any PGresult.

 All PGresults are created via PQmakeEmptyPGresult, including libpqtypes.
  Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.

 It might be better to say a server result vs. a client result.
 Currently, PQsetvalue is broken when provided a server generated result.

er, right-- the divergence was in how the tuples were getting added --
thanks for the clarification.  essentially your attached patch fixes
that divergence.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hugetables question

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 5:01 AM, Radosław Smogura
rsmog...@softperience.eu wrote:
 I think conclusion from this test was Much more important things are to do,
 then 1% benefit - not 1% is worthless.

 I will try today hugepages, with random peeks.

I think the real conclusion here is Linux will soon do this for us
automatically without us needing to do anything, so pretty soon there
will be no possible benefit at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table Partitioning

2011-06-23 Thread Robert Haas
On Tue, Jun 21, 2011 at 1:52 PM, David Fetter da...@fetter.org wrote:
 Still, I think a pretty clear
 way forward here is to try to figure out a way to add some explicit
 syntax for range partitions, so that you can say...

 foo_a is for all rows where foo starts with 'a'
 foo_b is for all rows where foo starts with 'b'
 ...
 foo_xyz is for all rows where foo starts with 'xyz'

 If we have that data represented explicitly in the system catalog,
 then we can look at doing built-in INSERT-routing and UPDATE-handling.
  For an added bonus, it's a more natural syntax.

 Does someone else have such a syntax?  Does The Standard™ have
 anything to say?

Yes, and I don't know, respectively.  There have been previous
discussions of this.  You might want to start by reading the
discussion around the previous patch.

https://commitfest.postgresql.org/action/patch_view?id=266

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-06-23 Thread Steve Singer
On 11-06-23 02:41 AM, Jun Ishiduka wrote:
 I receive this mail, so I notice I do wrong recognition to what
 Heikki is proposing. 

 my recognition:
   Before:
 * I thought Heikki proposes, Execute SQL(pg_start_backup('x'); copy 
   the data directory and pg_stop_backup();) from the standby server 
   to the primary server.
   - I disliked this. 
   Now:
 * Heikki is proposing both No using pg_basebackup and Not specify -x.
   So,
 * Use the output of pg_stop_backup().
 * Don't use backup history file.
   he thinks.

 Right?


What I think he is proposing would not require using pg_stop_backup()
but you could also modify pg_stop_back() to work as well.

What do you think of this idea?

Do you see how the patch can be reworked to accomplish this?



 
 Jun Ishizuka
 NTT Software Corporation
 TEL:045-317-7018
 E-Mail: ishizuka@po.ntts.co.jp
 





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table Partitioning

2011-06-23 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 David Fetter da...@fetter.org wrote:
 
 Does The Standard* have anything to say?
 
 I don't know
 
I dug around in the standard a bit and didn't find anything.  Given
that the standard doesn't even touch the issue of indexes because
that a performance tuning implementation detail, it would seem out
of character for them to define table partitioning.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).

 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.

 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...

 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit.  It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all.  Except on extraordinarily rare occasions, it is
only ever taken in share mode.  Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy.  I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether.  I have a feeling if I
were 25% smarter I could do it, but I'm not.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read.  We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing.  I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written.  What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read.  When a message is written, the global counter is bumped.  When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

There is an obvious (potential) memory-ordering problem here.  If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen.  Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock.  It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before.  As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good.  On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic.  In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough.  We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces).  When a backend compares its counter to the
global counter, it does so while holding this lock.  When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1).  So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

If anyone has read this far, you have my undying gratitude
especially if you respond with comments.

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Heikki Linnakangas

On 23.06.2011 18:42, Robert Haas wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes.  I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.


ProcArrayLock is currently held for a relatively long period of time 
when a snapshot is taken, especially if there's a lot of backends. There 
are three operations to the procarray:


1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that 
setting an xid in shared memory is atomic. To end a transaction, you 
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire 
ProcArrayLock in shared mode, and scan the whole procarray.


I wonder if it would be a better tradeoff to keep a materialized 
snapshot in shared memory that's updated every time a new xid is 
assigned or a transaction ends. Getting a snapshot would become much 
cheaper, as you could just memcpy the ready-built snapshot from shared 
memory into local memory.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 17:42 , Robert Haas wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
 
 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.
 
 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...
 
 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...
 
 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

 SInvalReadLock is a good example of a lock which seems barely to be
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to commit the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a current pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the current one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of 
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the current structure,
and *then* updates the current pointer.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has not yet finished writing.  I believe the
 idea here is that sinval messages should be emitted before releasing
 the related locks, so if we've successfully acquired a lock on 

Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Kevin Grittner
Gurjeet Singh singh.gurj...@gmail.com wrote:
 
 Instead of just synonyms of columns, why don't we think about
implementing
 virtual columns (feature as named in other RDBMS). This is the
ability to
 define a column in a table which is derived using an expression
around other
 non-virtual columns.
 
How do you see that working differently from what PostgreSQL can
currently do?
 
test=# create table line_item(id int primary key not null, quantity int
not null, unit_price numeric(13,2));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
line_item_pkey for table line_item
CREATE TABLE
test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23');
INSERT 0 2
test=# create function line_total(line_item) returns numeric(13,2)
language sql immutable as $$ select ($1.quantity *
$1.unit_price)::numeric(13,2);$$;
CREATE FUNCTION
test=# select li.id, li.line_total from line_item li;
 id | line_total
+
  1 | 187.95
  2 |  81.15
(2 rows)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 23.06.2011 18:42, Robert Haas wrote:
 ProcArrayLock looks like a tougher nut to crack - there's simply no
 way, with the system we have right now, that you can take a snapshot
 without locking the list of running processes.  I'm not sure what to
 do about that, but we're probably going to have to come up with
 something, because it seems clear that once we eliminate the lock
 manager LWLock contention, this is a major bottleneck.

 ProcArrayLock is currently held for a relatively long period of time when a
 snapshot is taken, especially if there's a lot of backends. There are three
 operations to the procarray:

 1. Advertising a new xid belonging to my backend.
 2. Ending a transaction.
 3. Getting a snapshot.

 Advertising a new xid is currently done without a lock, assuming that
 setting an xid in shared memory is atomic.

The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects.  But I think even if they do,
there's no real harm done.  Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.

I am rather doubtful that this code is strictly safe:

myproc-subxids.xids[nxids] = xid;
myproc-subxids.nxids = nxids + 1;

In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.

 To end a transaction, you acquire
 ProcArrayLock in exclusive mode. To get a snapshot, you acquire
 ProcArrayLock in shared mode, and scan the whole procarray.

 I wonder if it would be a better tradeoff to keep a materialized snapshot
 in shared memory that's updated every time a new xid is assigned or a
 transaction ends. Getting a snapshot would become much cheaper, as you could
 just memcpy the ready-built snapshot from shared memory into local memory.

At least in the case of the pgbench -S workload, I doubt that would
help at all.  The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway.  The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Merlin Moncure
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug f...@phlo.org wrote:
 On Jun23, 2011, at 17:42 , Robert Haas wrote:
 On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote:
 On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).

 Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
 However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
 so I guess that two adjacent LWLocks currently share one cache line.

 Currently, the ProcArrayLock has index 4 while SInvalReadLock has
 index 5, so if I'm not mistaken exactly the two locks where you saw
 the largest contention on are on the same cache line...

 Might make sense to try and see if these numbers change if you
 either make LWLockPadded 64bytes or arrange the locks differently...

 I fooled around with this a while back and saw no benefit.  It's
 possible a more careful test would turn up something, but I think the
 only real way forward here is going to be to eliminate some of that
 locking altogether.

 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.

 SInvalReadLock is a good example of a lock which seems barely to be
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

 This is the case we should optimize for, I think. Currently, there's
 contention even between two SHARE lockers of a LWLock because our
 LWLock implementation uses a spinlock underneath. I've googled around
 a bit, and found this:

 http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

 It's an userspace rwlock implementation based on atomic instructions
 and futexes (for blocking) written by Linus Torwalds as a response
 to the following thread

 http://lwn.net/Articles/284741/

 It seems to require only a single atomic increment instruction to
 acquire a share lock in the uncontested case, and a single compare-
 and-exchange instruction to acquire an exlcusive lock (again in
 the uncontested case).

 The code doesn't seem to have a license attached, and seems to have
 been written over a weekend and never touched since, so we might
 not want (or be able to) to use this directly. But it'd still be
 interesting to see if replacing our LWLocks with this implementation
 affects throughput.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

 This sounds like something which could be done with RCU (Read-Copy-Update)
 in a lockless way. The idea is to copy the data structure (or parts)
 of it before updating it, and to commit the change afterwards by
 adjusting a single pointer to point to the new structure (kinda like
 we do atomic commits by an atomic update of one bit in the clog).

 The hard part is usually to figure when it's OK to clean up or
 reuse the old version of the data structure - for that, you need to
 know that all previous readers have completed.

 Here's a rough description of how that could work for the invalidation
 queue. We'd have two queue structures in shared memory, each containing
 a version number. We'd also have a current pointer pointing to the
 most recent one. Each reader would publish the version number of the
 queue he's about to read (the current one) in shared memory
 (and issue a write barrier). SICleanupQueue() would check whether the
 non-current queue structure is unused by comparing its version to the
 published versions of all backends (after issuing a read barrier, and
 after taking a lock that prevents concurrent SICleanupQueue runs of
 course). If there still are users of that version, SICleanupQueue()
 out-waits them. Afterwards, it simply copies the current structure
 over the old one, does the necessary cleanups, increments the version
 number to be larger than the one of the current structure,
 and *then* updates the current pointer.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has not yet finished writing.  I believe the
 idea here is that sinval 

Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Gurjeet Singh
On Thu, Jun 23, 2011 at 2:58 PM, Kevin Grittner kevin.gritt...@wicourts.gov
 wrote:

 Gurjeet Singh singh.gurj...@gmail.com wrote:

  Instead of just synonyms of columns, why don't we think about
 implementing
  virtual columns (feature as named in other RDBMS). This is the
 ability to
  define a column in a table which is derived using an expression
 around other
  non-virtual columns.

 How do you see that working differently from what PostgreSQL can
 currently do?

 test=# create table line_item(id int primary key not null, quantity int
 not null, unit_price numeric(13,2));
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
 line_item_pkey for table line_item
 CREATE TABLE
 test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23');
 INSERT 0 2
 test=# create function line_total(line_item) returns numeric(13,2)
 language sql immutable as $$ select ($1.quantity *
 $1.unit_price)::numeric(13,2);$$;
 CREATE FUNCTION
 test=# select li.id, li.line_total from line_item li;
  id | line_total
 +
  1 | 187.95
  2 |  81.15
 (2 rows)


For one, this column is not part of the table, so we can't gather statistics
on them to help the optimizer.

We can'r create primary keys on this expression.

Also, say if the query wasn't fetching all the columns and we had just the
line_total call in SELECT list, the executor has to fetch the whole row and
pass it on to the function even though the function uses only part of the
row (2 columns in this case).

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs si...@2ndquadrant.com wrote:
 For people that still think there is still risk, I've added a
 parameter called relaxed_ddl_locking which defaults to off. That
 way people can use this feature in an informed way without exposing us
 to support risks from its use.

I can't get excited about adding a GUC that says you can turn this
off if you want but don't blame us if it breaks.  That just doesn't
seem like good software engineering to me.

 I think we should be clear that this adds *one* lock/unlock for each
 object for each session, ONCE only after each transaction that
 executed a DDL statement against an object. So with 200 sessions, a
 typical ALTER TABLE statement will cause 400 locks. The current lock
 manager maxes out above 50,000 locks per second, so any comparative
 analysis will show this is a minor blip on even a heavy workload.

I think that using locks to address this problem is the wrong
approach.  I think the right fix is to use something other than
SnapshotNow to do the system catalog scans.  However, if we were going
to use locking, then we'd need to ensure that:

(1) No transaction which has made changes to a system catalog may
commit while some other transaction is in the middle of scanning that
catalog.
(2) No transaction which has made changes to a set of system catalogs
may commit while some other transaction is in the midst of fetching
interrelated data from a subset of those catalogs.

It's important to note, I think, that the problem here doesn't occur
at the time that the table rows are updated, because those rows aren't
visible yet.  The problem happens at commit time.  So unless I'm
missing something, ALTER TABLE would only need to acquire the relation
definition lock after it has finished all of its other work.  In fact,
it doesn't even really need to acquire it then, either.  It could just
queue up a list of relation definition locks to acquire immediately
before commit, which would be advantageous if the transaction went on
to abort, since in that case we'd skip the work of acquiring the locks
at all.

In fact, without doing that, this patch would undo much of the point
of the original ALTER TABLE lock strength reduction:

begin;
alter table foo alter column a set storage plain;
start a new psql session in another window
select * from foo;

In master, the SELECT completes without blocking.  With this patch
applied, the SELECT blocks, just as it did in 9.0, because it can't
rebuild the relcache entry without getting the relation definition
lock, which the alter table will hold until commit.  In fact, the same
thing happens even if foo has been accessed previously in the same
session.  If there is already an open *transaction* that has accessed
foo, then, it seems, it can continue to do so until it commits.  But
as soon as it tries to start a new transaction, it blocks again.  I
don't quite understand why that is - I didn't think we invalidated the
entire relcache on every commit.  But that's what happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote:
 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one.  Again, please feel free
to provide contrary evidence.  I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x.  (The less busy lock may get a larger reduction
but that may not help you much.)  If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.  That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

 SInvalReadLock is a good example of a lock which seems barely to e
 necessary at all.  Except on extraordinarily rare occasions, it is
 only ever taken in share mode.

 This is the case we should optimize for, I think. Currently, there's
 contention even between two SHARE lockers of a LWLock because our
 LWLock implementation uses a spinlock underneath. I've googled around
 a bit, and found this:

 http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

 It's an userspace rwlock implementation based on atomic instructions
 and futexes (for blocking) written by Linus Torwalds as a response
 to the following thread

 http://lwn.net/Articles/284741/

 It seems to require only a single atomic increment instruction to
 acquire a share lock in the uncontested case, and a single compare-
 and-exchange instruction to acquire an exlcusive lock (again in
 the uncontested case).

 The code doesn't seem to have a license attached, and seems to have
 been written over a weekend and never touched since, so we might
 not want (or be able to) to use this directly. But it'd still be
 interesting to see if replacing our LWLocks with this implementation
 affects throughput.

I tried rewriting the LWLocks using CAS.  It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
be better, but I'm not holding my breath.  Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

 Only SICleanupQueue() needs to lock
 out readers, and in the (probably common) case where all backends are
 reasonably close to being caught up, it wouldn't even matter if the
 determination of minMsgNum were a little bit fuzzy.  I've been
 straining my brain trying to figure out whether there's some way to
 rewrite this logic so that it can run concurrently with readers; then
 we could get rid of SInvalReadLock altogether.  I have a feeling if I
 were 25% smarter I could do it, but I'm not.

 This sounds like something which could be done with RCU (Read-Copy-Update)
 in a lockless way. The idea is to copy the data structure (or parts)
 of it before updating it, and to commit the change afterwards by
 adjusting a single pointer to point to the new structure (kinda like
 we do atomic commits by an atomic update of one bit in the clog).

 The hard part is usually to figure when it's OK to clean up or
 reuse the old version of the data structure - for that, you need to
 know that all previous readers have completed.

 Here's a rough description of how that could work for the invalidation
 queue. We'd have two queue structures in shared memory, each containing
 a version number. We'd also have a current pointer pointing to the
 most recent one. Each reader would publish the version number of the
 queue he's about to read (the current one) in shared memory
 (and issue a write barrier). SICleanupQueue() would check whether the
 non-current queue structure is unused by comparing its version to the
 published versions of all backends (after issuing a read barrier, and
 after taking a lock that prevents concurrent SICleanupQueue runs of
 course). If there still are users of that version, SICleanupQueue()
 out-waits them. Afterwards, it simply copies the current structure
 over the old one, does the necessary cleanups, increments the version
 number to be larger than the one of the current structure,
 and *then* updates the current pointer.

Interesting idea.

 So my fallback position is to try to find some way to optimize the
 common case where there are no messages to be read.  We're already
 allowing readers and writers to run in parallel, so it must not be
 important for a reader to see a message that a writer is in the
 process of writing, but has 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 22:15 , Robert Haas wrote:
 On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote:
 It seems hard to believe that there isn't some effect of two locks
 sharing a cache line. There are architectures (even some of the
 Intel architectures, I believe) where cache lines are 32 bit, though -
 so measuring this certainly isn't easy. Also, I'm absolute no
 expert for this, so it might very well be that' I'm missing something
 fundamental.
 
 Well, I'm sure there is some effect, but my experiments seem to
 indicate that it's not a very important one.  Again, please feel free
 to provide contrary evidence.  I think the basic issue is that - in
 the best possible case - padding the LWLocks so that you don't have
 two locks sharing a cache line can reduce contention on the busier
 lock by at most 2x.  (The less busy lock may get a larger reduction
 but that may not help you much.)  If you what you really need is for
 the contention to decrease by 1000x, you're just not really moving the
 needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

 That's why the basic fast-relation-lock patch helps so much:
 it replaces a system where every lock request results in contention
 with a system where NONE of them do.
 
 I tried rewriting the LWLocks using CAS.  It actually seems to make
 things slightly worse on the tests I've done so far, perhaps because I
 didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
 be better, but I'm not holding my breath.  Everything I'm seeing so
 far leads me to the belief that we need to get rid of the contention
 altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
  while (atomic_inc_post(lock)  0) {
atomic_dec(lock);
block(lock);
  }
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

 There is an obvious (potential) memory-ordering problem here.  If it's
 possible for a backend doing AcceptInvalidationMessages() to see a
 stale version of the counter, then it might fail to read messages from
 the queue that it really ought to have seen.  Protecting the global
 counter with a spinlock defeats the purpose of doing any of this in
 the first place, because the whole problem is too many people fighting
 over the spinlock protecting SInvalReadLock.  It should be sufficient,
 though, for the writing backend to execute a memory-barrier operation
 just after bumping the global counter and the reading backend to
 execute one just before.  As it happens, LWLockRelease() is such a
 barrier, since it must acquire and release a spinlock, so we should be
 able to just bump the counter right before releasing the LWLock and
 call it good.  On the reading side, the immediately-preceding lock
 acquisition seems like it ought to be sufficient - surely it can't be
 possible to acquire a heavyweight lock on an object without seeing
 memory updates that some other process made before releasing the same
 heavyweight lock.
 
 If we start doing lockless algorithms, I really think we should add
 explicit barrier primitives. We could start out with something
 simple such as
 
 #define MemoryBarrierWrite \
  do { slock_t l; S_UNLOCK(l); } while (0)
 #define MemoryBarrierRead \
  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
 #define MemoryBarrierReadWrite \
  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
 
 But yeah, it seems that in the case of the sinval queue, the surrounding
 LWLocks actually provide the necessary barriers.
 
 Yes, a set of general memory barrier primitives would be handy.
 Unfortunately, it would probably be quite a bit of work to develop
 this and verify that it works properly on all supported platforms.

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug f...@phlo.org wrote:
 Well, I'm sure there is some effect, but my experiments seem to
 indicate that it's not a very important one.  Again, please feel free
 to provide contrary evidence.  I think the basic issue is that - in
 the best possible case - padding the LWLocks so that you don't have
 two locks sharing a cache line can reduce contention on the busier
 lock by at most 2x.  (The less busy lock may get a larger reduction
 but that may not help you much.)  If you what you really need is for
 the contention to decrease by 1000x, you're just not really moving the
 needle.

 Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
 the most heavily contested LWLocks for the others might still be
 worthwhile.

Hey, if we can show that it works, sign me up.

 That's why the basic fast-relation-lock patch helps so much:
 it replaces a system where every lock request results in contention
 with a system where NONE of them do.

 I tried rewriting the LWLocks using CAS.  It actually seems to make
 things slightly worse on the tests I've done so far, perhaps because I
 didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
 be better, but I'm not holding my breath.  Everything I'm seeing so
 far leads me to the belief that we need to get rid of the contention
 altogether, not just contend more quickly.

 Is there a patch available? How did you do the slow path (i.e. the
 case where there's contention and you need to block)? It seems to
 me that without some kernel support like futexes it's impossible
 to do better than LWLocks already do, because any simpler scheme
 like
  while (atomic_inc_post(lock)  0) {
    atomic_dec(lock);
    block(lock);
  }
 for the shared-locker case suffers from a race condition (the lock
 might be released before you actually block()).

Attached...

 The idea would be to start out with something trivial like the above.
 Maybe with an #if for compilers which have something like GCC's
 __sync_synchronize(). We could then gradually add implementations
 for specific architectures, hopefully done by people who actually
 own the hardware and can test.

Yes.  But if we go that route, then we have to also support a code
path for architectures for which we don't have that support.  That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


lwlock-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Gurjeet Singh
On Wed, Jun 22, 2011 at 3:37 PM, Joshua D. Drake j...@commandprompt.comwrote:

 Per:

 http://archives.postgresql.**org/pgsql-hackers/2010-11/**msg02043.phphttp://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php

 It seems we did come up with a use case in the procpid discussion. The
 ability to change the names of columns/databases etc, to handle the fixing
 of bad decision decisions during development over time.

 Thoughts?


Instead of just synonyms of columns, why don't we think about implementing
virtual columns (feature as named in other RDBMS). This is the ability to
define a column in a table which is derived using an expression around other
non-virtual columns. I agree it would be much more difficult and some may
even argue it is pointless in the presence of views and expression indexes,
but I leave that as an exercise for others.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis pg...@j-davis.com wrote:
 1. Torn pages -- not a problem because it's a single bit and idempotent.

Right.

 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
 an action that makes the page all-visible. Depending on what action
 makes a page all-visible:
  a. An old snapshot is released -- not a problem, because if there is a
 crash all snapshots are released.
  b. Cleanup action on the page -- not a problem, because that will
 create a WAL record and update the page's LSN before setting the
 PD_ALL_VISIBLE.

Lazy VACUUM is the only thing that makes a page all visible.  I don't
understand the part about snapshots.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Jeff Davis
On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
 Lazy VACUUM is the only thing that makes a page all visible.  I don't
 understand the part about snapshots.

Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. 

After an INSERT to a new page, and after all snapshots are released, the
page becomes all-visible; and thus subject to being marked with
PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
is no cleanup action that takes place here, so nothing else will bump
the LSN either.

So, let's say that we hypothetically had persistent snapshots, then
you'd have the following problem:

1. INSERT to a new page, marking it with LSN X
2. WAL flushed to LSN Y (Y  X)
2. Some persistent snapshot (that doesn't see the INSERT) is released,
and generates WAL recording that fact with LSN Z (Z  Y)
3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
4. page is written out because LSN is still X
5. crash

Now, the persistent snapshot is still present because LSN Z never made
it to disk; but the page is marked with PD_ALL_VISIBLE.

Sure, if these hypothetical persistent snapshots were transactional, and
if synchronous_commit is on, then LSN Z would be flushed before step 3;
but that's another set of assumptions. That's why I left it simple and
said that the assumption was snapshots are released if there's a
crash.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade defaulting to port 25432

2011-06-23 Thread Bruce Momjian
Tom Lane wrote:
 Christopher Browne cbbro...@gmail.com writes:
  On Wed, Jun 15, 2011 at 5:35 PM, Bruce Momjian br...@momjian.us wrote:
  [ just recommend using a different port number during pg_upgrade ]
 
  +1...  That seems to have lots of nice properties.
 
 Yeah, that seems like an appropriate expenditure of effort.  It's surely
 not bulletproof, since someone could intentionally connect to the actual
 port number, but getting to bulletproof is a lot more work than anyone
 seems to want to do right now.  (And, as Bruce pointed out, no complete
 solution would be back-patchable anyway.)

I have created the following patch which uses 25432 as the default port
number for pg_upgrade.  It also creates two new environment variables,
OLDPGPORT and NEWPGPORT, to control the port values because we don't
want to default to PGPORT anymore.

This will allow most users to use pg_upgrade without needing to specify
port parameters, and will prevent accidental connections.  The user does
have to specify the port number for live checks, but that was always the
case because you have to use different port numbers for old/new in that
case.  I updated the error message to be clearer about this.

The patch is small and might be possible for 9.1, except it changes
user-visible behavior, which would suggest it should be in 9.2, plus it
doesn't address a serious bug.  Comments?

I will batckpatch a doc recommendation to use 25432 as a port number for
versions of pg_upgrade that don't get this patch.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
new file mode 100644
index 1ee2aca..323b5f1
*** a/contrib/pg_upgrade/check.c
--- b/contrib/pg_upgrade/check.c
*** output_check_banner(bool *live_check)
*** 30,37 
  	{
  		*live_check = true;
  		if (old_cluster.port == new_cluster.port)
! 			pg_log(PG_FATAL, When checking a live server, 
!    the old and new port numbers must be different.\n);
  		pg_log(PG_REPORT, Performing Consistency Checks on Old Live Server\n);
  		pg_log(PG_REPORT, \n);
  	}
--- 30,37 
  	{
  		*live_check = true;
  		if (old_cluster.port == new_cluster.port)
! 			pg_log(PG_FATAL, When checking a live old server, 
!    you must specify the old server's port number.\n);
  		pg_log(PG_REPORT, Performing Consistency Checks on Old Live Server\n);
  		pg_log(PG_REPORT, \n);
  	}
diff --git a/contrib/pg_upgrade/option.c b/contrib/pg_upgrade/option.c
new file mode 100644
index 4401a81..7fbfa8c
*** a/contrib/pg_upgrade/option.c
--- b/contrib/pg_upgrade/option.c
*** parseCommandLine(int argc, char *argv[])
*** 58,65 
  	os_info.progname = get_progname(argv[0]);
  
  	/* Process libpq env. variables; load values here for usage() output */
! 	old_cluster.port = getenv(PGPORT) ? atoi(getenv(PGPORT)) : DEF_PGPORT;
! 	new_cluster.port = getenv(PGPORT) ? atoi(getenv(PGPORT)) : DEF_PGPORT;
  
  	os_user_effective_id = get_user_info(os_info.user);
  	/* we override just the database user name;  we got the OS id above */
--- 58,65 
  	os_info.progname = get_progname(argv[0]);
  
  	/* Process libpq env. variables; load values here for usage() output */
! 	old_cluster.port = getenv(OLDPGPORT) ? atoi(getenv(OLDPGPORT)) : DEF_PGUPORT;
! 	new_cluster.port = getenv(NEWPGPORT) ? atoi(getenv(NEWPGPORT)) : DEF_PGUPORT;
  
  	os_user_effective_id = get_user_info(os_info.user);
  	/* we override just the database user name;  we got the OS id above */
*** Options:\n\
*** 231,238 
-G, --debugfile=FILENAME  output debugging activity to file\n\
-k, --linklink instead of copying files to new cluster\n\
-l, --logfile=FILENAMElog session activity to file\n\
!   -p, --old-port=OLDPORTold cluster port number (default %d)\n\
!   -P, --new-port=NEWPORTnew cluster port number (default %d)\n\
-u, --user=NAME   clusters superuser (default \%s\)\n\
-v, --verbose enable verbose output\n\
-V, --version display version information, then exit\n\
--- 231,238 
-G, --debugfile=FILENAME  output debugging activity to file\n\
-k, --linklink instead of copying files to new cluster\n\
-l, --logfile=FILENAMElog session activity to file\n\
!   -p, --old-port=OLDPGPORT  old cluster port number (default %d)\n\
!   -P, --new-port=NEWPGPORT  new cluster port number (default %d)\n\
-u, --user=NAME   clusters superuser (default \%s\)\n\
-v, --verbose enable verbose output\n\
-V, --version display version information, then exit\n\
diff --git a/contrib/pg_upgrade/pg_upgrade.h 

Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis pg...@j-davis.com wrote:
 On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
 Lazy VACUUM is the only thing that makes a page all visible.  I don't
 understand the part about snapshots.

 Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE.

 After an INSERT to a new page, and after all snapshots are released, the
 page becomes all-visible; and thus subject to being marked with
 PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
 is no cleanup action that takes place here, so nothing else will bump
 the LSN either.

 So, let's say that we hypothetically had persistent snapshots, then
 you'd have the following problem:

 1. INSERT to a new page, marking it with LSN X
 2. WAL flushed to LSN Y (Y  X)
 2. Some persistent snapshot (that doesn't see the INSERT) is released,
 and generates WAL recording that fact with LSN Z (Z  Y)
 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
 4. page is written out because LSN is still X
 5. crash

 Now, the persistent snapshot is still present because LSN Z never made
 it to disk; but the page is marked with PD_ALL_VISIBLE.

 Sure, if these hypothetical persistent snapshots were transactional, and
 if synchronous_commit is on, then LSN Z would be flushed before step 3;
 but that's another set of assumptions. That's why I left it simple and
 said that the assumption was snapshots are released if there's a
 crash.

I don't really think that's a separate set of assumptions - if we had
some system whereby snapshots could survive a crash, then they'd have
to be WAL-logged (because that's how we make things survive crashes).
And anything that is WAL-logged must obey the WAL-before-data rule.
We have a system already that ensures that when
synchronous_commit=off, CLOG pages can't be flushed before the
corresponding WAL record makes it to disk.  For a system like what
you're describing, you'd need something similar - these
crash-surviving snapshots would have to make sure that no action which
depended on their state hit the disk before the WAL record marking the
state change hit the disk.

I guess the point you are driving at here is that a page can only go
from being all-visible to not-all-visible by virtue of being modified.
 There's no other piece of state (like a persistent snapshot) that can
be lost as part of a crash that would make us need change our mind and
decide that an all-visible XID is really not all-visible after all.
(The reverse is not true: since snapshots are ephemeral, a crash will
render every row either all-visible or dead.)  I guess I never thought
about documenting that particular aspect of it because (to me) it
seems fairly self-evident.  Maybe I'm wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Keywords in pg_hba.conf should be field-specific

2011-06-23 Thread Brendan Jurd
On 24 June 2011 03:48, Alvaro Herrera alvhe...@commandprompt.com wrote:
 I have touched next_token() and next_token_expand() a bit more, because
 it seemed to me that they could be simplified further (the bit about
 returning the comma in the token, instead of being a boolean return,
 seemed strange).  Please let me know what you think.

Cool.  I didn't like the way it returned the comma either, but I
thought it would be best if I kept the changes in my patch to a
minimum.

Cheers,
BJ

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade version check improvements and small fixes

2011-06-23 Thread Bruce Momjian
Dan McGee wrote:
 On Wed, Jun 22, 2011 at 8:54 PM, Bruce Momjian br...@momjian.us wrote:
  0003 is what I really wanted to solve, which was my failure with
  pg_upgrade. The call to pg_ctl didn't succeed because the binaries
  didn't match the data directory, thus resulting in this:
 
  The error had nothing to do with trust at all; it was simply that I
  tried to use 9.0 binaries with an 8.4 data directory. My patch checks
  for this and ensures that the -D bindir is the correct version, just
  as the -B datadir has to be the correct version.
 
  I had not thought about testing if the bin and data directory were the
  same major version, but you are right that it generates an odd error if
  they are not.
 
  I changed your sscanf format string because the version can be just
  9.2dev and there is no need for the minor version.
 No problem- you're going to know way more about this than me, and I
 was just going off what I saw in my two installed versions and a quick
 look over at the code of pg_ctl to make sure that string was never
 translated.

 On a side note I don't think I ever mentioned to everyone else why
 parsing the version from pg_ctl rather than pg_config was done- this
 is so we don't introduce any additional binary requirements. Tom Lane
 noted in an earlier cleanup series that pg_config is not really needed
 at all in the old cluster, so I wanted to keep it that way, but pg_ctl
 has always been required.

Yes, I looked specifically for that.  We could have used pg_controldata
too, but pg_ctl seemed just as good.

  ? I saw no reason
  to test if the binary version matches the pg_upgrade version because we
  already test the cluster version, and we test the cluster version is the
  same as the binary version.
 Duh. That makes sense. Thanks for getting to these so quickly!
 
 To partially toot my own horn but also show where a user like me
 encountered this, after some packaging hacking, anyone running Arch
 Linux should be able to do a pg_upgrade from here on out by installing
 the postgresql-old-upgrade package
 (http://www.archlinux.org/packages/extra/x86_64/postgresql-old-upgrade/)
 and possible consulting the Arch wiki.

Great.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-06-23 Thread Jun Ishiduka

 What I think he is proposing would not require using pg_stop_backup()
 but you could also modify pg_stop_back() to work as well.
 
 What do you think of this idea?
 
 Do you see how the patch can be reworked to accomplish this?

The logic that not use pg_stop_backup() would be difficult,
because pg_stop_backup() is used to identify minRecoveryPoint.


Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers