[HACKERS] debugging tools inside postgres

2011-06-23 Thread HuangQi
Hi,
   I'm trying to debug a modification for the query planner. But I found it
seems the data structure of my planned query is incorrect. I was trying to
print out the data structure by use the "p" command in gdb which is quite
inconvenient and takes time. May I know is there any embedded function in
postgres to print out the node data structures, or any other plan related
data structures? Thanks.

-- 
Best Regards
Huang Qi Victor


Re: [HACKERS] Re: patch review : Add ability to constrain backend temporary file space

2011-06-23 Thread Mark Kirkwood

On 22/06/11 11:13, Mark Kirkwood wrote:

On 21/06/11 02:39, Cédric Villemain wrote:

2011/6/20 Robert Haas:

On Mon, Jun 20, 2011 at 9:15 AM, Cédric Villemain
  wrote:

The feature does not work exactly as expected because the write limit
is rounded per 8kB because we write before checking. I believe if one
write a file of 1GB in one pass (instead of repetitive 8kB increment),
and the temp_file_limit is 0, then the server will write the 1GB
before aborting.

Can we rearrange thing so we check first, and then write?

probably but it needs more work to catch corner cases. We may be safe
to just document that (and also in the code). The only way I see so
far to have a larger value than 8kB here is to have a plugin doing the
sort instead of the postgresql core sort algo.




Thanks guys - will look at moving the check, and adding some 
documentation about the possible impacts of plugins (or new executor 
methods) that might write in chunks bigger than blocksz.


Maybe a few days - I'm home sick ATM, plus looking after these 
http://www.maftet.co.nz/kittens.html





This  version moves the check *before* we write the new buffer, so 
should take care of issues about really large write buffers, plugins 
etc. Also I *think* that means there is no need to amend the documentation.


Cheers

Mark

P.s: Hopefully I've got a context diff this time, just realized that git 
diff -c does *not* produce a context diff (doh).




temp-files-v6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-06-23 Thread Jun Ishiduka

> What I think he is proposing would not require using pg_stop_backup()
> but you could also modify pg_stop_back() to work as well.
> 
> What do you think of this idea?
> 
> Do you see how the patch can be reworked to accomplish this?

The logic that not use pg_stop_backup() would be difficult,
because pg_stop_backup() is used to identify minRecoveryPoint.


Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade version check improvements and small fixes

2011-06-23 Thread Bruce Momjian
Dan McGee wrote:
> On Wed, Jun 22, 2011 at 8:54 PM, Bruce Momjian  wrote:
> >> 0003 is what I really wanted to solve, which was my failure with
> >> pg_upgrade. The call to pg_ctl didn't succeed because the binaries
> >> didn't match the data directory, thus resulting in this:
> >>
> >> The error had nothing to do with "trust" at all; it was simply that I
> >> tried to use 9.0 binaries with an 8.4 data directory. My patch checks
> >> for this and ensures that the -D bindir is the correct version, just
> >> as the -B datadir has to be the correct version.
> >
> > I had not thought about testing if the bin and data directory were the
> > same major version, but you are right that it generates an odd error if
> > they are not.
> >
> > I changed your sscanf format string because the version can be just
> > "9.2dev" and there is no need for the minor version.
> No problem- you're going to know way more about this than me, and I
> was just going off what I saw in my two installed versions and a quick
> look over at the code of pg_ctl to make sure that string was never
> translated.

> On a side note I don't think I ever mentioned to everyone else why
> parsing the version from pg_ctl rather than pg_config was done- this
> is so we don't introduce any additional binary requirements. Tom Lane
> noted in an earlier cleanup series that pg_config is not really needed
> at all in the old cluster, so I wanted to keep it that way, but pg_ctl
> has always been required.

Yes, I looked specifically for that.  We could have used pg_controldata
too, but pg_ctl seemed just as good.

> > ? I saw no reason
> > to test if the binary version matches the pg_upgrade version because we
> > already test the cluster version, and we test the cluster version is the
> > same as the binary version.
> Duh. That makes sense. Thanks for getting to these so quickly!
> 
> To partially toot my own horn but also show where a user like me
> encountered this, after some packaging hacking, anyone running Arch
> Linux should be able to do a pg_upgrade from here on out by installing
> the postgresql-old-upgrade package
> (http://www.archlinux.org/packages/extra/x86_64/postgresql-old-upgrade/)
> and possible consulting the Arch wiki.

Great.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Keywords in pg_hba.conf should be field-specific

2011-06-23 Thread Brendan Jurd
On 24 June 2011 03:48, Alvaro Herrera  wrote:
> I have touched next_token() and next_token_expand() a bit more, because
> it seemed to me that they could be simplified further (the bit about
> returning the comma in the token, instead of being a boolean return,
> seemed strange).  Please let me know what you think.

Cool.  I didn't like the way it returned the comma either, but I
thought it would be best if I kept the changes in my patch to a
minimum.

Cheers,
BJ

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis  wrote:
> On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
>> Lazy VACUUM is the only thing that makes a page all visible.  I don't
>> understand the part about snapshots.
>
> Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE.
>
> After an INSERT to a new page, and after all snapshots are released, the
> page becomes all-visible; and thus subject to being marked with
> PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
> is no cleanup action that takes place here, so nothing else will bump
> the LSN either.
>
> So, let's say that we hypothetically had persistent snapshots, then
> you'd have the following problem:
>
> 1. INSERT to a new page, marking it with LSN X
> 2. WAL flushed to LSN Y (Y > X)
> 2. Some persistent snapshot (that doesn't see the INSERT) is released,
> and generates WAL recording that fact with LSN Z (Z > Y)
> 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
> 4. page is written out because LSN is still X
> 5. crash
>
> Now, the persistent snapshot is still present because LSN Z never made
> it to disk; but the page is marked with PD_ALL_VISIBLE.
>
> Sure, if these hypothetical persistent snapshots were transactional, and
> if synchronous_commit is on, then LSN Z would be flushed before step 3;
> but that's another set of assumptions. That's why I left it simple and
> said that the assumption was "snapshots are released if there's a
> crash".

I don't really think that's a separate set of assumptions - if we had
some system whereby snapshots could survive a crash, then they'd have
to be WAL-logged (because that's how we make things survive crashes).
And anything that is WAL-logged must obey the WAL-before-data rule.
We have a system already that ensures that when
synchronous_commit=off, CLOG pages can't be flushed before the
corresponding WAL record makes it to disk.  For a system like what
you're describing, you'd need something similar - these
crash-surviving snapshots would have to make sure that no action which
depended on their state hit the disk before the WAL record marking the
state change hit the disk.

I guess the point you are driving at here is that a page can only go
from being all-visible to not-all-visible by virtue of being modified.
 There's no other piece of state (like a persistent snapshot) that can
be lost as part of a crash that would make us need change our mind and
decide that an all-visible XID is really not all-visible after all.
(The reverse is not true: since snapshots are ephemeral, a crash will
render every row either all-visible or dead.)  I guess I never thought
about documenting that particular aspect of it because (to me) it
seems fairly self-evident.  Maybe I'm wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade defaulting to port 25432

2011-06-23 Thread Bruce Momjian
Tom Lane wrote:
> Christopher Browne  writes:
> > On Wed, Jun 15, 2011 at 5:35 PM, Bruce Momjian  wrote:
> >> [ just recommend using a different port number during pg_upgrade ]
> 
> > +1...  That seems to have lots of nice properties.
> 
> Yeah, that seems like an appropriate expenditure of effort.  It's surely
> not bulletproof, since someone could intentionally connect to the actual
> port number, but getting to bulletproof is a lot more work than anyone
> seems to want to do right now.  (And, as Bruce pointed out, no complete
> solution would be back-patchable anyway.)

I have created the following patch which uses 25432 as the default port
number for pg_upgrade.  It also creates two new environment variables,
OLDPGPORT and NEWPGPORT, to control the port values because we don't
want to default to PGPORT anymore.

This will allow most users to use pg_upgrade without needing to specify
port parameters, and will prevent accidental connections.  The user does
have to specify the port number for live checks, but that was always the
case because you have to use different port numbers for old/new in that
case.  I updated the error message to be clearer about this.

The patch is small and might be possible for 9.1, except it changes
user-visible behavior, which would suggest it should be in 9.2, plus it
doesn't address a serious bug.  Comments?

I will batckpatch a doc recommendation to use 25432 as a port number for
versions of pg_upgrade that don't get this patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
new file mode 100644
index 1ee2aca..323b5f1
*** a/contrib/pg_upgrade/check.c
--- b/contrib/pg_upgrade/check.c
*** output_check_banner(bool *live_check)
*** 30,37 
  	{
  		*live_check = true;
  		if (old_cluster.port == new_cluster.port)
! 			pg_log(PG_FATAL, "When checking a live server, "
!    "the old and new port numbers must be different.\n");
  		pg_log(PG_REPORT, "Performing Consistency Checks on Old Live Server\n");
  		pg_log(PG_REPORT, "\n");
  	}
--- 30,37 
  	{
  		*live_check = true;
  		if (old_cluster.port == new_cluster.port)
! 			pg_log(PG_FATAL, "When checking a live old server, "
!    "you must specify the old server's port number.\n");
  		pg_log(PG_REPORT, "Performing Consistency Checks on Old Live Server\n");
  		pg_log(PG_REPORT, "\n");
  	}
diff --git a/contrib/pg_upgrade/option.c b/contrib/pg_upgrade/option.c
new file mode 100644
index 4401a81..7fbfa8c
*** a/contrib/pg_upgrade/option.c
--- b/contrib/pg_upgrade/option.c
*** parseCommandLine(int argc, char *argv[])
*** 58,65 
  	os_info.progname = get_progname(argv[0]);
  
  	/* Process libpq env. variables; load values here for usage() output */
! 	old_cluster.port = getenv("PGPORT") ? atoi(getenv("PGPORT")) : DEF_PGPORT;
! 	new_cluster.port = getenv("PGPORT") ? atoi(getenv("PGPORT")) : DEF_PGPORT;
  
  	os_user_effective_id = get_user_info(&os_info.user);
  	/* we override just the database user name;  we got the OS id above */
--- 58,65 
  	os_info.progname = get_progname(argv[0]);
  
  	/* Process libpq env. variables; load values here for usage() output */
! 	old_cluster.port = getenv("OLDPGPORT") ? atoi(getenv("OLDPGPORT")) : DEF_PGUPORT;
! 	new_cluster.port = getenv("NEWPGPORT") ? atoi(getenv("NEWPGPORT")) : DEF_PGUPORT;
  
  	os_user_effective_id = get_user_info(&os_info.user);
  	/* we override just the database user name;  we got the OS id above */
*** Options:\n\
*** 231,238 
-G, --debugfile=FILENAME  output debugging activity to file\n\
-k, --linklink instead of copying files to new cluster\n\
-l, --logfile=FILENAMElog session activity to file\n\
!   -p, --old-port=OLDPORTold cluster port number (default %d)\n\
!   -P, --new-port=NEWPORTnew cluster port number (default %d)\n\
-u, --user=NAME   clusters superuser (default \"%s\")\n\
-v, --verbose enable verbose output\n\
-V, --version display version information, then exit\n\
--- 231,238 
-G, --debugfile=FILENAME  output debugging activity to file\n\
-k, --linklink instead of copying files to new cluster\n\
-l, --logfile=FILENAMElog session activity to file\n\
!   -p, --old-port=OLDPGPORT  old cluster port number (default %d)\n\
!   -P, --new-port=NEWPGPORT  new cluster port number (default %d)\n\
-u, --user=NAME   clusters superuser (default \"%s\")\n\
-v, --verbose enable verbose output\n\
-V, --version display version information, then exit\n\
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_up

Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Jeff Davis
On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
> Lazy VACUUM is the only thing that makes a page all visible.  I don't
> understand the part about snapshots.

Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. 

After an INSERT to a new page, and after all snapshots are released, the
page becomes all-visible; and thus subject to being marked with
PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
is no cleanup action that takes place here, so nothing else will bump
the LSN either.

So, let's say that we hypothetically had persistent snapshots, then
you'd have the following problem:

1. INSERT to a new page, marking it with LSN X
2. WAL flushed to LSN Y (Y > X)
2. Some persistent snapshot (that doesn't see the INSERT) is released,
and generates WAL recording that fact with LSN Z (Z > Y)
3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
4. page is written out because LSN is still X
5. crash

Now, the persistent snapshot is still present because LSN Z never made
it to disk; but the page is marked with PD_ALL_VISIBLE.

Sure, if these hypothetical persistent snapshots were transactional, and
if synchronous_commit is on, then LSN Z would be flushed before step 3;
but that's another set of assumptions. That's why I left it simple and
said that the assumption was "snapshots are released if there's a
crash".

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis  wrote:
> 1. Torn pages -- not a problem because it's a single bit and idempotent.

Right.

> 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
> an action that makes the page all-visible. Depending on what action
> makes a page all-visible:
>  a. An old snapshot is released -- not a problem, because if there is a
> crash all snapshots are released.
>  b. Cleanup action on the page -- not a problem, because that will
> create a WAL record and update the page's LSN before setting the
> PD_ALL_VISIBLE.

Lazy VACUUM is the only thing that makes a page all visible.  I don't
understand the part about snapshots.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Gurjeet Singh
On Wed, Jun 22, 2011 at 3:37 PM, Joshua D. Drake wrote:

> Per:
>
> http://archives.postgresql.**org/pgsql-hackers/2010-11/**msg02043.php
>
> It seems we did come up with a use case in the procpid discussion. The
> ability to change the names of columns/databases etc, to handle the fixing
> of bad decision decisions during development over time.
>
> Thoughts?
>

Instead of just synonyms of columns, why don't we think about implementing
virtual columns (feature as named in other RDBMS). This is the ability to
define a column in a table which is derived using an expression around other
non-virtual columns. I agree it would be much more difficult and some may
even argue it is pointless in the presence of views and expression indexes,
but I leave that as an exercise for others.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug  wrote:
>> Well, I'm sure there is some effect, but my experiments seem to
>> indicate that it's not a very important one.  Again, please feel free
>> to provide contrary evidence.  I think the basic issue is that - in
>> the best possible case - padding the LWLocks so that you don't have
>> two locks sharing a cache line can reduce contention on the busier
>> lock by at most 2x.  (The less busy lock may get a larger reduction
>> but that may not help you much.)  If you what you really need is for
>> the contention to decrease by 1000x, you're just not really moving the
>> needle.
>
> Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
> the most heavily contested LWLocks for the others might still be
> worthwhile.

Hey, if we can show that it works, sign me up.

>> That's why the basic fast-relation-lock patch helps so much:
>> it replaces a system where every lock request results in contention
>> with a system where NONE of them do.
>>
>> I tried rewriting the LWLocks using CAS.  It actually seems to make
>> things slightly worse on the tests I've done so far, perhaps because I
>> didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
>> be better, but I'm not holding my breath.  Everything I'm seeing so
>> far leads me to the belief that we need to get rid of the contention
>> altogether, not just contend more quickly.
>
> Is there a patch available? How did you do the slow path (i.e. the
> case where there's contention and you need to block)? It seems to
> me that without some kernel support like futexes it's impossible
> to do better than LWLocks already do, because any simpler scheme
> like
>  while (atomic_inc_post(lock) > 0) {
>    atomic_dec(lock);
>    block(lock);
>  }
> for the shared-locker case suffers from a race condition (the lock
> might be released before you actually block()).

Attached...

> The idea would be to start out with something trivial like the above.
> Maybe with an #if for compilers which have something like GCC's
> __sync_synchronize(). We could then gradually add implementations
> for specific architectures, hopefully done by people who actually
> own the hardware and can test.

Yes.  But if we go that route, then we have to also support a code
path for architectures for which we don't have that support.  That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


lwlock-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 22:15 , Robert Haas wrote:
> On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug  wrote:
>> It seems hard to believe that there isn't some effect of two locks
>> sharing a cache line. There are architectures (even some of the
>> Intel architectures, I believe) where cache lines are 32 bit, though -
>> so measuring this certainly isn't easy. Also, I'm absolute no
>> expert for this, so it might very well be that' I'm missing something
>> fundamental.
> 
> Well, I'm sure there is some effect, but my experiments seem to
> indicate that it's not a very important one.  Again, please feel free
> to provide contrary evidence.  I think the basic issue is that - in
> the best possible case - padding the LWLocks so that you don't have
> two locks sharing a cache line can reduce contention on the busier
> lock by at most 2x.  (The less busy lock may get a larger reduction
> but that may not help you much.)  If you what you really need is for
> the contention to decrease by 1000x, you're just not really moving the
> needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

> That's why the basic fast-relation-lock patch helps so much:
> it replaces a system where every lock request results in contention
> with a system where NONE of them do.
> 
> I tried rewriting the LWLocks using CAS.  It actually seems to make
> things slightly worse on the tests I've done so far, perhaps because I
> didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
> be better, but I'm not holding my breath.  Everything I'm seeing so
> far leads me to the belief that we need to get rid of the contention
> altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
  while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
  }
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

>>> There is an obvious (potential) memory-ordering problem here.  If it's
>>> possible for a backend doing AcceptInvalidationMessages() to see a
>>> stale version of the counter, then it might fail to read messages from
>>> the queue that it really ought to have seen.  Protecting the global
>>> counter with a spinlock defeats the purpose of doing any of this in
>>> the first place, because the whole problem is too many people fighting
>>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>>> though, for the writing backend to execute a memory-barrier operation
>>> just after bumping the global counter and the reading backend to
>>> execute one just before.  As it happens, LWLockRelease() is such a
>>> barrier, since it must acquire and release a spinlock, so we should be
>>> able to just bump the counter right before releasing the LWLock and
>>> call it good.  On the reading side, the immediately-preceding lock
>>> acquisition seems like it ought to be sufficient - surely it can't be
>>> possible to acquire a heavyweight lock on an object without seeing
>>> memory updates that some other process made before releasing the same
>>> heavyweight lock.
>> 
>> If we start doing lockless algorithms, I really think we should add
>> explicit barrier primitives. We could start out with something
>> simple such as
>> 
>> #define MemoryBarrierWrite \
>>  do { slock_t l; S_UNLOCK(l); } while (0)
>> #define MemoryBarrierRead \
>>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
>> #define MemoryBarrierReadWrite \
>>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>> 
>> But yeah, it seems that in the case of the sinval queue, the surrounding
>> LWLocks actually provide the necessary barriers.
> 
> Yes, a set of general memory barrier primitives would be handy.
> Unfortunately, it would probably be quite a bit of work to develop
> this and verify that it works properly on all supported platforms.

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug  wrote:
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one.  Again, please feel free
to provide contrary evidence.  I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x.  (The less busy lock may get a larger reduction
but that may not help you much.)  If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.  That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

>> SInvalReadLock is a good example of a lock which seems barely to e
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.

I tried rewriting the LWLocks using CAS.  It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay.  Perhaps fetch-and-add would
be better, but I'm not holding my breath.  Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.

Interesting idea.

>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important f

Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs  wrote:
> For people that still think there is still risk, I've added a
> parameter called "relaxed_ddl_locking" which defaults to "off". That
> way people can use this feature in an informed way without exposing us
> to support risks from its use.

I can't get excited about adding a GUC that says "you can turn this
off if you want but don't blame us if it breaks".  That just doesn't
seem like good software engineering to me.

> I think we should be clear that this adds *one* lock/unlock for each
> object for each session, ONCE only after each transaction that
> executed a DDL statement against an object. So with 200 sessions, a
> typical ALTER TABLE statement will cause 400 locks. The current lock
> manager maxes out above 50,000 locks per second, so any comparative
> analysis will show this is a minor blip on even a heavy workload.

I think that using locks to address this problem is the wrong
approach.  I think the right fix is to use something other than
SnapshotNow to do the system catalog scans.  However, if we were going
to use locking, then we'd need to ensure that:

(1) No transaction which has made changes to a system catalog may
commit while some other transaction is in the middle of scanning that
catalog.
(2) No transaction which has made changes to a set of system catalogs
may commit while some other transaction is in the midst of fetching
interrelated data from a subset of those catalogs.

It's important to note, I think, that the problem here doesn't occur
at the time that the table rows are updated, because those rows aren't
visible yet.  The problem happens at commit time.  So unless I'm
missing something, ALTER TABLE would only need to acquire the relation
definition lock after it has finished all of its other work.  In fact,
it doesn't even really need to acquire it then, either.  It could just
queue up a list of relation definition locks to acquire immediately
before commit, which would be advantageous if the transaction went on
to abort, since in that case we'd skip the work of acquiring the locks
at all.

In fact, without doing that, this patch would undo much of the point
of the original ALTER TABLE lock strength reduction:

begin;
alter table foo alter column a set storage plain;

select * from foo;

In master, the SELECT completes without blocking.  With this patch
applied, the SELECT blocks, just as it did in 9.0, because it can't
rebuild the relcache entry without getting the relation definition
lock, which the alter table will hold until commit.  In fact, the same
thing happens even if foo has been accessed previously in the same
session.  If there is already an open *transaction* that has accessed
foo, then, it seems, it can continue to do so until it commits.  But
as soon as it tries to start a new transaction, it blocks again.  I
don't quite understand why that is - I didn't think we invalidated the
entire relcache on every commit.  But that's what happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Gurjeet Singh
On Thu, Jun 23, 2011 at 2:58 PM, Kevin Grittner  wrote:

> Gurjeet Singh  wrote:
>
> > Instead of just synonyms of columns, why don't we think about
> implementing
> > virtual columns (feature as named in other RDBMS). This is the
> ability to
> > define a column in a table which is derived using an expression
> around other
> > non-virtual columns.
>
> How do you see that working differently from what PostgreSQL can
> currently do?
>
> test=# create table line_item(id int primary key not null, quantity int
> not null, unit_price numeric(13,2));
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "line_item_pkey" for table "line_item"
> CREATE TABLE
> test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23');
> INSERT 0 2
> test=# create function line_total(line_item) returns numeric(13,2)
> language sql immutable as $$ select ($1.quantity *
> $1.unit_price)::numeric(13,2);$$;
> CREATE FUNCTION
> test=# select li.id, li.line_total from line_item li;
>  id | line_total
> +
>  1 | 187.95
>  2 |  81.15
> (2 rows)
>

For one, this column is not part of the table, so we can't gather statistics
on them to help the optimizer.

We can'r create primary keys on this expression.

Also, say if the query wasn't fetching all the columns and we had just the
line_total call in SELECT list, the executor has to fetch the whole row and
pass it on to the function even though the function uses only part of the
row (2 columns in this case).

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] spinlock contention

2011-06-23 Thread Merlin Moncure
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug  wrote:
> On Jun23, 2011, at 17:42 , Robert Haas wrote:
>> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug  wrote:
>>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
 So, the majority (60%) of the excess spinning appears to be due to
 SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>>
>>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>>> so I guess that two adjacent LWLocks currently share one cache line.
>>>
>>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>>> index 5, so if I'm not mistaken exactly the two locks where you saw
>>> the largest contention on are on the same cache line...
>>>
>>> Might make sense to try and see if these numbers change if you
>>> either make LWLockPadded 64bytes or arrange the locks differently...
>>
>> I fooled around with this a while back and saw no benefit.  It's
>> possible a more careful test would turn up something, but I think the
>> only real way forward here is going to be to eliminate some of that
>> locking altogether.
>
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.
>
>> SInvalReadLock is a good example of a lock which seems barely to be
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.
>
>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.
>
>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important for a reader to see 

Re: [HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
 wrote:
> On 23.06.2011 18:42, Robert Haas wrote:
>> ProcArrayLock looks like a tougher nut to crack - there's simply no
>> way, with the system we have right now, that you can take a snapshot
>> without locking the list of running processes.  I'm not sure what to
>> do about that, but we're probably going to have to come up with
>> something, because it seems clear that once we eliminate the lock
>> manager LWLock contention, this is a major bottleneck.
>
> ProcArrayLock is currently held for a relatively long period of time when a
> snapshot is taken, especially if there's a lot of backends. There are three
> operations to the procarray:
>
> 1. Advertising a new xid belonging to my backend.
> 2. Ending a transaction.
> 3. Getting a snapshot.
>
> Advertising a new xid is currently done without a lock, assuming that
> setting an xid in shared memory is atomic.

The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects.  But I think even if they do,
there's no real harm done.  Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.

I am rather doubtful that this code is strictly safe:

myproc->subxids.xids[nxids] = xid;
myproc->subxids.nxids = nxids + 1;

In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.

> To end a transaction, you acquire
> ProcArrayLock in exclusive mode. To get a snapshot, you acquire
> ProcArrayLock in shared mode, and scan the whole procarray.
>
> I wonder if it would be a better tradeoff to keep a "materialized" snapshot
> in shared memory that's updated every time a new xid is assigned or a
> transaction ends. Getting a snapshot would become much cheaper, as you could
> just memcpy the ready-built snapshot from shared memory into local memory.

At least in the case of the pgbench -S workload, I doubt that would
help at all.  The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway.  The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread Kevin Grittner
Gurjeet Singh  wrote:
 
> Instead of just synonyms of columns, why don't we think about
implementing
> virtual columns (feature as named in other RDBMS). This is the
ability to
> define a column in a table which is derived using an expression
around other
> non-virtual columns.
 
How do you see that working differently from what PostgreSQL can
currently do?
 
test=# create table line_item(id int primary key not null, quantity int
not null, unit_price numeric(13,2));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"line_item_pkey" for table "line_item"
CREATE TABLE
test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23');
INSERT 0 2
test=# create function line_total(line_item) returns numeric(13,2)
language sql immutable as $$ select ($1.quantity *
$1.unit_price)::numeric(13,2);$$;
CREATE FUNCTION
test=# select li.id, li.line_total from line_item li;
 id | line_total
+
  1 | 187.95
  2 |  81.15
(2 rows)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-06-23 Thread Florian Pflug
On Jun23, 2011, at 17:42 , Robert Haas wrote:
> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug  wrote:
>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>> So, the majority (60%) of the excess spinning appears to be due to
>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>> 
>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>> so I guess that two adjacent LWLocks currently share one cache line.
>> 
>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>> index 5, so if I'm not mistaken exactly the two locks where you saw
>> the largest contention on are on the same cache line...
>> 
>> Might make sense to try and see if these numbers change if you
>> either make LWLockPadded 64bytes or arrange the locks differently...
> 
> I fooled around with this a while back and saw no benefit.  It's
> possible a more careful test would turn up something, but I think the
> only real way forward here is going to be to eliminate some of that
> locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

> SInvalReadLock is a good example of a lock which seems barely to be
> necessary at all.  Except on extraordinarily rare occasions, it is
> only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

> Only SICleanupQueue() needs to lock
> out readers, and in the (probably common) case where all backends are
> reasonably close to being caught up, it wouldn't even matter if the
> determination of minMsgNum were a little bit fuzzy.  I've been
> straining my brain trying to figure out whether there's some way to
> rewrite this logic so that it can run concurrently with readers; then
> we could get rid of SInvalReadLock altogether.  I have a feeling if I
> were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of 
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

> So my fallback position is to try to find some way to optimize the
> common case where there are no messages to be read.  We're already
> allowing readers and writers to run in parallel, so it must not be
> important for a reader to see a message that a writer is in the
> process of writing, but has not yet finished writing.  I believe the
> idea here is that sinval messages should be emitted before releasing
> the related lo

Re: [HACKERS] spinlock contention

2011-06-23 Thread Heikki Linnakangas

On 23.06.2011 18:42, Robert Haas wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes.  I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.


ProcArrayLock is currently held for a relatively long period of time 
when a snapshot is taken, especially if there's a lot of backends. There 
are three operations to the procarray:


1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that 
setting an xid in shared memory is atomic. To end a transaction, you 
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire 
ProcArrayLock in shared mode, and scan the whole procarray.


I wonder if it would be a better tradeoff to keep a "materialized" 
snapshot in shared memory that's updated every time a new xid is 
assigned or a transaction ends. Getting a snapshot would become much 
cheaper, as you could just memcpy the ready-built snapshot from shared 
memory into local memory.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] spinlock contention

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug  wrote:
> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>> So, the majority (60%) of the excess spinning appears to be due to
>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>
> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
> so I guess that two adjacent LWLocks currently share one cache line.
>
> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
> index 5, so if I'm not mistaken exactly the two locks where you saw
> the largest contention on are on the same cache line...
>
> Might make sense to try and see if these numbers change if you
> either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit.  It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all.  Except on extraordinarily rare occasions, it is
only ever taken in share mode.  Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy.  I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether.  I have a feeling if I
were 25% smarter I could do it, but I'm not.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read.  We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing.  I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written.  What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read.  When a message is written, the global counter is bumped.  When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

There is an obvious (potential) memory-ordering problem here.  If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen.  Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock.  It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before.  As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good.  On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic.  In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough.  We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces).  When a backend compares its counter to the
global counter, it does so while holding this lock.  When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1).  So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

If anyone has read this far, you have my undying gratitude
especially if you respond with comments.

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without lo

Re: [HACKERS] Table Partitioning

2011-06-23 Thread Kevin Grittner
Robert Haas  wrote:
> David Fetter  wrote:
 
>> Does The Standard* have anything to say?
> 
> I don't know
 
I dug around in the standard a bit and didn't find anything.  Given
that the standard doesn't even touch the issue of indexes because
that a performance tuning implementation detail, it would seem out
of character for them to define table partitioning.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-06-23 Thread Steve Singer
On 11-06-23 02:41 AM, Jun Ishiduka wrote:
> I receive this mail, so I notice I do wrong recognition to what
> Heikki is proposing. 
>
> my recognition:
>   Before:
> * I thought Heikki proposes, "Execute SQL(pg_start_backup('x'); copy 
>   the data directory and pg_stop_backup();) from the standby server 
>   to the primary server".
>   -> I disliked this. 
>   Now:
> * Heikki is proposing both No using pg_basebackup and Not specify -x.
>   So,
> * Use the output of pg_stop_backup().
> * Don't use backup history file.
>   he thinks.
>
> Right?
>

What I think he is proposing would not require using pg_stop_backup()
but you could also modify pg_stop_back() to work as well.

What do you think of this idea?

Do you see how the patch can be reworked to accomplish this?



> 
> Jun Ishizuka
> NTT Software Corporation
> TEL:045-317-7018
> E-Mail: ishizuka@po.ntts.co.jp
> 
>
>
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table Partitioning

2011-06-23 Thread Robert Haas
On Tue, Jun 21, 2011 at 1:52 PM, David Fetter  wrote:
>> Still, I think a pretty clear
>> way forward here is to try to figure out a way to add some explicit
>> syntax for range partitions, so that you can say...
>>
>> foo_a is for all rows where foo starts with 'a'
>> foo_b is for all rows where foo starts with 'b'
>> ...
>> foo_xyz is for all rows where foo starts with 'xyz'
>>
>> If we have that data represented explicitly in the system catalog,
>> then we can look at doing built-in INSERT-routing and UPDATE-handling.
>>  For an added bonus, it's a more natural syntax.
>
> Does someone else have such a syntax?  Does The Standard™ have
> anything to say?

Yes, and I don't know, respectively.  There have been previous
discussions of this.  You might want to start by reading the
discussion around the previous patch.

https://commitfest.postgresql.org/action/patch_view?id=266

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hugetables question

2011-06-23 Thread Robert Haas
On Thu, Jun 23, 2011 at 5:01 AM, Radosław Smogura
 wrote:
> I think conclusion from this test was "Much more important things are to do,
> then 1% benefit" - not "1% is worthless".
>
> I will try today hugepages, with random peeks.

I think the real conclusion here is "Linux will soon do this for us
automatically without us needing to do anything, so pretty soon there
will be no possible benefit at all".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Merlin Moncure
On Thu, Jun 23, 2011 at 7:54 AM, Andrew Chernow  wrote:
>>    you are creating as you iterate through.  This behavior was
>>    unnecessary in terms of what libpqtypes and friends needed and may (as
>>    Tom suggested) come back to bite us at some point. As it turns out,
>>    PQsetvalue's operation on results that weren't created via
>>    PQmakeEmptyResult was totally busted because of the bug, so we have a
>>    unique opportunity to tinker with libpq here: you could argue that you
>>
>> +1
>>
>> Exactly at this moment I am thinking about using modifiable
>> (via PQsetvalue) PGresult instead of std::map in my C++ library
>> for store parameters for binding to executing command.
>> I am already designed how to implement it, and I supposed that
>> PQsetvalue is intended to work with any PGresult and not only
>> with those which has been created via PQmakeEmptyResult...
>> So, I am absolutely sure, that PQsetvalue should works with
>> any PGresult.
>
> All PGresults are created via PQmakeEmptyPGresult, including libpqtypes.
>  Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.
>
> It might be better to say a "server" result vs. a "client" result.
> Currently, PQsetvalue is broken when provided a "server" generated result.

er, right-- the divergence was in how the tuples were getting added --
thanks for the clarification.  essentially your attached patch fixes
that divergence.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Dmitriy Igrishin
2011/6/23 Andrew Chernow 

> you are creating as you iterate through.  This behavior was
>>unnecessary in terms of what libpqtypes and friends needed and may (as
>>Tom suggested) come back to bite us at some point. As it turns out,
>>PQsetvalue's operation on results that weren't created via
>>PQmakeEmptyResult was totally busted because of the bug, so we have a
>>unique opportunity to tinker with libpq here: you could argue that you
>>
>> +1
>>
>> Exactly at this moment I am thinking about using modifiable
>> (via PQsetvalue) PGresult instead of std::map in my C++ library
>> for store parameters for binding to executing command.
>> I am already designed how to implement it, and I supposed that
>> PQsetvalue is intended to work with any PGresult and not only
>> with those which has been created via PQmakeEmptyResult...
>> So, I am absolutely sure, that PQsetvalue should works with
>> any PGresult.
>>
>
> All PGresults are created via PQmakeEmptyPGresult, including libpqtypes.
>  Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.
>
> It might be better to say a "server" result vs. a "client" result.
> Currently, PQsetvalue is broken when provided a "server" generated result.
>
Yeah, yeah. Thanks for clarification!
I am not libpq hacker, just user. So I was thinking that server-returned
result creating by libpq's private functions, rather than public
PQmakeEmptyPGresult...

-1 although if this feature (which I personally thought already exists
according to the
documentation) make future optimizations impossible or hard (as
mentioned by Tom). Otherwise - +1.


> --
> Andrew Chernow
> eSilo, LLC
> global backup
> http://www.esilo.com/
>



-- 
// Dmitriy.


Re: [HACKERS] [COMMITTERS] pgsql: Make the visibility map crash-safe.

2011-06-23 Thread Robert Haas
On Wed, Jun 22, 2011 at 10:23 PM, Robert Haas  wrote:
> Well, it seems I didn't put nearly enough thought into heap_update().
> The fix for the immediate problem looks simple enough - all the code
> has been refactored to use the new API, so the calls can be easily be
> moved into the critical section (see attached).  But looking at this a
> little more, I see that heap_update() is many bricks short of a load,
> because there are several places where the buffer can be unlocked and
> relocked, and we don't recheck whether the page is all-visible after
> reacquiring the lock.  So I've got some more work to do here.

See what you think of the attached.  I *think* this covers all bases.
It's a little more complicated than I would like, but I don't think
fatally so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


heap-update-vm-fix-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Andrew Chernow

you are creating as you iterate through.  This behavior was
unnecessary in terms of what libpqtypes and friends needed and may (as
Tom suggested) come back to bite us at some point. As it turns out,
PQsetvalue's operation on results that weren't created via
PQmakeEmptyResult was totally busted because of the bug, so we have a
unique opportunity to tinker with libpq here: you could argue that you

+1

Exactly at this moment I am thinking about using modifiable
(via PQsetvalue) PGresult instead of std::map in my C++ library
for store parameters for binding to executing command.
I am already designed how to implement it, and I supposed that
PQsetvalue is intended to work with any PGresult and not only
with those which has been created via PQmakeEmptyResult...
So, I am absolutely sure, that PQsetvalue should works with
any PGresult.


All PGresults are created via PQmakeEmptyPGresult, including libpqtypes. 
 Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult.


It might be better to say a "server" result vs. a "client" result. 
Currently, PQsetvalue is broken when provided a "server" generated result.


--
Andrew Chernow
eSilo, LLC
global backup
http://www.esilo.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Pavel Golub
Hello, Merlin.

You wrote:

MM> On Jun 6
MM> (http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php),
MM> Pavel discovered an issue with PQsetvalue that could cause libpq to
MM> wander off into unallocated memory that was present in 9.0.x.  A
MM> fairly uninteresting fix was quickly produced, but Tom indicated
MM> during subsequent review that he was not happy with the behavior of
MM> the function.  Everyone was busy with the beta wrap at the time so I
MM> didn't press the issue.

MM> A little history here: PQsetvalue's
MM> (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main
MM> reason to exist is to allow creating a PGresult out of scratch data on
MM> a result created via PQmakeEmptyResult(). This behavior works as
MM> intended and is not controversial...it was initially done to support
MM> libpqtypes but has apparently found use in other places like ecpg.

MM> PQsetvalue *also* allows you to mess with results returned by the
MM> server using standard query methods for any tuple, not just the one
MM> you are creating as you iterate through.  This behavior was
MM> unnecessary in terms of what libpqtypes and friends needed and may (as
MM> Tom suggested) come back to bite us at some point. As it turns out,
MM> PQsetvalue's operation on results that weren't created via
MM> PQmakeEmptyResult was totally busted because of the bug, so we have a
MM> unique opportunity to tinker with libpq here: you could argue that you
MM> have a window of opportunity to change the behavior here since we know
MM> it isn't being usefully used.  I think it's too late for that but it's
MM> if it had to be done the time is now.

MM> Pavel actually has been requesting to go further with being able to
MM> mess around with PGresults (see:
MM> http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php),
MM> such that the result would start to have some 'recordset' features you
MM> find in various other libraries.  Maybe it's not the job of libpq to
MM> do that, but I happen to think it's pretty cool.  Anyways, something
MM> has to be done -- I hate to see an unpatched known 9.0 issue remain in
MM> the wild.

MM> merlin

I confirm my desire to have delete tuple routine. And I'm ready to
create\test\modify any sources for this.

-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-06-23 Thread Dmitriy Igrishin
2011/6/23 Merlin Moncure 

> On Jun 6 (
> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php),
> Pavel discovered an issue with PQsetvalue that could cause libpq to
> wander off into unallocated memory that was present in 9.0.x.  A
> fairly uninteresting fix was quickly produced, but Tom indicated
> during subsequent review that he was not happy with the behavior of
> the function.  Everyone was busy with the beta wrap at the time so I
> didn't press the issue.
>
> A little history here: PQsetvalue's
> (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main
> reason to exist is to allow creating a PGresult out of scratch data on
> a result created via PQmakeEmptyResult(). This behavior works as
> intended and is not controversial...it was initially done to support
> libpqtypes but has apparently found use in other places like ecpg.
>
> PQsetvalue *also* allows you to mess with results returned by the
> server using standard query methods for any tuple, not just the one
> you are creating as you iterate through.  This behavior was
> unnecessary in terms of what libpqtypes and friends needed and may (as
> Tom suggested) come back to bite us at some point. As it turns out,
> PQsetvalue's operation on results that weren't created via
> PQmakeEmptyResult was totally busted because of the bug, so we have a
> unique opportunity to tinker with libpq here: you could argue that you
> have a window of opportunity to change the behavior here since we know
> it isn't being usefully used.  I think it's too late for that but it's
> if it had to be done the time is now.
>
> Pavel actually has been requesting to go further with being able to
> mess around with PGresults (see:
> http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php),
> such that the result would start to have some 'recordset' features you
> find in various other libraries.  Maybe it's not the job of libpq to
> do that, but I happen to think it's pretty cool.  Anyways, something
> has to be done -- I hate to see an unpatched known 9.0 issue remain in
> the wild.
>
+1

Exactly at this moment I am thinking about using modifiable
(via PQsetvalue) PGresult instead of std::map in my C++ library
for store parameters for binding to executing command.
I am already designed how to implement it, and I supposed that
PQsetvalue is intended to work with any PGresult and not only
with those which has been created via PQmakeEmptyResult...
So, I am absolutely sure, that PQsetvalue should works with
any PGresult.


> merlin
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
// Dmitriy.


Re: [HACKERS] [v9.2] DROP Reworks Part.0 - 'missing_ok' support of get_object_address

2011-06-23 Thread Kohei KaiGai
I revised my patch based on your "there-is-no-try-v2.patch".
It enabled to implement 'missing_ok' support without modification of
orders to solve the object name and relation locking.

Thanks,

2011/6/22 Robert Haas :
> On Wed, Jun 22, 2011 at 12:51 PM, Alvaro Herrera
>  wrote:
>> Excerpts from Robert Haas's message of mié jun 22 08:56:02 -0400 2011:
>>
>>> Another option might be to leave heap_openrv() and relation_openrv()
>>> alone and add a missing_ok argument to try_heap_openrv() and
>>> try_relation_openrv().  Passing true would give the same behavior as
>>> presently; passing false would make them behave like the non-try
>>> version.
>>
>> That would be pretty weird, having two functions, one of them sometimes
>> doing the same thing as the other one.
>>
>> I understand Noah's concern but I think your original proposal was saner
>> than both options presented so far.
>
> I agree with you.  If we had a whole pile of options it might be worth
> having heap_openrv() and heap_openrv_extended() so as not to
> complicate the simple case, but since there's no forseeable need to
> add anything other than missing_ok, my gut is to just add it and call
> it good.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
KaiGai Kohei 


pgsql-v9.2-drop-reworks-part-0.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hugetables question

2011-06-23 Thread Radosław Smogura
Martijn van Oosterhout  Thursday 23 of June 2011 09:10:20
> On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote:
> > I strictly disagree with opinion if there is 1% it's worthless. 1%
> > here, 1% there, and finally You get 10%, but of course hugepages
> > will work quite well if will be used in code that require many
> > random "jumps". I think this can be reproduced and some not-common
> > case may be found to get performance of about 10% (maybe upload
> > whole table in shared buffer and randomly "peek" records one by
> > one).
> 
> I think the point is not that 1% is worthless, but that it hasn't been
> shown that it is a 1% improvement, becuase the noise is too large.
> 
> For benefits this small, what you need to is run each test 100 times
> and check the mean and standard deviation and see whether the
> improvment is real or not.
> 
> When the benefit is 10% you only need a handful of runs to prove it,
> which is why they're accepted easier.
> 
> Have a nice day,
> 
> > Patriotism is when love of your own people comes first; nationalism,
> > when hate for people other than your own comes first.
> > 
> >   - Charles de Gaulle
I think conclusion from this test was "Much more important things are to do, 
then 1% benefit" - not "1% is worthless".

I will try today hugepages, with random peeks.

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hugetables question

2011-06-23 Thread Martijn van Oosterhout
On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote:
> I strictly disagree with opinion if there is 1% it's worthless. 1%
> here, 1% there, and finally You get 10%, but of course hugepages
> will work quite well if will be used in code that require many
> random "jumps". I think this can be reproduced and some not-common
> case may be found to get performance of about 10% (maybe upload
> whole table in shared buffer and randomly "peek" records one by
> one).

I think the point is not that 1% is worthless, but that it hasn't been
shown that it is a 1% improvement, becuase the noise is too large.

For benefits this small, what you need to is run each test 100 times
and check the mean and standard deviation and see whether the
improvment is real or not.

When the benefit is 10% you only need a handful of runs to prove it,
which is why they're accepted easier.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first. 
>   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote:

> Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011:
>> Per:
>> 
>> http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php
>> 
>> It seems we did come up with a use case in the procpid discussion. The 
>> ability to change the names of columns/databases etc, to handle the 
>> fixing of bad decision decisions during development over time.
>> 
>> Thoughts?
> 
> Let's start with what was discussed and supported in that thread, that
> is, databases.  It seems less clear that columns are widely believed to
> be a good idea to have synonyms for.  Besides, synonyms for databases
> should be reasonably simple to implement, which is not something I would
> say for columns.



sorry, i missed the links:

http://archives.postgresql.org/pgsql-patches/2006-03/msg00085.php

many thanks,

hans


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SYNONYMS (again)

2011-06-23 Thread PostgreSQL - Hans-Jürgen Schönig

On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote:

> Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011:
>> Per:
>> 
>> http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php
>> 
>> It seems we did come up with a use case in the procpid discussion. The 
>> ability to change the names of columns/databases etc, to handle the 
>> fixing of bad decision decisions during development over time.
>> 
>> Thoughts?
> 
> Let's start with what was discussed and supported in that thread, that
> is, databases.  It seems less clear that columns are widely believed to
> be a good idea to have synonyms for.  Besides, synonyms for databases
> should be reasonably simple to implement, which is not something I would
> say for columns.



yes, implementing synonyms is not too hard.
some time ago (3 or 4 years ago most likely) we already posted a patch 
providing support for synonyms.
it was rejected because synonyms were said to be a bad design pattern which app 
developers to do nasty things.
so, if you want to work on it maybe this patch is the place to start.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers