Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-28 Thread daveg
On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 5:46 PM, daveg  wrote:
> > On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:
> >> On Wed, Jul 27, 2011 at 8:28 PM, daveg  wrote:
> >> > My client has been seeing regular instances of the following sort of 
> >> > problem:
> >> On what version of PostgreSQL?
> >
> > 9.0.4.
> >
> > I previously said:
> >> > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
> >> > identical systems still running 8.4.8 that do not have this issue, so I'm
> >> > assuming it is related to the vacuum full work done for 9.0. Oddly, we 
> >> > don't
> >> > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
> >> > so it may be timing related.
> 
> Ah, OK, sorry.  Well, in 9.0, VACUUM FULL is basically CLUSTER, which
> means that a REINDEX is happening as part of the same operation.  In
> 9.0, there's no point in doing VACUUM FULL immediately followed by
> REINDEX.  My guess is that this is happening either right around the
> time the VACUUM FULL commits or right around the time the REINDEX
> commits.  It'd be helpful to know which, if you can figure it out.

I'll update my vacuum script to skip reindexes after vacuum full for 9.0
servers and see if that makes the problem go away. Thanks for reminding
me that they are not needed. However, I suspect it is the vacuum, not the
reindex causing the problem. I'll update when I know.

> If there's not a hardware problem causing those read errors, maybe a
> backend is somehow ending up with a stale or invalid relcache entry.
> I'm not sure exactly how that could be happening, though...

It does not appear to be a hardware problem. I also suspect it is a stale
relcache.

-dg
 
-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch  wrote:
> The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
> vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit 
> of
> [2] over [1], as far as I can see, is invulnerability to the hazard described 
> in
> [3].  Before selecting [2] over [1], it would be instructive to know how much
> performance we exchange for its benefit.
>
> I did a bit of (relatively unrigorous) poking at this workload with oprofile,
> and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
> perhaps too cheap relative to the workload's other costs to matter.  That's 
> good
> enough for me.

Me, too.  There's another possible benefit, though: in
sinval-fastpath, everybody's got to read maxMsgNum every time through;
whereas in sinval-hasmessages, everyone's reading their own flag.  I'm
not sure how much it costs to have memory that gets read by multiple
readers rather than just a single reader, but I think I've seen some
things that indicate it might not always be free.

> Interesting.  I had hypothesized that at two clients per core, nearly every 
> call
> to SIGetDataEntries() would find work to do, making the patch a strict loss.  
> A
> bit of instrumented testing showed otherwise: at two clients per core,
> sinval-hasmessages optimized away 93% of the calls.  At fifty clients per 
> core,
> it still optimized away more than half of the calls.

Wow.  That's better than I expected.  Great idea for a test, too.

Unless someone has another concern here, I think we've got this one nailed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-28 Thread Noah Misch
On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas  wrote:
> >> I'll also test out creating and dropping some tables.
> >
> > Still need to work on this one.
> 
> And there results are in.  I set up the following sophisticated test
> script for pgbench:
> 
> CREATE TEMP TABLE foo (a int);
> DROP TABLE foo;
> 
> And then did 5-minute test runs with varying numbers of clients on
> Nate Boley's 32-core machine with (a) master, (b) master +
> sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
> + sinval-hasmessages.

The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit of
[2] over [1], as far as I can see, is invulnerability to the hazard described in
[3].  Before selecting [2] over [1], it would be instructive to know how much
performance we exchange for its benefit.

I did a bit of (relatively unrigorous) poking at this workload with oprofile,
and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
perhaps too cheap relative to the workload's other costs to matter.  That's good
enough for me.

[1] 
http://archives.postgresql.org/message-id/ca+tgmozc8z_jtj44xvpwpxkqt2jgjb5ygcz3t9u5-qzvdbm...@mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xox3ahupybmjt9+pkw6qubqzn7sukk...@mail.gmail.com
[3] 
http://archives.postgresql.org/message-id/20110727033537.gb18...@tornado.leadboat.com

> 80L tps = 1192.768020 (including connections establishing)
> 80L tps = 1165.545010 (including connections establishing)
> 80L tps = 1169.776066 (including connections establishing)

> 80LS tps = 1510.778084 (including connections establishing)
> 80LS tps = 1484.423486 (including connections establishing)
> 80LS tps = 1480.692051 (including connections establishing)

> 80 tps = 1154.272515 (including connections establishing)
> 80 tps = 1168.805881 (including connections establishing)
> 80 tps = 1173.971801 (including connections establishing)

> 80S tps = 1483.037788 (including connections establishing)
> 80S tps = 1481.059504 (including connections establishing)
> 80S tps = 1487.215748 (including connections establishing)
> 
> So, apparently, the extra work in SIInsertDataEntries() is more than
> paid for by the speedup in SIGetDataEntries().  I'm guessing that at
> high client counts you win because of reduced spinlock contention, and
> at low client counts you still win a little bit because
> SIGetDataEntries() is called multiple times per transaction, whereas
> SIInsertDataEntries() is only called once.  I could be all wet on the
> reason, but at any rate the numbers look pretty good.

Interesting.  I had hypothesized that at two clients per core, nearly every call
to SIGetDataEntries() would find work to do, making the patch a strict loss.  A
bit of instrumented testing showed otherwise: at two clients per core,
sinval-hasmessages optimized away 93% of the calls.  At fifty clients per core,
it still optimized away more than half of the calls.

-- 
Noah Mischhttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 8:12 PM, Ants Aasma  wrote:
> On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas  wrote:
>> Well, again, there are three levels:
>>
>> (A) synchronous_commit=off.  No waiting!
>> (B) synchronous_commit=local transactions, and synchronous_commit=on
>> transactions when sync rep is not in use.  Wait for xlog flush.
>> (C) synchronous_commit=on transactions when sync rep IS in use.  Wait
>> for xlog flush and replication.
> ...
>> So basically, you can't be more asynchronous than the guy in front of
>> you.
>
> (A) still gives a guarantee - transactions that begin after the commit
> returns see
> the commited transaction. A weaker variant would say that if the commit
> returns, and the server doesn't crash in the meantime, the commit would at
> some point become visible. Maybe even that transactions that begin after the
> commit returns become visible after that commit.

Yeah, you could do that.  But that's such a weak guarantee that I'm
not sure it has much practical utility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma  wrote:
> On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
>  wrote:
>> (4)  We communicate acceptable snapshots to the replica to make the
>> order of visibility visibility match the master even when that
>> doesn't match the order that transactions returned from commit.
>
> I wonder if some interpretation of 2 phase commit could make Robert's
> original suggestion implement this.
>
> On the master the commit sequence would look something like:
> 1. Insert commit record to the WAL
> 2. Wait for replication
> 3. Get a commit seq nr and mark XIDs visible
> 4. WAL log the seq nr
> 5. Return success to client
>
> When replaying:
> * When replaying commit record, do everything but make
>  the tx visible.
> * When replaying the commit sequence number
>    if there is a gap between last visible commit and current:
>      insert the commit sequence nr. to list of waiting commits.
>    else:
>      mark current and all directly following waiting tx's visible
>
> This would give consistent visibility order on master and slave. Robert
> is right that this would undesirably increase WAL traffic. Delaying this
> traffic would undesirably increase replay lag between master and slave.
> But it seems to me that this could be an optional WAL level on top of
> hot_standby that would only be enabled if consistent visibility on
> slaves is desired.

I think you nailed it.

An additional point to think about: if we were willing to insist on
streaming replication, we could send the commit sequence numbers via a
side channel rather than writing them to WAL, which would be a lot
cheaper.  That might even be a reasonable thing to do, because if
you're doing log shipping, this is all going to be super-not-real-time
anyway.  OTOH, I know we don't want to make WAL shipping anything less
than a first class citizen, so maybe not.

At any rate, we may be getting a little sidetracked here from the
original point of the thread, which was how to make snapshot-taking
cheaper.  Maybe there's some tie-in to when transactions become
visible, but I think it's pretty weak.  The existing system could be
hacked up to avoid making transactions visible out of LSN order, and
the system I proposed could make them visible either in LSN order or
do the same thing we do now.  They are basically independent problems,
AFAICS.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Ants Aasma
On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas  wrote:
> Well, again, there are three levels:
>
> (A) synchronous_commit=off.  No waiting!
> (B) synchronous_commit=local transactions, and synchronous_commit=on
> transactions when sync rep is not in use.  Wait for xlog flush.
> (C) synchronous_commit=on transactions when sync rep IS in use.  Wait
> for xlog flush and replication.
...
> So basically, you can't be more asynchronous than the guy in front of
> you.

(A) still gives a guarantee - transactions that begin after the commit
returns see
the commited transaction. A weaker variant would say that if the commit
returns, and the server doesn't crash in the meantime, the commit would at
some point become visible. Maybe even that transactions that begin after the
commit returns become visible after that commit.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Ants Aasma
On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
 wrote:
> (4)  We communicate acceptable snapshots to the replica to make the
> order of visibility visibility match the master even when that
> doesn't match the order that transactions returned from commit.

I wonder if some interpretation of 2 phase commit could make Robert's
original suggestion implement this.

On the master the commit sequence would look something like:
1. Insert commit record to the WAL
2. Wait for replication
3. Get a commit seq nr and mark XIDs visible
4. WAL log the seq nr
5. Return success to client

When replaying:
* When replaying commit record, do everything but make
  the tx visible.
* When replaying the commit sequence number
if there is a gap between last visible commit and current:
  insert the commit sequence nr. to list of waiting commits.
else:
  mark current and all directly following waiting tx's visible

This would give consistent visibility order on master and slave. Robert
is right that this would undesirably increase WAL traffic. Delaying this
traffic would undesirably increase replay lag between master and slave.
But it seems to me that this could be an optional WAL level on top of
hot_standby that would only be enabled if consistent visibility on
slaves is desired.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 5:46 PM, daveg  wrote:
> On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:
>> On Wed, Jul 27, 2011 at 8:28 PM, daveg  wrote:
>> > My client has been seeing regular instances of the following sort of 
>> > problem:
>> On what version of PostgreSQL?
>
> 9.0.4.
>
> I previously said:
>> > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
>> > identical systems still running 8.4.8 that do not have this issue, so I'm
>> > assuming it is related to the vacuum full work done for 9.0. Oddly, we 
>> > don't
>> > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
>> > so it may be timing related.

Ah, OK, sorry.  Well, in 9.0, VACUUM FULL is basically CLUSTER, which
means that a REINDEX is happening as part of the same operation.  In
9.0, there's no point in doing VACUUM FULL immediately followed by
REINDEX.  My guess is that this is happening either right around the
time the VACUUM FULL commits or right around the time the REINDEX
commits.  It'd be helpful to know which, if you can figure it out.

If there's not a hardware problem causing those read errors, maybe a
backend is somehow ending up with a stale or invalid relcache entry.
I'm not sure exactly how that could be happening, though...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 16:42 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing  wrote:
> > so in case of stuck slave the syncrep transcation is committed after
> > crash, but is not committed before the crash happens ?
> 
> Yep.
> 
> > ow will the replay process get stuc gaian during recovery ?
> 
> Nope.

Are you sure ? I mean the case when a stuck master comes up but slave is
still not functional.

How does this behavior currently fit in with ACID and sync guarantees ?

> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:54 PM, Kevin Grittner
 wrote:
> Robert Haas  wrote:
>
>> Having transactions become visible in the same order on the master
>> and the standby is very appealing, but I'm pretty well convinced
>> that allowing commits to become visible before they've been
>> durably committed is throwing the "D" an ACID out the window.  If
>> synchronous_commit is off, sure, but otherwise...
>
> It has been durably committed on the master, but not on the
> supposedly synchronous copy; so it's not so much through out the "D"
> in "ACID" as throwing out the "synchronous" in "synchronous
> replication".  :-(

Well, depends.  Currently, the sequence of events is:

1. Insert commit record.
2. Flush commit record, if synchronous_commit in {local, on}.
3. Wait for synchronous replication, if synchronous_commit = on and
synchronous_standby_names is non-empty.
4. Make transaction visible.

If you move (4) before (3), you're throwing out the synchronous in
synchronous replication.  If you move (4) before (2), you're throwing
out the D in ACID.

> Unless I'm missing something we have a choice to make -- I see four
> possibilities (already mentioned on this thread, I think):
>
> (1)  Transactions are visible on the master which won't necessarily
> be there if a meteor takes out the master and you need to resume
> operations on the replica.
>
> (2)  An asynchronous commit must block behind any pending
> synchronous commits if synchronous replication is in use.

Well, again, there are three levels:

(A) synchronous_commit=off.  No waiting!
(B) synchronous_commit=local transactions, and synchronous_commit=on
transactions when sync rep is not in use.  Wait for xlog flush.
(C) synchronous_commit=on transactions when sync rep IS in use.  Wait
for xlog flush and replication.

Under your option #2, if a type-A transaction commits after a type-B
transaction, it will need to wait for the type-B transaction's xlog
flush.  If a type-A transaction commits after a type-C transaction, it
will need to wait for the type-C transaction to flush xlog and
replicate.  And if a type-B transaction commits after a type-C
transaction, there's no additional waiting for xlog flush, because the
type-B transaction would have to wait for that anyway.  But it will
also have to wait for the preceding type-C transaction to replicate.
So basically, you can't be more asynchronous than the guy in front of
you.

Aside from the fact that this behavior isn't too hot from a user
perspective, it might lead to some pretty complicated locking.  Every
time a transaction finishes xlog flush or sync rep, it's got to go
release the transactions that piled up behind it - but not too many,
just up to the next one that still needs to wait on some higher LSN.

> (3)  Transactions become visible on the replica in a different order
> than they became visible on the master.
>
> (4)  We communicate acceptable snapshots to the replica to make the
> order of visibility visibility match the master even when that
> doesn't match the order that transactions returned from commit.
>
> I don't see how we can accept (1) and call it synchronous
> replication.  I'm pretty dubious about (3), because we don't even
> have Snapshot Isolation on the replica, really.  Is (3) where we're
> currently at?  An advantage of (4) is that on the replica we would
> get the same SI behavior at Repeatable Read that exists on the
> master, and we could even use the same mechanism for SSI to provide
> Serializable isolation on the replica.
>
> I (predictably) like (4) -- even though it's a lot of work

I think that (4), beyond being a lot of work, will also have pretty
terrible performance.  You're basically talking about emitting two WAL
records for every commit instead of one.  That's not going to be
awesome.  It might be OK for small or relatively lightly loaded
systems, or those with "big" transactions.  But for something like
pgbench or DBT-2, I think it's going to be a big problem.  WAL is
already a major bottleneck for us; we need to find a way to make it
less of one, not more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread karavelov
- Цитат от Hannu Krosing (ha...@2ndquadrant.com), на 28.07.2011 в 22:40 
-

>> 
>> Maybe this is why other databases don't offer per backend async commit ?
>> 
> 

Isn't Oracle's

COMMIT WRITE NOWAIT;

basically the same - ad hoc async commit? Though their idea of backend do not 
maps 
exactly to postgrsql's idea. The closest thing is per session async commit:

ALTER SESSION SET COMMIT_WRITE='NOWAIT';


Best regards

--
Luben Karavelov

Re: [HACKERS] WIP: Fast GiST index build

2011-07-28 Thread Alexander Korotkov
On Fri, Jul 29, 2011 at 1:10 AM, Heikki Linnakangas <
heikki.linnakan...@enterprisedb.com> wrote:

> gistFindCorrectParent() should take care of that.
>
OK, now it seems that I understood. I need to verify amount memory needed
for paths because it seems that they tends to accumulate. Also I need to
verify final emptying, because IO guarantees of original paper is based on
strict descending final emptying.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
"Kevin Grittner"  wrote:
 
> to make visibility atomic with commit
 
I meant:
 
to make visibility atomic with WAL-write of the commit record
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Jeff Davis  wrote:
 
> Wouldn't the same issue exist if one transaction is waiting for
> sync rep (synchronous_commit=on), and another is waiting for just
> a WAL flush (synchronous_commit=local)? I don't think that a
> synchronous_commit=off is required.
 
I think you're right -- basically, to make visibility atomic with
commit and allow a fast snapshot build based on that order, any new
commit request would need to block behind any pending request,
regardless of that setting.  At least, no way around that is
apparent to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Jeff Davis
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
> > Right, but if the visibility order were *defined* as the order in which
> > commit records appear in WAL, that problem neatly goes away.  It's only
> > because we have the implementation artifact that "set my xid to 0 in the
> > ProcArray" is decoupled from inserting the commit record that there's
> > any difference.
> 
> Hmm, interesting idea.  However, consider the scenario where some
> transactions are using synchronous_commit or synchronous replication,
> and others are not.  If a transaction that needs to wait (either just
> for WAL flush, or for WAL flush and synchronous replication) inserts
> its commit record, and then another transaction with
> synchronous_commit=off comes along and inserts its commit record, the
> second transaction will have to block until the first transaction is
> done waiting.  We can't make either transaction visible without making
> both visible, and we certainly can't acknowledge the second
> transaction to the client until we've made it visible.  I'm not going
> to say that's so horrible we shouldn't even consider it, but it
> doesn't seem great, either.

I'm trying to follow along here.

Wouldn't the same issue exist if one transaction is waiting for sync rep
(synchronous_commit=on), and another is waiting for just a WAL flush
(synchronous_commit=local)? I don't think that a synchronous_commit=off
is required.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-28 Thread daveg
On Thu, Jul 28, 2011 at 09:46:41AM -0400, Robert Haas wrote:
> On Wed, Jul 27, 2011 at 8:28 PM, daveg  wrote:
> > My client has been seeing regular instances of the following sort of 
> > problem:
> On what version of PostgreSQL?

9.0.4.

I previously said:
> > This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
> > identical systems still running 8.4.8 that do not have this issue, so I'm
> > assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
> > see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
> > so it may be timing related.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-07-28 Thread Heikki Linnakangas

On 28.07.2011 23:57, Alexander Korotkov wrote:

On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas<
heikki.linnakan...@enterprisedb.com>  wrote:


Let me know if you have questions on the approach taken.



I realized that approach which comes as replace to loaded-subtrees keeping
is unclear to me. We save paths between node buffers. But those paths can
become invalid on page splits. It seems to me that approximately same volume
of code for maintaining parent links should be added to this version of
patch in order to get it working correctly.


gistFindCorrectParent() should take care of that.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-07-28 Thread Alexander Korotkov
On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <
heikki.linnakan...@enterprisedb.com> wrote:
>
> Let me know if you have questions on the approach taken.


I realized that approach which comes as replace to loaded-subtrees keeping
is unclear to me. We save paths between node buffers. But those paths can
become invalid on page splits. It seems to me that approximately same volume
of code for maintaining parent links should be added to this version of
patch in order to get it working correctly.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Robert Haas  wrote:
 
> Having transactions become visible in the same order on the master
> and the standby is very appealing, but I'm pretty well convinced
> that allowing commits to become visible before they've been
> durably committed is throwing the "D" an ACID out the window.  If
> synchronous_commit is off, sure, but otherwise...
 
It has been durably committed on the master, but not on the
supposedly synchronous copy; so it's not so much through out the "D"
in "ACID" as throwing out the "synchronous" in "synchronous
replication".  :-(
 
Unless I'm missing something we have a choice to make -- I see four
possibilities (already mentioned on this thread, I think):
 
(1)  Transactions are visible on the master which won't necessarily
be there if a meteor takes out the master and you need to resume
operations on the replica.
 
(2)  An asynchronous commit must block behind any pending
synchronous commits if synchronous replication is in use.
 
(3)  Transactions become visible on the replica in a different order
than they became visible on the master.
 
(4)  We communicate acceptable snapshots to the replica to make the
order of visibility visibility match the master even when that
doesn't match the order that transactions returned from commit.
 
I don't see how we can accept (1) and call it synchronous
replication.  I'm pretty dubious about (3), because we don't even
have Snapshot Isolation on the replica, really.  Is (3) where we're
currently at?  An advantage of (4) is that on the replica we would
get the same SI behavior at Repeatable Read that exists on the
master, and we could even use the same mechanism for SSI to provide
Serializable isolation on the replica.
 
I (predictably) like (4) -- even though it's a lot of work
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XMLATTRIBUTES vs. values of type XML

2011-07-28 Thread Peter Eisentraut
On ons, 2011-07-27 at 23:21 +0200, Florian Pflug wrote:
> On Jul27, 2011, at 23:08 , Peter Eisentraut wrote:
> > Well, offhand I would expect that passing an XML value to XMLATTRIBUTES
> > would behave as in
> > 
> > SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES(XMLSERIALIZE(content '&'::XML 
> > AS text) AS "a"))
> 
> With both 9.1 and 9.2 this query returns
> 
>  xmlelement 
> 
>  
> 
> i.e. makes the value of "a" represent the *literal* string '&', *not*
> the literal string '&'. Just to be sure there's no miss-understanding here
> - is this what you expect?

Well, I expect it to fail.

> What's the workaround if you have a value of type XML, e.g. '&',
> and want to set an attribute to the value represented by that XML fragment
> (i.e. '&')? Since we have no XMLUNESCAPE function, I don't see an easy
> way to do that. Maybe I'm missing something, though.

It may be worth researching whether the XMLSERIALIZE function is
actually doing what it should, or whether it could/should do some
unescaping.  Unfortunately, in the latest SQL/XML standard the final
answer it nested deep in the three other standards, so I don't have an
answer right now.  But there are plenty of standards in this area, so
I'd hope that one of them can give us the right behavior, instead of us
making something up.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing  wrote:
> so in case of stuck slave the syncrep transcation is committed after
> crash, but is not committed before the crash happens ?

Yep.

> ow will the replay process get stuc gaian during recovery ?

Nope.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning WAS: Check constraints on partition parents only?

2011-07-28 Thread Tom Lane
Aidan Van Dyk  writes:
> On Thu, Jul 28, 2011 at 12:53 PM, Josh Berkus  wrote:
>> Second, the key-based partitioning I described would actually be
>> preferred to what you describe by a lot of users I know, because it's
>> even simpler than what you propose, which means less contract DBA work
>> they have to pay for to set it up.

> But part of the desire for "simple partitioning" is to make sure the
> query planner and execution knows about partitions, can do exclude
> unnecessary partitions from queries.  If partion knowledge doesn't
> help the query plans, its not much use excpt to reduce table size,
> which isn't a hard task with the current inheritance options.

> But if the "partition" selection is an opaque "simple key" type
> function, you haven't given the planner/executor anything better to be
> able to pick partitions for queries, unless the query is an exact "key
> =" type of operation.

Right.  I think the *minimum* requirement for intelligent planning is
that the partitioning be based on ranges of a btree-sortable type.
Single values is a special case of that (for discrete types anyway),
but it doesn't cover enough cases to be the primary definition.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 16:20 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing  wrote:
> > On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
> >> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
> >>
> >> > Hmm, interesting idea.  However, consider the scenario where some
> >> > transactions are using synchronous_commit or synchronous replication,
> >> > and others are not.  If a transaction that needs to wait (either just
> >> > for WAL flush, or for WAL flush and synchronous replication) inserts
> >> > its commit record, and then another transaction with
> >> > synchronous_commit=off comes along and inserts its commit record, the
> >> > second transaction will have to block until the first transaction is
> >> > done waiting.
> >>
> >> What is the current behavior when the synchronous replication fails (say
> >> the slave breaks down) - will the transaction be rolled back at some
> >> point or will it wait indefinitely , that is until a new slave is
> >> installed ?
> >
> > More importantly, if the master crashes after the commit is written to
> > WAL, will the transaction be rolled back after recovery based on the
> > fact that no confirmation from synchronous slave is received ?
> 
> No.  You can't roll back a transaction once it's committed - ever.

so in case of stuck slave the syncrep transcation is committed after
crash, but is not committed before the crash happens ?

ow will the replay process get stuc gaian during recovery ?

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning WAS: Check constraints on partition parents only?

2011-07-28 Thread Martijn van Oosterhout
On Thu, Jul 28, 2011 at 10:20:57AM -0400, Robert Haas wrote:
> What Itagaki Takahiro proposed a year ago was basically something
> where you would say, OK, I want to partition on this column (or maybe
> expression).  And then you say:
> 
> If the value is less than v1, put it in a partition called p1.
> If the value is less than v2, put it in a position called p2.
> 
> If the value is not less than any of the above, put it in a partition
> called poverflow.
> 
> I like that design, not least but also not only because it's similar
> to what one of our competitors does.

FWIW, this seems to me to be the most useful design, because the other
nice use for partitioning is being able to throw away old data without
leaving huge chunks of deleted row. If the column you partition on
is a timestamp, then the above scheme makes it easy to just drop the
oldest partition when the disk is nearly full.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:12 PM, Kevin Grittner
 wrote:
> Hannu Krosing  wrote:
>> but I still think that it is right semantics to make your commit
>> visible to others, even before you have gotten back the
>> confirmation yourself.
>
> Possibly. That combined with building snapshots based on the order
> of WAL entries of commit records certainly has several appealing
> aspects.  It is hard to get over the fact that you lose an existing
> guarantee, though: right now, if you have one synchronous replica,
> you can never see a transaction's work on the master and then *not*
> see it on the slave -- the slave always has first visibility.  I
> don't see how such a guarantee can exist in *either* direction with
> the semantics you describe.  After seeing a transaction's work on
> one system it would always be unknown whether it was visible on the
> other.  There are situations where that is OK as long as each copy
> has a sane order of visibility, but there are situations where
> losing that guarantee might matter.
>
> On the bright side, it means that transactions would become visible
> on the replica in the same order as on the master, and that blocking
> would be reduced.

Having transactions become visible in the same order on the master and
the standby is very appealing, but I'm pretty well convinced that
allowing commits to become visible before they've been durably
committed is throwing the "D" an ACID out the window.  If
synchronous_commit is off, sure, but otherwise...

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing  wrote:
> On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
>> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
>>
>> > Hmm, interesting idea.  However, consider the scenario where some
>> > transactions are using synchronous_commit or synchronous replication,
>> > and others are not.  If a transaction that needs to wait (either just
>> > for WAL flush, or for WAL flush and synchronous replication) inserts
>> > its commit record, and then another transaction with
>> > synchronous_commit=off comes along and inserts its commit record, the
>> > second transaction will have to block until the first transaction is
>> > done waiting.
>>
>> What is the current behavior when the synchronous replication fails (say
>> the slave breaks down) - will the transaction be rolled back at some
>> point or will it wait indefinitely , that is until a new slave is
>> installed ?
>
> More importantly, if the master crashes after the commit is written to
> WAL, will the transaction be rolled back after recovery based on the
> fact that no confirmation from synchronous slave is received ?

No.  You can't roll back a transaction once it's committed - ever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:32 PM, Hannu Krosing  wrote:
>> Hmm, interesting idea.  However, consider the scenario where some
>> transactions are using synchronous_commit or synchronous replication,
>> and others are not.  If a transaction that needs to wait (either just
>> for WAL flush, or for WAL flush and synchronous replication) inserts
>> its commit record, and then another transaction with
>> synchronous_commit=off comes along and inserts its commit record, the
>> second transaction will have to block until the first transaction is
>> done waiting.
>
> What is the current behavior when the synchronous replication fails (say
> the slave breaks down) - will the transaction be rolled back at some
> point or will it wait indefinitely , that is until a new slave is
> installed ?

It will wait forever, unless you shut down the database or hit ^C.

>> We can't make either transaction visible without making
>> both visible, and we certainly can't acknowledge the second
>> transaction to the client until we've made it visible.  I'm not going
>> to say that's so horrible we shouldn't even consider it, but it
>> doesn't seem great, either.
>
> Maybe this is why other databases don't offer per backend async commit ?

Yeah, possibly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Hannu Krosing  wrote:
 
> but I still think that it is right semantics to make your commit
> visible to others, even before you have gotten back the
> confirmation yourself.
 
Possibly.  That combined with building snapshots based on the order
of WAL entries of commit records certainly has several appealing
aspects.  It is hard to get over the fact that you lose an existing
guarantee, though: right now, if you have one synchronous replica,
you can never see a transaction's work on the master and then *not*
see it on the slave -- the slave always has first visibility.  I
don't see how such a guarantee can exist in *either* direction with
the semantics you describe.  After seeing a transaction's work on
one system it would always be unknown whether it was visible on the
other.  There are situations where that is OK as long as each copy
has a sane order of visibility, but there are situations where
losing that guarantee might matter.
 
On the bright side, it means that transactions would become visible
on the replica in the same order as on the master, and that blocking
would be reduced.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 15:38 -0400, Tom Lane wrote:
> Hannu Krosing  writes:
> > So the basic design could be "a sparse snapshot", consisting of 'xmin,
> > xmax, running_txids[numbackends] where each backend manages its own slot
> > in running_txids - sets a txid when aquiring one and nulls it at commit,
> > possibly advancing xmin if xmin==mytxid.
> 
> How is that different from what we're doing now?  Basically, what you're
> describing is pulling the xids out of the ProcArray and moving them into
> a separate data structure.  That could be a win I guess if non-snapshot-
> related reasons to take ProcArrayLock represent enough of the contention
> to be worth separating out, but I suspect they don't. 

the idea was to make the thid array small enough to be able to memcpy it
to backend local memory fast. But I agree it takes testing to see if it
is an overall win

>  In particular,
> the data structure you describe above *cannot* be run lock-free, because
> it doesn't provide any consistency guarantees without a lock.  You need
> everyone to have the same ideas about commit order, and random backends
> independently changing array elements without locks won't guarantee
> that.
> 
>   regards, tom lane
> 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 15:42 -0400, Tom Lane wrote:
> Hannu Krosing  writes:
> > On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
> >> We can't make either transaction visible without making
> >> both visible, and we certainly can't acknowledge the second
> >> transaction to the client until we've made it visible.  I'm not going
> >> to say that's so horrible we shouldn't even consider it, but it
> >> doesn't seem great, either.
> 
> > Maybe this is why other databases don't offer per backend async commit ?
> 
> Yeah, I've always thought that feature wasn't as simple as it appeared.
> It got in only because it was claimed to be cost-free, and it's now
> obvious that it isn't.

I still think it is cost-free if you get the semantics of the COMMIT
contract right. (Of course it is not cost free as in not wasting
developers time in discussions ;) ) 

I'm still with you in claiming that transaction should be visible to
other backends as committed as soon as the WAL record is inserted.

the main thing to keep in mind is that getting back positive commit
confirmation really means (depending on various sync settings) that your
transaction is on stable storage.

BUT, _not_ getting back confirmation on commit does not quaranee that it
is not committed, just that you need to check. It may well be that it
was committed, written to stable storage _and_ also syncrepped but then
the confirnation did not come bac to you due to some network outage. or
your client computer crashed. or your child spilled black paint over the
monitor. or thousand other reasons.

async commit has the contract that you are ready to check a few latest
commits after crash.

but I still think that it is right semantics to make your commit visible
to others, even before you have gotten back the confirmation yourself.

---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing  writes:
> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
>> We can't make either transaction visible without making
>> both visible, and we certainly can't acknowledge the second
>> transaction to the client until we've made it visible.  I'm not going
>> to say that's so horrible we shouldn't even consider it, but it
>> doesn't seem great, either.

> Maybe this is why other databases don't offer per backend async commit ?

Yeah, I've always thought that feature wasn't as simple as it appeared.
It got in only because it was claimed to be cost-free, and it's now
obvious that it isn't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
>  
> > Hmm, interesting idea.  However, consider the scenario where some
> > transactions are using synchronous_commit or synchronous replication,
> > and others are not.  If a transaction that needs to wait (either just
> > for WAL flush, or for WAL flush and synchronous replication) inserts
> > its commit record, and then another transaction with
> > synchronous_commit=off comes along and inserts its commit record, the
> > second transaction will have to block until the first transaction is
> > done waiting.  
> 
> What is the current behavior when the synchronous replication fails (say
> the slave breaks down) - will the transaction be rolled back at some
> point or will it wait indefinitely , that is until a new slave is
> installed ?

More importantly, if the master crashes after the commit is written to
WAL, will the transaction be rolled back after recovery based on the
fact that no confirmation from synchronous slave is received ?

> Or will the sync rep transaction commit when archive_command returns
> true after copying the WAL segment containing this commit ?
> 
> > We can't make either transaction visible without making
> > both visible, and we certainly can't acknowledge the second
> > transaction to the client until we've made it visible.  I'm not going
> > to say that's so horrible we shouldn't even consider it, but it
> > doesn't seem great, either.
> 
> Maybe this is why other databases don't offer per backend async commit ?
> 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing  writes:
> So the basic design could be "a sparse snapshot", consisting of 'xmin,
> xmax, running_txids[numbackends] where each backend manages its own slot
> in running_txids - sets a txid when aquiring one and nulls it at commit,
> possibly advancing xmin if xmin==mytxid.

How is that different from what we're doing now?  Basically, what you're
describing is pulling the xids out of the ProcArray and moving them into
a separate data structure.  That could be a win I guess if non-snapshot-
related reasons to take ProcArrayLock represent enough of the contention
to be worth separating out, but I suspect they don't.  In particular,
the data structure you describe above *cannot* be run lock-free, because
it doesn't provide any consistency guarantees without a lock.  You need
everyone to have the same ideas about commit order, and random backends
independently changing array elements without locks won't guarantee
that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane  wrote:
> >>> But should we rethink that?  Your point that hot standby transactions on
> >>> a slave could see snapshots that were impossible on the parent was
> >>> disturbing.  Should we look for a way to tie "transaction becomes
> >>> visible" to its creation of a commit WAL record?  I think the fact that
> >>> they are not an indivisible operation is an implementation artifact, and
> >>> not a particularly nice one.
> >
> >> Well, I agree with you that it isn't especially nice, but it seems
> >> like a fairly intractable problem.  Currently, the standby has no way
> >> of knowing in what order the transactions became visible on the
> >> master.
> >
> > Right, but if the visibility order were *defined* as the order in which
> > commit records appear in WAL, that problem neatly goes away.  It's only
> > because we have the implementation artifact that "set my xid to 0 in the
> > ProcArray" is decoupled from inserting the commit record that there's
> > any difference.
> 
> Hmm, interesting idea.  However, consider the scenario where some
> transactions are using synchronous_commit or synchronous replication,
> and others are not.  If a transaction that needs to wait (either just
> for WAL flush, or for WAL flush and synchronous replication) inserts
> its commit record, and then another transaction with
> synchronous_commit=off comes along and inserts its commit record, the
> second transaction will have to block until the first transaction is
> done waiting.  

What is the current behavior when the synchronous replication fails (say
the slave breaks down) - will the transaction be rolled back at some
point or will it wait indefinitely , that is until a new slave is
installed ?

Or will the sync rep transaction commit when archive_command returns
true after copying the WAL segment containing this commit ?

> We can't make either transaction visible without making
> both visible, and we certainly can't acknowledge the second
> transaction to the client until we've made it visible.  I'm not going
> to say that's so horrible we shouldn't even consider it, but it
> doesn't seem great, either.

Maybe this is why other databases don't offer per backend async commit ?

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas  wrote:
>> I'll also test out creating and dropping some tables.
>
> Still need to work on this one.

And there results are in.  I set up the following sophisticated test
script for pgbench:

CREATE TEMP TABLE foo (a int);
DROP TABLE foo;

And then did 5-minute test runs with varying numbers of clients on
Nate Boley's 32-core machine with (a) master, (b) master +
sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
+ sinval-hasmessages.  I held my breath until the results started
popping out, then felt much better.  In the table below, results with
L are with the lazy-vxid patch, S is with the sinval-hasmessages
patch, LS with both, and results with no letters are neither.  The
numbers are the client count.  Long story short, it seems that the
patch actually makes this test case faster at any client code, though
the improvement at 1 and 8 clients might be in the noise:

01L tps = 514.880290 (including connections establishing)
01L tps = 525.097199 (including connections establishing)
01L tps = 508.319588 (including connections establishing)
08L tps = 1834.259810 (including connections establishing)
08L tps = 1846.846089 (including connections establishing)
08L tps = 1841.402433 (including connections establishing)
32L tps = 1463.822936 (including connections establishing)
32L tps = 1481.169483 (including connections establishing)
32L tps = 1393.780335 (including connections establishing)
80L tps = 1192.768020 (including connections establishing)
80L tps = 1165.545010 (including connections establishing)
80L tps = 1169.776066 (including connections establishing)

01LS tps = 517.624068 (including connections establishing)
01LS tps = 524.507723 (including connections establishing)
01LS tps = 507.847622 (including connections establishing)
08LS tps = 1831.248178 (including connections establishing)
08LS tps = 1873.932133 (including connections establishing)
08LS tps = 1863.048113 (including connections establishing)
32LS tps = 1851.143407 (including connections establishing)
32LS tps = 1754.683356 (including connections establishing)
32LS tps = 1785.926527 (including connections establishing)
80LS tps = 1510.778084 (including connections establishing)
80LS tps = 1484.423486 (including connections establishing)
80LS tps = 1480.692051 (including connections establishing)

01 tps = 511.572832 (including connections establishing)
01 tps = 499.389527 (including connections establishing)
01 tps = 495.697080 (including connections establishing)
08 tps = 1832.762548 (including connections establishing)
08 tps = 1819.884564 (including connections establishing)
08 tps = 1835.608561 (including connections establishing)
32 tps = 1417.168790 (including connections establishing)
32 tps = 1447.478971 (including connections establishing)
32 tps = 1427.489879 (including connections establishing)
80 tps = 1154.272515 (including connections establishing)
80 tps = 1168.805881 (including connections establishing)
80 tps = 1173.971801 (including connections establishing)

01S tps = 519.860218 (including connections establishing)
01S tps = 510.759087 (including connections establishing)
01S tps = 517.159276 (including connections establishing)
08S tps = 1880.179600 (including connections establishing)
08S tps = 1829.693454 (including connections establishing)
08S tps = 1886.168401 (including connections establishing)
32S tps = 1809.950929 (including connections establishing)
32S tps = 1809.474070 (including connections establishing)
32S tps = 1798.620842 (including connections establishing)
80S tps = 1483.037788 (including connections establishing)
80S tps = 1481.059504 (including connections establishing)
80S tps = 1487.215748 (including connections establishing)

So, apparently, the extra work in SIInsertDataEntries() is more than
paid for by the speedup in SIGetDataEntries().  I'm guessing that at
high client counts you win because of reduced spinlock contention, and
at low client counts you still win a little bit because
SIGetDataEntries() is called multiple times per transaction, whereas
SIInsertDataEntries() is only called once.  I could be all wet on the
reason, but at any rate the numbers look pretty good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:36 AM, Hannu Krosing  wrote:
> On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote:
>> On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing  
>> wrote:
>> > My main point was, that we already do synchronization when writing wal,
>> > why not piggyback on this to also update latest snapshot .
>>
>> Well, one problem is that it would break sync rep.
>
> Can you elaborate, in what way it "breaks" sync rep ?

Well, the point of synchronous replication is that the local machine
doesn't see the effects of the transaction until it's been replicated.
 Therefore, no one can be relying on data that might disappear in the
event the system is crushed by a falling meteor.  It would be easy,
technically speaking, to remove the transaction from the ProcArray and
*then* wait for synchronous replication, but that would offer a much
weaker guarantee than what the current version provides.  We would
still guarantee that the commit wouldn't be acknowledged to the client
which submitted it until it was replicated, but we would no longer be
able to guarantee that no one else relied on data written by the
transaction prior to successful replication.

For example, consider this series of events:

1. User asks ATM "what is my balance?".  ATM inquires of database,
which says $500.
2. User deposits a check for $100.  ATM does an UPDATE to add $100 to
balance and issues a COMMIT.  But the master has become disconnected
from the synchronous standby, so the sync rep wait hangs.
3. ATM eventually times out and tells user "sorry, i can't complete
your transaction right now".
4. User wants to know whether their check got deposited, so they walk
into the bank and ask a teller to check their balance.  Teller's
computer connects to the database and gets $600.  User is happy and
leaves.
5. Master dies.  Failover.
6. User's balance is now back to $500.  When the user finds out much
later, they say "wtf?  you told me before it was $600!".

Right now, when using synchronous replication, this series of events
CANNOT HAPPEN.  If some other transaction interrogates the state of
the database and sees the results of some transaction, it is an
ironclad guarantee that the transaction has been replicated.  If we
start making transactions visible when their WAL record is flushed or
- worse - when it's inserted, then those guarantees go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane  wrote:
>>> But should we rethink that?  Your point that hot standby transactions on
>>> a slave could see snapshots that were impossible on the parent was
>>> disturbing.  Should we look for a way to tie "transaction becomes
>>> visible" to its creation of a commit WAL record?  I think the fact that
>>> they are not an indivisible operation is an implementation artifact, and
>>> not a particularly nice one.
>
>> Well, I agree with you that it isn't especially nice, but it seems
>> like a fairly intractable problem.  Currently, the standby has no way
>> of knowing in what order the transactions became visible on the
>> master.
>
> Right, but if the visibility order were *defined* as the order in which
> commit records appear in WAL, that problem neatly goes away.  It's only
> because we have the implementation artifact that "set my xid to 0 in the
> ProcArray" is decoupled from inserting the commit record that there's
> any difference.

Hmm, interesting idea.  However, consider the scenario where some
transactions are using synchronous_commit or synchronous replication,
and others are not.  If a transaction that needs to wait (either just
for WAL flush, or for WAL flush and synchronous replication) inserts
its commit record, and then another transaction with
synchronous_commit=off comes along and inserts its commit record, the
second transaction will have to block until the first transaction is
done waiting.  We can't make either transaction visible without making
both visible, and we certainly can't acknowledge the second
transaction to the client until we've made it visible.  I'm not going
to say that's so horrible we shouldn't even consider it, but it
doesn't seem great, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning WAS: Check constraints on partition parents only?

2011-07-28 Thread Aidan Van Dyk
On Thu, Jul 28, 2011 at 12:53 PM, Josh Berkus  wrote:
> Robert,
>
>> If the value is less than v1, put it in a partition called p1.
>> If the value is less than v2, put it in a position called p2.
>> 
>> If the value is not less than any of the above, put it in a partition
>> called poverflow.

> Sure.  I'm just restarting the discussion from the point of "what's the
> very simplest implementation of partitioning we could create and still
> be useful?"

> Second, the key-based partitioning I described would actually be
> preferred to what you describe by a lot of users I know, because it's
> even simpler than what you propose, which means less contract DBA work
> they have to pay for to set it up.

But part of the desire for "simple partitioning" is to make sure the
query planner and execution knows about partitions, can do exclude
unnecessary partitions from queries.  If partion knowledge doesn't
help the query plans, its not much use excpt to reduce table size,
which isn't a hard task with the current inheritance options.

But if the "partition" selection is an opaque "simple key" type
function, you haven't given the planner/executor anything better to be
able to pick partitions for queries, unless the query is an exact "key
=" type of operation.

So I'm failing to see the benefit of that "key based" partitioning,
even if that key-based function was something like date_trunc on a
timestamp...



a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 18:48 +0200, Hannu Krosing wrote:
> On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote:
> 
> > But it is also possible, that you can get logically consistent snapshots
> > by protecting only some ops. for example, if you protect only insert and
> > get snapshot, then the worst that can happen is that you get a snapshot
> > that is a few commits older than what youd get with full locking and it
> > may well be ok for all real uses.
> 
> Thinking more of it, we should lock commit/remove_txid and get_snapshot
> 
> having a few more running backends does not make a difference, but
> seeing commits in wrong order may.

Sorry, not true as this may advanxe xmax to include some running
transactions which were missed during memcpy.

so we still need some mechanism to either synchronize the the copy with
both inserts and removes, or make it atomic even in presence of multiple
CPUs.

> this will cause contention between commit and get_snapshot, but
> hopefully less than current ProcArray manipulation, as there is just one
> simple C array to lock and copy.
> 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning WAS: Check constraints on partition parents only?

2011-07-28 Thread Josh Berkus
Robert,

> If the value is less than v1, put it in a partition called p1.
> If the value is less than v2, put it in a position called p2.
> 
> If the value is not less than any of the above, put it in a partition
> called poverflow.
> 
> I like that design, not least but also not only because it's similar
> to what one of our competitors does.

Sure.  I'm just restarting the discussion from the point of "what's the
very simplest implementation of partitioning we could create and still
be useful?"

There's value in similicity.  First, by having a very simple
implementation it's more likely someone will code it.  If we let
-hackers pile on the "must have X feature" to a new partitioning
implementation, it'll never get built.

Second, the key-based partitioning I described would actually be
preferred to what you describe by a lot of users I know, because it's
even simpler than what you propose, which means less contract DBA work
they have to pay for to set it up.

I'm sure what we eventually implement will be a compromise.  I just want
to push the discussion away from the "must have every feature under the
sun" direction and towards something that might actually work.

Oh, and no question that automatic partitioning will be a PITA and might
not be implemented for years.  But it's a serious user desire.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote:

> But it is also possible, that you can get logically consistent snapshots
> by protecting only some ops. for example, if you protect only insert and
> get snapshot, then the worst that can happen is that you get a snapshot
> that is a few commits older than what youd get with full locking and it
> may well be ok for all real uses.

Thinking more of it, we should lock commit/remove_txid and get_snapshot

having a few more running backends does not make a difference, but
seeing commits in wrong order may.

this will cause contention between commit and get_snapshot, but
hopefully less than current ProcArray manipulation, as there is just one
simple C array to lock and copy.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 11:57 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane  wrote:
> >> But should we rethink that?  Your point that hot standby transactions on
> >> a slave could see snapshots that were impossible on the parent was
> >> disturbing.  Should we look for a way to tie "transaction becomes
> >> visible" to its creation of a commit WAL record?  I think the fact that
> >> they are not an indivisible operation is an implementation artifact, and
> >> not a particularly nice one.
> 
> > Well, I agree with you that it isn't especially nice, but it seems
> > like a fairly intractable problem.  Currently, the standby has no way
> > of knowing in what order the transactions became visible on the
> > master.
> 
> Right, but if the visibility order were *defined* as the order in which
> commit records appear in WAL, that problem neatly goes away.  It's only
> because we have the implementation artifact that "set my xid to 0 in the
> ProcArray" is decoupled from inserting the commit record that there's
> any difference.

Yes, as I explain in another e-mail, the _only_ one for whom the
transaction is not yet committed is the waiting backend itself. for all
others it should show as committed the moment after the wal record is
written.

It's kind of "local 2 phase commit" thing :)

> 
>   regards, tom lane
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Wed, 2011-07-27 at 22:51 -0400, Robert Haas wrote:
> On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane  wrote:
> > I wonder whether we could do something involving WAL properties --- the
> > current tuple visibility logic was designed before WAL existed, so it's
> > not exploiting that resource at all.  I'm imagining that the kernel of a
> > snapshot is just a WAL position, ie the end of WAL as of the time you
> > take the snapshot (easy to get in O(1) time).  Visibility tests then
> > reduce to "did this transaction commit with a WAL record located before
> > the specified position?".  You'd need some index datastructure that made
> > it reasonably cheap to find out the commit locations of recently
> > committed transactions, where "recent" means "back to recentGlobalXmin".
> > That seems possibly do-able, though I don't have a concrete design in
> > mind.
> 
> I was mulling this idea over some more (the same ideas keep floating
> back to the top...).  I don't think an LSN can actually work, because
> there's no guarantee that the order in which the WAL records are
> emitted is the same order in which the effects of the transactions
> become visible to new snapshots.  For example:
> 
> 1. Transaction A inserts its commit record, flushes WAL, and begins
> waiting for sync rep.
> 2. A moment later, transaction B sets synchronous_commit=off, inserts
> its commit record, requests a background WAL flush, and removes itself
> from the ProcArray.
> 3. Transaction C takes a snapshot.

It is Transaction A here which is acting badly - it should also remove
itself from procArray right after it inserts its commit record, as for
everybody else except the client app of transaction A it is committed at
this point. It just cant report back to client before getting
confirmation that it is actually syncrepped (or locally written to
stable storage).

At least at the point of consistent snapshots the right sequence should
be:

1) inert commit record into wal
2) remove yourself from ProcArray (or use some other means to declare
that your transaction is no longer running)
3) if so configured, wait for WAL flus to stable storage and/or SYnc Rep
confirmation

Based on this let me suggest a simple snapshot cache mechanism

A simple snapshot cache mechanism
=

have an array of running transactions, with one slot per backend

txid running_transactions[max_connections];

there are exactly 3 operations on this array

1. insert backends running transaction id
-

this is done at the moment of acquiring your transaction id from system,
and synchronized by the same mechanism as getting the transaction id

running_transactions[my_backend] = current_transaction_id

2. remove backends running transaction id
-

this is done at the moment of committing or aborting the transaction,
again synchronized by the write commit record mechanism. 

running_transactions[my_backend] = NULL

should be first thing after insertin WAcommit record

3. getting a snapshot
-

memcpy() running_transactions to local memory, then construct a snapshot


it may be that you need to protect all3 operations with a single
spinlock, if so then I'd propose the same spinlock used when getting
your transaction id (and placing the array near where latest transaction
id is stored so they share cache line). 

But it is also possible, that you can get logically consistent snapshots
by protecting only some ops. for example, if you protect only insert and
get snapshot, then the worst that can happen is that you get a snapshot
that is a few commits older than what youd get with full locking and it
may well be ok for all real uses.



-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane  wrote:
>> But should we rethink that?  Your point that hot standby transactions on
>> a slave could see snapshots that were impossible on the parent was
>> disturbing.  Should we look for a way to tie "transaction becomes
>> visible" to its creation of a commit WAL record?  I think the fact that
>> they are not an indivisible operation is an implementation artifact, and
>> not a particularly nice one.

> Well, I agree with you that it isn't especially nice, but it seems
> like a fairly intractable problem.  Currently, the standby has no way
> of knowing in what order the transactions became visible on the
> master.

Right, but if the visibility order were *defined* as the order in which
commit records appear in WAL, that problem neatly goes away.  It's only
because we have the implementation artifact that "set my xid to 0 in the
ProcArray" is decoupled from inserting the commit record that there's
any difference.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing  wrote:
> > My main point was, that we already do synchronization when writing wal,
> > why not piggyback on this to also update latest snapshot .
> 
> Well, one problem is that it would break sync rep.

Can you elaborate, in what way it "breaks" sync rep ?

> Another problem is that pretty much the last thing I want to do is
> push more work under WALInsertLock.  Based on the testing I've done so
> far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
> are the main bottlenecks here.  I'm focusing on ProcArrayLock and
> CLogControlLock right now, but I am pretty well convinced that
> WALInsertLock is going to be the hardest nut to crack, so putting
> anything more under there seems like it's going in the wrong
> direction. 

probably it is not just the WALInsertLock, but the fact that we have
just one WAL. It can become a bottleneck once we have significant number
of processors fighting to write in single WAL.

>  IMHO, anyway.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 17:10 +0200, Hannu Krosing wrote:
> On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
> > Hannu Krosing  writes:
> > > On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
> > >> I'm confused by this, because I don't think any of this can be done
> > >> when we insert the commit record into the WAL stream.
> > 
> > > The update to stored snapshot needs to happen at the moment when the WAL
> > > record is considered to be "on stable storage", so the "current
> > > snapshot" update presumably can be done by the same process which forces
> > > it to stable storage, with the same contention pattern that applies to
> > > writing WAL records, no ?
> > 
> > No.  There is no reason to tie this to fsyncing WAL.  For purposes of
> > other currently-running transactions, the commit can be considered to
> > occur at the instant the commit record is inserted into WAL buffers.
> > If we crash before that makes it to disk, no problem, because nothing
> > those other transactions did will have made it to disk either. 
> 
> Agreed. Actually figured it out right after pushing send :)
> 
> > The
> > advantage of defining it that way is you don't have weirdly different
> > behaviors for sync and async transactions.
> 
> My main point was, that we already do synchronization when writing wal,
> why not piggyback on this to also update latest snapshot .

So the basic design could be "a sparse snapshot", consisting of 'xmin,
xmax, running_txids[numbackends] where each backend manages its own slot
in running_txids - sets a txid when aquiring one and nulls it at commit,
possibly advancing xmin if xmin==mytxid. as xmin update requires full
scan of running_txids, it is also a good time to update xmax  - no need
to advance xmax when "inserting" your next txid, so you don't need to
locak anything at insert time. 

the valid xmax is still computed when getting the snapshot. 

hmm, probably no need to store xmin and xmax at all.

it needs some further analysis to figure out, if doing it this way
without any locks can produce any relevantly bad snapshots.

maybe you still need one spinlock + memcpy of running_txids to local
memory to get snapshot.

also, as the running_txids array is global, it may need to be made even
sparser to minimise cache-line collisions. needs to be a tuning decision
between cache conflicts and speed of memcpy.

> 
> 
> -- 
> ---
> Hannu Krosing
> PostgreSQL (Infinite) Scalability and Performance Consultant
> PG Admin Book: http://www.2ndQuadrant.com/books/
> 
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Netbeans and postgres

2011-07-28 Thread Peter Geoghegan
2011/7/28 pasman pasmański :
> I install netbeans 7. When i try to configure postgresql project, path
> is set up incorectly - gcc not found
>
> How to set the path ?

Install GCC?


-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-07-28 Thread Kevin Grittner
Robert Haas  wrote:
> Peter Eisentraut  wrote:
 
>> I would like to argue for reverting this.  If you want to
>> word-smith details like this, "relation" doesn't carry any
>> additional meaning.  PG hackers know that internally, a
>> "relation" is a table, view, index, sequence, etc., but for the
>> user, it doesn't mean anything.
> 
> Well, I don't think we're going to do very well trying to get by
> without a generic term of some sort.  Calling it a table is more
> confusing, because the user might easily be forgiven for thinking
> that he knows what the word "table" means and reading no further. 
> If you say relation, then either (a) the user knows what that
> means, or (b) he'll read the text and find out.  I am not very
> excited about the idea of documenting "table_name" as "either a
> table name, or the name of some kind of object that isn't a
> table"; I think that's just weird.
 
+1 on that whole paragraph.
 
"relation" has been a "term of art" since 1969.  A table is a type
of relation variable.  I don't think it makes sense to invent new
terminology, although there's nothing wrong with the docs explaining
terms which might not be familiar to all readers.  Of course, we
don't want to come off as overly pedantic with our use of
terminology, but this one is pretty basic and commonly used.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing  wrote:
> My main point was, that we already do synchronization when writing wal,
> why not piggyback on this to also update latest snapshot .

Well, one problem is that it would break sync rep.

Another problem is that pretty much the last thing I want to do is
push more work under WALInsertLock.  Based on the testing I've done so
far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
are the main bottlenecks here.  I'm focusing on ProcArrayLock and
CLogControlLock right now, but I am pretty well convinced that
WALInsertLock is going to be the hardest nut to crack, so putting
anything more under there seems like it's going in the wrong
direction.  IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
> Hannu Krosing  writes:
> > On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
> >> I'm confused by this, because I don't think any of this can be done
> >> when we insert the commit record into the WAL stream.
> 
> > The update to stored snapshot needs to happen at the moment when the WAL
> > record is considered to be "on stable storage", so the "current
> > snapshot" update presumably can be done by the same process which forces
> > it to stable storage, with the same contention pattern that applies to
> > writing WAL records, no ?
> 
> No.  There is no reason to tie this to fsyncing WAL.  For purposes of
> other currently-running transactions, the commit can be considered to
> occur at the instant the commit record is inserted into WAL buffers.
> If we crash before that makes it to disk, no problem, because nothing
> those other transactions did will have made it to disk either. 

Agreed. Actually figured it out right after pushing send :)

> The
> advantage of defining it that way is you don't have weirdly different
> behaviors for sync and async transactions.

My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .


-- 
---
Hannu Krosing
PostgreSQL (Infinite) Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing  
>> wrote:
>>> My hope was, that this contention would be the same than simply writing
>>> the WAL buffers currently, and thus largely hidden by the current WAL
>>> writing sync mechanisma.
>>>
>>> It really covers just the part which writes commit records to WAL, as
>>> non-commit WAL records dont participate in snapshot updates.
>
>> I'm confused by this, because I don't think any of this can be done
>> when we insert the commit record into the WAL stream.  It has to be
>> done later, at the time we currently remove ourselves from the
>> ProcArray.  Those things need not happen in the same order, as I noted
>> in my original post.
>
> But should we rethink that?  Your point that hot standby transactions on
> a slave could see snapshots that were impossible on the parent was
> disturbing.  Should we look for a way to tie "transaction becomes
> visible" to its creation of a commit WAL record?  I think the fact that
> they are not an indivisible operation is an implementation artifact, and
> not a particularly nice one.

Well, I agree with you that it isn't especially nice, but it seems
like a fairly intractable problem.  Currently, the standby has no way
of knowing in what order the transactions became visible on the
master.  Unless we want to allow only SR and not log shipping, the
only way to communicate that information would be to WAL log it.
Aside from the expense, what do we do if XLogInsert() fails, given
that we've already committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing  writes:
> On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
>> I'm confused by this, because I don't think any of this can be done
>> when we insert the commit record into the WAL stream.

> The update to stored snapshot needs to happen at the moment when the WAL
> record is considered to be "on stable storage", so the "current
> snapshot" update presumably can be done by the same process which forces
> it to stable storage, with the same contention pattern that applies to
> writing WAL records, no ?

No.  There is no reason to tie this to fsyncing WAL.  For purposes of
other currently-running transactions, the commit can be considered to
occur at the instant the commit record is inserted into WAL buffers.
If we crash before that makes it to disk, no problem, because nothing
those other transactions did will have made it to disk either.  The
advantage of defining it that way is you don't have weirdly different
behaviors for sync and async transactions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing  wrote:
> > My hope was, that this contention would be the same than simply writing
> > the WAL buffers currently, and thus largely hidden by the current WAL
> > writing sync mechanisma.
> >
> > It really covers just the part which writes commit records to WAL, as
> > non-commit WAL records dont participate in snapshot updates.
> 
> I'm confused by this, because I don't think any of this can be done
> when we insert the commit record into the WAL stream.  It has to be
> done later, at the time we currently remove ourselves from the
> ProcArray.  Those things need not happen in the same order, as I noted
> in my original post.

The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be "on stable storage", so the "current
snapshot" update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?

If the problem is with a backend which requested an "async commit", then
it is free to apply it's additional local commit changes from its own
memory if the global latest snapshot disgrees with it.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Netbeans and postgres

2011-07-28 Thread pasman pasmański
Hi.

I install netbeans 7. When i try to configure postgresql project, path
is set up incorectly - gcc not found

How to set the path ?

-- 

pasman

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Robert Haas  writes:
> On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing  wrote:
>> My hope was, that this contention would be the same than simply writing
>> the WAL buffers currently, and thus largely hidden by the current WAL
>> writing sync mechanisma.
>> 
>> It really covers just the part which writes commit records to WAL, as
>> non-commit WAL records dont participate in snapshot updates.

> I'm confused by this, because I don't think any of this can be done
> when we insert the commit record into the WAL stream.  It has to be
> done later, at the time we currently remove ourselves from the
> ProcArray.  Those things need not happen in the same order, as I noted
> in my original post.

But should we rethink that?  Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing.  Should we look for a way to tie "transaction becomes
visible" to its creation of a commit WAL record?  I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing  wrote:
> My hope was, that this contention would be the same than simply writing
> the WAL buffers currently, and thus largely hidden by the current WAL
> writing sync mechanisma.
>
> It really covers just the part which writes commit records to WAL, as
> non-commit WAL records dont participate in snapshot updates.

I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream.  It has to be
done later, at the time we currently remove ourselves from the
ProcArray.  Those things need not happen in the same order, as I noted
in my original post.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning WAS: Check constraints on partition parents only?

2011-07-28 Thread Robert Haas
On Tue, Jul 26, 2011 at 7:58 PM, Josh Berkus  wrote:
> Jim,
>
>> That's why I'd be opposed to any partitioning scheme that removed the 
>> ability to have different fields in different children. We've found that 
>> ability to be very useful. Likewise, I think we need to have intelligent 
>> plans involving a parent table that's either completely empty or mostly 
>> empty.
>
> Well, I don't think that anyone is proposing making constraint exclusion
> go away.  However, we also need a new version of partitioning which
> happens "below" the table level.  I don't agree that the new
> partitioning needs -- at least at the start -- the level of flexibility
> which CE gives the user.  In order to get simplicity, we have to
> sacrifice flexibility.

Agreed.

> In fact, I'd suggest extreme simplicity for the first version of this,
> with just key partitioning.  That is:
>
> CREATE TABLE  (
>        ... cols ... )
> PARTITION ON 
> [ AUTOMATIC CREATE ];

I think that the automatic create feature is just about impossible to
implement reliably, at least not without autonomous transactions.
There are big problems here in the case of concurrent activity.

What Itagaki Takahiro proposed a year ago was basically something
where you would say, OK, I want to partition on this column (or maybe
expression).  And then you say:

If the value is less than v1, put it in a partition called p1.
If the value is less than v2, put it in a position called p2.

If the value is not less than any of the above, put it in a partition
called poverflow.

I like that design, not least but also not only because it's similar
to what one of our competitors does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 09:38 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing  wrote:
> > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane  wrote:
> >> > I wonder whether we could do something involving WAL properties --- the
> >> > current tuple visibility logic was designed before WAL existed, so it's
> >> > not exploiting that resource at all.  I'm imagining that the kernel of a
> >> > snapshot is just a WAL position, ie the end of WAL as of the time you
> >> > take the snapshot (easy to get in O(1) time).  Visibility tests then
> >> > reduce to "did this transaction commit with a WAL record located before
> >> > the specified position?".
> >
> > Why not just cache a "reference snapshots" near WAL writer and maybe
> > also save it at some interval in WAL in case you ever need to restore an
> > old snapshot at some val position for things like time travel.
> >
> > It may be cheaper lock-wise not to update ref. snapshot at each commit,
> > but to keep latest saved snapshot and a chain of transactions
> > committed / aborted since. This means that when reading the snapshot you
> > read the current "saved snapshot" and then apply the list of commits.
> 
> Yeah, interesting idea.  I thought about that.  You'd need not only
> the list of commits but also the list of XIDs that had been published,
> since the commits have to be removed from the snapshot and the
> newly-published XIDs have to be added to it (in case they commit later
> while the snapshot is still in use).
> 
> You can imagine doing this with a pair of buffers.  You write a
> snapshot into the beginning of the first buffer and then write each
> XID that is published or commits into the next slot in the array.
> When the buffer is filled up, the next process that wants to publish
> an XID or commit scans through the array and constructs a new snapshot
> that compacts away all the begin/commit pairs and writes it into the
> second buffer, and all new snapshots are taken there.  When that
> buffer fills up you flip back to the first one.  Of course, you need
> some kind of synchronization to make sure that you don't flip back to
> the first buffer while some laggard is still using it to construct a
> snapshot that he started taking before you flipped to the second one,
> but maybe that could be made light-weight enough not to matter.
> 
> I am somewhat concerned that this approach might lead to a lot of
> contention over the snapshot buffers.  

My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma. 

It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.

Writing WAL is already a single point which needs locks or other kind of
synchronization. This will stay with us at least until we start
supporting multiple WAL streams, and even then we will need some
synchronisation between those.

> In particular, the fact that
> you have to touch shared cache lines both to advertise a new XID and
> when it gets committed seems less than ideal. 

Every commit record writer should do this as part of writing the commit
record. And as mostly you still want the latest snapshot, why not just
update the snapshot as part of the commit/abort ?

Do we need the ability for fast "recent snapshots" at all ?

>  One thing that's kind
> of interesting about the "commit sequence number" approach is that -
> as far as I can tell - it doesn't require new XIDs to be advertised
> anywhere at all.  You don't have to worry about overflowing the
> subxids[] array because it goes away altogether.  The commit sequence
> number itself is going to be a contention hotspot, but at least it's
> small and fixed-size.
> 
> Another concern I have with this approach is - how large do you make
> the buffers?  If you make them too small, then you're going to have to
> regenerate the snapshot frequently, which will lead to the same sort
> of lock contention we have today - no one can commit while the
> snapshot is being regenerated.  On the other hand, if you make them
> too big, then deriving a snapshot gets slow.  Maybe there's some way
> to make it work, but I'm afraid it might end up being yet another
> arcane thing the tuning of which will become a black art among
> hackers...
> 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:01 AM, Nikhil Sontakke
 wrote:
> Yeah, in your case too an initdb would be required, so might as well go down
> the route of a new column. Any preferences for the name?
> connoinh
> conisonly
> constatic or confixed

I'd probably pick conisonly from those choices.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: move dumpUserConfig call in dumpRoles function of pg_dumpall.c

2011-07-28 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jul 27, 2011 at 7:51 PM, Tom Lane  wrote:
>> I think pg_dumpall is the very least of your problems if you do
>> something like that.  We probably ought to forbid it entirely.

> Well, we had a long discussion of that on the thread Phil linked to,
> and I don't think there was any consensus that forbidding it was the
> right thing to do.

You're right, I was half-remembering that thread and thinking that
there are a lot of gotchas in doing an ALTER ROLE SET ROLE.  Florian
claimed in the thread that he'd never hit one before, but that doesn't
convince me much.

> Phil appears to be trying to implement the
> proposal you made here:
> http://archives.postgresql.org/pgsql-hackers/2011-01/msg00452.php
> ...although I don't think that what he did quite matches what you asked for.

No, the proposed patch doesn't go nearly far enough to address Florian's
problem.  What I was speculating about was moving all the role (and
database) alters to the *end*, so they'd not take effect until after
we'd completed all the restore actions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-28 Thread Robert Haas
On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas  wrote:
> The reason the benefit is smaller is, I believe, because the previous
> numbers were generated with the lazy vxid locks patch applied, and
> these were generated against master.  With the lock manager as a
> bottleneck, the sinval stuff doesn't get hit quite as hard, so the
> benefit is less.  I can regenerate the numbers with the lazy vxid
> patch applied; I suspect they'll be similar to what we saw before.

Yep.  Here's with both lazy-vxids and sinval-hasmessages;

01 tps = 4470.133776 (including connections establishing)
01 tps = 4471.450650 (including connections establishing)
01 tps = 4490.833194 (including connections establishing)
32 tps = 191416.960956 (including connections establishing)
32 tps = 190653.742400 (including connections establishing)
32 tps = 191832.231458 (including connections establishing)
80 tps = 189348.509378 (including connections establishing)
80 tps = 191080.641878 (including connections establishing)
80 tps = 191366.728275 (including connections establishing)

And with just lazy vxids:

01 tps = 4458.667013 (including connections establishing)
01 tps = 4526.922638 (including connections establishing)
01 tps = 4480.415099 (including connections establishing)
32 tps = 193273.434028 (including connections establishing)
32 tps = 190661.279391 (including connections establishing)
32 tps = 189526.560031 (including connections establishing)
80 tps = 150572.020250 (including connections establishing)
80 tps = 118643.970622 (including connections establishing)
80 tps = 119211.643930 (including connections establishing)

Same select-only, scale-factor-100 pgbench test, same 32 core machine,
as I've been using for my other recent tests.

> I'll also test out creating and dropping some tables.

Still need to work on this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Nikhil Sontakke
> This approach certainly can't work, because a table can be both an
> inheritance parent and an inheritance child.  It could have an ONLY
> constraint, and also inherit a copy of the same constraint for one or
> more parents.  IOW, the fact that conislocal = true does not mean that
> coninhcount is irrelevant.


Oh I see.


> I think what you probably want to do is
> either (a) add a new column or (b) change conislocal to a char value
> and make it three-valued:
>
> n = inherited constraint, no local definition
> o = defined locally as an "ONLY" constraint
> i = defined locally as a non-ONLY constraint
>
> I think I favor the latter approach as more space-efficient, but I
> hear Tom muttering about backward-compatibility...
>
>
Yeah, in your case too an initdb would be required, so might as well go down
the route of a new column. Any preferences for the name?

connoinh
conisonly
constatic or confixed

Others?

Regards,
Nikhils


Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 9:43 AM, Nikhil Sontakke
 wrote:
> Alternatively we could bring about the same
> by using a combination of conislocal and coninhcnt. For ONLY constraints, we
> will need to percolate this information down to the point where we define it
> in the code. We can then mark ONLY constraints to have conislocal set to
> TRUE and coninhcnt set to a special value (-1)

This approach certainly can't work, because a table can be both an
inheritance parent and an inheritance child.  It could have an ONLY
constraint, and also inherit a copy of the same constraint for one or
more parents.  IOW, the fact that conislocal = true does not mean that
coninhcount is irrelevant.  I think what you probably want to do is
either (a) add a new column or (b) change conislocal to a char value
and make it three-valued:

n = inherited constraint, no local definition
o = defined locally as an "ONLY" constraint
i = defined locally as a non-ONLY constraint

I think I favor the latter approach as more space-efficient, but I
hear Tom muttering about backward-compatibility...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Tom Lane
Nikhil Sontakke  writes:
> What we need is to persist information of a particular constraint to be as
> specified - ONLY for this table. We could do that by adding a new column in
> pg_constraint like 'connoinh' or something, but I guess we would prefer not
> to get into the initdb business.

Uh, why not?  I trust you're not imagining this would get back-patched.

> Alternatively we could bring about the same
> by using a combination of conislocal and coninhcnt.

Ugh.  New column, please.  If you're wondering why, see the flak Robert
has been taking lately for replacing pg_class.relistemp.  Random changes
in the semantics of existing columns are trouble.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-28 Thread Robert Haas
On Wed, Jul 27, 2011 at 8:28 PM, daveg  wrote:
> My client has been seeing regular instances of the following sort of problem:

On what version of PostgreSQL?

> If simplicity worked, the world would be overrun with insects.

I thought it was...  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: move dumpUserConfig call in dumpRoles function of pg_dumpall.c

2011-07-28 Thread Robert Haas
On Wed, Jul 27, 2011 at 7:51 PM, Tom Lane  wrote:
> Phil Sorber  writes:
>> Currently if you use 'ALTER ROLE rolename SET ROLE', pg_dumpall will
>> dump an 'ALTER ROLE' out right after the 'CREATE ROLE' statement.
>
> I think pg_dumpall is the very least of your problems if you do
> something like that.  We probably ought to forbid it entirely.

Well, we had a long discussion of that on the thread Phil linked to,
and I don't think there was any consensus that forbidding it was the
right thing to do.  Phil appears to be trying to implement the
proposal you made here:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00452.php

...although I don't think that what he did quite matches what you asked for.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Nikhil Sontakke
> Now that we have coninhcnt, conislocal etc... we can probably support
> ONLY. But I agree with Robert it's probably a bit more than an
> afternoon to crank out :-)
>

Heh, agreed :), I was just looking for some quick and early feedback. So
what we need is basically a way to indicate that a particular constraint is
non-inheritable forever (meaning - even for future children) and that should
work?

Right now, it seems that the "ONLY" usage in the SQL only translates to a
recurse or no-recurse operation. For the parent, a constraint is marked with
conislocal set to true (coninhcount is 0) and for children, coninhcount is
used to indicate inheritance of that constraint with conislocal being set to
false.

What we need is to persist information of a particular constraint to be as
specified - ONLY for this table. We could do that by adding a new column in
pg_constraint like 'connoinh' or something, but I guess we would prefer not
to get into the initdb business. Alternatively we could bring about the same
by using a combination of conislocal and coninhcnt. For ONLY constraints, we
will need to percolate this information down to the point where we define it
in the code. We can then mark ONLY constraints to have conislocal set to
TRUE and coninhcnt set to a special value (-1). So to summarize, what I am
proposing is:

Introduce new column connoinh (boolean) in pg_constraint

OR in existing infrastructure:

Normal constraints:  conislocal (true)   coninhcnt (0)
 (inheritable like today)
Inherited constraints:   conislocal (false)  coninhcnt (n > 0)
ONLY constraints:conislocal (true)   coninhcnt (-1)   (not
inheritable)

With this arrangment, pg_dump will be able to easily identify and spit out
ONLY specifications for specific constraints and then they won't be blindly
passed on to children table under these new semantics.

Thoughts? Anything missing? Please let me know.

Regards,
Nikhils


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing  wrote:
> On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane  wrote:
>> > I wonder whether we could do something involving WAL properties --- the
>> > current tuple visibility logic was designed before WAL existed, so it's
>> > not exploiting that resource at all.  I'm imagining that the kernel of a
>> > snapshot is just a WAL position, ie the end of WAL as of the time you
>> > take the snapshot (easy to get in O(1) time).  Visibility tests then
>> > reduce to "did this transaction commit with a WAL record located before
>> > the specified position?".
>
> Why not just cache a "reference snapshots" near WAL writer and maybe
> also save it at some interval in WAL in case you ever need to restore an
> old snapshot at some val position for things like time travel.
>
> It may be cheaper lock-wise not to update ref. snapshot at each commit,
> but to keep latest saved snapshot and a chain of transactions
> committed / aborted since. This means that when reading the snapshot you
> read the current "saved snapshot" and then apply the list of commits.

Yeah, interesting idea.  I thought about that.  You'd need not only
the list of commits but also the list of XIDs that had been published,
since the commits have to be removed from the snapshot and the
newly-published XIDs have to be added to it (in case they commit later
while the snapshot is still in use).

You can imagine doing this with a pair of buffers.  You write a
snapshot into the beginning of the first buffer and then write each
XID that is published or commits into the next slot in the array.
When the buffer is filled up, the next process that wants to publish
an XID or commit scans through the array and constructs a new snapshot
that compacts away all the begin/commit pairs and writes it into the
second buffer, and all new snapshots are taken there.  When that
buffer fills up you flip back to the first one.  Of course, you need
some kind of synchronization to make sure that you don't flip back to
the first buffer while some laggard is still using it to construct a
snapshot that he started taking before you flipped to the second one,
but maybe that could be made light-weight enough not to matter.

I am somewhat concerned that this approach might lead to a lot of
contention over the snapshot buffers.  In particular, the fact that
you have to touch shared cache lines both to advertise a new XID and
when it gets committed seems less than ideal.  One thing that's kind
of interesting about the "commit sequence number" approach is that -
as far as I can tell - it doesn't require new XIDs to be advertised
anywhere at all.  You don't have to worry about overflowing the
subxids[] array because it goes away altogether.  The commit sequence
number itself is going to be a contention hotspot, but at least it's
small and fixed-size.

Another concern I have with this approach is - how large do you make
the buffers?  If you make them too small, then you're going to have to
regenerate the snapshot frequently, which will lead to the same sort
of lock contention we have today - no one can commit while the
snapshot is being regenerated.  On the other hand, if you make them
too big, then deriving a snapshot gets slow.  Maybe there's some way
to make it work, but I'm afraid it might end up being yet another
arcane thing the tuning of which will become a black art among
hackers...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:16 AM, Florian Pflug  wrote:
> On Jul28, 2011, at 04:51 , Robert Haas wrote:
>> One fly in the ointment is that 8-byte
>> stores are apparently done as two 4-byte stores on some platforms.
>> But if the counter runs backward, I think even that is OK.  If you
>> happen to read an 8 byte value as it's being written, you'll get 4
>> bytes of the intended value and 4 bytes of zeros.  The value will
>> therefore appear to be less than what it should be.  However, if the
>> value was in the midst of being written, then it's still in the midst
>> of committing, which means that that XID wasn't going to be visible
>> anyway.  Accidentally reading a smaller value doesn't change the
>> answer.
>
> That only works if the update of the most-significant word is guaranteed
> to be visible before the update to the lest-significant one. Which
> I think you can only enforce if you update the words individually
> (and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
> compiler.
>
> Otherwise, the following might happen (with a 2-byte value instead of an
> 8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
> ones aren't. Just to keep the numbers smaller. The machine is assumed to be
> big-endian)
>
> The counter is at 0xff00
> Backends 1 decrements, i.e. does
> (1)  STORE [counter+1] 0xff
> (2)  STORE [counter], 0x00
>
> Backend 2 reads
> (1')  LOAD [counter+1]
> (2')  LOAD [counter]
>
> If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
> 0x which is higher than it should be.

You're confusing two different things - I agree that you need a
spinlock around reading the counter, unless 8-byte loads and stores
are atomic.

What I'm saying can be done without a lock is reading the commit-order
value for a given XID.  If that's in the middle of being updated, then
the old value was zero, so the scenario you describe can't occur.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:46 AM, Simon Riggs  wrote:
> Sounds like the right set of thoughts to be having.

Thanks.

> If you do this, you must cover subtransactions and Hot Standby. Work
> in this area takes longer than you think when you take the
> complexities into account, as you must.

Right.  This would replace the KnownAssignedXids stuff (a non-trivial
project, I am sure).

> I think you should take the premise of making snapshots O(1) and look
> at all the ways of doing that. If you grab too early at a solution you
> may grab the wrong one.

Yeah, I'm just brainstorming at this point.  This is, I think, the
best idea of what I've come up with so far, but it's definitely not
the only approach.

> For example, another approach would be to use a shared hash table.
> Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
> N can be kept small by regularly pruning the hash table. If we crash
> we lose the hash table - no matter. (I'm not suggesting this is
> better, just a different approach that should be judged across
> others).

Sorry, I'm having a hard time understanding what you are describing
here.  What would the keys and values in this hash table be, and what
do k and N refer to here?

> What I'm not sure in any of these ideas is how to derive a snapshot xmin.

That is a problem.  If we have to scan the ProcArray every time we
take a snapshot just to derive an xmin, we are kind of hosed.  One
thought I had is that we might be able to use a sort of sloppy xmin.
In other words, we keep a cached xmin, and have some heuristic where
we occasionally try to update it.  A snapshot with a too-old xmin
isn't wrong, just possibly slower.  But if xmin is only slightly stale
and xids can be tested relatively quickly, it might not matter very
much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to use CreateFunctionStmt's RETURN TABLE?

2011-07-28 Thread Dickson S. Guedes
2011/7/26 _石头 :
[... cut ...]
>           I do not know how to use the second syntax:RETURNS TABLE '('
> table_func_column_list ')' createfunc_opt_list opt_definition.
>           May someone help me to write a simple example of this syntax!
>  Thank you very much. Looking forward for your help!

(I'm supposing that you are talking about the syntax of 'CREATE
FUNCTION' itself and not about the bison entry in that file.)

See the docs [1] and this post [2]. They could help you.

[1] http://www.postgresql.org/docs/current/static/sql-createfunction.html
[2] http://www.postgresonline.com/journal/archives/201-returns-table.html

Best regards,
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane  wrote:
> > I wonder whether we could do something involving WAL properties --- the
> > current tuple visibility logic was designed before WAL existed, so it's
> > not exploiting that resource at all.  I'm imagining that the kernel of a
> > snapshot is just a WAL position, ie the end of WAL as of the time you
> > take the snapshot (easy to get in O(1) time).  Visibility tests then
> > reduce to "did this transaction commit with a WAL record located before
> > the specified position?".  

Why not just cache a "reference snapshots" near WAL writer and maybe
also save it at some interval in WAL in case you ever need to restore an
old snapshot at some val position for things like time travel.

It may be cheaper lock-wise not to update ref. snapshot at each commit,
but to keep latest saved snapshot and a chain of transactions
committed / aborted since. This means that when reading the snapshot you
read the current "saved snapshot" and then apply the list of commits.

when moving to a new saved snapshot you really generate a new one and
keep the old snapshot + commit chain around for a little while for those
who may be still processing it. Seems like this is something that can be
done with no locking,

> You'd need some index datastructure that made
> > it reasonably cheap to find out the commit locations of recently
> > committed transactions, where "recent" means "back to recentGlobalXmin".
> > That seems possibly do-able, though I don't have a concrete design in
> > mind.

snapshot + chain of commits is likely as cheap as it gets, unless you
additionally cache the commits in a tighter data structure. this is
because you will need them all anyway to compute difference from ref
snapshot.


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Florian Pflug
On Jul28, 2011, at 04:51 , Robert Haas wrote:
> One fly in the ointment is that 8-byte
> stores are apparently done as two 4-byte stores on some platforms.
> But if the counter runs backward, I think even that is OK.  If you
> happen to read an 8 byte value as it's being written, you'll get 4
> bytes of the intended value and 4 bytes of zeros.  The value will
> therefore appear to be less than what it should be.  However, if the
> value was in the midst of being written, then it's still in the midst
> of committing, which means that that XID wasn't going to be visible
> anyway.  Accidentally reading a smaller value doesn't change the
> answer.

That only works if the update of the most-significant word is guaranteed
to be visible before the update to the lest-significant one. Which
I think you can only enforce if you update the words individually
(and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
compiler.

Otherwise, the following might happen (with a 2-byte value instead of an
8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
ones aren't. Just to keep the numbers smaller. The machine is assumed to be
big-endian)

The counter is at 0xff00
Backends 1 decrements, i.e. does
(1)  STORE [counter+1] 0xff
(2)  STORE [counter], 0x00

Backend 2 reads
(1')  LOAD [counter+1]
(2')  LOAD [counter]

If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
0x which is higher than it should be.

But we could simply use a spin-lock to protect the read on machines where
we don't know for sure that 64-bit reads and write are atomic. That'll
only really hurt on machines with 16+ cores or so, and the number of
architectures which support that isn't that high anyway. If we supported
spinlock-less operation on SPARC, x86-64, PPC64 and maybe Itanium, would we
miss any important one?


best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Simon Riggs
On Thu, Jul 28, 2011 at 3:51 AM, Robert Haas  wrote:

> All that having been said, even if I haven't made any severe
> conceptual errors in the above, I'm not sure how well it will work in
> practice.  On the plus side, taking a snapshot becomes O(1) rather
> than O(MaxBackends) - that's good.  On the further plus side, you can
> check both whether an XID has committed and whether it's visible to
> your snapshot in a single, atomic action with no lock - that seems
> really good.  On the minus side, checking an xid against your snapshot
> now has less locality of reference.  And, rolling over into a new
> segment of the array is going to require everyone to map it, and maybe
> cause some disk I/O as a new file gets created.

Sounds like the right set of thoughts to be having.

If you do this, you must cover subtransactions and Hot Standby. Work
in this area takes longer than you think when you take the
complexities into account, as you must.

I think you should take the premise of making snapshots O(1) and look
at all the ways of doing that. If you grab too early at a solution you
may grab the wrong one.

For example, another approach would be to use a shared hash table.
Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
N can be kept small by regularly pruning the hash table. If we crash
we lose the hash table - no matter. (I'm not suggesting this is
better, just a different approach that should be judged across
others).

What I'm not sure in any of these ideas is how to derive a snapshot xmin.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers