Re: [HACKERS] cheaper snapshots redux

2011-09-13 Thread Robert Haas
On Tue, Sep 13, 2011 at 7:49 AM, Amit Kapila amit.kap...@huawei.com wrote:
Yep, that's pretty much what it does, although xmax is actually
defined as the XID *following* the last one that ended, and I think
xmin needs to also be in xip, so in this case you'd actually end up
with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }.  But you've got
the basic idea of it.

 Shouldn't Xmax be 21 okay as current check in TupleVisibility indicate if
 XID is greater than equal to Xmax then it returns tuple is not visible.

No, that's not OK.  You stipulated 21 as committed, so it had better be visible.

In particular, if someone with proc-xmin = InvalidTransactionId is
taking a snapshot while you're computing RecentGlobalXmin, and then
stores a proc-xmin less than your newly-computed RecentGlobalXmin,
you've got a problem.

 I am assuming here you are reffering to take a snapshot means it has to be
 updated in shared memory because otherwise no need to refer proc with your
 new design.

 Session-1
 Updating RecentGlobalXmin during GetSnapshotData() using shared memory copy
 of snapshot and completed transactions as RecentGlobalXmin can be updated if
 we get xmin.

 Session-2
 Getting Snapshot to update in shared memory, here it needs to go through
 procarray.
 Now when it is going through procarray using proclock it can be case that
 proc of Session-1 has InvalidTransId, so we will ignore it and go through
 remaining session procs.
 Now normally Session-1 proc should not get lesser xmin as compare to other
 session procs but incase it has got his copy from shared memory ring buffer
 before other session procs then it can be lower and which can cause a
 problem.

 It's not one extra read - you'd have to look at every PGPROC.

 If the above explanation is right then is this the reason that to update
 RecentGlobalXmin, it has to go through every PGPROC.

Your explanation isn't very clear to me.  But I will post the patch
once I have some of these details sorted out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-09-13 Thread Amit Kapila

Yep, that's pretty much what it does, although xmax is actually
defined as the XID *following* the last one that ended, and I think
xmin needs to also be in xip, so in this case you'd actually end up
with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }.  But you've got
the basic idea of it.

Shouldn't Xmax be 21 okay as current check in TupleVisibility indicate if
XID is greater than equal to Xmax then it returns tuple is not visible.

In particular, if someone with proc-xmin = InvalidTransactionId is
taking a snapshot while you're computing RecentGlobalXmin, and then
stores a proc-xmin less than your newly-computed RecentGlobalXmin,
you've got a problem.  

I am assuming here you are reffering to take a snapshot means it has to be
updated in shared memory because otherwise no need to refer proc with your
new design.

Session-1
Updating RecentGlobalXmin during GetSnapshotData() using shared memory copy
of snapshot and completed transactions as RecentGlobalXmin can be updated if
we get xmin.

Session-2
Getting Snapshot to update in shared memory, here it needs to go through
procarray. 
Now when it is going through procarray using proclock it can be case that
proc of Session-1 has InvalidTransId, so we will ignore it and go through
remaining session procs.
Now normally Session-1 proc should not get lesser xmin as compare to other
session procs but incase it has got his copy from shared memory ring buffer
before other session procs then it can be lower and which can cause a
problem.

 It's not one extra read - you'd have to look at every PGPROC.  

If the above explanation is right then is this the reason that to update
RecentGlobalXmin, it has to go through every PGPROC.

***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com] 
Sent: Monday, September 12, 2011 9:31 PM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] cheaper snapshots redux

On Mon, Sep 12, 2011 at 11:07 AM, Amit Kapila amit.kap...@huawei.com
wrote:
If you know what transactions were running the last time a snapshot
summary
 was written and what transactions have ended since then, you can work
out
 the new xmin on the fly.  I have working code for this and it's actually
 quite simple.

 I believe one method to do same is as follows:

 Let us assume at some point of time the snapshot and completed XID list is
 somewhat as follows:

 Snapshot

 { Xmin – 5, Xip[] – 8 10 12, Xmax  - 15 }

 Committed XIDS – 8, 10 , 12, 18, 20, 21

 So it means 16,17,19 are running transactions. So it will behave as
follows:

 { Xmin – 16, Xmax – 21, Xip[] – 17,19 }

Yep, that's pretty much what it does, although xmax is actually
defined as the XID *following* the last one that ended, and I think
xmin needs to also be in xip, so in this case you'd actually end up
with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }.  But you've got
the basic idea of it.

 But if we do above way to calculate Xmin, we need to check in existing Xip
 array and committed Xid array to find Xmin. Won’t this cause reasonable
time
 even though it is outside lock time if Xip and Xid are large.

Yes, Tom raised this concern earlier.  I can't answer it for sure
without benchmarking, but clearly xip[] can't be allowed to get too
big.

 Because GetSnapshotData() computes a new value for RecentGlobalXmin by
 scanning the ProcArray.   This isn't costing a whole lot extra right now
 because the xmin and xid fields are normally in  the same cache line, so
 once you've looked at one of them it doesn't cost that much extra to
 look at the other.  If, on the other hand, you're not looking at (or even
 locking) the
 ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.

 Yes, this is more time as compare to earlier, but if our approach to
 calculate Xmin is like above point, then one extra read outside lock
should
 not matter. However if for above point approach is different then it will
be
 costlier.

It's not one extra read - you'd have to look at every PGPROC.  And it
is not outside a lock, either.  You definitely need locking around
computing RecentGlobalXmin; see src/backend/access/transa/README.  In
particular, if someone with proc-xmin = InvalidTransactionId is
taking a snapshot while you're computing RecentGlobalXmin, and then
stores a proc-xmin less than your newly-computed RecentGlobalXmin,
you've got a problem.  That can't happen right now because no
transactions can commit while

Re: [HACKERS] cheaper snapshots redux

2011-09-12 Thread Robert Haas
On Sun, Sep 11, 2011 at 11:08 PM, Amit Kapila amit.kap...@huawei.com wrote:
   In the approach mentioned in your idea, it mentioned that once after
 taking snapshot, only committed XIDs will be updated and sometimes snapshot
 itself.

   So when the xmin will be updated according to your idea as snapshot will
 not be updated everytime so xmin cannot be latest always.

If you know what transactions were running the last time a snapshot
summary was written and what transactions have ended since then, you
can work out the new xmin on the fly.  I have working code for this
and it's actually quite simple.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
 recomputing it on every snapshot becomes prohibitively expensive with this
 approach.  I'm still struggling with the best way to handle that.

   RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin
 and that too outside ProcArrayLock, so why it will be expensive if you have
 updated xmin.

Because GetSnapshotData() computes a new value for RecentGlobalXmin by
scanning the ProcArray.  This isn't costing a whole lot extra right
now because the xmin and xid fields are normally in the same cache
line, so once you've looked at one of them it doesn't cost that much
extra to look at the other.  If, on the other hand, you're not looking
at (or even locking) the ProcArray, then doing so just to recompute
RecentGlobalXmin sucks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-09-12 Thread Robert Haas
On Mon, Sep 12, 2011 at 11:07 AM, Amit Kapila amit.kap...@huawei.com wrote:
If you know what transactions were running the last time a snapshot summary
 was written and what transactions have ended since then, you can work out
 the new xmin on the fly.  I have working code for this and it's actually
 quite simple.

 I believe one method to do same is as follows:

 Let us assume at some point of time the snapshot and completed XID list is
 somewhat as follows:

 Snapshot

 { Xmin – 5, Xip[] – 8 10 12, Xmax  - 15 }

 Committed XIDS – 8, 10 , 12, 18, 20, 21

 So it means 16,17,19 are running transactions. So it will behave as follows:

 { Xmin – 16, Xmax – 21, Xip[] – 17,19 }

Yep, that's pretty much what it does, although xmax is actually
defined as the XID *following* the last one that ended, and I think
xmin needs to also be in xip, so in this case you'd actually end up
with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }.  But you've got
the basic idea of it.

 But if we do above way to calculate Xmin, we need to check in existing Xip
 array and committed Xid array to find Xmin. Won’t this cause reasonable time
 even though it is outside lock time if Xip and Xid are large.

Yes, Tom raised this concern earlier.  I can't answer it for sure
without benchmarking, but clearly xip[] can't be allowed to get too
big.

 Because GetSnapshotData() computes a new value for RecentGlobalXmin by
 scanning the ProcArray.   This isn't costing a whole lot extra right now
 because the xmin and xid fields are normally in  the same cache line, so
 once you've looked at one of them it doesn't cost that much extra to
 look at the other.  If, on the other hand, you're not looking at (or even
 locking) the
 ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.

 Yes, this is more time as compare to earlier, but if our approach to
 calculate Xmin is like above point, then one extra read outside lock should
 not matter. However if for above point approach is different then it will be
 costlier.

It's not one extra read - you'd have to look at every PGPROC.  And it
is not outside a lock, either.  You definitely need locking around
computing RecentGlobalXmin; see src/backend/access/transa/README.  In
particular, if someone with proc-xmin = InvalidTransactionId is
taking a snapshot while you're computing RecentGlobalXmin, and then
stores a proc-xmin less than your newly-computed RecentGlobalXmin,
you've got a problem.  That can't happen right now because no
transactions can commit while RecentGlobalXmin is being computed, but
the point here is precisely to allow those operations to (mostly) run
in parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-09-12 Thread Amit Kapila
If you know what transactions were running the last time a snapshot summary
was written and what transactions have ended since then, you can work out
the new xmin on the fly.  I have working code for this and it's actually
quite simple.

 

I believe one method to do same is as follows:

Let us assume at some point of time the snapshot and completed XID list is
somewhat as follows:

Snapshot

{

Xmin - 5

Xip[] - 8 10 12

Xmax  - 15

}

Committed XIDS - 8, 10 , 12, 18, 20, 21

So it means 16,17,19 are running transactions. So it will behave as follows:

Xmin - 16

Xmax - 21

Xip[] - 17,19

 

But if we do above way to calculate Xmin, we need to check in existing Xip
array and committed Xid array to find Xmin. Won't this cause reasonable time
even though it is outside lock time if Xip and Xid are large.

 

 Because GetSnapshotData() computes a new value for RecentGlobalXmin by
scanning the ProcArray.   This isn't costing a whole lot extra right now
because the xmin and xid fields are normally in  the same cache line, so
once you've looked at one of them it doesn't cost that much extra to 

 look at the other.  If, on the other hand, you're not looking at (or even
locking) the 

 ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.

 

Yes, this is more time as compare to earlier, but if our approach to
calculate Xmin is like above point, then one extra read outside lock should
not matter. However if for above point approach is different then it will be
costlier.

 


***

This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 

 

-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com] 
Sent: Monday, September 12, 2011 7:39 PM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] cheaper snapshots redux

 

On Sun, Sep 11, 2011 at 11:08 PM, Amit Kapila amit.kap...@huawei.com
wrote:

   In the approach mentioned in your idea, it mentioned that once after

 taking snapshot, only committed XIDs will be updated and sometimes
snapshot

 itself.

 

   So when the xmin will be updated according to your idea as snapshot will

 not be updated everytime so xmin cannot be latest always.

 

If you know what transactions were running the last time a snapshot

summary was written and what transactions have ended since then, you

can work out the new xmin on the fly.  I have working code for this

and it's actually quite simple.

 

RecentGlobalXmin doesn't need to be completely up to date, and in fact

 recomputing it on every snapshot becomes prohibitively expensive with this

 approach.  I'm still struggling with the best way to handle that.

 

   RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin

 and that too outside ProcArrayLock, so why it will be expensive if you
have

 updated xmin.

 

Because GetSnapshotData() computes a new value for RecentGlobalXmin by

scanning the ProcArray.  This isn't costing a whole lot extra right

now because the xmin and xid fields are normally in the same cache

line, so once you've looked at one of them it doesn't cost that much

extra to look at the other.  If, on the other hand, you're not looking

at (or even locking) the ProcArray, then doing so just to recompute

RecentGlobalXmin sucks.

 

-- 

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company



Re: [HACKERS] cheaper snapshots redux

2011-09-12 Thread Amit Kapila
 4. Won't it effect if we don't update xmin everytime and just noting the
committed XIDs. The reason I am asking is that it is used in tuple
visibility check so with new idea in some cases instead of just returning 
from begining by checking xmin it has to go through the committed XID list. 

 I understand that there may be less cases or the improvement by your idea
can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break that
because it would be wrong.

   In the approach mentioned in your idea, it mentioned that once after
taking snapshot, only committed XIDs will be updated and sometimes snapshot
itself.

   So when the xmin will be updated according to your idea as snapshot will
not be updated everytime so xmin cannot be latest always.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
recomputing it on every snapshot becomes prohibitively expensive with this
approach.  I'm still struggling with the best way to handle that.

   RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin
and that too outside ProcArrayLock, so why it will be expensive if you have
updated xmin.



With Regards,

Amit Kapila.



***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

-Original Message-
From: Robert Haas [mailto:robertmh...@gmail.com] 
Sent: Thursday, September 08, 2011 7:50 PM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] cheaper snapshots redux

On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila amit.kap...@huawei.com wrote:
 1. With the above, you want to reduce/remove the concurrency issue between
 the GetSnapshotData() [used at begining of sql command execution] and
 ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
 is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
 and by ProcArrayEndTransaction() in X mode.
 There may be other instances for similar thing, but this the main thing
 which you want to resolve.

Yep.

 2. You want to resolve it by using ring buffer such that readers don't
need
 to take any lock.

Yep.  Actually, they're still going to need some spinlocks at least in
the first go round, to protect the pointers.  I'm hoping those can
eventually be eliminated on machines with 8-byte atomic reads using
appropriate memory barrier primitives.

 1. 2 Writers; Won't 2 different sessions who try to commit at same time
will
 get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
 as I understood?

Yes, commits have to be serialized.  No way around that.  The best
we'll ever be able to do is shorten the critical section.

 2. 1 Reader, 1 Writter; It might be case that some body has written a new
 snapshot and advanced the stop pointer and at that point of time one
reader
 came and read between start pointer and stop pointer. Now the reader will
 see as follows:
   snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
 snapshot?

In my prototype implementation that can't happen because the start and
stop pointers are protected by a single spinlock and are moved
simultaneously.  But I think we can get rid on machines with 8-byte
atomic writes of that and just move the stop pointer first and then
the start pointer.  If you get more than one snapshot in the middle
you just ignore the first part of the data you read and start with the
beginning of the last snapshot.

 3. How will you detect overwrite.

If the write pointer is greater than the start pointer by more than
the ring size, you've wrapped.

 4. Won't it effect if we don't update xmin everytime and just noting the
 committed XIDs. The reason I am asking is that it is used in tuple
 visibility check
    so with new idea in some cases instead of just returning from begining
 by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
 idea can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break
that because it would be wrong.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
recomputing it on every snapshot becomes prohibitively expensive with
this approach.  I'm still struggling with the best way to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise

Re: [HACKERS] cheaper snapshots redux

2011-09-12 Thread Amit kapila
 4. Won't it effect if we don't update xmin everytime and just noting the 
 committed XIDs. The reason I am asking is that it is used in tuple 
 visibility check so with new idea in some cases instead of just returning  
 from begining by checking xmin it has to go through the committed XID 
 list.

 I understand that there may be less cases or the improvement by your idea 
 can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break that 
because it would be wrong.

   In the approach mentioned in your idea, it mentioned that once after taking 
snapshot, only committed XIDs will be updated and sometimes snapshot itself.

   So when the xmin will be updated according to your idea.

RecentGlobalXmin doesn't need to be completely up to date, and in fact 
recomputing it on every snapshot becomes prohibitively expensive with this 
approach.  I'm still struggling with the best way to handle that.

   RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin and 
that too outside ProcArrayLock, so why it will be expensive if you have updated 
xmin.


With Regards,

Amit Kapila.

***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!


Re: [HACKERS] cheaper snapshots redux

2011-09-08 Thread Robert Haas
On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila amit.kap...@huawei.com wrote:
 1. With the above, you want to reduce/remove the concurrency issue between
 the GetSnapshotData() [used at begining of sql command execution] and
 ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
 is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
 and by ProcArrayEndTransaction() in X mode.
 There may be other instances for similar thing, but this the main thing
 which you want to resolve.

Yep.

 2. You want to resolve it by using ring buffer such that readers don't need
 to take any lock.

Yep.  Actually, they're still going to need some spinlocks at least in
the first go round, to protect the pointers.  I'm hoping those can
eventually be eliminated on machines with 8-byte atomic reads using
appropriate memory barrier primitives.

 1. 2 Writers; Won't 2 different sessions who try to commit at same time will
 get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
 as I understood?

Yes, commits have to be serialized.  No way around that.  The best
we'll ever be able to do is shorten the critical section.

 2. 1 Reader, 1 Writter; It might be case that some body has written a new
 snapshot and advanced the stop pointer and at that point of time one reader
 came and read between start pointer and stop pointer. Now the reader will
 see as follows:
   snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
 snapshot?

In my prototype implementation that can't happen because the start and
stop pointers are protected by a single spinlock and are moved
simultaneously.  But I think we can get rid on machines with 8-byte
atomic writes of that and just move the stop pointer first and then
the start pointer.  If you get more than one snapshot in the middle
you just ignore the first part of the data you read and start with the
beginning of the last snapshot.

 3. How will you detect overwrite.

If the write pointer is greater than the start pointer by more than
the ring size, you've wrapped.

 4. Won't it effect if we don't update xmin everytime and just noting the
 committed XIDs. The reason I am asking is that it is used in tuple
 visibility check
    so with new idea in some cases instead of just returning from begining
 by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
 idea can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break
that because it would be wrong.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
recomputing it on every snapshot becomes prohibitively expensive with
this approach.  I'm still struggling with the best way to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-09-07 Thread Amit Kapila
I wanted to clarify my understanding and have some doubts.

What I'm thinking about instead is using a ring buffer with three pointers:
a start pointer, a stop pointer, and a write pointer.  When a transaction
ends, we
advance the write pointer, write the XIDs or a whole new snapshot into the
buffer, and then advance the stop pointer.  If we wrote a whole new
snapshot, 
we advance the start pointer to the beginning of the data we just wrote.
Someone who wants to take a snapshot must read the data between the start
and stop pointers, and must then check that the write pointer
hasn't advanced so far in the meantime that the data they read might have
been overwritten before they finished reading it.  Obviously,
that's a little risky, since we'll have to do the whole thing over if a
wraparound occurs, but if the ring buffer is large enough it shouldn't
happen very often.  
 
Clarification
--
1. With the above, you want to reduce/remove the concurrency issue between
the GetSnapshotData() [used at begining of sql command execution] and
ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
and by ProcArrayEndTransaction() in X mode.
There may be other instances for similar thing, but this the main thing
which you want to resolve.
 
2. You want to resolve it by using ring buffer such that readers don't need
to take any lock.
 
Is my above understanding correct?
 
Doubts


1. 2 Writers; Won't 2 different sessions who try to commit at same time will
get the same write pointer.
I assume it will be protected as even indicated in one of your replies
as I understood?
 
2. 1 Reader, 1 Writter; It might be case that some body has written a new
snapshot and advanced the stop pointer and at that point of time one reader
came and read between start pointer and stop pointer. Now the reader will
see as follows:
   snapshot, few XIDs, snapshot
 
So will it handle this situation such that it will only read latest
snapshot?
 
3. How will you detect overwrite.
 
4. Won't it effect if we don't update xmin everytime and just noting the
committed XIDs. The reason I am asking is that it is used in tuple
visibility check
so with new idea in some cases instead of just returning from begining
by checking xmin it has to go through the committed XID list.
I understand that there may be less cases or the improvement by your
idea can supesede this minimal effect. However some cases can be defeated.
 
 
-- 
With Regards,
Amit Kapila.




***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
Sent: Sunday, August 28, 2011 7:17 AM
To: Gokulakannan Somasundaram
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] cheaper snapshots redux

On Sat, Aug 27, 2011 at 1:38 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
 First i respectfully disagree with you on the point of 80MB. I would say
 that its very rare that a small system( with 1 GB RAM ) might have a long
 running transaction sitting idle, while 10 million transactions are
sitting
 idle. Should an optimization be left, for the sake of a very small system
to
 achieve high enterprise workloads?

With the design where you track commit-visbility sequence numbers
instead of snapshots, you wouldn't need 10 million transactions that
were all still running.  You would just need a snapshot that had been
sitting around while 10 million transactions completed meanwhile.

That having been said, I don't necessarily think that design is
doomed.  I just think it's going to be trickier to get working than
the design I'm now hacking on, and a bigger change from what we do
now.  If this doesn't pan out, I might try that one, or something
else.

 Second, if we make use of the memory mapped files, why should we think,
that
 all the 80MB of data will always reside in memory? Won't they get paged
out
 by the  operating system, when it is in need of memory? Or do you have
some
 specific OS in mind?

No, I don't think it will all be in memory - but that's part of the
performance calculation.  If you need to check on the status of an XID
and find that you need to read a page of data in from disk, that's
going to be many orders of magnitude slower than anything we do with s
snapshot now.  Now, if you gain enough elsewhere, it could still

Re: [HACKERS] cheaper snapshots redux

2011-08-28 Thread Gokulakannan Somasundaram
 No, I don't think it will all be in memory - but that's part of the
 performance calculation.  If you need to check on the status of an XID
 and find that you need to read a page of data in from disk, that's
 going to be many orders of magnitude slower than anything we do with s
 snapshot now.  Now, if you gain enough elsewhere, it could still be a
 win, but I'm not going to just assume that.

 I was just suggesting this, because the memory costs have come down a
lot(as you may know) and people can afford to buy more memory in enterprise
scenario. We may not need to worry about MBs of memory, especially with the
cloud computing being widely adopted, when we get scalability.

Gokul.


Re: [HACKERS] cheaper snapshots redux

2011-08-28 Thread Robert Haas
On Sun, Aug 28, 2011 at 4:33 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
 No, I don't think it will all be in memory - but that's part of the
 performance calculation.  If you need to check on the status of an XID
 and find that you need to read a page of data in from disk, that's
 going to be many orders of magnitude slower than anything we do with s
 snapshot now.  Now, if you gain enough elsewhere, it could still be a
 win, but I'm not going to just assume that.

 I was just suggesting this, because the memory costs have come down a lot(as
 you may know) and people can afford to buy more memory in enterprise
 scenario. We may not need to worry about MBs of memory, especially with the
 cloud computing being widely adopted, when we get scalability.

The proof of the pudding is in the eating, so let me finish coding up
this approach and see how it works.  Then we can decide where to go
next...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-27 Thread Robert Haas
On Sat, Aug 27, 2011 at 1:38 AM, Gokulakannan Somasundaram
gokul...@gmail.com wrote:
 First i respectfully disagree with you on the point of 80MB. I would say
 that its very rare that a small system( with 1 GB RAM ) might have a long
 running transaction sitting idle, while 10 million transactions are sitting
 idle. Should an optimization be left, for the sake of a very small system to
 achieve high enterprise workloads?

With the design where you track commit-visbility sequence numbers
instead of snapshots, you wouldn't need 10 million transactions that
were all still running.  You would just need a snapshot that had been
sitting around while 10 million transactions completed meanwhile.

That having been said, I don't necessarily think that design is
doomed.  I just think it's going to be trickier to get working than
the design I'm now hacking on, and a bigger change from what we do
now.  If this doesn't pan out, I might try that one, or something
else.

 Second, if we make use of the memory mapped files, why should we think, that
 all the 80MB of data will always reside in memory? Won't they get paged out
 by the  operating system, when it is in need of memory? Or do you have some
 specific OS in mind?

No, I don't think it will all be in memory - but that's part of the
performance calculation.  If you need to check on the status of an XID
and find that you need to read a page of data in from disk, that's
going to be many orders of magnitude slower than anything we do with s
snapshot now.  Now, if you gain enough elsewhere, it could still be a
win, but I'm not going to just assume that.

As I play with this, I'm coming around to the conclusion that, in
point of fact, the thing that's hard about snapshots has a great deal
more to do with memory than it does with CPU time.  Sure, using the
snapshot has to be cheap.  But it already IS cheap.  We don't need to
fix that problem; we just need to not break it.  What's not cheap is
constructing the snapshot - principally because of ProcArrayLock, and
secondarily because we're grovelling through fairly large amounts of
shared memory to get all the XIDs we need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-26 Thread Robert Haas
On Thu, Aug 25, 2011 at 6:24 PM, Jim Nasby j...@nasby.net wrote:
 On Aug 25, 2011, at 8:24 AM, Robert Haas wrote:
 My hope (and it might turn out that I'm an optimist) is that even with
 a reasonably small buffer it will be very rare for a backend to
 experience a wraparound condition.  For example, consider a buffer
 with ~6500 entries, approximately 64 * MaxBackends, the approximate
 size of the current subxip arrays taken in aggregate.  I hypothesize
 that a typical snapshot on a running system is going to be very small
 - a handful of XIDs at most - because, on the average, transactions
 are going to commit in *approximately* increasing XID order and, if
 you take the regression tests as representative of a real workload,
 only a small fraction of transactions will have more than one XID.  So

 BTW, there's a way to actually gather some data on this by using PgQ (part of 
 Skytools and used by Londiste). PgQ works by creating ticks at regular 
 intervals, where a tick is basically just a snapshot of committed XIDs. 
 Presumably Slony does something similar.

 I can provide you with sample data from our production systems if you're 
 interested.

Yeah, that would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-26 Thread Robert Haas
On Thu, Aug 25, 2011 at 6:29 PM, Jim Nasby j...@nasby.net wrote:
 Actually, I wasn't thinking about the system dynamically sizing shared memory 
 on it's own... I was only thinking of providing the ability for a user to 
 change something like shared_buffers and allow that change to take effect 
 with a SIGHUP instead of requiring a full restart.

I agree.  That would be awesome.  Sadly, I don't have time to work on it.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-26 Thread Gokulakannan Somasundaram
On Tue, Aug 23, 2011 at 5:25 AM, Robert Haas robertmh...@gmail.com wrote:

 I've been giving this quite a bit more thought, and have decided to
 abandon the scheme described above, at least for now.  It has the
 advantage of avoiding virtually all locking, but it's extremely
 inefficient in its use of memory in the presence of long-running
 transactions.  For example, if there's an open transaction that's been
 sitting around for 10 million transactions or so and has an XID
 assigned, any new snapshot is going to need to probe into the big
 array for any XID in that range.  At 8 bytes per entry, that means
 we're randomly accessing about ~80MB of memory-mapped data.  That
 seems problematic both in terms of blowing out the cache and (on small
 machines) possibly even blowing out RAM.  Nor is that the worst case
 scenario: a transaction could sit open for 100 million transactions.

 First i respectfully disagree with you on the point of 80MB. I would say
that its very rare that a small system( with 1 GB RAM ) might have a long
running transaction sitting idle, while 10 million transactions are sitting
idle. Should an optimization be left, for the sake of a very small system to
achieve high enterprise workloads?

Second, if we make use of the memory mapped files, why should we think, that
all the 80MB of data will always reside in memory? Won't they get paged out
by the  operating system, when it is in need of memory? Or do you have some
specific OS in mind?

Thanks,
Gokul.


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Robert Haas
On Thu, Aug 25, 2011 at 1:55 AM, Markus Wanner mar...@bluegap.ch wrote:
 One difference with snapshots is that only the latest snapshot is of
 any interest.

 Theoretically, yes.  But as far as I understood, you proposed the
 backends copy that snapshot to local memory.  And copying takes some
 amount of time, possibly being interrupted by other backends which add
 newer snapshots...  Or do you envision the copying to restart whenever a
 new snapshot arrives?

My hope (and it might turn out that I'm an optimist) is that even with
a reasonably small buffer it will be very rare for a backend to
experience a wraparound condition.  For example, consider a buffer
with ~6500 entries, approximately 64 * MaxBackends, the approximate
size of the current subxip arrays taken in aggregate.  I hypothesize
that a typical snapshot on a running system is going to be very small
- a handful of XIDs at most - because, on the average, transactions
are going to commit in *approximately* increasing XID order and, if
you take the regression tests as representative of a real workload,
only a small fraction of transactions will have more than one XID.  So
it seems believable to think that the typical snapshot on a machine
with max_connections=100 might only be ~10 XIDs, even if none of the
backends are read-only.  So the backend taking a snapshot only needs
to be able to copy  ~64 bytes of information from the ring buffer
before other backends write ~27k of data into that buffer, likely
requiring hundreds of other commits.  That seems vanishingly unlikely;
memcpy() is very fast.  If it does happen, you can recover by
retrying, but it should be a once-in-a-blue-moon kind of thing.  I
hope.

Now, as the size of the snapshot gets bigger, things will eventually
become less good.  For example if you had a snapshot with 6000 XIDs in
it then every commit would need to write over the previous snapshot
and things would quickly deteriorate.  But you can cope with that
situation using the same mechanism we already use to handle big
snapshots: toss out all the subtransaction IDs, mark the snapshot as
overflowed, and just keep the toplevel XIDs.  Now you've got at most
~100 XIDs to worry about, so you're back in the safety zone.  That's
not ideal in the sense that you will cause more pg_subtrans lookups,
but that's the price you pay for having a gazillion subtransactions
floating around, and any system is going to have to fall back on some
sort of mitigation strategy at some point.  There's no useful limit on
the number of subxids a transaction can have, so unless you're
prepared to throw an unbounded amount of memory at the problem you're
going to eventually have to punt.

It seems to me that the problem case is when you are just on the edge.
 Say you have 1400 XIDs in the snapshot.  If you compact the snapshot
down to toplevel XIDs, most of those will go away and you won't have
to worry about wraparound - but you will pay a performance penalty in
pg_subtrans lookups.  On the other hand, if you don't compact the
snapshot, it's not that hard to imagine a wraparound occurring - four
snapshot rewrites could wrap the buffer.  You would still hope that
memcpy() could finish in time, but if you're rewriting 1400 XIDs with
any regularity, it might not take that many commits to throw a spanner
into the works.  If the system is badly overloaded and the backend
trying to take a snapshot gets descheduled for a long time at just the
wrong moment, it doesn't seem hard to imagine a wraparound happening.

Now, it's not hard to recover from a wraparound.  In fact, we can
pretty easily guarantee that any given attempt to take a snapshot will
suffer a wraparound at most once.  The writers (who are committing)
have to be serialized anyway, so anyone who suffers a wraparound can
just grab the same lock in shared mode and retry its snapshot.  Now
concurrency decreases significantly, because no one else is allowed to
commit until that guy has got his snapshot, but right now that's true
*every time* someone wants to take a snapshot, so falling back to that
strategy occasionally doesn't seem prohibitively bad.  However, you
don't want it to happen very often, because even leaving aside the
concurrency hit, it's double work: you have to try to take a snapshot,
realize you've had a wraparound, and then retry.   It seems pretty
clear that with a big enough ring buffer the wraparound problem will
become so infrequent as to be not worth worrying about.  I'm
theorizing that even with a quite small ring buffer the problem will
still be infrequent enough not to worry about, but that might be
optimistic.  I think I'm going to need some kind of test case that
generates very large, frequently changing snapshots.

Of course even if wraparound turns out not to be a problem there are
other things that could scuttle this whole approach, but I think the
idea has enough potential to be worth testing.  If the whole thing
crashes and burns I hope I'll at least learn enough along the 

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner
Robert,

On 08/25/2011 03:24 PM, Robert Haas wrote:
 My hope (and it might turn out that I'm an optimist) is that even with
 a reasonably small buffer it will be very rare for a backend to
 experience a wraparound condition.

It certainly seems less likely than with the ring-buffer for imessages, yes.

Note, however, that for imessages, I've also had the policy in place
that a backend *must* consume its message before sending any.  And that
I took great care for all receivers to consume their messages as early
as possible.  None the less, I kept incrementing the buffer size (to
multiple megabytes) to make this work.  Maybe I'm overcautious because
of that experience.

 - a handful of XIDs at most - because, on the average, transactions
 are going to commit in *approximately* increasing XID order

This assumption quickly turns false, if you happen to have just one
long-running transaction, I think.  Or in general, if transaction
duration varies a lot.

 So the backend taking a snapshot only needs
 to be able to copy  ~64 bytes of information from the ring buffer
 before other backends write ~27k of data into that buffer, likely
 requiring hundreds of other commits.

You said earlier, that only the latest snapshot is required.  It takes
only a single commit for such a snapshot to not be the latest anymore.

Instead, if you keep around older snapshots for some time - as what your
description here implies - readers are free to copy from those older
snapshots while other backends are able to make progress concurrently
(writers or readers of other snapshots).

However, that either requires keeping track of readers of a certain
snapshot (reference counting) or - as I understand your description -
you simply invalidate all concurrent readers upon wrap-around, or something.

 That seems vanishingly unlikely;

Agreed.

 Now, as the size of the snapshot gets bigger, things will eventually
 become less good.

Also keep configurations with increased max_connections in mind.  With
that, we not only the snapshots get bigger, but more processes have to
share CPU time, on avg. making memcpy slower for a single process.

 Of course even if wraparound turns out not to be a problem there are
 other things that could scuttle this whole approach, but I think the
 idea has enough potential to be worth testing.  If the whole thing
 crashes and burns I hope I'll at least learn enough along the way to
 design something better...

That's always a good motivation.  In that sense: happy hacking!

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Robert Haas
On Thu, Aug 25, 2011 at 10:19 AM, Markus Wanner mar...@bluegap.ch wrote:
 Note, however, that for imessages, I've also had the policy in place
 that a backend *must* consume its message before sending any.  And that
 I took great care for all receivers to consume their messages as early
 as possible.  None the less, I kept incrementing the buffer size (to
 multiple megabytes) to make this work.  Maybe I'm overcautious because
 of that experience.

What's a typical message size for imessages?

 - a handful of XIDs at most - because, on the average, transactions
 are going to commit in *approximately* increasing XID order

 This assumption quickly turns false, if you happen to have just one
 long-running transaction, I think.  Or in general, if transaction
 duration varies a lot.

Well, one long-running transaction that only has a single XID is not
really a problem: the snapshot is still small.  But one very old
transaction that also happens to have a large number of
subtransactions all of which have XIDs assigned might be a good way to
stress the system.

 So the backend taking a snapshot only needs
 to be able to copy  ~64 bytes of information from the ring buffer
 before other backends write ~27k of data into that buffer, likely
 requiring hundreds of other commits.

 You said earlier, that only the latest snapshot is required.  It takes
 only a single commit for such a snapshot to not be the latest anymore.

 Instead, if you keep around older snapshots for some time - as what your
 description here implies - readers are free to copy from those older
 snapshots while other backends are able to make progress concurrently
 (writers or readers of other snapshots).

 However, that either requires keeping track of readers of a certain
 snapshot (reference counting) or - as I understand your description -
 you simply invalidate all concurrent readers upon wrap-around, or something.

Each reader decides which data he needs to copy from the buffer, and
then copies it, and then checks whether any of it got overwritten
before the copy was completed.  So there's a lively possibility that
the snapshot that was current when the reader began copying it will no
longer be current by the time he finishes copying it, because a commit
has intervened.  That's OK: it just means that, effectively, the
snapshot is taken at the moment the start and stop pointers are read,
and won't take into account any commits that happen later, which is
exactly what a snapshot is supposed to do anyway.

There is a hopefully quite small possibility that by the time the
reader finishes copying it so much new data will have been written to
the buffer that it will have wrapped around and clobbered the portion
the reader was interested in.  That needs to be rare.

 Now, as the size of the snapshot gets bigger, things will eventually
 become less good.

 Also keep configurations with increased max_connections in mind.  With
 that, we not only the snapshots get bigger, but more processes have to
 share CPU time, on avg. making memcpy slower for a single process.

Right.  I'm imagining making the default buffer size proportional to
max_connections.

 Of course even if wraparound turns out not to be a problem there are
 other things that could scuttle this whole approach, but I think the
 idea has enough potential to be worth testing.  If the whole thing
 crashes and burns I hope I'll at least learn enough along the way to
 design something better...

 That's always a good motivation.  In that sense: happy hacking!

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Well, one long-running transaction that only has a single XID is not
 really a problem: the snapshot is still small.  But one very old
 transaction that also happens to have a large number of
 subtransactions all of which have XIDs assigned might be a good way to
 stress the system.

That's a good point.  If the ring buffer size creates a constraint on
the maximum number of sub-XIDs per transaction, you're going to need a
fallback path of some sort.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner
Robert,

On 08/25/2011 04:48 PM, Robert Haas wrote:
 What's a typical message size for imessages?

Most message types in Postgres-R are just a couple bytes in size.
Others, especially change sets, can be up to 8k.

However, I think you'll have an easier job guaranteeing that backends
consume their portions of the ring-buffer in time.  Plus wrap-around
isn't that much of a problem in your case.  (I couldn't drop imessage,
but had to let senders wait).

 Well, one long-running transaction that only has a single XID is not
 really a problem: the snapshot is still small.  But one very old
 transaction that also happens to have a large number of
 subtransactions all of which have XIDs assigned might be a good way to
 stress the system.

Ah, right, that's why its a list of transactions in progress and not a
list of completed transactions in SnapshotData... good.

 Each reader decides which data he needs to copy from the buffer, and
 then copies it, and then checks whether any of it got overwritten
 before the copy was completed.  So there's a lively possibility that
 the snapshot that was current when the reader began copying it will no
 longer be current by the time he finishes copying it, because a commit
 has intervened.  That's OK: it just means that, effectively, the
 snapshot is taken at the moment the start and stop pointers are read,
 and won't take into account any commits that happen later, which is
 exactly what a snapshot is supposed to do anyway.

Agreed, that makes sense.  Thanks for explaining.

 There is a hopefully quite small possibility that by the time the
 reader finishes copying it so much new data will have been written to
 the buffer that it will have wrapped around and clobbered the portion
 the reader was interested in.  That needs to be rare.

Yeah.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner
Tom,

On 08/25/2011 04:59 PM, Tom Lane wrote:
 That's a good point.  If the ring buffer size creates a constraint on
 the maximum number of sub-XIDs per transaction, you're going to need a
 fallback path of some sort.

I think Robert envisions the same fallback path we already have:
subxids.overflowed.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Robert Haas
On Thu, Aug 25, 2011 at 11:15 AM, Markus Wanner mar...@bluegap.ch wrote:
 On 08/25/2011 04:59 PM, Tom Lane wrote:
 That's a good point.  If the ring buffer size creates a constraint on
 the maximum number of sub-XIDs per transaction, you're going to need a
 fallback path of some sort.

 I think Robert envisions the same fallback path we already have:
 subxids.overflowed.

I have a slightly more nuanced idea, but basically yes.  The trouble
is that if you're keeping the snapshot around and updating it (rather
than scanning the ProcArray each time) you need some sort of mechanism
for the snapshot to eventually un-overflow.  Otherwise, the first
overflow leaves you in the soup for the entire lifetime of the
cluster.

What I have in mind is to store the highest subxid that has been
removed from the snapshot, or InvalidTransactonId if we know the
snapshot is complete.  Whenever the highest removed subxid falls
behind xmin, we can reset it to InvalidTransactionId.

It would be sensible for clients to store the exact value of
highest_removed_subxid in their snapshots as well, instead of just a
Boolean flag.  A pg_subtrans lookup is needed only for XIDs which are
greater than xmin and less than or equal to highest_removed_subxid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Jim Nasby
On Aug 25, 2011, at 8:24 AM, Robert Haas wrote:
 My hope (and it might turn out that I'm an optimist) is that even with
 a reasonably small buffer it will be very rare for a backend to
 experience a wraparound condition.  For example, consider a buffer
 with ~6500 entries, approximately 64 * MaxBackends, the approximate
 size of the current subxip arrays taken in aggregate.  I hypothesize
 that a typical snapshot on a running system is going to be very small
 - a handful of XIDs at most - because, on the average, transactions
 are going to commit in *approximately* increasing XID order and, if
 you take the regression tests as representative of a real workload,
 only a small fraction of transactions will have more than one XID.  So

BTW, there's a way to actually gather some data on this by using PgQ (part of 
Skytools and used by Londiste). PgQ works by creating ticks at regular 
intervals, where a tick is basically just a snapshot of committed XIDs. 
Presumably Slony does something similar.

I can provide you with sample data from our production systems if you're 
interested.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Jim Nasby
On Aug 22, 2011, at 6:22 PM, Robert Haas wrote:
 With respect to a general-purpose shared memory allocator, I think
 that there are cases where that would be useful to have, but I don't
 think there are as many of them as many people seem to think.  I
 wouldn't choose to implement this using a general-purpose allocator
 even if we had it, both because it's undesirable to allow this or any
 subsystem to consume an arbitrary amount of memory (nor can it fail...
 especially in the abort path) and because a ring buffer is almost
 certainly faster than a general-purpose allocator.  We have enough
 trouble with palloc overhead already.  That having been said, I do
 think there are cases where it would be nice to have... and it
 wouldn't surprise me if I end up working on something along those
 lines in the next year or so.  It turns out that memory management is
 a major issue in lock-free programming; you can't assume that it's
 safe to recycle an object once the last pointer to it has been removed
 from shared memory - because someone may have fetched the pointer just
 before you removed it and still be using it to examine the object.  An
 allocator with some built-in capabilities for handling such problems
 seems like it might be very useful

Actually, I wasn't thinking about the system dynamically sizing shared memory 
on it's own... I was only thinking of providing the ability for a user to 
change something like shared_buffers and allow that change to take effect with 
a SIGHUP instead of requiring a full restart.

I agree that we'd have to be very careful with allowing the code to start 
changing shared memory size on it's own...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner
Hello Dimitri,

On 08/23/2011 06:39 PM, Dimitri Fontaine wrote:
 I'm far from familiar with the detailed concepts here, but allow me to
 comment.  I have two open questions:
 
  - is it possible to use a distributed algorithm to produce XIDs,
something like Vector Clocks?
 
Then each backend is able to create a snapshot (well, and XID) on its
own, and any backend is still able to compare its snapshot to any
other snapshot (well, XID)

Creation of snapshots and XID assignment are not as related as you imply
here.  Keep in mind that a read-only transaction have a snapshot, but no
XID.  (Not sure if it's possible for a transaction to have an XID, but
no snapshot.  If it only touches system catalogs with SnapshotNow,
maybe?  Don't think we support that, ATM).

  - is it possible to cache the production of the next snapshots so that
generating an XID only means getting the next in a pre-computed
vector?

The way I look at it, what Robert proposed can be thought of as cache
the production of the next snapshot, with a bit of a stretch of what a
cache is, perhaps.  I'd rather call it early snapshot creation, maybe
look-ahead something.

ATM backends all scan ProcArray to generate their snapshot.  Instead,
what Robert proposes would - sometimes, somewhat - move that work from
snapshot creation time to commit time.

As Tom points out, the difficulty lies in the question of when it's
worth doing that:  if you have lots of commits in a row, and no
transaction ever uses the (pre generated) snapshots of the point in time
in between, then those were wasted.  OTOH, if there are just very few
COMMITs spread across lots of writes, the read-only backends will
re-create the same snapshots, over and over again.  Seems wasteful as
well (as GetSnapshotData popping up high on profiles confirms somewhat).

Hope to have cleared up things a bit.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner
Robert, Jim,

thanks for thinking out loud about dynamic allocation of shared memory.
 Very much appreciated.

On 08/23/2011 01:22 AM, Robert Haas wrote:
 With respect to a general-purpose shared memory allocator, I think
 that there are cases where that would be useful to have, but I don't
 think there are as many of them as many people seem to think.  I
 wouldn't choose to implement this using a general-purpose allocator
 even if we had it, both because it's undesirable to allow this or any
 subsystem to consume an arbitrary amount of memory (nor can it fail...
 especially in the abort path) and because a ring buffer is almost
 certainly faster than a general-purpose allocator.

I'm in respectful disagreement regarding the ring-buffer approach and
think that dynamic allocation can actually be more efficient if done
properly, because there doesn't need to be head and tail pointers, which
might turn into a point of contention.

As a side note: that I've been there with imessages.  Those were first
organized as a ring-bufffer.  The major problem with that approach was
the imessages were consumed with varying delay.  In case an imessage was
left there for a longer amount of time, it blocked creation of new
imessages, because the ring-buffer cycled around once and its head
arrived back at the unconsumed imessage.

IIUC (which might not be the case) the same issue applies for snapshots.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Robert Haas
On Wed, Aug 24, 2011 at 4:30 AM, Markus Wanner mar...@bluegap.ch wrote:
 I'm in respectful disagreement regarding the ring-buffer approach and
 think that dynamic allocation can actually be more efficient if done
 properly, because there doesn't need to be head and tail pointers, which
 might turn into a point of contention.

True; although there are some other complications.  With a
sufficiently sophisticated allocator you can avoid mutex contention
when allocating chunks, but then you have to store a pointer to the
chunk somewhere or other, and that then requires some kind of
synchronization.

 As a side note: that I've been there with imessages.  Those were first
 organized as a ring-bufffer.  The major problem with that approach was
 the imessages were consumed with varying delay.  In case an imessage was
 left there for a longer amount of time, it blocked creation of new
 imessages, because the ring-buffer cycled around once and its head
 arrived back at the unconsumed imessage.

 IIUC (which might not be the case) the same issue applies for snapshots.

One difference with snapshots is that only the latest snapshot is of
any interest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner
Robert,

On 08/25/2011 04:59 AM, Robert Haas wrote:
 True; although there are some other complications.  With a
 sufficiently sophisticated allocator you can avoid mutex contention
 when allocating chunks, but then you have to store a pointer to the
 chunk somewhere or other, and that then requires some kind of
 synchronization.

Hm.. right.

 One difference with snapshots is that only the latest snapshot is of
 any interest.

Theoretically, yes.  But as far as I understood, you proposed the
backends copy that snapshot to local memory.  And copying takes some
amount of time, possibly being interrupted by other backends which add
newer snapshots...  Or do you envision the copying to restart whenever a
new snapshot arrives?

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-23 Thread Simon Riggs
On Mon, Aug 22, 2011 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote:

 I've been giving this quite a bit more thought, and have decided to
 abandon the scheme described above, at least for now.

I liked your goal of O(1) snapshots and think you should go for that.

I didn't realise you were still working on this, and had some thoughts
at the weekend which I recorded just now. Different tack entirely.

 Heikki has made the suggestion a few times (and a few other people
 have since made somewhat similar suggestions in different words) of
 keeping an-up-to-date snapshot in shared memory such that transactions
 that need a snapshot can simply copy it.  I've since noted that in Hot
 Standby mode, that's more or less what the KnownAssignedXids stuff
 already does.  I objected that, first, the overhead of updating the
 snapshot for every commit would be too great, and second, it didn't
 seem to do a whole lot to reduce the size of the critical section, and
 therefore probably wouldn't improve performance that much.  But I'm
 coming around to the view that these might be solvable problems rather
 than reasons to give up on the idea altogether.

Sounds easy enough to just link up KnownAssignedXids and see...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 With respect to the first problem, what I'm imagining is that we not
 do a complete rewrite of the snapshot in shared memory on every
 commit.  Instead, when a transaction ends, we'll decide whether to (a)
 write a new snapshot or (b) just record the XIDs that ended.  If we do
 (b), then any backend that wants a snapshot will need to copy from
 shared memory both the most recently written snapshot and the XIDs
 that have subsequently ended.  From there, it can figure out which
 XIDs are still running.  Of course, if the list of recently-ended XIDs
 gets too long, then taking a snapshot will start to get expensive, so
 we'll need to periodically do (a) instead.  There are other ways that
 this could be done as well; for example, the KnownAssignedXids stuff
 just flags XIDs that should be ignored and then periodically compacts
 away the ignored entries.

I'm a bit concerned that this approach is trying to optimize the heavy
contention situation at the cost of actually making things worse anytime
that you're not bottlenecked by contention for access to this shared
data structure.  In particular, given the above design, then every
reader of the data structure has to duplicate the work of eliminating
subsequently-ended XIDs from the latest stored snapshot.  Maybe that's
relatively cheap, but if you do it N times it's not going to be so cheap
anymore.  In fact, it looks to me like that cost would scale about as
O(N^2) in the number of transactions you allow to elapse before storing
a new snapshot, so you're not going to be able to let very many go by
before you do that.

I don't say this can't be made to work, but I don't want to blow off
performance for single-threaded applications in pursuit of scalability
that will only benefit people running massively parallel applications
on big iron.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-23 Thread Robert Haas
On Tue, Aug 23, 2011 at 12:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm a bit concerned that this approach is trying to optimize the heavy
 contention situation at the cost of actually making things worse anytime
 that you're not bottlenecked by contention for access to this shared
 data structure.  In particular, given the above design, then every
 reader of the data structure has to duplicate the work of eliminating
 subsequently-ended XIDs from the latest stored snapshot.  Maybe that's
 relatively cheap, but if you do it N times it's not going to be so cheap
 anymore.  In fact, it looks to me like that cost would scale about as
 O(N^2) in the number of transactions you allow to elapse before storing
 a new snapshot, so you're not going to be able to let very many go by
 before you do that.

That's certainly a fair concern, and it might even be worse than
O(n^2).  On the other hand, the current approach involves scanning the
entire ProcArray for every snapshot, even if nothing has changed and
90% of the backends are sitting around playing tiddlywinks, so I don't
think I'm giving up something for nothing except perhaps in the case
where there is only one active backend in the entire system.  On the
other hand, you could be entirely correct that the current
implementation wins in the uncontended case.  Without testing it, I
just don't know...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-23 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 I think the real trick is figuring out a design that can improve
 concurrency.

I'm far from familiar with the detailed concepts here, but allow me to
comment.  I have two open questions:

 - is it possible to use a distributed algorithm to produce XIDs,
   something like Vector Clocks?

   Then each backend is able to create a snapshot (well, and XID) on its
   own, and any backend is still able to compare its snapshot to any
   other snapshot (well, XID)

 - is it possible to cache the production of the next snapshots so that
   generating an XID only means getting the next in a pre-computed
   vector?

   My guess by reading the emails here is that we need to add some
   information at snapshot generation time, it's not just about getting
   a 32 bit sequence number.

I'm not sure I'm being that helpful here, but sometime stating the
obviously impossible ideas allows to think about some new design, so I
figured I would still try :)

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 That's certainly a fair concern, and it might even be worse than
 O(n^2).  On the other hand, the current approach involves scanning the
 entire ProcArray for every snapshot, even if nothing has changed and
 90% of the backends are sitting around playing tiddlywinks, so I don't
 think I'm giving up something for nothing except perhaps in the case
 where there is only one active backend in the entire system.  On the
 other hand, you could be entirely correct that the current
 implementation wins in the uncontended case.  Without testing it, I
 just don't know...

Sure.  Like I said, I don't know that this can't be made to work.
I'm just pointing out that we have to keep an eye on the single-backend
case as well as the many-backends case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] cheaper snapshots redux

2011-08-22 Thread Robert Haas
On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether we could do something involving WAL properties --- the
 current tuple visibility logic was designed before WAL existed, so it's
 not exploiting that resource at all.  I'm imagining that the kernel of a
 snapshot is just a WAL position, ie the end of WAL as of the time you
 take the snapshot (easy to get in O(1) time).  Visibility tests then
 reduce to did this transaction commit with a WAL record located before
 the specified position?.  You'd need some index datastructure that made
 it reasonably cheap to find out the commit locations of recently
 committed transactions, where recent means back to recentGlobalXmin.
 That seems possibly do-able, though I don't have a concrete design in
 mind.

 [discussion of why I don't think an LSN will work]

 But having said that an LSN can't work, I don't see why we can't just
 use a 64-bit counter.  In fact, the predicate locking code already
 does something much like this, using an SLRU, for serializable
 transactions only.  In more detail, what I'm imagining is an array
 with 4 billion entries, one per XID, probably broken up into files of
 say 16MB each with 2 million entries per file.  Each entry is a 64-bit
 value.  It is 0 if the XID has not yet started, is still running, or
 has aborted.  Otherwise, it is the commit sequence number of the
 transaction.

I've been giving this quite a bit more thought, and have decided to
abandon the scheme described above, at least for now.  It has the
advantage of avoiding virtually all locking, but it's extremely
inefficient in its use of memory in the presence of long-running
transactions.  For example, if there's an open transaction that's been
sitting around for 10 million transactions or so and has an XID
assigned, any new snapshot is going to need to probe into the big
array for any XID in that range.  At 8 bytes per entry, that means
we're randomly accessing about ~80MB of memory-mapped data.  That
seems problematic both in terms of blowing out the cache and (on small
machines) possibly even blowing out RAM.  Nor is that the worst case
scenario: a transaction could sit open for 100 million transactions.

Heikki has made the suggestion a few times (and a few other people
have since made somewhat similar suggestions in different words) of
keeping an-up-to-date snapshot in shared memory such that transactions
that need a snapshot can simply copy it.  I've since noted that in Hot
Standby mode, that's more or less what the KnownAssignedXids stuff
already does.  I objected that, first, the overhead of updating the
snapshot for every commit would be too great, and second, it didn't
seem to do a whole lot to reduce the size of the critical section, and
therefore probably wouldn't improve performance that much.  But I'm
coming around to the view that these might be solvable problems rather
than reasons to give up on the idea altogether.

With respect to the first problem, what I'm imagining is that we not
do a complete rewrite of the snapshot in shared memory on every
commit.  Instead, when a transaction ends, we'll decide whether to (a)
write a new snapshot or (b) just record the XIDs that ended.  If we do
(b), then any backend that wants a snapshot will need to copy from
shared memory both the most recently written snapshot and the XIDs
that have subsequently ended.  From there, it can figure out which
XIDs are still running.  Of course, if the list of recently-ended XIDs
gets too long, then taking a snapshot will start to get expensive, so
we'll need to periodically do (a) instead.  There are other ways that
this could be done as well; for example, the KnownAssignedXids stuff
just flags XIDs that should be ignored and then periodically compacts
away the ignored entries.

I think the real trick is figuring out a design that can improve
concurrency.  If you keep a snapshot in shared memory and periodically
overwrite it in place, I don't think you're going to gain much.
Everyone who wants a snapshot still needs a share-lock and everyone
who wants to commit still needs an exclusive-lock, and while you might
be able to make the critical section a bit shorter, I think it's still
going to be hard to make big gains that way.  What I'm thinking about
instead is using a ring buffer with three pointers: a start pointer, a
stop pointer, and a write pointer.  When a transaction ends, we
advance the write pointer, write the XIDs or a whole new snapshot into
the buffer, and then advance the stop pointer.  If we wrote a whole
new snapshot, we advance the start pointer to the beginning of the
data we just wrote.

Someone who wants to take a snapshot must read the data between the
start and stop pointers, and must then check that the write pointer
hasn't advanced so far in the meantime that the data they read might
have been overwritten before they finished reading it.  Obviously,

Re: [HACKERS] cheaper snapshots redux

2011-08-22 Thread Jim Nasby
On Aug 22, 2011, at 4:25 PM, Robert Haas wrote:
 What I'm thinking about
 instead is using a ring buffer with three pointers: a start pointer, a
 stop pointer, and a write pointer.  When a transaction ends, we
 advance the write pointer, write the XIDs or a whole new snapshot into
 the buffer, and then advance the stop pointer.  If we wrote a whole
 new snapshot, we advance the start pointer to the beginning of the
 data we just wrote.
 
 Someone who wants to take a snapshot must read the data between the
 start and stop pointers, and must then check that the write pointer
 hasn't advanced so far in the meantime that the data they read might
 have been overwritten before they finished reading it.  Obviously,
 that's a little risky, since we'll have to do the whole thing over if
 a wraparound occurs, but if the ring buffer is large enough it
 shouldn't happen very often.  And a typical snapshot is pretty small
 unless massive numbers of subxids are in use, so it seems like it
 might not be too bad.  Of course, it's pretty hard to know for sure
 without coding it up and testing it.

Something that would be really nice to fix is our reliance on a fixed size of 
shared memory, and I'm wondering if this could be an opportunity to start in a 
new direction. My thought is that we could maintain two distinct shared memory 
snapshots and alternate between them. That would allow us to actually resize 
them as needed. We would still need something like what you suggest to allow 
for adding to the list without locking, but with this scheme we wouldn't need 
to worry about extra locking when taking a snapshot since we'd be doing that in 
a new segment that no one is using yet.

The downside is such a scheme does add non-trivial complexity on top of what 
you proposed. I suspect it would be much better if we had a separate mechanism 
for dealing with shared memory requirements (shalloc?). But if it's just not 
practical to make a generic shared memory manager it would be good to start 
thinking about ways we can work around fixed shared memory size issues.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-22 Thread Robert Haas
On Mon, Aug 22, 2011 at 6:45 PM, Jim Nasby j...@nasby.net wrote:
 Something that would be really nice to fix is our reliance on a fixed size of 
 shared memory, and I'm wondering if this could be an opportunity to start in 
 a new direction. My thought is that we could maintain two distinct shared 
 memory snapshots and alternate between them. That would allow us to actually 
 resize them as needed. We would still need something like what you suggest to 
 allow for adding to the list without locking, but with this scheme we 
 wouldn't need to worry about extra locking when taking a snapshot since we'd 
 be doing that in a new segment that no one is using yet.

 The downside is such a scheme does add non-trivial complexity on top of what 
 you proposed. I suspect it would be much better if we had a separate 
 mechanism for dealing with shared memory requirements (shalloc?). But if it's 
 just not practical to make a generic shared memory manager it would be good 
 to start thinking about ways we can work around fixed shared memory size 
 issues.

Well, the system I'm proposing is actually BETTER than having two
distinct shared memory snapshots.  For example, right now we cache up
to 64 subxids per backend.  I'm imagining that going away and using
that memory for the ring buffer.  Out of the box, that would imply a
ring buffer of 64 * 103 = 6592 slots.  If the average snapshot lists
100 XIDs, you could rewrite the snapshot dozens of times times before
the buffer wraps around, which is obviously a lot more than two.  Even
if subtransactions are being heavily used and each snapshot lists 1000
XIDs, you still have enough space to rewrite the snapshot several
times over before wraparound occurs.  Of course, at some point the
snapshot gets too big and you have to switch to retaining only the
toplevel XIDs, which is more or less the equivalent of what happens
under the current implementation when any single transaction's subxid
cache overflows.

With respect to a general-purpose shared memory allocator, I think
that there are cases where that would be useful to have, but I don't
think there are as many of them as many people seem to think.  I
wouldn't choose to implement this using a general-purpose allocator
even if we had it, both because it's undesirable to allow this or any
subsystem to consume an arbitrary amount of memory (nor can it fail...
especially in the abort path) and because a ring buffer is almost
certainly faster than a general-purpose allocator.  We have enough
trouble with palloc overhead already.  That having been said, I do
think there are cases where it would be nice to have... and it
wouldn't surprise me if I end up working on something along those
lines in the next year or so.  It turns out that memory management is
a major issue in lock-free programming; you can't assume that it's
safe to recycle an object once the last pointer to it has been removed
from shared memory - because someone may have fetched the pointer just
before you removed it and still be using it to examine the object.  An
allocator with some built-in capabilities for handling such problems
seems like it might be very useful

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-30 Thread Simon Riggs
On Thu, Jul 28, 2011 at 8:32 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

 Maybe this is why other databases don't offer per backend async commit ?

Oracle has async commit but very few people know about it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 (4)  We communicate acceptable snapshots to the replica to make
 the order of visibility visibility match the master even when
 that doesn't match the order that transactions returned from
 commit.
 
  I (predictably) like (4) -- even though it's a lot of work
 
 I think that (4), beyond being a lot of work, will also have
 pretty terrible performance.  You're basically talking about
 emitting two WAL records for every commit instead of one.
 
Well, I can think of a great many other ways this could be done,
each with its own trade-offs of various types of overhead against
how close the replica is to current.  At one extreme you could do
what you describe, at the other you could generate a new snapshot on
the replica once every few minutes.
 
Then there are more clever ways, in discussions a few months ago I
suggested that adding two new bit flags to the commit record would
suffice, and I don't remember anyone blowing holes in that idea.  Of
course, that was to achieve serializable behavior on the replica,
based on some assumption that the current hot standby already
supported repeatable read.  We might need another bit or two to
solve the problems with that which have surfaced on this thread.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Hannu Krosing
On Thu, 2011-07-28 at 20:14 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma ants.aa...@eesti.ee wrote:
  On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
  kevin.gritt...@wicourts.gov wrote:
  (4)  We communicate acceptable snapshots to the replica to make the
  order of visibility visibility match the master even when that
  doesn't match the order that transactions returned from commit.
 
  I wonder if some interpretation of 2 phase commit could make Robert's
  original suggestion implement this.
 
  On the master the commit sequence would look something like:
  1. Insert commit record to the WAL
  2. Wait for replication
  3. Get a commit seq nr and mark XIDs visible
  4. WAL log the seq nr
  5. Return success to client
 
  When replaying:
  * When replaying commit record, do everything but make
   the tx visible.
  * When replaying the commit sequence number
 if there is a gap between last visible commit and current:
   insert the commit sequence nr. to list of waiting commits.
 else:
   mark current and all directly following waiting tx's visible
 
  This would give consistent visibility order on master and slave. Robert
  is right that this would undesirably increase WAL traffic. Delaying this
  traffic would undesirably increase replay lag between master and slave.
  But it seems to me that this could be an optional WAL level on top of
  hot_standby that would only be enabled if consistent visibility on
  slaves is desired.
 
 I think you nailed it.

Agreed, this would keep current semantics on master and same visibility
order on master and slave.

 An additional point to think about: if we were willing to insist on
 streaming replication, we could send the commit sequence numbers via a
 side channel rather than writing them to WAL, which would be a lot
 cheaper. 

Why do you think that side channel is cheaper than main WAL ?

How would you handle synchronising the two ?

 That might even be a reasonable thing to do, because if
 you're doing log shipping, this is all going to be super-not-real-time
 anyway. 

But perhaps you still may want to preserve visibility order to be able
to do PITR to exact transaction commit, no ?

  OTOH, I know we don't want to make WAL shipping anything less
 than a first class citizen, so maybe not.
 
 At any rate, we may be getting a little sidetracked here from the
 original point of the thread, which was how to make snapshot-taking
 cheaper.  Maybe there's some tie-in to when transactions become
 visible, but I think it's pretty weak.  The existing system could be
 hacked up to avoid making transactions visible out of LSN order, and
 the system I proposed could make them visible either in LSN order or
 do the same thing we do now.  They are basically independent problems,
 AFAICS.

Agreed.


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Robert Haas
On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 An additional point to think about: if we were willing to insist on
 streaming replication, we could send the commit sequence numbers via a
 side channel rather than writing them to WAL, which would be a lot
 cheaper.

 Why do you think that side channel is cheaper than main WAL ?

You don't have to flush it to disk, and you can use some other lock
that isn't as highly contended as WALInsertLock to synchronize it.

 That might even be a reasonable thing to do, because if
 you're doing log shipping, this is all going to be super-not-real-time
 anyway.

 But perhaps you still may want to preserve visibility order to be able
 to do PITR to exact transaction commit, no ?

Maybe.  In practice, I suspect most people won't be willing to pay the
price a feature like this would exact.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-29 Thread Hannu Krosing
On Fri, 2011-07-29 at 10:23 -0400, Robert Haas wrote:
 On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
  An additional point to think about: if we were willing to insist on
  streaming replication, we could send the commit sequence numbers via a
  side channel rather than writing them to WAL, which would be a lot
  cheaper.
 
  Why do you think that side channel is cheaper than main WAL ?
 
 You don't have to flush it to disk, 

You can probably write the i became visible WAL record without forcing
a flush and still get the same visibility order.

 and you can use some other lock
 that isn't as highly contended as WALInsertLock to synchronize it.

but you will need to synchronise it with WAL replay on slave anyway. It
seems easiest to just insert it in the WAL stream and be done with it. 

  That might even be a reasonable thing to do, because if
  you're doing log shipping, this is all going to be super-not-real-time
  anyway.
 
  But perhaps you still may want to preserve visibility order to be able
  to do PITR to exact transaction commit, no ?
 
 Maybe.  In practice, I suspect most people won't be willing to pay the
 price a feature like this would exact.

Unless we find some really bad problems with different visibility orders
on master and slave(s) you are probably right.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Simon Riggs
On Thu, Jul 28, 2011 at 3:51 AM, Robert Haas robertmh...@gmail.com wrote:

 All that having been said, even if I haven't made any severe
 conceptual errors in the above, I'm not sure how well it will work in
 practice.  On the plus side, taking a snapshot becomes O(1) rather
 than O(MaxBackends) - that's good.  On the further plus side, you can
 check both whether an XID has committed and whether it's visible to
 your snapshot in a single, atomic action with no lock - that seems
 really good.  On the minus side, checking an xid against your snapshot
 now has less locality of reference.  And, rolling over into a new
 segment of the array is going to require everyone to map it, and maybe
 cause some disk I/O as a new file gets created.

Sounds like the right set of thoughts to be having.

If you do this, you must cover subtransactions and Hot Standby. Work
in this area takes longer than you think when you take the
complexities into account, as you must.

I think you should take the premise of making snapshots O(1) and look
at all the ways of doing that. If you grab too early at a solution you
may grab the wrong one.

For example, another approach would be to use a shared hash table.
Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
N can be kept small by regularly pruning the hash table. If we crash
we lose the hash table - no matter. (I'm not suggesting this is
better, just a different approach that should be judged across
others).

What I'm not sure in any of these ideas is how to derive a snapshot xmin.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Florian Pflug
On Jul28, 2011, at 04:51 , Robert Haas wrote:
 One fly in the ointment is that 8-byte
 stores are apparently done as two 4-byte stores on some platforms.
 But if the counter runs backward, I think even that is OK.  If you
 happen to read an 8 byte value as it's being written, you'll get 4
 bytes of the intended value and 4 bytes of zeros.  The value will
 therefore appear to be less than what it should be.  However, if the
 value was in the midst of being written, then it's still in the midst
 of committing, which means that that XID wasn't going to be visible
 anyway.  Accidentally reading a smaller value doesn't change the
 answer.

That only works if the update of the most-significant word is guaranteed
to be visible before the update to the lest-significant one. Which
I think you can only enforce if you update the words individually
(and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
compiler.

Otherwise, the following might happen (with a 2-byte value instead of an
8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
ones aren't. Just to keep the numbers smaller. The machine is assumed to be
big-endian)

The counter is at 0xff00
Backends 1 decrements, i.e. does
(1)  STORE [counter+1] 0xff
(2)  STORE [counter], 0x00

Backend 2 reads
(1')  LOAD [counter+1]
(2')  LOAD [counter]

If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
0x which is higher than it should be.

But we could simply use a spin-lock to protect the read on machines where
we don't know for sure that 64-bit reads and write are atomic. That'll
only really hurt on machines with 16+ cores or so, and the number of
architectures which support that isn't that high anyway. If we supported
spinlock-less operation on SPARC, x86-64, PPC64 and maybe Itanium, would we
miss any important one?


best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I wonder whether we could do something involving WAL properties --- the
  current tuple visibility logic was designed before WAL existed, so it's
  not exploiting that resource at all.  I'm imagining that the kernel of a
  snapshot is just a WAL position, ie the end of WAL as of the time you
  take the snapshot (easy to get in O(1) time).  Visibility tests then
  reduce to did this transaction commit with a WAL record located before
  the specified position?.  

Why not just cache a reference snapshots near WAL writer and maybe
also save it at some interval in WAL in case you ever need to restore an
old snapshot at some val position for things like time travel.

It may be cheaper lock-wise not to update ref. snapshot at each commit,
but to keep latest saved snapshot and a chain of transactions
committed / aborted since. This means that when reading the snapshot you
read the current saved snapshot and then apply the list of commits.

when moving to a new saved snapshot you really generate a new one and
keep the old snapshot + commit chain around for a little while for those
who may be still processing it. Seems like this is something that can be
done with no locking,

 You'd need some index datastructure that made
  it reasonably cheap to find out the commit locations of recently
  committed transactions, where recent means back to recentGlobalXmin.
  That seems possibly do-able, though I don't have a concrete design in
  mind.

snapshot + chain of commits is likely as cheap as it gets, unless you
additionally cache the commits in a tighter data structure. this is
because you will need them all anyway to compute difference from ref
snapshot.


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:46 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Sounds like the right set of thoughts to be having.

Thanks.

 If you do this, you must cover subtransactions and Hot Standby. Work
 in this area takes longer than you think when you take the
 complexities into account, as you must.

Right.  This would replace the KnownAssignedXids stuff (a non-trivial
project, I am sure).

 I think you should take the premise of making snapshots O(1) and look
 at all the ways of doing that. If you grab too early at a solution you
 may grab the wrong one.

Yeah, I'm just brainstorming at this point.  This is, I think, the
best idea of what I've come up with so far, but it's definitely not
the only approach.

 For example, another approach would be to use a shared hash table.
 Snapshots are O(1), committing is O(k), using the snapshot is O(logN).
 N can be kept small by regularly pruning the hash table. If we crash
 we lose the hash table - no matter. (I'm not suggesting this is
 better, just a different approach that should be judged across
 others).

Sorry, I'm having a hard time understanding what you are describing
here.  What would the keys and values in this hash table be, and what
do k and N refer to here?

 What I'm not sure in any of these ideas is how to derive a snapshot xmin.

That is a problem.  If we have to scan the ProcArray every time we
take a snapshot just to derive an xmin, we are kind of hosed.  One
thought I had is that we might be able to use a sort of sloppy xmin.
In other words, we keep a cached xmin, and have some heuristic where
we occasionally try to update it.  A snapshot with a too-old xmin
isn't wrong, just possibly slower.  But if xmin is only slightly stale
and xids can be tested relatively quickly, it might not matter very
much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:16 AM, Florian Pflug f...@phlo.org wrote:
 On Jul28, 2011, at 04:51 , Robert Haas wrote:
 One fly in the ointment is that 8-byte
 stores are apparently done as two 4-byte stores on some platforms.
 But if the counter runs backward, I think even that is OK.  If you
 happen to read an 8 byte value as it's being written, you'll get 4
 bytes of the intended value and 4 bytes of zeros.  The value will
 therefore appear to be less than what it should be.  However, if the
 value was in the midst of being written, then it's still in the midst
 of committing, which means that that XID wasn't going to be visible
 anyway.  Accidentally reading a smaller value doesn't change the
 answer.

 That only works if the update of the most-significant word is guaranteed
 to be visible before the update to the lest-significant one. Which
 I think you can only enforce if you update the words individually
 (and use a fence on e.g. PPC32). Otherwise you're at the mercy of the
 compiler.

 Otherwise, the following might happen (with a 2-byte value instead of an
 8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes
 ones aren't. Just to keep the numbers smaller. The machine is assumed to be
 big-endian)

 The counter is at 0xff00
 Backends 1 decrements, i.e. does
 (1)  STORE [counter+1] 0xff
 (2)  STORE [counter], 0x00

 Backend 2 reads
 (1')  LOAD [counter+1]
 (2')  LOAD [counter]

 If the sequence of events is (1), (1'), (2'), (2), backend 2 will read
 0x which is higher than it should be.

You're confusing two different things - I agree that you need a
spinlock around reading the counter, unless 8-byte loads and stores
are atomic.

What I'm saying can be done without a lock is reading the commit-order
value for a given XID.  If that's in the middle of being updated, then
the old value was zero, so the scenario you describe can't occur.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I wonder whether we could do something involving WAL properties --- the
  current tuple visibility logic was designed before WAL existed, so it's
  not exploiting that resource at all.  I'm imagining that the kernel of a
  snapshot is just a WAL position, ie the end of WAL as of the time you
  take the snapshot (easy to get in O(1) time).  Visibility tests then
  reduce to did this transaction commit with a WAL record located before
  the specified position?.

 Why not just cache a reference snapshots near WAL writer and maybe
 also save it at some interval in WAL in case you ever need to restore an
 old snapshot at some val position for things like time travel.

 It may be cheaper lock-wise not to update ref. snapshot at each commit,
 but to keep latest saved snapshot and a chain of transactions
 committed / aborted since. This means that when reading the snapshot you
 read the current saved snapshot and then apply the list of commits.

Yeah, interesting idea.  I thought about that.  You'd need not only
the list of commits but also the list of XIDs that had been published,
since the commits have to be removed from the snapshot and the
newly-published XIDs have to be added to it (in case they commit later
while the snapshot is still in use).

You can imagine doing this with a pair of buffers.  You write a
snapshot into the beginning of the first buffer and then write each
XID that is published or commits into the next slot in the array.
When the buffer is filled up, the next process that wants to publish
an XID or commit scans through the array and constructs a new snapshot
that compacts away all the begin/commit pairs and writes it into the
second buffer, and all new snapshots are taken there.  When that
buffer fills up you flip back to the first one.  Of course, you need
some kind of synchronization to make sure that you don't flip back to
the first buffer while some laggard is still using it to construct a
snapshot that he started taking before you flipped to the second one,
but maybe that could be made light-weight enough not to matter.

I am somewhat concerned that this approach might lead to a lot of
contention over the snapshot buffers.  In particular, the fact that
you have to touch shared cache lines both to advertise a new XID and
when it gets committed seems less than ideal.  One thing that's kind
of interesting about the commit sequence number approach is that -
as far as I can tell - it doesn't require new XIDs to be advertised
anywhere at all.  You don't have to worry about overflowing the
subxids[] array because it goes away altogether.  The commit sequence
number itself is going to be a contention hotspot, but at least it's
small and fixed-size.

Another concern I have with this approach is - how large do you make
the buffers?  If you make them too small, then you're going to have to
regenerate the snapshot frequently, which will lead to the same sort
of lock contention we have today - no one can commit while the
snapshot is being regenerated.  On the other hand, if you make them
too big, then deriving a snapshot gets slow.  Maybe there's some way
to make it work, but I'm afraid it might end up being yet another
arcane thing the tuning of which will become a black art among
hackers...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 09:38 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
  On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   I wonder whether we could do something involving WAL properties --- the
   current tuple visibility logic was designed before WAL existed, so it's
   not exploiting that resource at all.  I'm imagining that the kernel of a
   snapshot is just a WAL position, ie the end of WAL as of the time you
   take the snapshot (easy to get in O(1) time).  Visibility tests then
   reduce to did this transaction commit with a WAL record located before
   the specified position?.
 
  Why not just cache a reference snapshots near WAL writer and maybe
  also save it at some interval in WAL in case you ever need to restore an
  old snapshot at some val position for things like time travel.
 
  It may be cheaper lock-wise not to update ref. snapshot at each commit,
  but to keep latest saved snapshot and a chain of transactions
  committed / aborted since. This means that when reading the snapshot you
  read the current saved snapshot and then apply the list of commits.
 
 Yeah, interesting idea.  I thought about that.  You'd need not only
 the list of commits but also the list of XIDs that had been published,
 since the commits have to be removed from the snapshot and the
 newly-published XIDs have to be added to it (in case they commit later
 while the snapshot is still in use).
 
 You can imagine doing this with a pair of buffers.  You write a
 snapshot into the beginning of the first buffer and then write each
 XID that is published or commits into the next slot in the array.
 When the buffer is filled up, the next process that wants to publish
 an XID or commit scans through the array and constructs a new snapshot
 that compacts away all the begin/commit pairs and writes it into the
 second buffer, and all new snapshots are taken there.  When that
 buffer fills up you flip back to the first one.  Of course, you need
 some kind of synchronization to make sure that you don't flip back to
 the first buffer while some laggard is still using it to construct a
 snapshot that he started taking before you flipped to the second one,
 but maybe that could be made light-weight enough not to matter.
 
 I am somewhat concerned that this approach might lead to a lot of
 contention over the snapshot buffers.  

My hope was, that this contention would be the same than simply writing
the WAL buffers currently, and thus largely hidden by the current WAL
writing sync mechanisma. 

It really covers just the part which writes commit records to WAL, as
non-commit WAL records dont participate in snapshot updates.

Writing WAL is already a single point which needs locks or other kind of
synchronization. This will stay with us at least until we start
supporting multiple WAL streams, and even then we will need some
synchronisation between those.

 In particular, the fact that
 you have to touch shared cache lines both to advertise a new XID and
 when it gets committed seems less than ideal. 

Every commit record writer should do this as part of writing the commit
record. And as mostly you still want the latest snapshot, why not just
update the snapshot as part of the commit/abort ?

Do we need the ability for fast recent snapshots at all ?

  One thing that's kind
 of interesting about the commit sequence number approach is that -
 as far as I can tell - it doesn't require new XIDs to be advertised
 anywhere at all.  You don't have to worry about overflowing the
 subxids[] array because it goes away altogether.  The commit sequence
 number itself is going to be a contention hotspot, but at least it's
 small and fixed-size.
 
 Another concern I have with this approach is - how large do you make
 the buffers?  If you make them too small, then you're going to have to
 regenerate the snapshot frequently, which will lead to the same sort
 of lock contention we have today - no one can commit while the
 snapshot is being regenerated.  On the other hand, if you make them
 too big, then deriving a snapshot gets slow.  Maybe there's some way
 to make it work, but I'm afraid it might end up being yet another
 arcane thing the tuning of which will become a black art among
 hackers...
 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 My hope was, that this contention would be the same than simply writing
 the WAL buffers currently, and thus largely hidden by the current WAL
 writing sync mechanisma.

 It really covers just the part which writes commit records to WAL, as
 non-commit WAL records dont participate in snapshot updates.

I'm confused by this, because I don't think any of this can be done
when we insert the commit record into the WAL stream.  It has to be
done later, at the time we currently remove ourselves from the
ProcArray.  Those things need not happen in the same order, as I noted
in my original post.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 My hope was, that this contention would be the same than simply writing
 the WAL buffers currently, and thus largely hidden by the current WAL
 writing sync mechanisma.
 
 It really covers just the part which writes commit records to WAL, as
 non-commit WAL records dont participate in snapshot updates.

 I'm confused by this, because I don't think any of this can be done
 when we insert the commit record into the WAL stream.  It has to be
 done later, at the time we currently remove ourselves from the
 ProcArray.  Those things need not happen in the same order, as I noted
 in my original post.

But should we rethink that?  Your point that hot standby transactions on
a slave could see snapshots that were impossible on the parent was
disturbing.  Should we look for a way to tie transaction becomes
visible to its creation of a commit WAL record?  I think the fact that
they are not an indivisible operation is an implementation artifact, and
not a particularly nice one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
  My hope was, that this contention would be the same than simply writing
  the WAL buffers currently, and thus largely hidden by the current WAL
  writing sync mechanisma.
 
  It really covers just the part which writes commit records to WAL, as
  non-commit WAL records dont participate in snapshot updates.
 
 I'm confused by this, because I don't think any of this can be done
 when we insert the commit record into the WAL stream.  It has to be
 done later, at the time we currently remove ourselves from the
 ProcArray.  Those things need not happen in the same order, as I noted
 in my original post.

The update to stored snapshot needs to happen at the moment when the WAL
record is considered to be on stable storage, so the current
snapshot update presumably can be done by the same process which forces
it to stable storage, with the same contention pattern that applies to
writing WAL records, no ?

If the problem is with a backend which requested an async commit, then
it is free to apply it's additional local commit changes from its own
memory if the global latest snapshot disgrees with it.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing ha...@2ndquadrant.com writes:
 On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
 I'm confused by this, because I don't think any of this can be done
 when we insert the commit record into the WAL stream.

 The update to stored snapshot needs to happen at the moment when the WAL
 record is considered to be on stable storage, so the current
 snapshot update presumably can be done by the same process which forces
 it to stable storage, with the same contention pattern that applies to
 writing WAL records, no ?

No.  There is no reason to tie this to fsyncing WAL.  For purposes of
other currently-running transactions, the commit can be considered to
occur at the instant the commit record is inserted into WAL buffers.
If we crash before that makes it to disk, no problem, because nothing
those other transactions did will have made it to disk either.  The
advantage of defining it that way is you don't have weirdly different
behaviors for sync and async transactions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing ha...@2ndquadrant.com 
 wrote:
 My hope was, that this contention would be the same than simply writing
 the WAL buffers currently, and thus largely hidden by the current WAL
 writing sync mechanisma.

 It really covers just the part which writes commit records to WAL, as
 non-commit WAL records dont participate in snapshot updates.

 I'm confused by this, because I don't think any of this can be done
 when we insert the commit record into the WAL stream.  It has to be
 done later, at the time we currently remove ourselves from the
 ProcArray.  Those things need not happen in the same order, as I noted
 in my original post.

 But should we rethink that?  Your point that hot standby transactions on
 a slave could see snapshots that were impossible on the parent was
 disturbing.  Should we look for a way to tie transaction becomes
 visible to its creation of a commit WAL record?  I think the fact that
 they are not an indivisible operation is an implementation artifact, and
 not a particularly nice one.

Well, I agree with you that it isn't especially nice, but it seems
like a fairly intractable problem.  Currently, the standby has no way
of knowing in what order the transactions became visible on the
master.  Unless we want to allow only SR and not log shipping, the
only way to communicate that information would be to WAL log it.
Aside from the expense, what do we do if XLogInsert() fails, given
that we've already committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
  I'm confused by this, because I don't think any of this can be done
  when we insert the commit record into the WAL stream.
 
  The update to stored snapshot needs to happen at the moment when the WAL
  record is considered to be on stable storage, so the current
  snapshot update presumably can be done by the same process which forces
  it to stable storage, with the same contention pattern that applies to
  writing WAL records, no ?
 
 No.  There is no reason to tie this to fsyncing WAL.  For purposes of
 other currently-running transactions, the commit can be considered to
 occur at the instant the commit record is inserted into WAL buffers.
 If we crash before that makes it to disk, no problem, because nothing
 those other transactions did will have made it to disk either. 

Agreed. Actually figured it out right after pushing send :)

 The
 advantage of defining it that way is you don't have weirdly different
 behaviors for sync and async transactions.

My main point was, that we already do synchronization when writing wal,
why not piggyback on this to also update latest snapshot .


-- 
---
Hannu Krosing
PostgreSQL (Infinite) Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 My main point was, that we already do synchronization when writing wal,
 why not piggyback on this to also update latest snapshot .

Well, one problem is that it would break sync rep.

Another problem is that pretty much the last thing I want to do is
push more work under WALInsertLock.  Based on the testing I've done so
far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
are the main bottlenecks here.  I'm focusing on ProcArrayLock and
CLogControlLock right now, but I am pretty well convinced that
WALInsertLock is going to be the hardest nut to crack, so putting
anything more under there seems like it's going in the wrong
direction.  IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 17:10 +0200, Hannu Krosing wrote:
 On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote:
  Hannu Krosing ha...@2ndquadrant.com writes:
   On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote:
   I'm confused by this, because I don't think any of this can be done
   when we insert the commit record into the WAL stream.
  
   The update to stored snapshot needs to happen at the moment when the WAL
   record is considered to be on stable storage, so the current
   snapshot update presumably can be done by the same process which forces
   it to stable storage, with the same contention pattern that applies to
   writing WAL records, no ?
  
  No.  There is no reason to tie this to fsyncing WAL.  For purposes of
  other currently-running transactions, the commit can be considered to
  occur at the instant the commit record is inserted into WAL buffers.
  If we crash before that makes it to disk, no problem, because nothing
  those other transactions did will have made it to disk either. 
 
 Agreed. Actually figured it out right after pushing send :)
 
  The
  advantage of defining it that way is you don't have weirdly different
  behaviors for sync and async transactions.
 
 My main point was, that we already do synchronization when writing wal,
 why not piggyback on this to also update latest snapshot .

So the basic design could be a sparse snapshot, consisting of 'xmin,
xmax, running_txids[numbackends] where each backend manages its own slot
in running_txids - sets a txid when aquiring one and nulls it at commit,
possibly advancing xmin if xmin==mytxid. as xmin update requires full
scan of running_txids, it is also a good time to update xmax  - no need
to advance xmax when inserting your next txid, so you don't need to
locak anything at insert time. 

the valid xmax is still computed when getting the snapshot. 

hmm, probably no need to store xmin and xmax at all.

it needs some further analysis to figure out, if doing it this way
without any locks can produce any relevantly bad snapshots.

maybe you still need one spinlock + memcpy of running_txids to local
memory to get snapshot.

also, as the running_txids array is global, it may need to be made even
sparser to minimise cache-line collisions. needs to be a tuning decision
between cache conflicts and speed of memcpy.

 
 
 -- 
 ---
 Hannu Krosing
 PostgreSQL (Infinite) Scalability and Performance Consultant
 PG Admin Book: http://www.2ndQuadrant.com/books/
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
  My main point was, that we already do synchronization when writing wal,
  why not piggyback on this to also update latest snapshot .
 
 Well, one problem is that it would break sync rep.

Can you elaborate, in what way it breaks sync rep ?

 Another problem is that pretty much the last thing I want to do is
 push more work under WALInsertLock.  Based on the testing I've done so
 far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock
 are the main bottlenecks here.  I'm focusing on ProcArrayLock and
 CLogControlLock right now, but I am pretty well convinced that
 WALInsertLock is going to be the hardest nut to crack, so putting
 anything more under there seems like it's going in the wrong
 direction. 

probably it is not just the WALInsertLock, but the fact that we have
just one WAL. It can become a bottleneck once we have significant number
of processors fighting to write in single WAL.

  IMHO, anyway.
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 But should we rethink that?  Your point that hot standby transactions on
 a slave could see snapshots that were impossible on the parent was
 disturbing.  Should we look for a way to tie transaction becomes
 visible to its creation of a commit WAL record?  I think the fact that
 they are not an indivisible operation is an implementation artifact, and
 not a particularly nice one.

 Well, I agree with you that it isn't especially nice, but it seems
 like a fairly intractable problem.  Currently, the standby has no way
 of knowing in what order the transactions became visible on the
 master.

Right, but if the visibility order were *defined* as the order in which
commit records appear in WAL, that problem neatly goes away.  It's only
because we have the implementation artifact that set my xid to 0 in the
ProcArray is decoupled from inserting the commit record that there's
any difference.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Wed, 2011-07-27 at 22:51 -0400, Robert Haas wrote:
 On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I wonder whether we could do something involving WAL properties --- the
  current tuple visibility logic was designed before WAL existed, so it's
  not exploiting that resource at all.  I'm imagining that the kernel of a
  snapshot is just a WAL position, ie the end of WAL as of the time you
  take the snapshot (easy to get in O(1) time).  Visibility tests then
  reduce to did this transaction commit with a WAL record located before
  the specified position?.  You'd need some index datastructure that made
  it reasonably cheap to find out the commit locations of recently
  committed transactions, where recent means back to recentGlobalXmin.
  That seems possibly do-able, though I don't have a concrete design in
  mind.
 
 I was mulling this idea over some more (the same ideas keep floating
 back to the top...).  I don't think an LSN can actually work, because
 there's no guarantee that the order in which the WAL records are
 emitted is the same order in which the effects of the transactions
 become visible to new snapshots.  For example:
 
 1. Transaction A inserts its commit record, flushes WAL, and begins
 waiting for sync rep.
 2. A moment later, transaction B sets synchronous_commit=off, inserts
 its commit record, requests a background WAL flush, and removes itself
 from the ProcArray.
 3. Transaction C takes a snapshot.

It is Transaction A here which is acting badly - it should also remove
itself from procArray right after it inserts its commit record, as for
everybody else except the client app of transaction A it is committed at
this point. It just cant report back to client before getting
confirmation that it is actually syncrepped (or locally written to
stable storage).

At least at the point of consistent snapshots the right sequence should
be:

1) inert commit record into wal
2) remove yourself from ProcArray (or use some other means to declare
that your transaction is no longer running)
3) if so configured, wait for WAL flus to stable storage and/or SYnc Rep
confirmation

Based on this let me suggest a simple snapshot cache mechanism

A simple snapshot cache mechanism
=

have an array of running transactions, with one slot per backend

txid running_transactions[max_connections];

there are exactly 3 operations on this array

1. insert backends running transaction id
-

this is done at the moment of acquiring your transaction id from system,
and synchronized by the same mechanism as getting the transaction id

running_transactions[my_backend] = current_transaction_id

2. remove backends running transaction id
-

this is done at the moment of committing or aborting the transaction,
again synchronized by the write commit record mechanism. 

running_transactions[my_backend] = NULL

should be first thing after insertin WAcommit record

3. getting a snapshot
-

memcpy() running_transactions to local memory, then construct a snapshot


it may be that you need to protect all3 operations with a single
spinlock, if so then I'd propose the same spinlock used when getting
your transaction id (and placing the array near where latest transaction
id is stored so they share cache line). 

But it is also possible, that you can get logically consistent snapshots
by protecting only some ops. for example, if you protect only insert and
get snapshot, then the worst that can happen is that you get a snapshot
that is a few commits older than what youd get with full locking and it
may well be ok for all real uses.



-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 11:57 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  But should we rethink that?  Your point that hot standby transactions on
  a slave could see snapshots that were impossible on the parent was
  disturbing.  Should we look for a way to tie transaction becomes
  visible to its creation of a commit WAL record?  I think the fact that
  they are not an indivisible operation is an implementation artifact, and
  not a particularly nice one.
 
  Well, I agree with you that it isn't especially nice, but it seems
  like a fairly intractable problem.  Currently, the standby has no way
  of knowing in what order the transactions became visible on the
  master.
 
 Right, but if the visibility order were *defined* as the order in which
 commit records appear in WAL, that problem neatly goes away.  It's only
 because we have the implementation artifact that set my xid to 0 in the
 ProcArray is decoupled from inserting the commit record that there's
 any difference.

Yes, as I explain in another e-mail, the _only_ one for whom the
transaction is not yet committed is the waiting backend itself. for all
others it should show as committed the moment after the wal record is
written.

It's kind of local 2 phase commit thing :)

 
   regards, tom lane
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote:

 But it is also possible, that you can get logically consistent snapshots
 by protecting only some ops. for example, if you protect only insert and
 get snapshot, then the worst that can happen is that you get a snapshot
 that is a few commits older than what youd get with full locking and it
 may well be ok for all real uses.

Thinking more of it, we should lock commit/remove_txid and get_snapshot

having a few more running backends does not make a difference, but
seeing commits in wrong order may.

this will cause contention between commit and get_snapshot, but
hopefully less than current ProcArray manipulation, as there is just one
simple C array to lock and copy.

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 18:48 +0200, Hannu Krosing wrote:
 On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote:
 
  But it is also possible, that you can get logically consistent snapshots
  by protecting only some ops. for example, if you protect only insert and
  get snapshot, then the worst that can happen is that you get a snapshot
  that is a few commits older than what youd get with full locking and it
  may well be ok for all real uses.
 
 Thinking more of it, we should lock commit/remove_txid and get_snapshot
 
 having a few more running backends does not make a difference, but
 seeing commits in wrong order may.

Sorry, not true as this may advanxe xmax to include some running
transactions which were missed during memcpy.

so we still need some mechanism to either synchronize the the copy with
both inserts and removes, or make it atomic even in presence of multiple
CPUs.

 this will cause contention between commit and get_snapshot, but
 hopefully less than current ProcArray manipulation, as there is just one
 simple C array to lock and copy.
 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 But should we rethink that?  Your point that hot standby transactions on
 a slave could see snapshots that were impossible on the parent was
 disturbing.  Should we look for a way to tie transaction becomes
 visible to its creation of a commit WAL record?  I think the fact that
 they are not an indivisible operation is an implementation artifact, and
 not a particularly nice one.

 Well, I agree with you that it isn't especially nice, but it seems
 like a fairly intractable problem.  Currently, the standby has no way
 of knowing in what order the transactions became visible on the
 master.

 Right, but if the visibility order were *defined* as the order in which
 commit records appear in WAL, that problem neatly goes away.  It's only
 because we have the implementation artifact that set my xid to 0 in the
 ProcArray is decoupled from inserting the commit record that there's
 any difference.

Hmm, interesting idea.  However, consider the scenario where some
transactions are using synchronous_commit or synchronous replication,
and others are not.  If a transaction that needs to wait (either just
for WAL flush, or for WAL flush and synchronous replication) inserts
its commit record, and then another transaction with
synchronous_commit=off comes along and inserts its commit record, the
second transaction will have to block until the first transaction is
done waiting.  We can't make either transaction visible without making
both visible, and we certainly can't acknowledge the second
transaction to the client until we've made it visible.  I'm not going
to say that's so horrible we shouldn't even consider it, but it
doesn't seem great, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 11:36 AM, Hannu Krosing ha...@krosing.net wrote:
 On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing ha...@2ndquadrant.com 
 wrote:
  My main point was, that we already do synchronization when writing wal,
  why not piggyback on this to also update latest snapshot .

 Well, one problem is that it would break sync rep.

 Can you elaborate, in what way it breaks sync rep ?

Well, the point of synchronous replication is that the local machine
doesn't see the effects of the transaction until it's been replicated.
 Therefore, no one can be relying on data that might disappear in the
event the system is crushed by a falling meteor.  It would be easy,
technically speaking, to remove the transaction from the ProcArray and
*then* wait for synchronous replication, but that would offer a much
weaker guarantee than what the current version provides.  We would
still guarantee that the commit wouldn't be acknowledged to the client
which submitted it until it was replicated, but we would no longer be
able to guarantee that no one else relied on data written by the
transaction prior to successful replication.

For example, consider this series of events:

1. User asks ATM what is my balance?.  ATM inquires of database,
which says $500.
2. User deposits a check for $100.  ATM does an UPDATE to add $100 to
balance and issues a COMMIT.  But the master has become disconnected
from the synchronous standby, so the sync rep wait hangs.
3. ATM eventually times out and tells user sorry, i can't complete
your transaction right now.
4. User wants to know whether their check got deposited, so they walk
into the bank and ask a teller to check their balance.  Teller's
computer connects to the database and gets $600.  User is happy and
leaves.
5. Master dies.  Failover.
6. User's balance is now back to $500.  When the user finds out much
later, they say wtf?  you told me before it was $600!.

Right now, when using synchronous replication, this series of events
CANNOT HAPPEN.  If some other transaction interrogates the state of
the database and sees the results of some transaction, it is an
ironclad guarantee that the transaction has been replicated.  If we
start making transactions visible when their WAL record is flushed or
- worse - when it's inserted, then those guarantees go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  But should we rethink that?  Your point that hot standby transactions on
  a slave could see snapshots that were impossible on the parent was
  disturbing.  Should we look for a way to tie transaction becomes
  visible to its creation of a commit WAL record?  I think the fact that
  they are not an indivisible operation is an implementation artifact, and
  not a particularly nice one.
 
  Well, I agree with you that it isn't especially nice, but it seems
  like a fairly intractable problem.  Currently, the standby has no way
  of knowing in what order the transactions became visible on the
  master.
 
  Right, but if the visibility order were *defined* as the order in which
  commit records appear in WAL, that problem neatly goes away.  It's only
  because we have the implementation artifact that set my xid to 0 in the
  ProcArray is decoupled from inserting the commit record that there's
  any difference.
 
 Hmm, interesting idea.  However, consider the scenario where some
 transactions are using synchronous_commit or synchronous replication,
 and others are not.  If a transaction that needs to wait (either just
 for WAL flush, or for WAL flush and synchronous replication) inserts
 its commit record, and then another transaction with
 synchronous_commit=off comes along and inserts its commit record, the
 second transaction will have to block until the first transaction is
 done waiting.  

What is the current behavior when the synchronous replication fails (say
the slave breaks down) - will the transaction be rolled back at some
point or will it wait indefinitely , that is until a new slave is
installed ?

Or will the sync rep transaction commit when archive_command returns
true after copying the WAL segment containing this commit ?

 We can't make either transaction visible without making
 both visible, and we certainly can't acknowledge the second
 transaction to the client until we've made it visible.  I'm not going
 to say that's so horrible we shouldn't even consider it, but it
 doesn't seem great, either.

Maybe this is why other databases don't offer per backend async commit ?

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing ha...@krosing.net writes:
 So the basic design could be a sparse snapshot, consisting of 'xmin,
 xmax, running_txids[numbackends] where each backend manages its own slot
 in running_txids - sets a txid when aquiring one and nulls it at commit,
 possibly advancing xmin if xmin==mytxid.

How is that different from what we're doing now?  Basically, what you're
describing is pulling the xids out of the ProcArray and moving them into
a separate data structure.  That could be a win I guess if non-snapshot-
related reasons to take ProcArrayLock represent enough of the contention
to be worth separating out, but I suspect they don't.  In particular,
the data structure you describe above *cannot* be run lock-free, because
it doesn't provide any consistency guarantees without a lock.  You need
everyone to have the same ideas about commit order, and random backends
independently changing array elements without locks won't guarantee
that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
 On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
  
  Hmm, interesting idea.  However, consider the scenario where some
  transactions are using synchronous_commit or synchronous replication,
  and others are not.  If a transaction that needs to wait (either just
  for WAL flush, or for WAL flush and synchronous replication) inserts
  its commit record, and then another transaction with
  synchronous_commit=off comes along and inserts its commit record, the
  second transaction will have to block until the first transaction is
  done waiting.  
 
 What is the current behavior when the synchronous replication fails (say
 the slave breaks down) - will the transaction be rolled back at some
 point or will it wait indefinitely , that is until a new slave is
 installed ?

More importantly, if the master crashes after the commit is written to
WAL, will the transaction be rolled back after recovery based on the
fact that no confirmation from synchronous slave is received ?

 Or will the sync rep transaction commit when archive_command returns
 true after copying the WAL segment containing this commit ?
 
  We can't make either transaction visible without making
  both visible, and we certainly can't acknowledge the second
  transaction to the client until we've made it visible.  I'm not going
  to say that's so horrible we shouldn't even consider it, but it
  doesn't seem great, either.
 
 Maybe this is why other databases don't offer per backend async commit ?
 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Tom Lane
Hannu Krosing ha...@2ndquadrant.com writes:
 On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
 We can't make either transaction visible without making
 both visible, and we certainly can't acknowledge the second
 transaction to the client until we've made it visible.  I'm not going
 to say that's so horrible we shouldn't even consider it, but it
 doesn't seem great, either.

 Maybe this is why other databases don't offer per backend async commit ?

Yeah, I've always thought that feature wasn't as simple as it appeared.
It got in only because it was claimed to be cost-free, and it's now
obvious that it isn't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 15:42 -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
  We can't make either transaction visible without making
  both visible, and we certainly can't acknowledge the second
  transaction to the client until we've made it visible.  I'm not going
  to say that's so horrible we shouldn't even consider it, but it
  doesn't seem great, either.
 
  Maybe this is why other databases don't offer per backend async commit ?
 
 Yeah, I've always thought that feature wasn't as simple as it appeared.
 It got in only because it was claimed to be cost-free, and it's now
 obvious that it isn't.

I still think it is cost-free if you get the semantics of the COMMIT
contract right. (Of course it is not cost free as in not wasting
developers time in discussions ;) ) 

I'm still with you in claiming that transaction should be visible to
other backends as committed as soon as the WAL record is inserted.

the main thing to keep in mind is that getting back positive commit
confirmation really means (depending on various sync settings) that your
transaction is on stable storage.

BUT, _not_ getting back confirmation on commit does not quaranee that it
is not committed, just that you need to check. It may well be that it
was committed, written to stable storage _and_ also syncrepped but then
the confirnation did not come bac to you due to some network outage. or
your client computer crashed. or your child spilled black paint over the
monitor. or thousand other reasons.

async commit has the contract that you are ready to check a few latest
commits after crash.

but I still think that it is right semantics to make your commit visible
to others, even before you have gotten back the confirmation yourself.

---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 15:38 -0400, Tom Lane wrote:
 Hannu Krosing ha...@krosing.net writes:
  So the basic design could be a sparse snapshot, consisting of 'xmin,
  xmax, running_txids[numbackends] where each backend manages its own slot
  in running_txids - sets a txid when aquiring one and nulls it at commit,
  possibly advancing xmin if xmin==mytxid.
 
 How is that different from what we're doing now?  Basically, what you're
 describing is pulling the xids out of the ProcArray and moving them into
 a separate data structure.  That could be a win I guess if non-snapshot-
 related reasons to take ProcArrayLock represent enough of the contention
 to be worth separating out, but I suspect they don't. 

the idea was to make the thid array small enough to be able to memcpy it
to backend local memory fast. But I agree it takes testing to see if it
is an overall win

  In particular,
 the data structure you describe above *cannot* be run lock-free, because
 it doesn't provide any consistency guarantees without a lock.  You need
 everyone to have the same ideas about commit order, and random backends
 independently changing array elements without locks won't guarantee
 that.
 
   regards, tom lane
 

-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Hannu Krosing ha...@2ndquadrant.com wrote:
 
 but I still think that it is right semantics to make your commit
 visible to others, even before you have gotten back the
 confirmation yourself.
 
Possibly.  That combined with building snapshots based on the order
of WAL entries of commit records certainly has several appealing
aspects.  It is hard to get over the fact that you lose an existing
guarantee, though: right now, if you have one synchronous replica,
you can never see a transaction's work on the master and then *not*
see it on the slave -- the slave always has first visibility.  I
don't see how such a guarantee can exist in *either* direction with
the semantics you describe.  After seeing a transaction's work on
one system it would always be unknown whether it was visible on the
other.  There are situations where that is OK as long as each copy
has a sane order of visibility, but there are situations where
losing that guarantee might matter.
 
On the bright side, it means that transactions would become visible
on the replica in the same order as on the master, and that blocking
would be reduced.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:32 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Hmm, interesting idea.  However, consider the scenario where some
 transactions are using synchronous_commit or synchronous replication,
 and others are not.  If a transaction that needs to wait (either just
 for WAL flush, or for WAL flush and synchronous replication) inserts
 its commit record, and then another transaction with
 synchronous_commit=off comes along and inserts its commit record, the
 second transaction will have to block until the first transaction is
 done waiting.

 What is the current behavior when the synchronous replication fails (say
 the slave breaks down) - will the transaction be rolled back at some
 point or will it wait indefinitely , that is until a new slave is
 installed ?

It will wait forever, unless you shut down the database or hit ^C.

 We can't make either transaction visible without making
 both visible, and we certainly can't acknowledge the second
 transaction to the client until we've made it visible.  I'm not going
 to say that's so horrible we shouldn't even consider it, but it
 doesn't seem great, either.

 Maybe this is why other databases don't offer per backend async commit ?

Yeah, possibly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
 On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:

  Hmm, interesting idea.  However, consider the scenario where some
  transactions are using synchronous_commit or synchronous replication,
  and others are not.  If a transaction that needs to wait (either just
  for WAL flush, or for WAL flush and synchronous replication) inserts
  its commit record, and then another transaction with
  synchronous_commit=off comes along and inserts its commit record, the
  second transaction will have to block until the first transaction is
  done waiting.

 What is the current behavior when the synchronous replication fails (say
 the slave breaks down) - will the transaction be rolled back at some
 point or will it wait indefinitely , that is until a new slave is
 installed ?

 More importantly, if the master crashes after the commit is written to
 WAL, will the transaction be rolled back after recovery based on the
 fact that no confirmation from synchronous slave is received ?

No.  You can't roll back a transaction once it's committed - ever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:12 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Hannu Krosing ha...@2ndquadrant.com wrote:
 but I still think that it is right semantics to make your commit
 visible to others, even before you have gotten back the
 confirmation yourself.

 Possibly. That combined with building snapshots based on the order
 of WAL entries of commit records certainly has several appealing
 aspects.  It is hard to get over the fact that you lose an existing
 guarantee, though: right now, if you have one synchronous replica,
 you can never see a transaction's work on the master and then *not*
 see it on the slave -- the slave always has first visibility.  I
 don't see how such a guarantee can exist in *either* direction with
 the semantics you describe.  After seeing a transaction's work on
 one system it would always be unknown whether it was visible on the
 other.  There are situations where that is OK as long as each copy
 has a sane order of visibility, but there are situations where
 losing that guarantee might matter.

 On the bright side, it means that transactions would become visible
 on the replica in the same order as on the master, and that blocking
 would be reduced.

Having transactions become visible in the same order on the master and
the standby is very appealing, but I'm pretty well convinced that
allowing commits to become visible before they've been durably
committed is throwing the D an ACID out the window.  If
synchronous_commit is off, sure, but otherwise...

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 16:20 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
  On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote:
  On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
 
   Hmm, interesting idea.  However, consider the scenario where some
   transactions are using synchronous_commit or synchronous replication,
   and others are not.  If a transaction that needs to wait (either just
   for WAL flush, or for WAL flush and synchronous replication) inserts
   its commit record, and then another transaction with
   synchronous_commit=off comes along and inserts its commit record, the
   second transaction will have to block until the first transaction is
   done waiting.
 
  What is the current behavior when the synchronous replication fails (say
  the slave breaks down) - will the transaction be rolled back at some
  point or will it wait indefinitely , that is until a new slave is
  installed ?
 
  More importantly, if the master crashes after the commit is written to
  WAL, will the transaction be rolled back after recovery based on the
  fact that no confirmation from synchronous slave is received ?
 
 No.  You can't roll back a transaction once it's committed - ever.

so in case of stuck slave the syncrep transcation is committed after
crash, but is not committed before the crash happens ?

ow will the replay process get stuc gaian during recovery ?

 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing ha...@krosing.net wrote:
 so in case of stuck slave the syncrep transcation is committed after
 crash, but is not committed before the crash happens ?

Yep.

 ow will the replay process get stuc gaian during recovery ?

Nope.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Having transactions become visible in the same order on the master
 and the standby is very appealing, but I'm pretty well convinced
 that allowing commits to become visible before they've been
 durably committed is throwing the D an ACID out the window.  If
 synchronous_commit is off, sure, but otherwise...
 
It has been durably committed on the master, but not on the
supposedly synchronous copy; so it's not so much through out the D
in ACID as throwing out the synchronous in synchronous
replication.  :-(
 
Unless I'm missing something we have a choice to make -- I see four
possibilities (already mentioned on this thread, I think):
 
(1)  Transactions are visible on the master which won't necessarily
be there if a meteor takes out the master and you need to resume
operations on the replica.
 
(2)  An asynchronous commit must block behind any pending
synchronous commits if synchronous replication is in use.
 
(3)  Transactions become visible on the replica in a different order
than they became visible on the master.
 
(4)  We communicate acceptable snapshots to the replica to make the
order of visibility visibility match the master even when that
doesn't match the order that transactions returned from commit.
 
I don't see how we can accept (1) and call it synchronous
replication.  I'm pretty dubious about (3), because we don't even
have Snapshot Isolation on the replica, really.  Is (3) where we're
currently at?  An advantage of (4) is that on the replica we would
get the same SI behavior at Repeatable Read that exists on the
master, and we could even use the same mechanism for SSI to provide
Serializable isolation on the replica.
 
I (predictably) like (4) -- even though it's a lot of work
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Jeff Davis
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote:
  Right, but if the visibility order were *defined* as the order in which
  commit records appear in WAL, that problem neatly goes away.  It's only
  because we have the implementation artifact that set my xid to 0 in the
  ProcArray is decoupled from inserting the commit record that there's
  any difference.
 
 Hmm, interesting idea.  However, consider the scenario where some
 transactions are using synchronous_commit or synchronous replication,
 and others are not.  If a transaction that needs to wait (either just
 for WAL flush, or for WAL flush and synchronous replication) inserts
 its commit record, and then another transaction with
 synchronous_commit=off comes along and inserts its commit record, the
 second transaction will have to block until the first transaction is
 done waiting.  We can't make either transaction visible without making
 both visible, and we certainly can't acknowledge the second
 transaction to the client until we've made it visible.  I'm not going
 to say that's so horrible we shouldn't even consider it, but it
 doesn't seem great, either.

I'm trying to follow along here.

Wouldn't the same issue exist if one transaction is waiting for sync rep
(synchronous_commit=on), and another is waiting for just a WAL flush
(synchronous_commit=local)? I don't think that a synchronous_commit=off
is required.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Jeff Davis pg...@j-davis.com wrote:
 
 Wouldn't the same issue exist if one transaction is waiting for
 sync rep (synchronous_commit=on), and another is waiting for just
 a WAL flush (synchronous_commit=local)? I don't think that a
 synchronous_commit=off is required.
 
I think you're right -- basically, to make visibility atomic with
commit and allow a fast snapshot build based on that order, any new
commit request would need to block behind any pending request,
regardless of that setting.  At least, no way around that is
apparent to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 to make visibility atomic with commit
 
I meant:
 
to make visibility atomic with WAL-write of the commit record
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread karavelov
- Цитат от Hannu Krosing (ha...@2ndquadrant.com), на 28.07.2011 в 22:40 
-

 
 Maybe this is why other databases don't offer per backend async commit ?
 
 

Isn't Oracle's

COMMIT WRITE NOWAIT;

basically the same - ad hoc async commit? Though their idea of backend do not 
maps 
exactly to postgrsql's idea. The closest thing is per session async commit:

ALTER SESSION SET COMMIT_WRITE='NOWAIT';


Best regards

--
Luben Karavelov

Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 4:54 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 Having transactions become visible in the same order on the master
 and the standby is very appealing, but I'm pretty well convinced
 that allowing commits to become visible before they've been
 durably committed is throwing the D an ACID out the window.  If
 synchronous_commit is off, sure, but otherwise...

 It has been durably committed on the master, but not on the
 supposedly synchronous copy; so it's not so much through out the D
 in ACID as throwing out the synchronous in synchronous
 replication.  :-(

Well, depends.  Currently, the sequence of events is:

1. Insert commit record.
2. Flush commit record, if synchronous_commit in {local, on}.
3. Wait for synchronous replication, if synchronous_commit = on and
synchronous_standby_names is non-empty.
4. Make transaction visible.

If you move (4) before (3), you're throwing out the synchronous in
synchronous replication.  If you move (4) before (2), you're throwing
out the D in ACID.

 Unless I'm missing something we have a choice to make -- I see four
 possibilities (already mentioned on this thread, I think):

 (1)  Transactions are visible on the master which won't necessarily
 be there if a meteor takes out the master and you need to resume
 operations on the replica.

 (2)  An asynchronous commit must block behind any pending
 synchronous commits if synchronous replication is in use.

Well, again, there are three levels:

(A) synchronous_commit=off.  No waiting!
(B) synchronous_commit=local transactions, and synchronous_commit=on
transactions when sync rep is not in use.  Wait for xlog flush.
(C) synchronous_commit=on transactions when sync rep IS in use.  Wait
for xlog flush and replication.

Under your option #2, if a type-A transaction commits after a type-B
transaction, it will need to wait for the type-B transaction's xlog
flush.  If a type-A transaction commits after a type-C transaction, it
will need to wait for the type-C transaction to flush xlog and
replicate.  And if a type-B transaction commits after a type-C
transaction, there's no additional waiting for xlog flush, because the
type-B transaction would have to wait for that anyway.  But it will
also have to wait for the preceding type-C transaction to replicate.
So basically, you can't be more asynchronous than the guy in front of
you.

Aside from the fact that this behavior isn't too hot from a user
perspective, it might lead to some pretty complicated locking.  Every
time a transaction finishes xlog flush or sync rep, it's got to go
release the transactions that piled up behind it - but not too many,
just up to the next one that still needs to wait on some higher LSN.

 (3)  Transactions become visible on the replica in a different order
 than they became visible on the master.

 (4)  We communicate acceptable snapshots to the replica to make the
 order of visibility visibility match the master even when that
 doesn't match the order that transactions returned from commit.

 I don't see how we can accept (1) and call it synchronous
 replication.  I'm pretty dubious about (3), because we don't even
 have Snapshot Isolation on the replica, really.  Is (3) where we're
 currently at?  An advantage of (4) is that on the replica we would
 get the same SI behavior at Repeatable Read that exists on the
 master, and we could even use the same mechanism for SSI to provide
 Serializable isolation on the replica.

 I (predictably) like (4) -- even though it's a lot of work

I think that (4), beyond being a lot of work, will also have pretty
terrible performance.  You're basically talking about emitting two WAL
records for every commit instead of one.  That's not going to be
awesome.  It might be OK for small or relatively lightly loaded
systems, or those with big transactions.  But for something like
pgbench or DBT-2, I think it's going to be a big problem.  WAL is
already a major bottleneck for us; we need to find a way to make it
less of one, not more.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Hannu Krosing
On Thu, 2011-07-28 at 16:42 -0400, Robert Haas wrote:
 On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing ha...@krosing.net wrote:
  so in case of stuck slave the syncrep transcation is committed after
  crash, but is not committed before the crash happens ?
 
 Yep.
 
  ow will the replay process get stuc gaian during recovery ?
 
 Nope.

Are you sure ? I mean the case when a stuck master comes up but slave is
still not functional.

How does this behavior currently fit in with ACID and sync guarantees ?

 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Ants Aasma
On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 (4)  We communicate acceptable snapshots to the replica to make the
 order of visibility visibility match the master even when that
 doesn't match the order that transactions returned from commit.

I wonder if some interpretation of 2 phase commit could make Robert's
original suggestion implement this.

On the master the commit sequence would look something like:
1. Insert commit record to the WAL
2. Wait for replication
3. Get a commit seq nr and mark XIDs visible
4. WAL log the seq nr
5. Return success to client

When replaying:
* When replaying commit record, do everything but make
  the tx visible.
* When replaying the commit sequence number
if there is a gap between last visible commit and current:
  insert the commit sequence nr. to list of waiting commits.
else:
  mark current and all directly following waiting tx's visible

This would give consistent visibility order on master and slave. Robert
is right that this would undesirably increase WAL traffic. Delaying this
traffic would undesirably increase replay lag between master and slave.
But it seems to me that this could be an optional WAL level on top of
hot_standby that would only be enabled if consistent visibility on
slaves is desired.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Ants Aasma
On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas robertmh...@gmail.com wrote:
 Well, again, there are three levels:

 (A) synchronous_commit=off.  No waiting!
 (B) synchronous_commit=local transactions, and synchronous_commit=on
 transactions when sync rep is not in use.  Wait for xlog flush.
 (C) synchronous_commit=on transactions when sync rep IS in use.  Wait
 for xlog flush and replication.
...
 So basically, you can't be more asynchronous than the guy in front of
 you.

(A) still gives a guarantee - transactions that begin after the commit
returns see
the commited transaction. A weaker variant would say that if the commit
returns, and the server doesn't crash in the meantime, the commit would at
some point become visible. Maybe even that transactions that begin after the
commit returns become visible after that commit.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma ants.aa...@eesti.ee wrote:
 On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 (4)  We communicate acceptable snapshots to the replica to make the
 order of visibility visibility match the master even when that
 doesn't match the order that transactions returned from commit.

 I wonder if some interpretation of 2 phase commit could make Robert's
 original suggestion implement this.

 On the master the commit sequence would look something like:
 1. Insert commit record to the WAL
 2. Wait for replication
 3. Get a commit seq nr and mark XIDs visible
 4. WAL log the seq nr
 5. Return success to client

 When replaying:
 * When replaying commit record, do everything but make
  the tx visible.
 * When replaying the commit sequence number
    if there is a gap between last visible commit and current:
      insert the commit sequence nr. to list of waiting commits.
    else:
      mark current and all directly following waiting tx's visible

 This would give consistent visibility order on master and slave. Robert
 is right that this would undesirably increase WAL traffic. Delaying this
 traffic would undesirably increase replay lag between master and slave.
 But it seems to me that this could be an optional WAL level on top of
 hot_standby that would only be enabled if consistent visibility on
 slaves is desired.

I think you nailed it.

An additional point to think about: if we were willing to insist on
streaming replication, we could send the commit sequence numbers via a
side channel rather than writing them to WAL, which would be a lot
cheaper.  That might even be a reasonable thing to do, because if
you're doing log shipping, this is all going to be super-not-real-time
anyway.  OTOH, I know we don't want to make WAL shipping anything less
than a first class citizen, so maybe not.

At any rate, we may be getting a little sidetracked here from the
original point of the thread, which was how to make snapshot-taking
cheaper.  Maybe there's some tie-in to when transactions become
visible, but I think it's pretty weak.  The existing system could be
hacked up to avoid making transactions visible out of LSN order, and
the system I proposed could make them visible either in LSN order or
do the same thing we do now.  They are basically independent problems,
AFAICS.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots

2011-07-28 Thread Robert Haas
On Thu, Jul 28, 2011 at 8:12 PM, Ants Aasma ants.aa...@eesti.ee wrote:
 On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas robertmh...@gmail.com wrote:
 Well, again, there are three levels:

 (A) synchronous_commit=off.  No waiting!
 (B) synchronous_commit=local transactions, and synchronous_commit=on
 transactions when sync rep is not in use.  Wait for xlog flush.
 (C) synchronous_commit=on transactions when sync rep IS in use.  Wait
 for xlog flush and replication.
 ...
 So basically, you can't be more asynchronous than the guy in front of
 you.

 (A) still gives a guarantee - transactions that begin after the commit
 returns see
 the commited transaction. A weaker variant would say that if the commit
 returns, and the server doesn't crash in the meantime, the commit would at
 some point become visible. Maybe even that transactions that begin after the
 commit returns become visible after that commit.

Yeah, you could do that.  But that's such a weak guarantee that I'm
not sure it has much practical utility.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] cheaper snapshots

2011-07-27 Thread Robert Haas
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether we could do something involving WAL properties --- the
 current tuple visibility logic was designed before WAL existed, so it's
 not exploiting that resource at all.  I'm imagining that the kernel of a
 snapshot is just a WAL position, ie the end of WAL as of the time you
 take the snapshot (easy to get in O(1) time).  Visibility tests then
 reduce to did this transaction commit with a WAL record located before
 the specified position?.  You'd need some index datastructure that made
 it reasonably cheap to find out the commit locations of recently
 committed transactions, where recent means back to recentGlobalXmin.
 That seems possibly do-able, though I don't have a concrete design in
 mind.

I was mulling this idea over some more (the same ideas keep floating
back to the top...).  I don't think an LSN can actually work, because
there's no guarantee that the order in which the WAL records are
emitted is the same order in which the effects of the transactions
become visible to new snapshots.  For example:

1. Transaction A inserts its commit record, flushes WAL, and begins
waiting for sync rep.
2. A moment later, transaction B sets synchronous_commit=off, inserts
its commit record, requests a background WAL flush, and removes itself
from the ProcArray.
3. Transaction C takes a snapshot.

Sync rep doesn't create this problem; there's a race anyway.  The
order of acquisition for WALInsertLock needn't match that for
ProcArrayLock.  This has the more-than-slightly-odd characteristic
that you could end up with a snapshot on the master that can see A but
not B and a snapshot on the slave that can see B but not A.

But having said that an LSN can't work, I don't see why we can't just
use a 64-bit counter.  In fact, the predicate locking code already
does something much like this, using an SLRU, for serializable
transactions only.  In more detail, what I'm imagining is an array
with 4 billion entries, one per XID, probably broken up into files of
say 16MB each with 2 million entries per file.  Each entry is a 64-bit
value.  It is 0 if the XID has not yet started, is still running, or
has aborted.  Otherwise, it is the commit sequence number of the
transaction.  For reasons I'll explain below, I'm imagining starting
the commit sequence number counter at some very large value and having
it count down from there.  So the basic operations are:

- To take a snapshot, you just read the counter.
- To commit a transaction which has an XID, you read the counter,
stamp all your XIDs with that value, and decrement the counter.
- To find out whether an XID is visible to your snapshot, you look up
the XID in the array and get the counter value.  If the value you read
is greater than your snapshot value, it's visible.  If it's less, it's
not.

Now, is this algorithm any good, and how little locking can we get away with?

It seems to me that if we used an SLRU to store the array, the lock
contention would be even worse than it is under our current system,
wherein everybody fights over ProcArrayLock.  A system like this is
going to involve lots and lots of probes into the array (even if we
build a per-backend cache of some kind) and an SLRU will require at
least one LWLock acquire and release per probe.  Some kind of locking
is pretty much unavoidable, because you have to worry about pages
getting evicted from shared memory.  However, what if we used a set of
files (like SLRU) but mapped them separately into each backend's
address space?  I think this would allow both loads and stores from
the array to be done unlocked.  One fly in the ointment is that 8-byte
stores are apparently done as two 4-byte stores on some platforms.
But if the counter runs backward, I think even that is OK.  If you
happen to read an 8 byte value as it's being written, you'll get 4
bytes of the intended value and 4 bytes of zeros.  The value will
therefore appear to be less than what it should be.  However, if the
value was in the midst of being written, then it's still in the midst
of committing, which means that that XID wasn't going to be visible
anyway.  Accidentally reading a smaller value doesn't change the
answer.

Committing will require a lock on the counter.

Taking a snapshot can be done unlocked if (1) 8-byte reads are atomic
and either (2a) the architecture has strong memory ordering (no
store/store reordering) or (2b) you insert a memory fence between
stamping the XIDs and decrementing the counter.  Otherwise, taking a
snapshot will also require a lock on the counter.

Once a particular XID precedes RecentGlobalXmin, you no longer care
about the associated counter value.  You just need to know that it
committed; the order no longer matters.  So after a crash, assuming
that you have the CLOG bits available, you can just throw away all the
array contents and start the counter over at the highest possible
value.  And, as RecentGlobalXmin advances, you can