Re: [HACKERS] XLogInsert scaling, revisited

2013-07-17 Thread Amit Kapila
On Monday, July 08, 2013 2:47 PM Heikki Linnakangas wrote:  
 Ok, I've committed this patch now. Finally, phew!

Few doubts while reading the code:

1. Why in function WALInsertSlotAcquireOne(int slotno), it does
START_CRIT_SECTION() to
   Lock out cancel/die interrupts, whereas other places call
HOLD_INTERRUPTS()

2. In function GetXLogBuffer(), why the logic to wakeup waiters is
different when expectedEndPtr != endptr;
  When the wakeupwaiters is done in case expectedEndPtr == endptr?

3.
static bool
ReserveXLogSwitch(..)

In above function header, why EndPos_p/StartPos_p is used when
function arguments are EndPos/StartPos?


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-17 Thread Andres Freund
On 2013-07-17 15:46:00 +0530, Amit Kapila wrote:
 On Monday, July 08, 2013 2:47 PM Heikki Linnakangas wrote:  
  Ok, I've committed this patch now. Finally, phew!
 
 Few doubts while reading the code:
 
 1. Why in function WALInsertSlotAcquireOne(int slotno), it does
 START_CRIT_SECTION() to
Lock out cancel/die interrupts, whereas other places call
 HOLD_INTERRUPTS()

A crit section does more than just stopping interrupts. They also ensure
that errors that occur while inside one get converted to a PANIC. That
seems apt for SlotAcquire/Release. Although the comments could possibly
improved a bit.

 2. In function GetXLogBuffer(), why the logic to wakeup waiters is
 different when expectedEndPtr != endptr;
   When the wakeupwaiters is done in case expectedEndPtr == endptr?

I am not sure what you're asking here. We wakeup waiters if
expectedEndPtr != endptr because that means the wal buffer page the
'ptr' fits on currently has different content. Which in turn means we've
finished with the last page and progressed to a new one. So we wake up
everyone waiting for us.
WakeupWaiters() doesn't get passed expectedEndPtr but expectedEndPtr -
XLOG_BLCKSZ (up to there we are guaranteed to have inserted
successfully). And we're comparing with the xlogInsertingAt value which
basically measures up to where we've successfully inserted.

 3.
 static bool
 ReserveXLogSwitch(..)
 
 In above function header, why EndPos_p/StartPos_p is used when
 function arguments are EndPos/StartPos?

I guess that's bitrot...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-17 Thread Heikki Linnakangas

On 17.07.2013 02:18, Michael Paquier wrote:

On Tue, Jul 16, 2013 at 2:24 AM, Fujii Masaomasao.fu...@gmail.com  wrote:

On Mon, Jul 8, 2013 at 6:16 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

Ok, I've committed this patch now. Finally, phew!


I found that this patch causes the assertion failure. When I set up simple
replication environment and promoted the standby before executing any
transaction on the master, I got the following assertion failure.

2013-07-16 02:22:06 JST sby1 LOG:  received promote request
2013-07-16 02:22:06 JST sby1 FATAL:  terminating walreceiver process
due to administrator command
2013-07-16 02:22:06 JST sby1 LOG:  redo done at 0/2F0
2013-07-16 02:22:06 JST sby1 LOG:  selected new timeline ID: 2
hrk:head-pgsql postgres$ 2013-07-16 02:22:06 JST sby1 LOG:  archive
recovery complete
TRAP: FailedAssertion(!(readOff == (XLogCtl-xlblocks[firstIdx] -
8192) % ((uint32) (16 * 1024 * 1024))), File: xlog.c, Line: 7048)
2013-07-16 02:22:12 JST sby1 LOG:  startup process (PID 37115) was
terminated by signal 6: Abort trap
2013-07-16 02:22:12 JST sby1 LOG:  terminating any other active server processes

Note that this is also reproducible even when trying to recover only
from archives without strrep.


Fixed, thanks for the report. While at it, I slightly refactored the way 
the buffer bookkeeping works. Instead of keeping track of the index of 
the last initialized buffer, keep track how far the buffer cache has 
been initialized in an XLogRecPtr variable (called 
XLogCtl-InitializedUpTo). That made the code slightly more readable IMO.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-17 Thread Heikki Linnakangas

On 17.07.2013 15:24, Andres Freund wrote:

On 2013-07-17 15:46:00 +0530, Amit Kapila wrote:

Few doubts while reading the code:

1. Why in function WALInsertSlotAcquireOne(int slotno), it does
START_CRIT_SECTION() to
Lock out cancel/die interrupts, whereas other places call
HOLD_INTERRUPTS()


A crit section does more than just stopping interrupts. They also ensure
that errors that occur while inside one get converted to a PANIC. That
seems apt for SlotAcquire/Release. Although the comments could possibly
improved a bit.


Agreed. The comment was copied from LWLockAcquire(), which only does 
HOLD_INTERRUPTS(). The crucial difference between LWLockAcquire() and 
WALInsertSlotAcquire() is that there is no automatic cleanup mechanism 
on abort for the WAL insertion slots like there is for lwlocks. Added a 
sentence to the comment to mention that.



3.
static bool
ReserveXLogSwitch(..)

In above function header, why EndPos_p/StartPos_p is used when
function arguments are EndPos/StartPos?


I guess that's bitrot...


Yep, fixed.

Thanks for the review!

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-16 Thread Michael Paquier
On Tue, Jul 16, 2013 at 2:24 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jul 8, 2013 at 6:16 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Ok, I've committed this patch now. Finally, phew!

 I found that this patch causes the assertion failure. When I set up simple
 replication environment and promoted the standby before executing any
 transaction on the master, I got the following assertion failure.

 2013-07-16 02:22:06 JST sby1 LOG:  received promote request
 2013-07-16 02:22:06 JST sby1 FATAL:  terminating walreceiver process
 due to administrator command
 2013-07-16 02:22:06 JST sby1 LOG:  redo done at 0/2F0
 2013-07-16 02:22:06 JST sby1 LOG:  selected new timeline ID: 2
 hrk:head-pgsql postgres$ 2013-07-16 02:22:06 JST sby1 LOG:  archive
 recovery complete
 TRAP: FailedAssertion(!(readOff == (XLogCtl-xlblocks[firstIdx] -
 8192) % ((uint32) (16 * 1024 * 1024))), File: xlog.c, Line: 7048)
 2013-07-16 02:22:12 JST sby1 LOG:  startup process (PID 37115) was
 terminated by signal 6: Abort trap
 2013-07-16 02:22:12 JST sby1 LOG:  terminating any other active server 
 processes
Note that this is also reproducible even when trying to recover only
from archives without strrep.
Regards,
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-15 Thread Fujii Masao
On Mon, Jul 8, 2013 at 6:16 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ok, I've committed this patch now. Finally, phew!

I found that this patch causes the assertion failure. When I set up simple
replication environment and promoted the standby before executing any
transaction on the master, I got the following assertion failure.

2013-07-16 02:22:06 JST sby1 LOG:  received promote request
2013-07-16 02:22:06 JST sby1 FATAL:  terminating walreceiver process
due to administrator command
2013-07-16 02:22:06 JST sby1 LOG:  redo done at 0/2F0
2013-07-16 02:22:06 JST sby1 LOG:  selected new timeline ID: 2
hrk:head-pgsql postgres$ 2013-07-16 02:22:06 JST sby1 LOG:  archive
recovery complete
TRAP: FailedAssertion(!(readOff == (XLogCtl-xlblocks[firstIdx] -
8192) % ((uint32) (16 * 1024 * 1024))), File: xlog.c, Line: 7048)
2013-07-16 02:22:12 JST sby1 LOG:  startup process (PID 37115) was
terminated by signal 6: Abort trap
2013-07-16 02:22:12 JST sby1 LOG:  terminating any other active server processes

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-12 Thread Abhijit Menon-Sen
At 2013-07-08 12:16:34 +0300, hlinnakan...@vmware.com wrote:

 Ok, I've committed this patch now. Finally, phew!

Good.

I'd signed up to review this patch, and did spend some considerable time
on it, but although I managed to understand what was going on (which was
my objective), I didn't find anything useful to say about it beyond what
others had brought up already. Sorry.

 Thanks to everyone involved for the review and testing! And if you
 can, please review the patch as committed once more.

…so I read through the patch as committed again, and it looks good.

Thanks.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-09 Thread Andres Freund
On 2013-07-09 08:00:52 +0900, Michael Paquier wrote:
 On Mon, Jul 8, 2013 at 6:16 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  Ok, I've committed this patch now. Finally, phew!
 +1. Great patch!

And one more: +1

There seem to be quite some lowhanging fruits to make stuff faster after
this bottleneck is out of the way. The by far biggest thing visible in
profiles seems to be the CRC32 computation atm...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Heikki Linnakangas

On 01.07.2013 16:40, Andres Freund wrote:

On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:

* Could you document the way slots prevent checkpoints from occurring
   when XLogInsert rechecks for full page writes? I think it's correct -
   but not very obvious on a glance.


There's this in the comment near the top of the file:

  * To update RedoRecPtr or fullPageWrites, one has to make sure that all
  * subsequent inserters see the new value. This is done by reserving all the
  * insertion slots before changing the value. XLogInsert reads RedoRecPtr
and
  * fullPageWrites after grabbing a slot, so by holding all the slots
  * simultaneously, you can ensure that all subsequent inserts see the new
  * value.  Those fields change very seldom, so we prefer to be fast and
  * non-contended when they need to be read, and slow when they're changed.

Does that explain it well enough? XLogInsert holds onto a slot while it
rechecks for full page writes.


I am a bit worried about that logic. We're basically reverting to the
old logic whe xlog writing is an exlusive affair. We will have to wait
for all the other queued inserters before we're finished. I am afraid
that that will show up latencywise.


A single stall of the xlog-insertion pipeline at a checkpoint is 
hardly going to be a problem. I wish PostgreSQL was real-time enough for 
that to matter, but I think we're very very far from that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Heikki Linnakangas

Ok, I've committed this patch now. Finally, phew!

I think I've addressed all your comments about the comments. I moved 
some of the comments around: I split up the large one near the top of 
the file, moving its paragraphs closer to the code where they apply.


Regarding your performance-related worries: you have good thoughts on 
how to improve things, but let's wait until we see some evidence that 
there is a problem, before any further optimizations.


I fixed one bug related to aligning the WAL buffers. The patch assumes 
WAL buffers to be aligned at a full XLOG_BLCKSZ boundary, but did not 
enforce it. That was already happening on platforms with O_DIRECT, which 
is why I didn't notice that in testing, but it would've failed on others.


I just remembered one detail that I'm not sure has been mentioned on the 
mailing list yet. Per the commit message:



This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the
segment with zeros. This potentially adds some overhead, but it has
been a very common practice by DBA's to clear the tail of the
segment with an external pg_clearxlogtail utility anyway, to make the
WAL files compress better. With this patch, it's no longer necessary
to do that.



I simplified the handling of xlogInsertingAt per discussion, and added 
the memory barrier to GetXLogBuffer(). I ran again the pgbench tests I 
did earlier with the now-committed version of the patch (except for some 
comment changes). The results are here:


http://hlinnaka.iki.fi/xloginsert-scaling/xloginsert-scale-26/

I tested three different workloads. with different numbers of slots, 
ranging from 1 to 1000. The tests were run on a 32-core machine, in a 
VM. As the baseline, I used a fresh checkout from master branch, with 
this one-line patch: 
http://www.postgresql.org/message-id/519a938a.1070...@vmware.com. That 
patch adjusts the spinlock delay loop, which happens to make a big 
difference on this box. We'll have to revisit and apply that patch 
separately, but I think that's the correct baseline to test this 
xloginsert scaling patch against.


nobranch


This is the pgbench -N workload. Throughput is mainly limited by 
flushing the WAL at commits. The patch makes no big difference here, 
which is good. The point of the test is to demonstrate that the patch 
doesn't make WAL flushing noticeably more expensive.


nobranch-sync-commit-off


Same as above, but with synchronous_commit=off. Here the patch somewhat. 
WAL insertion doesn't seem to be the biggest limiting factor in this 
test, but it's nice to see some benefit.


xlogtest


The workload in this test is a single INSERT statement that inserts a 
lot of rows: INSERT INTO foo:client_id SELECT 1 FROM 
generate_series(1,100) a, generate_series(1,100) b. Each client inserts 
to a separate table, to eliminate as much lock contention as possible, 
making the WAL insertion bottleneck as serious as possible (although I'm 
not sure how much difference that makes). This pretty much a best case 
scenario for this patch.


This test shows a big gain from the patch, as it should. The peak 
performance goes from about 35 TPS to 100 TPS. With the patch, I suspect 
the test saturates the I/O subsystem at that point. I think it could go 
higher with better I/O hardware.



All in all, I'm satisfied enough with this to commit. The default number 
of insertion slots, 8, seems to work fine for all the workloads on this 
box. We may have to adjust that or other details later, but what it 
needs now is more testing by people with different hardware.


Thanks to everyone involved for the review and testing! And if you can, 
please review the patch as committed once more.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Heikki Linnakangas

On 08.07.2013 12:16, Heikki Linnakangas wrote:

I just remembered one detail that I'm not sure has been mentioned on the
mailing list yet. Per the commit message:


This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the
segment with zeros. This potentially adds some overhead, but it has
been a very common practice by DBA's to clear the tail of the
segment with an external pg_clearxlogtail utility anyway, to make the
WAL files compress better. With this patch, it's no longer necessary
to do that.


Magnus just pointed out over IM that the above also applies to 
xlog-switches caused by archive_timeout, not just pg_switch_xlog(). IOW, 
all xlog-switch WAL records.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Heikki Linnakangas

On 27.06.2013 20:36, Andres Freund wrote:

On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:

There's this in the comment near the top of the file:


Btw, I find the 'you' used in the comment somewhat irritating. Not badly
so, but reading something like:

  * When you are about to write
  * out WAL, it is important to call WaitXLogInsertionsToFinish() *before*
  * acquiring WALWriteLock, to avoid deadlocks. Otherwise you might get stuck
  * waiting for an insertion to finish (or at least advance to next
  * uninitialized page), while you're holding WALWriteLock.

just seems strange to me. If this directed at plugin authors, maybe, but
it really isn't...


Agreed, that was bad wording. Fixed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Ants Aasma
On Mon, Jul 8, 2013 at 12:16 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ok, I've committed this patch now. Finally, phew!

Fantastic work!

 I simplified the handling of xlogInsertingAt per discussion, and added the
 memory barrier to GetXLogBuffer(). I ran again the pgbench tests I did
 earlier with the now-committed version of the patch (except for some comment
 changes). The results are here:

I can't see a reason why a full memory barrier is needed at
GetXLogBuffer, we just need to ensure that we read the content of the
page after we check the end pointer from xlblocks. A read barrier is
enough here unless there is some other undocumented interaction. I
don't think it matters for performance, but it seems like good
practice to have the barriers exactly matching the documentation.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Heikki Linnakangas

On 08.07.2013 13:21, Ants Aasma wrote:

On Mon, Jul 8, 2013 at 12:16 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

Ok, I've committed this patch now. Finally, phew!


Fantastic work!


I simplified the handling of xlogInsertingAt per discussion, and added the
memory barrier to GetXLogBuffer(). I ran again the pgbench tests I did
earlier with the now-committed version of the patch (except for some comment
changes). The results are here:


I can't see a reason why a full memory barrier is needed at
GetXLogBuffer, we just need to ensure that we read the content of the
page after we check the end pointer from xlblocks. A read barrier is
enough here unless there is some other undocumented interaction.


GetXLogBuffer() doesn't read the content of the page - it writes to it 
(or rather, the caller of GetXLogBarrier() does). The barrier is needed 
between reading xlblocks (to check that the buffer contains the right 
page), and writing the WAL data to the buffer. README.barrier says that 
you need a full memory barrier to separate a load and a store.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Andres Freund
On 2013-07-08 10:45:41 +0300, Heikki Linnakangas wrote:
 On 01.07.2013 16:40, Andres Freund wrote:
 On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:
 * Could you document the way slots prevent checkpoints from occurring
when XLogInsert rechecks for full page writes? I think it's correct -
but not very obvious on a glance.
 
 There's this in the comment near the top of the file:
 
   * To update RedoRecPtr or fullPageWrites, one has to make sure that all
   * subsequent inserters see the new value. This is done by reserving all 
  the
   * insertion slots before changing the value. XLogInsert reads RedoRecPtr
 and
   * fullPageWrites after grabbing a slot, so by holding all the slots
   * simultaneously, you can ensure that all subsequent inserts see the new
   * value.  Those fields change very seldom, so we prefer to be fast and
   * non-contended when they need to be read, and slow when they're changed.
 
 Does that explain it well enough? XLogInsert holds onto a slot while it
 rechecks for full page writes.
 
 I am a bit worried about that logic. We're basically reverting to the
 old logic whe xlog writing is an exlusive affair. We will have to wait
 for all the other queued inserters before we're finished. I am afraid
 that that will show up latencywise.
 
 A single stall of the xlog-insertion pipeline at a checkpoint is hardly
 going to be a problem. I wish PostgreSQL was real-time enough for that to
 matter, but I think we're very very far from that.

Well, the stall won't necessarily be that short. There might be several
backends piling on every insertion slot and waiting - and thus put to
sleep by the kerenl. I am pretty sure it's easy enough to get stalls in
the second range that way.

Sure, there are lots of reasons we don't have all that reliable response
times, but IME the amount of response time jitter is one of the bigger
pain points of postgres. And this feature has a good chance of reducing
that pain noticeably...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Ants Aasma
On Mon, Jul 8, 2013 at 1:38 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 GetXLogBuffer() doesn't read the content of the page - it writes to it (or
 rather, the caller of GetXLogBarrier() does). The barrier is needed between
 reading xlblocks (to check that the buffer contains the right page), and
 writing the WAL data to the buffer. README.barrier says that you need a full
 memory barrier to separate a load and a store.

Indeed you are right. My bad. I somehow thought that every location in
the WAL buffer is written once while it's actually done twice due to
AdvanceXLInsertBuffer() zeroing the page out.

Now thinking about it, if that memset or the full memory barrier in
GetXLogBuffer() ever prove to be a significant overhead, the
initialization could be optimized to avoid zeroing the page.
AdvanceXLInsertBuffer() would only write the header fields,
CopyXLogRecordToWAL() needs to take care to write out zeroes for
alignment padding and xlog switches and always write out xlp_rem_len
and XLP_FIRST_IS_CONTRECORD bit of xlp_info. xlp_info needs to be
split into two 8bit variables so XLP_FIRST_IS_CONTRECORD and
XLP_LONG_HEADER/XLP_BKP_REMOVABLE can be written into disjoint memory
locations.  This way all memory locations in xlog buffer page are
stored exactly once and there is no data race between writes making it
possible to omit the barrier from GetXLogBuffer.
WaitXLogInsertionsToFinish() takes care of the memory barrier for
XlogWrite().

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Robert Haas
On Jul 8, 2013, at 4:16 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 Ok, I've committed this patch now. Finally, phew!

Woohoo!

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-08 Thread Michael Paquier
On Mon, Jul 8, 2013 at 6:16 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ok, I've committed this patch now. Finally, phew!
+1. Great patch!
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-02 Thread Heikki Linnakangas

On 27.06.2013 15:27, Andres Freund wrote:

Hi,

On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:

* Could you document the way slots prevent checkpoints from occurring
   when XLogInsert rechecks for full page writes? I think it's correct -
   but not very obvious on a glance.


There's this in the comment near the top of the file:

  * To update RedoRecPtr or fullPageWrites, one has to make sure that all
  * subsequent inserters see the new value. This is done by reserving all the
  * insertion slots before changing the value. XLogInsert reads RedoRecPtr
and
  * fullPageWrites after grabbing a slot, so by holding all the slots
  * simultaneously, you can ensure that all subsequent inserts see the new
  * value.  Those fields change very seldom, so we prefer to be fast and
  * non-contended when they need to be read, and slow when they're changed.

Does that explain it well enough? XLogInsert holds onto a slot while it
rechecks for full page writes.


Yes. Earlieron that was explained in XLogInsert() - maybe point to the
documentation ontop of the file? The file is too big to expect everyone
to reread the whole file in a new release...


* The read of Insert-RedoRecPtr while rechecking whether it's out of
   date now is unlocked, is that correct? And why?



Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr.


I was wondering whether its guaranteed that we don't read a cached
value, but I didn't think of the memory barriers due to the spinlock in
Release/AcquireSlot. I think I thought that release wouldn't acquire the
spinlock at all.


* Have you measured whether it works to just keep as many slots as
   max_backends requires around and not bothering with dynamically
   allocating them to inserters?
   That seems to require to keep actually waiting slots in a separate
   list which very well might make that too expensive.

   Correctness wise the biggest problem for that probably is exlusive
   acquiration of all slots CreateCheckpoint() does...


Hmm. It wouldn't be much different, each backend would still need to reserve
its own dedicated slot, because it might be held by the a backend that
grabbed all the slots. Also, bgwriter and checkpointer need to flush the
WAL, so they'd need slots too.

More slots make WaitXLogInsertionsToFinish() more expensive, as it has to
loop through all of them. IIRC some earlier pgbench tests I ran didn't show
much difference in performance, whether there were 40 slots or 100, as long
as there was enough of them. I can run some more tests with even more slots
to see how it behaves.


In a very quick test I ran previously I did see the slot acquiration in
the profile when using short connections that all did some quick inserts
- which isn't surprising since they tend to all start with the same
slot so they will all fight for the first slots.

Maybe we could ammeliorate that by initializing MySlotNo to
(MyProc-pgprocno % num_xloginsert_slots) when uninitialized? That won't
be completely fair since the number of procs won't usually be a multiple
of num_insert_slots, but I doubt that will be problematic.


* What about using some sort of linked list of unused slots for
   WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
   of the list so it's less likely to have been grabbed by somebody else
   so we can reuse it.
   a) To grab a new one go to the head of the linked list spinlock it and
   recheck whether it's still free. If not, restart. Otherwise, remove
   from list.
   b) To reuse a previously used slot

   That way we only have to do the PGSemaphoreLock() dance if there
   really aren't any free slots.


That adds a spinlock acquisition / release into the critical path, to
protect the linked list. I'm trying very hard to avoid serialization points
like that.


Hm. We already have at least one spinlock in that path? Or are you
thinking of a global one protecting the linked list? If so, I think we
should be able to get away with locking individual slots.
IIRC if you never need to delete list elements you can even get away
with a lockless implementation.


While profiling the tests I've done, finding a free slot hasn't shown up
much. So I don't think it's a problem performance-wise as it is, and I don't
think it would make the code simpler.


It sure wouldn't make it simpler. As I said above, I saw the slot
acquiration in a profile when using -C and a short pgbench script (that
just inserts 10 rows).


* The queuing logic around slots seems to lack documentation. It's
   complex enough to warrant that imo.


Yeah, it's complex. I expanded the comments above XLogInsertSlot, to explain
how xlogInsertingAt works. Did that help, or was it some other part of that
that needs more docs?


What I don't understand so far is why we don't reset xlogInsertingAt
during SlotReleaseOne. There are places documenting that we don't do so,
but not why unless I missed it.


We could reset it in SlotReleaseOne. However, the next inserter that 

Re: [HACKERS] XLogInsert scaling, revisited

2013-07-02 Thread Andres Freund
On 2013-07-02 19:48:40 +0300, Heikki Linnakangas wrote:
 If so, why isn't it sufficient to
 initialize it in ReserveXLogInsertLocation?
 
 It would be, but then ReserveXLogInsertLocation would need to hold the
 slot's spinlock at the same time with insertpos_lck, so that it could
 atomically read the current CurrBytePos value and copy it to
 xlogInsertingAt. It's important to keep ReserveXLogInsertLocation() as
 lightweight as possible, to maximize concurrency.

If you make it so that you always acquire the slot's spinlock first and
insertpos_lck after, the scalability shouldn't be any different from
now? Both the duration during which insertpos_lck is held and the
overall amount of atomic ops should be the same?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-07-01 Thread Andres Freund
On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:
 * Could you document the way slots prevent checkpoints from occurring
when XLogInsert rechecks for full page writes? I think it's correct -
but not very obvious on a glance.

 There's this in the comment near the top of the file:

  * To update RedoRecPtr or fullPageWrites, one has to make sure that all
  * subsequent inserters see the new value. This is done by reserving all the
  * insertion slots before changing the value. XLogInsert reads RedoRecPtr
 and
  * fullPageWrites after grabbing a slot, so by holding all the slots
  * simultaneously, you can ensure that all subsequent inserts see the new
  * value.  Those fields change very seldom, so we prefer to be fast and
  * non-contended when they need to be read, and slow when they're changed.

 Does that explain it well enough? XLogInsert holds onto a slot while it
 rechecks for full page writes.

I am a bit worried about that logic. We're basically reverting to the
old logic whe xlog writing is an exlusive affair. We will have to wait
for all the other queued inserters before we're finished. I am afraid
that that will show up latencywise.

I have two ideas to improve on that:
a) Queue the backend that does WALInsertSlotAcquire(true) at the front
of the exclusive waiters in *AcquireOne. That should be fairly easy.
b) Get rid of WALInsertSlotAcquire(true) by not relying on
blocking all slot acquiration. I think with some trickery we can do that
safely:
In CreateCheckpoint() we first acquire the insertpos_lck and read
CurrBytePos as a recptr. Set some shared memory variable, say,
PseudoRedoRecPtr, that's now used to check whether backup blocks need to
be made. Release insertpos_lck. Then acquire each slot once, but without
holding the other slots. That guarantees that all XLogInsert()ing
backends henceforth see our PseudoRedoRecPtr value. Then just proceed in
CreateCheckpoint() as we're currently doing except computing RedoRecPtr
under a spinlock.
If a backend reads PseudoRedoRecPtr before we've set RedoRecPtr
accordingly, all that happens is that we possibly have written a FPI too
early.

Makes sense?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-27 Thread Andres Freund
Hi,

On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:
 * Could you document the way slots prevent checkpoints from occurring
when XLogInsert rechecks for full page writes? I think it's correct -
but not very obvious on a glance.

 There's this in the comment near the top of the file:

  * To update RedoRecPtr or fullPageWrites, one has to make sure that all
  * subsequent inserters see the new value. This is done by reserving all the
  * insertion slots before changing the value. XLogInsert reads RedoRecPtr
 and
  * fullPageWrites after grabbing a slot, so by holding all the slots
  * simultaneously, you can ensure that all subsequent inserts see the new
  * value.  Those fields change very seldom, so we prefer to be fast and
  * non-contended when they need to be read, and slow when they're changed.

 Does that explain it well enough? XLogInsert holds onto a slot while it
 rechecks for full page writes.

Yes. Earlieron that was explained in XLogInsert() - maybe point to the
documentation ontop of the file? The file is too big to expect everyone
to reread the whole file in a new release...

 * The read of Insert-RedoRecPtr while rechecking whether it's out of
date now is unlocked, is that correct? And why?

 Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr.

I was wondering whether its guaranteed that we don't read a cached
value, but I didn't think of the memory barriers due to the spinlock in
Release/AcquireSlot. I think I thought that release wouldn't acquire the
spinlock at all.

 * Have you measured whether it works to just keep as many slots as
max_backends requires around and not bothering with dynamically
allocating them to inserters?
That seems to require to keep actually waiting slots in a separate
list which very well might make that too expensive.
 
Correctness wise the biggest problem for that probably is exlusive
acquiration of all slots CreateCheckpoint() does...

 Hmm. It wouldn't be much different, each backend would still need to reserve
 its own dedicated slot, because it might be held by the a backend that
 grabbed all the slots. Also, bgwriter and checkpointer need to flush the
 WAL, so they'd need slots too.

 More slots make WaitXLogInsertionsToFinish() more expensive, as it has to
 loop through all of them. IIRC some earlier pgbench tests I ran didn't show
 much difference in performance, whether there were 40 slots or 100, as long
 as there was enough of them. I can run some more tests with even more slots
 to see how it behaves.

In a very quick test I ran previously I did see the slot acquiration in
the profile when using short connections that all did some quick inserts
- which isn't surprising since they tend to all start with the same
slot so they will all fight for the first slots.

Maybe we could ammeliorate that by initializing MySlotNo to
(MyProc-pgprocno % num_xloginsert_slots) when uninitialized? That won't
be completely fair since the number of procs won't usually be a multiple
of num_insert_slots, but I doubt that will be problematic.

 * What about using some sort of linked list of unused slots for
WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
of the list so it's less likely to have been grabbed by somebody else
so we can reuse it.
a) To grab a new one go to the head of the linked list spinlock it and
recheck whether it's still free. If not, restart. Otherwise, remove
from list.
b) To reuse a previously used slot
 
That way we only have to do the PGSemaphoreLock() dance if there
really aren't any free slots.

 That adds a spinlock acquisition / release into the critical path, to
 protect the linked list. I'm trying very hard to avoid serialization points
 like that.

Hm. We already have at least one spinlock in that path? Or are you
thinking of a global one protecting the linked list? If so, I think we
should be able to get away with locking individual slots.
IIRC if you never need to delete list elements you can even get away
with a lockless implementation.

 While profiling the tests I've done, finding a free slot hasn't shown up
 much. So I don't think it's a problem performance-wise as it is, and I don't
 think it would make the code simpler.

It sure wouldn't make it simpler. As I said above, I saw the slot
acquiration in a profile when using -C and a short pgbench script (that
just inserts 10 rows).

 * The queuing logic around slots seems to lack documentation. It's
complex enough to warrant that imo.

 Yeah, it's complex. I expanded the comments above XLogInsertSlot, to explain
 how xlogInsertingAt works. Did that help, or was it some other part of that
 that needs more docs?

What I don't understand so far is why we don't reset xlogInsertingAt
during SlotReleaseOne. There are places documenting that we don't do so,
but not why unless I missed it.

Do we do this only to have some plausible value for a backend that been
acquired but 

Re: [HACKERS] XLogInsert scaling, revisited

2013-06-27 Thread Andres Freund
On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:
 There's this in the comment near the top of the file:

Btw, I find the 'you' used in the comment somewhat irritating. Not badly
so, but reading something like:

 * When you are about to write
 * out WAL, it is important to call WaitXLogInsertionsToFinish() *before*
 * acquiring WALWriteLock, to avoid deadlocks. Otherwise you might get stuck
 * waiting for an insertion to finish (or at least advance to next
 * uninitialized page), while you're holding WALWriteLock.

just seems strange to me. If this directed at plugin authors, maybe, but
it really isn't...

But that's probably a question for a native speaker...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-26 Thread Heikki Linnakangas

On 24.06.2013 21:01, Andres Freund wrote:

Ok, I started to look at this:


Thanks!


* Could you document the way slots prevent checkpoints from occurring
   when XLogInsert rechecks for full page writes? I think it's correct -
   but not very obvious on a glance.


There's this in the comment near the top of the file:

 * To update RedoRecPtr or fullPageWrites, one has to make sure that all
 * subsequent inserters see the new value. This is done by reserving 
all the
 * insertion slots before changing the value. XLogInsert reads 
RedoRecPtr and

 * fullPageWrites after grabbing a slot, so by holding all the slots
 * simultaneously, you can ensure that all subsequent inserts see the new
 * value.  Those fields change very seldom, so we prefer to be fast and
 * non-contended when they need to be read, and slow when they're changed.

Does that explain it well enough? XLogInsert holds onto a slot while it 
rechecks for full page writes.



* The read of Insert-RedoRecPtr while rechecking whether it's out of
   date now is unlocked, is that correct? And why?


Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr.


* Have you measured whether it works to just keep as many slots as
   max_backends requires around and not bothering with dynamically
   allocating them to inserters?
   That seems to require to keep actually waiting slots in a separate
   list which very well might make that too expensive.

   Correctness wise the biggest problem for that probably is exlusive
   acquiration of all slots CreateCheckpoint() does...


Hmm. It wouldn't be much different, each backend would still need to 
reserve its own dedicated slot, because it might be held by the a 
backend that grabbed all the slots. Also, bgwriter and checkpointer need 
to flush the WAL, so they'd need slots too.


More slots make WaitXLogInsertionsToFinish() more expensive, as it has 
to loop through all of them. IIRC some earlier pgbench tests I ran 
didn't show much difference in performance, whether there were 40 slots 
or 100, as long as there was enough of them. I can run some more tests 
with even more slots to see how it behaves.



* What about using some sort of linked list of unused slots for
   WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
   of the list so it's less likely to have been grabbed by somebody else
   so we can reuse it.
   a) To grab a new one go to the head of the linked list spinlock it and
   recheck whether it's still free. If not, restart. Otherwise, remove
   from list.
   b) To reuse a previously used slot

   That way we only have to do the PGSemaphoreLock() dance if there
   really aren't any free slots.


That adds a spinlock acquisition / release into the critical path, to 
protect the linked list. I'm trying very hard to avoid serialization 
points like that.


While profiling the tests I've done, finding a free slot hasn't shown up 
much. So I don't think it's a problem performance-wise as it is, and I 
don't think it would make the code simpler.



* The queuing logic around slots seems to lack documentation. It's
   complex enough to warrant that imo.


Yeah, it's complex. I expanded the comments above XLogInsertSlot, to 
explain how xlogInsertingAt works. Did that help, or was it some other 
part of that that needs more docs?



* Not a big fan of the holdingAll variable name, for a file-global one
   that seems a bit too generic.


Renamed to holdingAllSlots.


* Could you add a #define or comment for the 64 used in
   XLogInsertSlotPadded? Not everyone might recognize that immediately as
   the most common cacheline size.
   Commenting on the reason we pad in general would be a good idea as
   well.


Copy-pasted and edited the explanation from LWLockPadded for that. I 
also changed the way it's allocated so that it's aligned at 64-byte 
boundary, like we do for lwlocks.



* Is it correct that WALInsertSlotAcquireOne() resets xlogInsertingAt of
   all slots *before* it has acquired locks in all of them? If yes, why
   haven't we already reset it to something invalid?


I didn't understand this part. Can you elaborate?


* Is GetXLogBuffer()'s unlocked read correct without a read barrier?


Hmm. I thought that it was safe because GetXLogBuffer() handles the case 
that you get a torn read of the 64-bit XLogRecPtr value. But now that 
I think of it, I wonder if it's possible for reads/writes to be 
reordered so that AdvanceXLInsertBuffer() overwrites WAL data that 
another backend has copied onto a page. Something like this:


1. Backend B zeroes a new WAL page, and sets its address in xlblocks
2. Backend A calls GetXLogBuffer(), and gets a pointer to that page
3. Backend A copies the WAL data to the page.

There is no lock acquisition in backend A during those steps, so I think 
in theory the writes from step 3 might be reordered to happen before 
step 1, so that that step 1 overwrites the WAL data written in step 3. 
It sounds crazy, but after reading 

Re: [HACKERS] XLogInsert scaling, revisited

2013-06-25 Thread Jeff Janes
On Sat, Jun 22, 2013 at 4:32 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 21.06.2013 21:55, Jeff Janes wrote:


 Hmm, it looks like the xlog-switch is trying to wait for itself to
 finish. The concurrent TRUNCATE is just being blocked behind the
 xlog-switch, which is stuck on itself.

 I wasn't able to reproduce exactly that, but I got a PANIC by running
 pgbench and concurrently doing select pg_switch_xlog() many times in psql.

 Attached is a new version that fixes at least the problem I saw. Not sure
 if it fixes what you saw, but it's worth a try. How easily can you
 reproduce that?


With v23, it got stuck both times I tried it, once after 4 hours and once
after 6 hours.

With v24, it has been running for 30 hours so far with no problems.  So
there is a pretty good chance that it is fixed.



  This is using the same testing harness as in the last round of this patch.


 This one? http://www.postgresql.org/**message-id/CAMkU=1xoA6Fdyoj_**
 4fMLqpicZR1V9GP7cLnXJdHU+**iggqb6...@mail.gmail.comhttp://www.postgresql.org/message-id/CAMkU=1xoa6fdyoj_4fmlqpiczr1v9gp7clnxjdhu+iggqb6...@mail.gmail.com



Yes.  I have cleaned it up some and added use of checksum, I don't know if
any of those things are needed to invoke the problem.


Cheers,

Jeff


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-24 Thread Andres Freund
On 2013-06-22 14:32:46 +0300, Heikki Linnakangas wrote:
 Attached is a new version that fixes at least the problem I saw. Not sure if
 it fixes what you saw, but it's worth a try. How easily can you reproduce
 that?

Ok, I started to look at this:

* Could you document the way slots prevent checkpoints from occurring
  when XLogInsert rechecks for full page writes? I think it's correct -
  but not very obvious on a glance.

* The read of Insert-RedoRecPtr while rechecking whether it's out of
  date now is unlocked, is that correct? And why?

* Have you measured whether it works to just keep as many slots as
  max_backends requires around and not bothering with dynamically
  allocating them to inserters?
  That seems to require to keep actually waiting slots in a separate
  list which very well might make that too expensive.

  Correctness wise the biggest problem for that probably is exlusive
  acquiration of all slots CreateCheckpoint() does...

* What about using some sort of linked list of unused slots for
  WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
  of the list so it's less likely to have been grabbed by somebody else
  so we can reuse it.
  a) To grab a new one go to the head of the linked list spinlock it and
  recheck whether it's still free. If not, restart. Otherwise, remove
  from list.
  b) To reuse a previously used slot

  That way we only have to do the PGSemaphoreLock() dance if there
  really aren't any free slots.

* The queuing logic around slots seems to lack documentation. It's
  complex enough to warrant that imo.

* Not a big fan of the holdingAll variable name, for a file-global one
  that seems a bit too generic.

* Could you add a #define or comment for the 64 used in
  XLogInsertSlotPadded? Not everyone might recognize that immediately as
  the most common cacheline size.
  Commenting on the reason we pad in general would be a good idea as
  well.

* Is it correct that WALInsertSlotAcquireOne() resets xlogInsertingAt of
  all slots *before* it has acquired locks in all of them? If yes, why
  haven't we already reset it to something invalid?

* Is GetXLogBuffer()'s unlocked read correct without a read barrier?

* XLogBytePosToEndRecPtr() seems to have a confusing name to me. At
  least the comment needs to better explain that it's named that way
  because of the way it's used.
  Also, doesn't
seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD;
  potentially point into the middle of a page?

* I wish we could get rid of the bytepos notion, it - while rather
  clever - complicates things. But that's probably not going to happen
  unless we get rid of short/long page headers :/

Cool stuff!

Greetings,

Andres Freund

PS: Btw, git diff|... -w might be more helpful than not indenting a
block.

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-22 Thread Heikki Linnakangas

On 21.06.2013 21:55, Jeff Janes wrote:

I think I'm getting an undetected deadlock between the checkpointer and a
user process running a TRUNCATE command.

This is the checkpointer:

#0  0x003a73eeaf37 in semop () from /lib64/libc.so.6
#1  0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4eb730,
interruptOK=0 '\000') at pg_sema.c:415
#2  0x004b0abf in WaitOnSlot (upto=416178159648) at xlog.c:1775
#3  WaitXLogInsertionsToFinish (upto=416178159648) at xlog.c:2086
#4  0x004b657a in CopyXLogRecordToWAL (write_len=32, isLogSwitch=1
'\001', rdata=0x0, StartPos=value optimized out, EndPos=416192397312)
 at xlog.c:1389
#5  0x004b6fb2 in XLogInsert (rmid=0 '\000', info=value optimized
out, rdata=0x7fff0020) at xlog.c:1209
#6  0x004b7644 in RequestXLogSwitch () at xlog.c:8748


Hmm, it looks like the xlog-switch is trying to wait for itself to 
finish. The concurrent TRUNCATE is just being blocked behind the 
xlog-switch, which is stuck on itself.


I wasn't able to reproduce exactly that, but I got a PANIC by running 
pgbench and concurrently doing select pg_switch_xlog() many times in psql.


Attached is a new version that fixes at least the problem I saw. Not 
sure if it fixes what you saw, but it's worth a try. How easily can you 
reproduce that?



This is using the same testing harness as in the last round of this patch.


This one? 
http://www.postgresql.org/message-id/CAMkU=1xoa6fdyoj_4fmlqpiczr1v9gp7clnxjdhu+iggqb6...@mail.gmail.com



Is there a way for me to dump the list of held/waiting lwlocks from gdb?


You can print out the held_lwlocks array. Or to make it more friendly, 
write a function that prints it out and call that from gdb. There's no 
easy way to print out who's waiting for what that I know of.


Thanks for the testing!

- Heikki


xloginsert-scale-24.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-21 Thread Jeff Janes
On Tue, Jun 18, 2013 at 11:28 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 18.06.2013 21:17, Jeff Janes wrote:

 Hi Heikki,

 I am getting conflicts applying version 22 of this patch to 9.4dev.  Could
 you rebase?


 Here you go.


I think I'm getting an undetected deadlock between the checkpointer and a
user process running a TRUNCATE command.

This is the checkpointer:

#0  0x003a73eeaf37 in semop () from /lib64/libc.so.6
#1  0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4eb730,
interruptOK=0 '\000') at pg_sema.c:415
#2  0x004b0abf in WaitOnSlot (upto=416178159648) at xlog.c:1775
#3  WaitXLogInsertionsToFinish (upto=416178159648) at xlog.c:2086
#4  0x004b657a in CopyXLogRecordToWAL (write_len=32, isLogSwitch=1
'\001', rdata=0x0, StartPos=value optimized out, EndPos=416192397312)
at xlog.c:1389
#5  0x004b6fb2 in XLogInsert (rmid=0 '\000', info=value optimized
out, rdata=0x7fff0020) at xlog.c:1209
#6  0x004b7644 in RequestXLogSwitch () at xlog.c:8748
#7  0x00611be3 in CheckArchiveTimeout () at checkpointer.c:622
#8  0x00611d97 in CheckpointWriteDelay (flags=value optimized
out, progress=value optimized out) at checkpointer.c:705
#9  0x0062ec5a in BufferSync (flags=64) at bufmgr.c:1322
#10 CheckPointBuffers (flags=64) at bufmgr.c:1828
#11 0x004b1393 in CheckPointGuts (checkPointRedo=416178159592,
flags=64) at xlog.c:8365
#12 0x004b8ff3 in CreateCheckPoint (flags=64) at xlog.c:8148
#13 0x006121c3 in CheckpointerMain () at checkpointer.c:502
#14 0x004c4c4a in AuxiliaryProcessMain (argc=2,
argv=0x7fff21c4a5d0) at bootstrap.c:439
#15 0x0060a68c in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:4954
#16 0x0060d1ea in reaper (postgres_signal_arg=value optimized
out) at postmaster.c:2571
#17 signal handler called
#18 0x003a73ee14d3 in __select_nocancel () from /lib64/libc.so.6
#19 0x0060efee in ServerLoop (argc=value optimized out,
argv=value optimized out) at postmaster.c:1537
#20 PostmasterMain (argc=value optimized out, argv=value optimized out)
at postmaster.c:1246
#21 0x005ad4e0 in main (argc=3, argv=0x179fd00) at main.c:196


And this is the TRUNCATE command.

#0  0x003a73eeaf37 in semop () from /lib64/libc.so.6
#1  0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4ea8d0,
interruptOK=0 '\000') at pg_sema.c:415
#2  0x004b002d in WALInsertSlotAcquireOne (slotno=-1) at xlog.c:1667
#3  0x004b6d5d in XLogInsert (rmid=0 '\000', info=value optimized
out, rdata=0x7fff21c4a5e0) at xlog.c:1115
#4  0x004b8abc in XLogPutNextOid (nextOid=67198981) at xlog.c:8704
#5  0x004a3bc2 in GetNewObjectId () at varsup.c:495
#6  0x004c5195 in GetNewRelFileNode (reltablespace=value optimized
out, pg_class=0x0, relpersistence=value optimized out) at catalog.c:437
#7  0x0070f52d in RelationSetNewRelfilenode
(relation=0x7f8c019cb2a0, freezeXid=2144367079, minmulti=1) at
relcache.c:2758
#8  0x0055de61 in ExecuteTruncate (stmt=value optimized out) at
tablecmds.c:1163
#9  0x00656080 in standard_ProcessUtility (parsetree=0x2058900,
queryString=value optimized out, context=value optimized out,
params=0x0,
dest=value optimized out, completionTag=value optimized out) at
utility.c:552
#10 0x00652a87 in PortalRunUtility (portal=0x17bf510,
utilityStmt=0x2058900, isTopLevel=1 '\001', dest=0x2058c40,
completionTag=0x7fff21c4a9a0 )
at pquery.c:1187
#11 0x006539fd in PortalRunMulti (portal=0x17bf510, isTopLevel=1
'\001', dest=0x2058c40, altdest=0x2058c40, completionTag=0x7fff21c4a9a0 )
at pquery.c:1318
#12 0x006540b3 in PortalRun (portal=0x17bf510,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2058c40,
altdest=0x2058c40,
completionTag=0x7fff21c4a9a0 ) at pquery.c:816
#13 0x00650944 in exec_simple_query (query_string=0x2057e90
truncate foo) at postgres.c:1048
#14 0x00651fe9 in PostgresMain (argc=value optimized out,
argv=value optimized out, dbname=0x1fc9e98 jjanes, username=value
optimized out)
at postgres.c:3985
#15 0x0060f80b in BackendRun (argc=value optimized out,
argv=value optimized out) at postmaster.c:3987
#16 BackendStartup (argc=value optimized out, argv=value optimized out)
at postmaster.c:3676
#17 ServerLoop (argc=value optimized out, argv=value optimized out) at
postmaster.c:1577
#18 PostmasterMain (argc=value optimized out, argv=value optimized out)
at postmaster.c:1246
#19 0x005ad4e0 in main (argc=3, argv=0x179fd00) at main.c:196

This is using the same testing harness as in the last round of this patch.

Is there a way for me to dump the list of held/waiting lwlocks from gdb?

Cheers,

Jeff


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-19 Thread Heikki Linnakangas

On 18.06.2013 21:17, Jeff Janes wrote:

Hi Heikki,

I am getting conflicts applying version 22 of this patch to 9.4dev.  Could
you rebase?


Here you go.


Does anyone know of an easy way to apply an external patch through git, so
I can get git-style merge conflict markers, rather than getting patch's
reject file?


I've been wishing for that too. You can check out an older version of 
the branch, one that the patch applies cleanly to, and then merge 
forward. But that's cumbersome.


- Heikki


xloginsert-scale-23.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-18 Thread Jeff Janes
Hi Heikki,

I am getting conflicts applying version 22 of this patch to 9.4dev.  Could
you rebase?

Does anyone know of an easy way to apply an external patch through git, so
I can get git-style merge conflict markers, rather than getting patch's
reject file?

Cheers,

Jeff


Re: [HACKERS] XLogInsert scaling, revisited

2013-05-29 Thread Ants Aasma
On Wed, May 29, 2013 at 8:40 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Thanks for asking :-). I've been fixing bitrot throughout the winter. I just
 started cleaning it up again last week, and also continued with performance
 testing.

 Unfortunately I lost the 8-core box I used earlier to test this, to disk
 failure, so I can't repeat the tests I ran earlier. However, I have access
 to a new 32-core box, the attached results are from that.

The results look great!

Is this 32 physical cores or with hyperthreading? If the former, then
did you profile what is behind the sublinear scaling at concurrency
8?

Shouldn't the pg_write_barrier in AdvanceXLInsertBuffer be
complemented with pg_read_barrier after reading the value of xlblocks
in GetXLogBuffer? It might not be needed if some other action is
providing the barrier, but in that case I think it deserves a comment
why it's not needed so future refactorings don't create a data race.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2013-05-28 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 I've been slowly continuing to work that I started last winder to
 make XLogInsert scale better. I have tried quite a few different
 approaches since then, and have settled on the attached. This is
 similar but not exactly the same as what I did in the patches I
 posted earlier.

Did this go anywhere?


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-28 Thread Fujii Masao
On Fri, Sep 28, 2012 at 12:58 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Hmm, I cannot reproduce this on my Linux laptop. However, I think I see what
 the problem is: the assertion should assert that (*CurrPos* % XLOG_BLCKZ =
 SizeOfXLogShortPHD), not currpos. The former is an XLogRecPtr, the latter is
 a pointer. If the WAL buffers are aligned at 8k boundaries, the effect is
 the same, but otherwise the assertion is just wrong. And as it happens, if
 O_DIRECT is defined, we align WAL buffers at XLOG_BLCKSZ. I think that's why
 I don't see this on my laptop. Does Mac OS X not define O_DIRECT?

Yes, AFAIK Mac OS doesn't support O_DIRECT.

 Anyway, attached is a patch with that fixed.

Thanks! In new patch, initdb was successfully completed.

I encountered another strange issue: When I called pg_switch_xlog() while
pgbench -j 1 -c 1 -T 600 is running, both pg_switch_xlog() and all connections
of pgbench got stuck.

Here is the backtrace of stuck pg_switch_xlog():
(gdb) bt
#0  0x7fff8fe13c46 in semop ()
#1  0x000106b97d34 in PGSemaphoreLock ()
#2  0x000106a2e8cf in WaitXLogInsertionsToFinish ()
#3  0x000106a2fe8b in XLogInsert ()
#4  0x000106a30576 in RequestXLogSwitch ()
#5  0x000106a37950 in pg_switch_xlog ()
#6  0x000106b19bd3 in ExecMakeFunctionResult ()
#7  0x000106b14be1 in ExecProject ()
#8  0x000106b2b83d in ExecResult ()
#9  0x000106b14000 in ExecProcNode ()
#10 0x000106b13080 in standard_ExecutorRun ()
#11 0x000106be919f in PortalRunSelect ()
#12 0x000106bea5c9 in PortalRun ()
#13 0x000106be8519 in PostgresMain ()
#14 0x000106ba4ef9 in PostmasterMain ()
#15 0x000106b418f1 in main ()

Here is the backtrace of stuck pgbench connection:
(gdb) bt
#0  0x7fff8fe13c46 in semop ()
#1  0x000106b97d34 in PGSemaphoreLock ()
#2  0x000106bda95e in LWLockAcquireWithCondVal ()
#3  0x000106a25556 in WALInsertLockAcquire ()
#4  0x000106a2fa8a in XLogInsert ()
#5  0x000106a0386d in heap_update ()
#6  0x000106b2a03e in ExecModifyTable ()
#7  0x000106b14010 in ExecProcNode ()
#8  0x000106b13080 in standard_ExecutorRun ()
#9  0x000106be9ceb in ProcessQuery ()
#10 0x000106be9eec in PortalRunMulti ()
#11 0x000106bea71e in PortalRun ()
#12 0x000106be8519 in PostgresMain ()
#13 0x000106ba4ef9 in PostmasterMain ()
#14 0x000106b418f1 in main ()

Though I've not read the patch yet, probably lock mechanism
in XLogInsert would have a bug which causes the above problem.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-27 Thread Heikki Linnakangas

On 24.09.2012 21:06, Fujii Masao wrote:

The patch could be applied cleanly and the compile could be successfully done.


Thanks for the testing!


But when I ran initdb, I got the following assertion error:

--
$ initdb -D data --locale=C --encoding=UTF-8
The files belonging to this database system will be owned by user postgres.
This user must also own the server process.

The database cluster will be initialized with locale C.
The default text search configuration will be set to english.

creating directory data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... TRAP: FailedAssertion(!(((uint64)
currpos) % 8192= (((intptr_t) ((sizeof(XLogPageHeaderData))) + ((8)
- 1))  ~((intptr_t) ((8) - 1))) || rdata_len == 0), File: xlog.c,
Line: 1363)
sh: line 1: 29537 Abort trap: 6   /dav/hoge/bin/postgres
--single -F -O -c search_path=pg_catalog -c exit_on_error=true
template1  /dev/null
child process exited with exit code 134
initdb: removing data directory data
--

I got the above problem on MacOS:


Hmm, I cannot reproduce this on my Linux laptop. However, I think I see 
what the problem is: the assertion should assert that (*CurrPos* % 
XLOG_BLCKZ = SizeOfXLogShortPHD), not currpos. The former is an 
XLogRecPtr, the latter is a pointer. If the WAL buffers are aligned at 
8k boundaries, the effect is the same, but otherwise the assertion is 
just wrong. And as it happens, if O_DIRECT is defined, we align WAL 
buffers at XLOG_BLCKSZ. I think that's why I don't see this on my 
laptop. Does Mac OS X not define O_DIRECT?


Anyway, attached is a patch with that fixed.

- Heikki


xloginsert-scale-21.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-24 Thread Fujii Masao
On Fri, Sep 21, 2012 at 12:29 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I've been slowly continuing to work that I started last winder to make
 XLogInsert scale better. I have tried quite a few different approaches since
 then, and have settled on the attached. This is similar but not exactly the
 same as what I did in the patches I posted earlier.

 The basic idea, like before, is to split WAL insertion into two phases:

 1. Reserve the right amount of WAL. This is done while holding just a
 spinlock. Thanks to the changes I made earlier to the WAL format, the space
 calculations are now much simpler and the critical section boils down to
 almost just CurBytePos += size_of_wal_record. See
 ReserveXLogInsertLocation() function.

 2. Copy the WAL record to the right location in the WAL buffers. This slower
 part can be done mostly in parallel.

 The difficult part is tracking which insertions are currently in progress,
 and being able to wait for an insertion to finish copying the record data in
 place. I'm using a small number (7 at the moment) of WAL insertion slots for
 that. The first thing that XLogInsert does is to grab one of the slots. Each
 slot is protected by a LWLock, and XLogInsert reserves a slot by acquiring
 its lock. It holds the lock until it has completely finished copying the WAL
 record in place. In each slot, there's an XLogRecPtr that indicates how far
 the current inserter has progressed with its insertion. Typically, for a
 short record that fits on a single page, it is updated after the insertion
 is finished, but if the insertion needs to wait for a WAL buffer to become
 available, it updates the XLogRecPtr before sleeping.

 To wait for all insertions up to a point to finish, you scan all the
 insertion slots, and observe that the XLogRecPtrs in them are = the point
 you're interested in. The number of slots is a tradeoff: more slots allow
 more concurrency in inserting records, but makes it slower to determine how
 far it can be safely flushed.

 I did some performance tests with this, on an 8-core HP Proliant server, in
 a VM running under VMware vSphere 5.1. The tests were performed with Greg
 Smith's pgbench-tools kit, with one of two custom workload scripts:

 1. Insert 1000 rows in each transaction. This is exactly the sort of
 workload where WALInsertLock currently becomes a bottleneck. Without the the
 patch, the test scales very badly, with about 420 TPS with a single client,
 peaking only at 520 TPS with two clients. With the patch, it scales up to
 about 1200 TPS, with 7 clients. I believe the test becomes I/O limited at
 that point; looking at iostat output while the test is running shows about
 200MB/s of writes, and that is roughly what the I/O subsystem of this
 machine can do, according to a simple test with 'dd ...; sync. Or perhaps
 having more insertion slots would allow it to go higher - the patch uses
 exactly 7 slots at the moment.

 http://hlinnaka.iki.fi/xloginsert-scaling/results-1k/

 2. Insert only 10 rows in each transaction. This simulates an OLTP workload
 with fairly small transactions. The patch doesn't make a huge difference
 with that workload. It performs somewhat worse with 4-16 clients, but then
 somewhat better with  16 clients. The patch adds some overhead to flushing
 the WAL, I believe that's what's causing the slowdown with 4-16 clients. But
 with more clients, the WALInsertLock bottleneck becomes more significant,
 and you start to see a benefit again.

 http://hlinnaka.iki.fi/xloginsert-scaling/results-10/

 Overall, the results look pretty good. I'm going to take a closer look at
 the slowdown in the second test. I think it might be fixable with some
 changes to how WaitInsertionsToFinish() and WALWriteLock work together,
 although I'm not sure how exactly it ought to work.

 Comments, ideas?

Sounds good.

The patch could be applied cleanly and the compile could be successfully done.
But when I ran initdb, I got the following assertion error:

--
$ initdb -D data --locale=C --encoding=UTF-8
The files belonging to this database system will be owned by user postgres.
This user must also own the server process.

The database cluster will be initialized with locale C.
The default text search configuration will be set to english.

creating directory data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... TRAP: FailedAssertion(!(((uint64)
currpos) % 8192 = (((intptr_t) ((sizeof(XLogPageHeaderData))) + ((8)
- 1))  ~((intptr_t) ((8) - 1))) || rdata_len == 0), File: xlog.c,
Line: 1363)
sh: line 1: 29537 Abort trap: 6   /dav/hoge/bin/postgres
--single -F -O -c search_path=pg_catalog -c exit_on_error=true
template1  /dev/null
child process exited with exit code 134

[HACKERS] XLogInsert scaling, revisited

2012-09-20 Thread Heikki Linnakangas
I've been slowly continuing to work that I started last winder to make 
XLogInsert scale better. I have tried quite a few different approaches 
since then, and have settled on the attached. This is similar but not 
exactly the same as what I did in the patches I posted earlier.


The basic idea, like before, is to split WAL insertion into two phases:

1. Reserve the right amount of WAL. This is done while holding just a 
spinlock. Thanks to the changes I made earlier to the WAL format, the 
space calculations are now much simpler and the critical section boils 
down to almost just CurBytePos += size_of_wal_record. See 
ReserveXLogInsertLocation() function.


2. Copy the WAL record to the right location in the WAL buffers. This 
slower part can be done mostly in parallel.


The difficult part is tracking which insertions are currently in 
progress, and being able to wait for an insertion to finish copying the 
record data in place. I'm using a small number (7 at the moment) of WAL 
insertion slots for that. The first thing that XLogInsert does is to 
grab one of the slots. Each slot is protected by a LWLock, and 
XLogInsert reserves a slot by acquiring its lock. It holds the lock 
until it has completely finished copying the WAL record in place. In 
each slot, there's an XLogRecPtr that indicates how far the current 
inserter has progressed with its insertion. Typically, for a short 
record that fits on a single page, it is updated after the insertion is 
finished, but if the insertion needs to wait for a WAL buffer to become 
available, it updates the XLogRecPtr before sleeping.


To wait for all insertions up to a point to finish, you scan all the 
insertion slots, and observe that the XLogRecPtrs in them are = the 
point you're interested in. The number of slots is a tradeoff: more 
slots allow more concurrency in inserting records, but makes it slower 
to determine how far it can be safely flushed.


I did some performance tests with this, on an 8-core HP Proliant server, 
in a VM running under VMware vSphere 5.1. The tests were performed with 
Greg Smith's pgbench-tools kit, with one of two custom workload scripts:


1. Insert 1000 rows in each transaction. This is exactly the sort of 
workload where WALInsertLock currently becomes a bottleneck. Without the 
the patch, the test scales very badly, with about 420 TPS with a single 
client, peaking only at 520 TPS with two clients. With the patch, it 
scales up to about 1200 TPS, with 7 clients. I believe the test becomes 
I/O limited at that point; looking at iostat output while the test is 
running shows about 200MB/s of writes, and that is roughly what the I/O 
subsystem of this machine can do, according to a simple test with 'dd 
...; sync. Or perhaps having more insertion slots would allow it to 
go higher - the patch uses exactly 7 slots at the moment.


http://hlinnaka.iki.fi/xloginsert-scaling/results-1k/

2. Insert only 10 rows in each transaction. This simulates an OLTP 
workload with fairly small transactions. The patch doesn't make a huge 
difference with that workload. It performs somewhat worse with 4-16 
clients, but then somewhat better with  16 clients. The patch adds some 
overhead to flushing the WAL, I believe that's what's causing the 
slowdown with 4-16 clients. But with more clients, the WALInsertLock 
bottleneck becomes more significant, and you start to see a benefit again.


http://hlinnaka.iki.fi/xloginsert-scaling/results-10/

Overall, the results look pretty good. I'm going to take a closer look 
at the slowdown in the second test. I think it might be fixable with 
some changes to how WaitInsertionsToFinish() and WALWriteLock work 
together, although I'm not sure how exactly it ought to work.


Comments, ideas?

- Heikki


xloginsert-scale-20.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-20 Thread Andres Freund
On Thursday, September 20, 2012 05:37:42 PM Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  I've been slowly continuing to work that I started last winder to make
  XLogInsert scale better. I have tried quite a few different approaches
  since then, and have settled on the attached. This is similar but not
  exactly the same as what I did in the patches I posted earlier.
Sounds pretty cool from a quick read.

 This sounds pretty good.  I'm a bit bothered by the fact that you've
 settled on 7 parallel-insertion slots after testing on an 8-core
 machine.  I suspect that it's not a coincidence that you're seeing
 a sweet spot for #slots ~= #CPUs.  If that is what's happening, we're
 going to want to be able to configure the #slots at postmaster start.
 Not sure how we'd go about it exactly - is there any reasonably portable
 way to find out how many CPUs the machine has?  Or do we have to use a
 GUC for that?
Several platforms support sysconf(_SC_NPROCESSORS_CONF) although after a quick 
look it doesn't seem to be standardized. A guc initialized to that or falling 
back to 4 or so?

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-20 Thread Heikki Linnakangas

On 20.09.2012 18:37, Tom Lane wrote:

I suspect that it's not a coincidence that you're seeing
a sweet spot for #slots ~= #CPUs.


Yeah, quite possible. I did not test with any different number of slots, 
so I don't know if that's the sweet spot, but I wouldn't be surprised if 
it is.



If that is what's happening, we're going to want to be able to
configure the #slots at postmaster start. Not sure how we'd go about
it exactly - is there any reasonably portable way to find out how
many CPUs the machine has?  Or do we have to use a GUC for that?


Detecting the number of CPUs and using that might not be optimal. Even 
with a machine with a lot of CPUs, a workload might not be limited by 
WAL insertion speed. Perhaps we could have a counter of how often you 
have to wait for a slot, and adjust the number of slots on the fly based 
on that. Similar to the way the spinlock delay is adjusted.


At the moment, I'm grabbing the lock on a slot before determining which 
blocks need to be backed up because of full_page_writes, and before 
calculating the CRCs. I can try to move that so that the lock is grabbed 
later, more like the WALInsertLock currently works. That would make the 
duration the slot locks are held much shorter, which probably would make 
the number of slots less important.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-20 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 I've been slowly continuing to work that I started last winder to make 
 XLogInsert scale better. I have tried quite a few different approaches 
 since then, and have settled on the attached. This is similar but not 
 exactly the same as what I did in the patches I posted earlier.

This sounds pretty good.  I'm a bit bothered by the fact that you've
settled on 7 parallel-insertion slots after testing on an 8-core
machine.  I suspect that it's not a coincidence that you're seeing
a sweet spot for #slots ~= #CPUs.  If that is what's happening, we're
going to want to be able to configure the #slots at postmaster start.
Not sure how we'd go about it exactly - is there any reasonably portable
way to find out how many CPUs the machine has?  Or do we have to use a
GUC for that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert scaling, revisited

2012-09-20 Thread Simon Riggs
On 20 September 2012 16:29, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 1. Insert 1000 rows in each transaction. This is exactly the sort of
 workload where WALInsertLock currently becomes a bottleneck. Without the the
 patch, the test scales very badly, with about 420 TPS with a single client,
 peaking only at 520 TPS with two clients. With the patch, it scales up to
 about 1200 TPS, with 7 clients. I believe the test becomes I/O limited at
 that point; looking at iostat output while the test is running shows about
 200MB/s of writes, and that is roughly what the I/O subsystem of this
 machine can do, according to a simple test with 'dd ...; sync. Or perhaps
 having more insertion slots would allow it to go higher - the patch uses
 exactly 7 slots at the moment.

 http://hlinnaka.iki.fi/xloginsert-scaling/results-1k/

 2. Insert only 10 rows in each transaction. This simulates an OLTP workload
 with fairly small transactions. The patch doesn't make a huge difference
 with that workload. It performs somewhat worse with 4-16 clients, but then
 somewhat better with  16 clients. The patch adds some overhead to flushing
 the WAL, I believe that's what's causing the slowdown with 4-16 clients. But
 with more clients, the WALInsertLock bottleneck becomes more significant,
 and you start to see a benefit again.

 http://hlinnaka.iki.fi/xloginsert-scaling/results-10/

 Overall, the results look pretty good.

Yes, excellent work.

The results seem sensitive to the use case, so my thoughts immediately
switch to auto-tuning or at least appropriate usage.

I'm a bit worried that its a narrow use case, since the problem
quickly moves from lock contention to I/O limiting.

It sounds like the use case where this is a win would be parallel data
loading into a high I/O bandwidth server. Could we do some more
tests/discuss to see how wide the use case is?

I'm also wondering about this from a different perspective. I was
looking to rate-limit WAL inserts from certain operations - would
rate-limiting remove the contention problem, or is that just a
different feature.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-16 Thread Gurjeet Singh
2009/12/15 Greg Smith g...@2ndquadrant.com

 Jaime Casanova wrote:

 So in this extreme case avg tps is just 6 transactions better


 Great job trying to find the spot where the code worked better.  I'm not so
 sure I trust pgbench results where the TPS was so low though.  Which leads
 us right back to exactly how Jeff measured his original results.

 As I said already, I think we need more insight into Jeff's performance
 report, a way to replicate that test, to look a bit at the latency as
 reported by the updated LWLock patch that Pierre submitted.  Tweaking your
 test to give more useful results is a nice second opinion on top of that.
  But we're out of time for now, so this patch is getting returned with
 feedback.  I encourage Jeff to resubmit the same patch or a better one with
 a little more data on performance measurements to our final 8.5 CommitFest
 in hopes we can confirm this an improvement worth committing.



Last week I worked on a FUSE based filesystem, which I call BlackholeFS. Its
similar to /dev/null, but for directories. Basically it simply returns
success for all the writes, but doesn't do any writes on the files under it.

Would moving the pg_xlog/ (and possibly table data too) to such a filesystem
exercise this patch better?

Best regards,
-- 
Lets call it Postgres

EnterpriseDB  http://www.enterprisedb.com

gurjeet[.sin...@enterprisedb.com

singh.gurj...@{ gmail | hotmail | indiatimes | yahoo }.com
Twitter: singh_gurjeet
Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] XLogInsert

2009-12-16 Thread Andres Freund
On Wednesday 16 December 2009 20:07:07 Gurjeet Singh wrote:
 2009/12/15 Greg Smith g...@2ndquadrant.com
 
  Jaime Casanova wrote:
  So in this extreme case avg tps is just 6 transactions better
 
  Great job trying to find the spot where the code worked better.  I'm not
  so sure I trust pgbench results where the TPS was so low though.  Which
  leads us right back to exactly how Jeff measured his original results.
 
  As I said already, I think we need more insight into Jeff's performance
  report, a way to replicate that test, to look a bit at the latency as
  reported by the updated LWLock patch that Pierre submitted.  Tweaking
  your test to give more useful results is a nice second opinion on top of
  that. But we're out of time for now, so this patch is getting returned
  with feedback.  I encourage Jeff to resubmit the same patch or a better
  one with a little more data on performance measurements to our final 8.5
  CommitFest in hopes we can confirm this an improvement worth committing.
 
 Last week I worked on a FUSE based filesystem, which I call BlackholeFS.
  Its similar to /dev/null, but for directories. Basically it simply returns
  success for all the writes, but doesn't do any writes on the files under
  it.
I doubt that it will be faster than a tmpfs - the additional context switches 
et al probably will hurt already.
If you constrain the checkpoint_segments to something sensible it shouldnt use 
too much memory.


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-14 Thread Greg Smith

Jaime Casanova wrote:

So in this extreme case avg tps is just 6 transactions better
  
Great job trying to find the spot where the code worked better.  I'm not 
so sure I trust pgbench results where the TPS was so low though.  Which 
leads us right back to exactly how Jeff measured his original results.


As I said already, I think we need more insight into Jeff's performance 
report, a way to replicate that test, to look a bit at the latency as 
reported by the updated LWLock patch that Pierre submitted.  Tweaking 
your test to give more useful results is a nice second opinion on top of 
that.  But we're out of time for now, so this patch is getting returned 
with feedback.  I encourage Jeff to resubmit the same patch or a better 
one with a little more data on performance measurements to our final 8.5 
CommitFest in hopes we can confirm this an improvement worth committing.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-12 Thread Jaime Casanova
On Thu, Dec 10, 2009 at 3:58 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Jaime Casanova escribió:
 On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  so I'd like some independent confirmation that it does.
 

 what kind of tests could show that? or simply running pgbench several
 times for 15 minutes each run could show any benefit?

 Isn't the supposed performance improvement in this patch linked strongly
 to backup blocks?  If that's really the case, I think it would show more
 prominently if you had very frequent checkpoints.


Ah! Ok, i was only following the logic that it was eliminating the
need of executing a loop twice...
But you are right while the loop executes always it only do something
meaningful after a checkpoint and the for statement only make 3 loops
each, because XLR_MAX_BKP_BLOCKS is defined as 3 in
src/include/access/xlog.h

looked that way seems like the benefit could be only marginal

to prove that i compile with and without the patch, and change
checkpoint_segments = 1 and checkpoint_timeout = 1min to force
frequent checkpoints (actually they ocurred a few seconds apart)

initialize the pgbench database in each installation with:
pgbench -i -s 200 -F 90 test

and executed 6 times with:
pgbench -n -c 50 -j 5 -l -T 900 test

Results are:

Min (tps)
Unpatched - including connections establishing 133.046069 excluding it
133.085274
Patched - including connections establishing 139.567284 excluding
it 139.591229

Max (tps)
Unpatched - including connections establishing 147.082181 excluding
it  147.108752
Patched- including connections establishing 151.276416 excluding
it  151.311745

Avg (tps)
Unpatched - including connections establishing 140.750998 excluding
it  140.790336
Patched - including connections establishing 146.383735 excluding
it 146.411039

So in this extreme case avg tps is just 6 transactions better

PS: i'm attaching the files i use for the tests

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


test_patched.sh
Description: Bourne shell script


test_unpatched.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-10 Thread Jaime Casanova
On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 so I'd like some independent confirmation that it does.


what kind of tests could show that? or simply running pgbench several
times for 15 minutes each run could show any benefit?

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-10 Thread Alvaro Herrera
Jaime Casanova escribió:
 On Wed, Dec 9, 2009 at 9:39 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  so I'd like some independent confirmation that it does.
 
 
 what kind of tests could show that? or simply running pgbench several
 times for 15 minutes each run could show any benefit?

Isn't the supposed performance improvement in this patch linked strongly
to backup blocks?  If that's really the case, I think it would show more
prominently if you had very frequent checkpoints.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-09 Thread Tom Lane
Jaime Casanova jcasa...@systemguards.com.ec writes:
 i haven't made any performance tests but it should gain something :),
 maybe someone can make those tests?

The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-12-09 Thread Greg Smith

Tom Lane wrote:

Jaime Casanova jcasa...@systemguards.com.ec writes:
  

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?



The argument for changing this at all is only that it will improve
performance, so I'd like some independent confirmation that it does.

I've done a little review of this myself, and I'm not quite happy with how this patch was 
delivered to us.  The bar for committing something that touches the WAL is really 
high--it needs to be a unquestionable win to touch that code.  The justification of 
the patch makes the overall code a bit cleaner is a hard sell on something 
that's hard to debug (triggering bad WAL situations at will isn't easy) and critical to 
the system.  If there's a clear performance improvement, that helps justify why it's 
worth working on.  Here's the original performance justification:

Using the only XLogInsert-bound test case I know of, parallel COPY into a skinny, 
unindexed table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync turned 
off (to approxiamately simulate SSD, which I do not have), I get a speed improvement of 
2-4% with the patch over unpatched head.

That makes sense, and using this spec I could probably come up with the test 
program to reproduce this.  But I'm getting tired of doing that.  It's hard 
enough to reproduce performance changes when someone gives the exact 
configuration and test program they used.  If we're working with a verbal spec 
for how to reproduce the issues, forget about it--that's more than we can 
expect a reviewer to handle, and the odds of that whole thing ending well are 
low.

Jeff:  before we do anything else with your patch, I'd like to see a script of 
some sort that runs the test you describe above, everything changed in the 
postgresql.conf from the defaults, and the resulting raw numbers that come from 
it on your system that prove an improvement there--not just a summary of the 
change.  That's really mandatory for a performance patch.  If any reviewer 
who's interested can't just run something and get a report suggesting whether 
the patch helped or harmed results in five minutes, unless we really, really 
want your patch it's just going to stall at that point.  And unfortunately, in 
the case of something that touches the WAL path, we really don't want to change 
anything unless there's a quite good reason to do so.

I've also realized that Patch LWlocks instrumentation at 
http://archives.postgresql.org/message-id/op.uz8sfkxycke...@soyouz should have been 
evaluated as its own patch altogether.  I think that the test program you're suggesting 
also proves its utility though, so for now I'll keep them roped together.

Sorry this ended up so late in this CommitFest, just a series of unexpected 
stuff rippled down to you.  On the bright side, had you submitted this before 
the whole organized CF process started, you could have waited months longer to 
get the same feedback.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com



Re: [HACKERS] XLogInsert

2009-12-08 Thread Jaime Casanova
On Sun, Sep 13, 2009 at 10:42 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 I've attached a patch which removes the iteration over the blocks to be
 backed-up from the critical section of XLogInsert.  Now those blocks are
 only looped over in one piece of code which both computes the CRC and builds
 the linked list, rather than having parallel loops.


ok, i started to review this one...

what the patch is doing seems very obvious and doesn't break the
original logic AFAIUI (i could be wrong, this seems like chinese to me
;)

i made some tests archiving wal and recovering... crashing the server
and restarting... every test was running the regression tests,
don't know what else to do, ideas?

i haven't made any performance tests but it should gain something :),
maybe someone can make those tests?

if not, i will mark the patch as ready for committer some time in the
next hours...

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-09-13 Thread Jeff Janes
On Wed, Aug 19, 2009 at 9:49 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Jeff Janes jeff.ja...@gmail.com writes:
  If I read the code correctly, the only thing that is irrevocable is
  that it writes into
  rdt-next, and if it saved an old copy of rdt first, then it could
  revoke the changes just
  by doing rdt_old-next=NULL.  If that were done, then I think this
  code could be
  moved out of the section holding the WALInsertLock.

 Hmm, I recall that the changes are ... or were ... more complex.
 The tricky case I think is where we have to go back and redo the
 block-backup decisions after discovering that the checkpoint REDO
 pointer has just moved.

 If you can get the work out of the WALInsertLock section for just a
 few more instructions, it would definitely be worth doing.


I've attached a patch which removes the iteration over the blocks to be
backed-up from the critical section of XLogInsert.  Now those blocks are
only looped over in one piece of code which both computes the CRC and builds
the linked list, rather than having parallel loops.

I've used an elog statement (not shown in patch) to demonstrate that the
goto begin; after detecting REDO race actually does get executed under a
standard workload, (pgbench -c10).  Two to 4 out of 10 the backends execute
that code path for each checkpoint on my single CPU machine.  By doing a
kill -9 on a process, to simulate a crash, during the period after the goto
begin is execercised but before the precipitating heckpoint completes, I can
force it to use the written WAL records in recovery.  The database
automatically recovers and the results are self-consistent.

I cannot imagine any other races, rare events, or action at a distance that
could come into play with this code change, so I cannot think of anything
else to test at the moment.

I could not detect a speed difference with pgbench, but as I cannot get
pgbench to be XLogInsert bound, that is not surprising.  Using the only
XLogInsert-bound test case I know of, parallel COPY into a skinny, unindexed
table, using 8 parallel copies on a 4 x dual-core x86_64 and with fsync
turned off (to approxiamately simulate SSD, which I do not have), I get a
speed improvement of 2-4% with the patch over unpatched head.  Maybe with
more CPUs the benefit would be greater.

That small improvement is probably not very attractive, however I think the
patch makes the overall code a bit cleaner, so it may be warranted on that
ground.  Indeed, my motivation for working on this is that I kept beating my
head against the complexity of the old code, and thought that simplifying it
would make future work easier.

Cheers,

Jeff
Index: xlog.c
===
RCS file: /home/jjanes/pgrepo/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.352
diff -c -r1.352 xlog.c
*** xlog.c	10 Sep 2009 09:42:10 -	1.352
--- xlog.c	10 Sep 2009 19:27:08 -
***
*** 540,548 
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];
  	pg_crc32	rdata_crc;
  	uint32		len,
  write_len;
--- 540,550 
  	bool		dtbuf_bkp[XLR_MAX_BKP_BLOCKS];
  	BkpBlock	dtbuf_xlg[XLR_MAX_BKP_BLOCKS];
  	XLogRecPtr	dtbuf_lsn[XLR_MAX_BKP_BLOCKS];
! 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];	/*xlog header of backed up block*/
! 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];	/*part of block before the hole*/
! 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];	/*part of block after the hole*/
! 	XLogRecData dummy_node;	/* head node for back-up block chain*/
! 	XLogRecData *rdt2;	/* tail pointer for back-up block chain*/
  	pg_crc32	rdata_crc;
  	uint32		len,
  write_len;
***
*** 663,696 
  
  	/*
  	 * Now add the backup block headers and data into the CRC
  	 */
  	for (i = 0; i  XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (dtbuf_bkp[i])
! 		{
! 			BkpBlock   *bkpb = (dtbuf_xlg[i]);
! 			char	   *page;
  
! 			COMP_CRC32(rdata_crc,
! 	   (char *) bkpb,
! 	   sizeof(BkpBlock));
! 			page = (char *) BufferGetBlock(dtbuf[i]);
! 			if (bkpb-hole_length == 0)
! 			{
! COMP_CRC32(rdata_crc,
! 		   page,
! 		   BLCKSZ);
! 			}
! 			else
! 			{
! /* must skip the hole */
! COMP_CRC32(rdata_crc,
! 		   page,
! 		   bkpb-hole_offset);
  COMP_CRC32(rdata_crc,
! 		   page + (bkpb-hole_offset + bkpb-hole_length),
! 		   BLCKSZ - (bkpb-hole_offset + bkpb-hole_length));
! 			}
  		}
  	}
  
--- 665,740 
  
  	/*
  	 * Now add the backup block headers and data into the CRC
+ 	 * Also make a separate chain of entries for the backup blocks.  
+ 	 * Once we know we do not need to repeat the process due to races,
+ 	 * the two chains are stitched together so that we  don't need 
+ 	 * to special-case them in the write loop.  At the 

[HACKERS] XLogInsert

2009-08-19 Thread Jeff Janes
In XLogInsert (src/backend/access/transam/xlog.c), the part that adds back-up
blocks into the rdata chain is described:

/*
 * Make additional rdata chain entries for the backup blocks, so that we
 * don't need to special-case them in the write loop.  Note that we have
 * now irrevocably changed the input rdata chain.

If I read the code correctly, the only thing that is irrevocable is
that it writes into
rdt-next, and if it saved an old copy of rdt first, then it could
revoke the changes just
by doing rdt_old-next=NULL.  If that were done, then I think this
code could be
moved out of the section holding the WALInsertLock.  It could probably be folded
into the part of the code that computes the CRC.

I don't think this wold be a big performance win, as that part of the
code is pretty
fast, but every bit helps in a highly contended lock, and I think the code
would be simpler as well.  Do you think it would be worth me giving this a try?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XLogInsert

2009-08-19 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 If I read the code correctly, the only thing that is irrevocable is
 that it writes into
 rdt-next, and if it saved an old copy of rdt first, then it could
 revoke the changes just
 by doing rdt_old-next=NULL.  If that were done, then I think this
 code could be
 moved out of the section holding the WALInsertLock.

Hmm, I recall that the changes are ... or were ... more complex.
The tricky case I think is where we have to go back and redo the
block-backup decisions after discovering that the checkpoint REDO
pointer has just moved.

If you can get the work out of the WALInsertLock section for just a
few more instructions, it would definitely be worth doing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers