Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-25 Thread KONDO Mitsumasa

Hi,

I understand why my patch is faster than original, by executing Heikki's patch. 
His patch execute write() and fsync() in each relation files in write-phase in 
checkpoint. Therefore, I expected that write-phase would be slow, and fsync-phase 
would be fast. Because disk-write had executed in write-phase. But fsync time in 
postgresql with his patch is almost same time as original. It's very mysterious!


I checked /proc/meminfo in executing benchmark and other resources. As a result, 
this was caused by separating checkpointer process and writer process. In 9.1 or 
older, checkpoint and background-write are executed in writer process by serial 
schedule. But in 9.2 or later, it is executed by parallel schedule, regardless 
executing checkpoint. Therefore, less fsync and long-term fsync schedule method 
which likes my patch are so faster. Because waste disk-write was descend by 
thease method. In worst case his patch, same peges disk-write are executed twice 
in one checkpoint, moreover it might be random disk-write.


By the way, when dirty buffers which have always under dirty_background_ratio * 
physical memory / 100, write-phase does not disk-write at all. Therefore, in 
fsync-phase disk-write all of dirty buffer. So when this case, write-schedule is 
not making sense. It's very heavy and waste, but it might not change by OS and 
postgres parameters. I set small dirty_backjground_ratio, but the result was very 
miserable...


Now, I am confirming my theory by dbt-2 benchmark in lru_max_pages = 0. And I 
will be told about OS background-writing mechanism by my colleague who is kernel 
hacker next week.


What do you think?

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-22 Thread KONDO Mitsumasa

(2013/07/19 22:48), Greg Smith wrote:

On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:

Recently, a user who think system availability is important uses
synchronous replication cluster.


If your argument for why it's OK to ignore bounding crash recovery on the master
is that it's possible to failover to a standby, I don't think that is
acceptable.  PostgreSQL users certainly won't like it.
OK. I will also test recovery time. However, I consider more good practice now, I 
test it with new patch.



I want you to read especially point that is line 631, 651, and 656.
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).


You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm to
realize everything you're telling me about the writeback code and its congestion
logic I knew back in 2007.  The situation is even worse than you describe,
because this section of Linux has gone through multiple, major revisions since
then.  You can't just say here is the writeback source code; you have to
reference each of the commonly deployed versions of the writeback feature to 
tell
how this is going to play out if released.  There are four major ones I pay
attention to.  The old kernel style as see in RHEL5/2.6.18--that's what my 2007
paper discussed--the similar code but with very different defaults in 2.6.22, 
the
writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then there are newer
kernels.  (The newer ones separate out into a few branches too, I haven't mapped
those as carefully yet)
The writeback source code which I indicated part of writeback is almost same as 
community kernel (2.6.32.61). I also read linux kernel 3.9.7, but it is almost 
same this part. I think that fs-writeback.c is easier than xlog.c. It is only 
1309 steps. I think that linux distributions are only different about tuning 
parameter, but same as programing logic. Do you think to need reading debian 
kernel source code? I will read part of this code, because it is only scores of 
steps at most.



 There are some examples of what really bad checkpoints look
like in
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf
if you want to see some of them.  That's the talk I did around the same time I
was trying out spreading the database fsync calls out over a longer period.
Does it cause in ext3 or 4 file system? I think this is bug of XFS. If fsync call 
doesn't return,

it indicate cannot writing WAL and not return their commit. It is seriously 
problem.

My fsync patch is only sleep returned succece of fsync and maximum sleep time is 
set to 10 seconds. It does not cause bad for this problem.



When I did that, checkpoints became even less predictable, and that was a major
reason behind why I rejected the approach.  I think your suggestion will have 
the
same problem.  You just aren't generating test cases with really large write
workloads yet to see it.  You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.
I think it is important that why this problem was caused. We should try to find 
the cause of which program has bug or problem.



In addition, if you set a large value of a checkpoint_timeout or
checkpoint_complete_taget, you have said that performance is improved,
but is it true in all the cases?


The timeout, yes.  Throughput is always improved by increasing
checkpoint_timeout.  Less checkpoints per unit of time increases efficiency.
Less writes of the most heavy accessed buffers happen per transaction.  It is
faster because you are doing less work, which on average is always faster than
doing more work.  And doing less work usually beats doing more work, but doing 
it
smarter.

If you want to see how much work per transaction a test is doing, track the
numbers of buffers written at the beginning/end of your test via
pg_stat_bgwriter.  Tests that delay checkpoints will show a lower total number 
of
writes per transaction.  That seems more efficient, but it's efficiency mainly
gained by ignoring checkpoint_timeout.

OK. In next test, I will try it.


When a checkpoint complication target is actually enlarged,
performance may fall in some cases. I think this as the last fsync
having become heavy owing to having write in slowly.


I think you're confusing throughput and latency here.  Increasing the checkpoint
timeout, or to a lesser extent the completion target, on average that increases
throughput.  It results in less work, and the more/less work amount is much more
important than worrying about scheduler details.  Now matter how efficient a
given write is, whether you've sorted it across elevator horizon boundary A or
boundary B, it's better not do it at all.
I think fsync which has longest time or continues a lot block other transactions. 
And my patch not only improvement of throughput but also realize stable response 
time at fsync phase in checkpoint.



By the way:  if you have a theory like the last fsync having become heavy for
why something is 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-22 Thread KONDO Mitsumasa

(2013/07/21 4:37), Heikki Linnakangas wrote:

Mitsumasa-san, since you have the test rig ready, could you try the attached
patch please? It scans the buffer cache several times, writing out all the dirty
buffers for segment A first, then fsyncs it, then all dirty buffers for segment
B, and so on. The patch is ugly, but if it proves to be helpful, we can spend 
the
time to clean it up.

Thank you! It is interesting code, I test it.

By the way, my campany's colleague helps us to testing. If you have other idea, 
please send me patch or methods.


Best regards,
--
Mitsumasa KONDO


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-22 Thread Greg Smith

On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:

The writeback source code which I indicated part of writeback is almost
same as community kernel (2.6.32.61). I also read linux kernel 3.9.7,
but it is almost same this part.


The main source code difference comes from going back to the RedHat 5 
kernel, which means 2.6.18.  For many of these versions, you are right 
that it is only the tuning parameters that were changed in newer versions.


Optimizing performance for the old RHEL5 kernel isn't the most important 
thing, but it's helpful to know the things it does very badly.



My fsync patch is only sleep returned succece of fsync and maximum sleep
time is set to 10 seconds. It does not cause bad for this problem.


It's easy to have hundreds of relations that are getting fsync calls 
during a checkpoint.  If you have 100 relations getting a 10 second 
sleep each, you could potentially delay checkpoints by 17 minutes this 
way.  I regularly see systems where shared_buffers=8GB and there are 200 
to 400 relation segments that need a sync during a checkpoint.


This is the biggest problem with your submission.  Once you give up 
following the checkpoint schedule carefully, it is very easy to end up 
with large checkpoint deadline misses on production servers.  If someone 
thinks they are doing a checkpoint every 5 minutes, but your patch makes 
them take 20 minutes instead, that is bad.  They will not expect that a 
crash might have to replay that much activity before the server is 
useful again.



You also don't seem afraid of how exceeding the
checkpoint timeout is a very bad thing yet.

I think it is important that why this problem was caused. We should try
to find the cause of which program has bug or problem.


The checkpointer process is the problem.  There's no filesystem bug or 
complicated issues involved in many of the bad cases.  Here is a simple 
example that shows how the toughest problem cases happen:


-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 buers (3.2%); 0 transaction log 
file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879 
s, total=4858.859 s


It's very hard to do better on these, and I don't expect any change to 
help this a lot.  But I don't want to see a change committed that makes 
this sort of checkpoint 17 minutes longer if there's 100 relations 
involved either.



My patch not only improvement of throughput but also
realize stable response time at fsync phase in checkpoint.


The main reason your patch improves latency and throughput is that it 
makes checkpoints farther apart.  That's why I drew you a graph showing 
how the time between checkpoints lined up perfectly with TPS.  If it was 
only a small problem it would be worth considering, but I think it's 
likely to end up with these 15 minute I've outlined here instead.



And I servey about ext3 file system.


I wouldn't worry too much about the problems ext3 has.  Like the old 
RHEL5 kernel I was commenting about above, there are a lot of ext3 
systems out there.  But we can't do a lot about getting good performance 
from them.  It's only important to test that you're not making them a 
lot worse with a change.



My system block size is 4096, but
8192 or more seems to better. It will decrease number of inode and get
more large sequential disk fields.


I normally increase read-ahead on Linux systems to get faster speed on 
sequential disk throughput.  Changing the block size might work better 
in some cases, but not many people are willing to do that.  Read-ahead 
is very easy to change at any time.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/20/13 4:48 AM, didier wrote:

 With your tests did you try to write the hot buffers first? ie buffers
 with a high  refcount, either by sorting them on refcount or at least
 sweeping the buffer list in reverse?


 I never tried that version.  After a few rounds of seeing that all changes
 I tried were just rearranging the good and bad cases, I got pretty bored
 with trying new changes in that same style.


  by writing to the OS the less likely to be recycle buffers first it may
 have less work to do at fsync time, hopefully they have been written by
 the OS background task during the spread and are not re-dirtied by other
 backends.


 That is the theory.  In practice write caches are so large now, there is
 almost no pressure forcing writes to happen until the fsync calls show up.
  It's easily possible to enter the checkpoint fsync phase only to discover
 there are 4GB of dirty writes ahead of you, ones that have nothing to do
 with the checkpoint's I/O.

 Backends are constantly pounding the write cache with new writes in
 situations with checkpoint spikes.  The writes and fsync calls made by the
 checkpoint process are only a fraction of the real I/O going on. The volume
 of data being squeezed out by each fsync call is based on total writes to
 that relation since the checkpoint.  That's connected to the writes to that
 relation happening during the checkpoint, but the checkpoint writes can
 easily be the minority there.

 It is not a coincidence that the next feature I'm working on attempts to
 quantify the total writes to each 1GB relation chunk.  That's the most
 promising path forward on the checkpoint problem I've found.


 --
 Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
 PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-21 Thread didier
Hi,

On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote:

 On 7/20/13 4:48 AM, didier wrote:


 That is the theory.  In practice write caches are so large now, there is
 almost no pressure forcing writes to happen until the fsync calls show up.
  It's easily possible to enter the checkpoint fsync phase only to discover
 there are 4GB of dirty writes ahead of you, ones that have nothing to do
 with the checkpoint's I/O.

 Isn't adding another layer of cache the usual answer?

The best would be in the OS, a fs with a big journal able to write
sequentially a lot of blocks.

If not and If you can spare at worst 2bit in memory per data blocks,  don't
mind preallocated data files (assuming meta data are stable then) and have
a working mmap(  MAP_NONBLOCK), and mincore() syscalls you could have a
checkpoint in bound time, worst case you sequentially write the whole
server RAM to a separate disk every checkpoint.
Not sure I would trust such a beast with my data though :)


Didier


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-20 Thread didier
Hi

With your tests did you try to write the hot buffers first? ie buffers with
a high  refcount, either by sorting them on refcount or at least sweeping
the buffer list in reverse?

In my understanding there's an 'impedance mismatch' between what postgresql
wants and what the OS offers.
when it called fsync() Postresql wants a set of buffers selected quickly at
checkpoint start time written to disks, but  the OS only offers to write
all dirties buffers at fsync time, not exactly the same contract, on a
loaded server with checkpoint spreading the difference could be big, worst
case checkpoint want 8KB fsync write 1GB.

As a control, there's 150 years of math, up to Maxwell himself, behind t
Adding as little energy (packets) as randomly as possible to a control
system you couldn't measure actuators do make a

by writing to the OS the less likely to be recycle buffers first it may
have less work to do at fsync time, hopefully they have been written by the
OS background task during the spread and are not re-dirtied by other
backends.

Didier


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-20 Thread Greg Smith

On 7/20/13 4:48 AM, didier wrote:

With your tests did you try to write the hot buffers first? ie buffers
with a high  refcount, either by sorting them on refcount or at least
sweeping the buffer list in reverse?


I never tried that version.  After a few rounds of seeing that all 
changes I tried were just rearranging the good and bad cases, I got 
pretty bored with trying new changes in that same style.



by writing to the OS the less likely to be recycle buffers first it may
have less work to do at fsync time, hopefully they have been written by
the OS background task during the spread and are not re-dirtied by other
backends.


That is the theory.  In practice write caches are so large now, there is 
almost no pressure forcing writes to happen until the fsync calls show 
up.  It's easily possible to enter the checkpoint fsync phase only to 
discover there are 4GB of dirty writes ahead of you, ones that have 
nothing to do with the checkpoint's I/O.


Backends are constantly pounding the write cache with new writes in 
situations with checkpoint spikes.  The writes and fsync calls made by 
the checkpoint process are only a fraction of the real I/O going on. 
The volume of data being squeezed out by each fsync call is based on 
total writes to that relation since the checkpoint.  That's connected to 
the writes to that relation happening during the checkpoint, but the 
checkpoint writes can easily be the minority there.


It is not a coincidence that the next feature I'm working on attempts to 
quantify the total writes to each 1GB relation chunk.  That's the most 
promising path forward on the checkpoint problem I've found.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-20 Thread Heikki Linnakangas

On 20.07.2013 19:28, Greg Smith wrote:

On 7/20/13 4:48 AM, didier wrote:

With your tests did you try to write the hot buffers first? ie buffers
with a high refcount, either by sorting them on refcount or at least
sweeping the buffer list in reverse?


I never tried that version. After a few rounds of seeing that all
changes I tried were just rearranging the good and bad cases, I got
pretty bored with trying new changes in that same style.


It doesn't seem like we're getting anywhere with minor changes to the 
existing logic. The reason I brought up sorting the writes in the first 
place is that it allows you to fsync() each segment after it's written, 
rather than doing all the writes first, and then fsyncing all the relations.


Mitsumasa-san, since you have the test rig ready, could you try the 
attached patch please? It scans the buffer cache several times, writing 
out all the dirty buffers for segment A first, then fsyncs it, then all 
dirty buffers for segment B, and so on. The patch is ugly, but if it 
proves to be helpful, we can spend the time to clean it up.


- Heikki
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..14149a9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1210,10 +1210,15 @@ static void
 BufferSync(int flags)
 {
 	int			buf_id;
-	int			num_to_scan;
+	int			buf_id_start;
 	int			num_to_write;
 	int			num_written;
 	int			mask = BM_DIRTY;
+	bool		target_chosen = false;
+	RelFileNode	target_rnode = {0, 0, 0};
+	ForkNumber	target_forknum = 0;
+	int			target_segno = 0;
+	int			target_bufs = 0;
 
 	/* Make sure we can handle the pin inside SyncOneBuffer */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -1275,10 +1280,10 @@ BufferSync(int flags)
 	 * Note that we don't read the buffer alloc count here --- that should be
 	 * left untouched till the next BgBufferSync() call.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
-	num_to_scan = NBuffers;
+	buf_id_start = 0;
 	num_written = 0;
-	while (num_to_scan--  0)
+	target_chosen = false;
+	for (buf_id = 0; buf_id  NBuffers; buf_id++)
 	{
 		volatile BufferDesc *bufHdr = BufferDescriptors[buf_id];
 
@@ -1294,7 +1299,12 @@ BufferSync(int flags)
 		 * write the buffer though we didn't need to.  It doesn't seem worth
 		 * guarding against this, though.
 		 */
-		if (bufHdr-flags  BM_CHECKPOINT_NEEDED)
+		/* the above reasoning applies to the target checks too */
+		if ((!target_chosen ||
+			 (RelFileNodeEquals(target_rnode, bufHdr-tag.rnode) 
+			  target_forknum == bufHdr-tag.forkNum 
+			  target_segno == bufHdr-tag.blockNum / RELSEG_SIZE)) 
+			bufHdr-flags  BM_CHECKPOINT_NEEDED)
 		{
 			if (SyncOneBuffer(buf_id, false)  BUF_WRITTEN)
 			{
@@ -1320,11 +1330,34 @@ BufferSync(int flags)
  * Sleep to throttle our I/O rate.
  */
 CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+
+/* Find other buffers belonging to this segment */
+if (!target_chosen)
+{
+	LockBufHdr(bufHdr);
+	target_rnode = bufHdr-tag.rnode;
+	target_forknum = bufHdr-tag.forkNum;
+	target_segno = bufHdr-tag.blockNum / RELSEG_SIZE;
+	UnlockBufHdr(bufHdr);
+	target_bufs = 0;
+	buf_id_start = buf_id;
+	target_chosen = true;
+}
+target_bufs++;
 			}
 		}
 
-		if (++buf_id = NBuffers)
-			buf_id = 0;
+		if (buf_id == NBuffers - 1  target_chosen)
+		{
+			if (log_checkpoints)
+elog(LOG, checkpoint sync: wrote %d buffers for target, syncing,
+	 target_bufs);
+
+			smgrsyncrel(target_rnode, target_forknum, target_segno);
+			target_chosen = false;
+			/* continue the scan where we left before we chose the target */
+			buf_id = buf_id_start;
+		}
 	}
 
 	/*
@@ -1825,10 +1858,11 @@ CheckPointBuffers(int flags)
 {
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
+	smgrsyncbegin();
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsyncend();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..4e4eb03 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -187,6 +187,7 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
+static void mdsyncguts(PendingOperationEntry *entry, ForkNumber forknum, int segno);
 
 
 /*
@@ -235,7 +236,8 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsyncbegin();
+		mdsyncend();
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -970,27 +972,18 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-19 Thread KONDO Mitsumasa

(2013/07/19 0:41), Greg Smith wrote:

On 7/18/13 11:04 AM, Robert Haas wrote:

On a system where fsync is sometimes very very slow, that
might result in the checkpoint overrunning its time budget - but SO
WHAT?


Checkpoints provide a boundary on recovery time.  That is their only purpose.
You can always do better by postponing them, but you've now changed the 
agreement
with the user about how long recovery might take.
Recently, a user who think system availability is important uses synchronous 
replication cluster. And, as Robert saying, a user who cannot build cluster 
system will not use this function in GUC.


When it became IO busy in calling fsync(), my patch does not take the over IO 
load in fsync(). Actually, it is the same as OS writeback structure. I read 
kernel source code which is fs/fs-writeback.c in linux-2.6.32-358.0.1.el6. It is 
latest RHEL6.4 kernel code. It seems that wb_writeback() controlled disk IO in 
OS-writeback function. Please see under source code. If OS think IO is busy, it 
does not write more IO for bail.


fs/fs-writeback.c @wb_writeback()
 623 /*
 624  * For background writeout, stop when we are below the
 625  * background dirty threshold
 626  */
 627 if (work-for_background  !over_bground_thresh())
 628 break;
 629
 630 wbc.more_io = 0;
 631 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 632 wbc.pages_skipped = 0;
 633
 634 trace_wbc_writeback_start(wbc, wb-bdi);
 635 if (work-sb)
 636 __writeback_inodes_sb(work-sb, wb, wbc);
 637 else
 638 writeback_inodes_wb(wb, wbc);
 639 trace_wbc_writeback_written(wbc, wb-bdi);
 640 work-nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 641 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 642
 643 /*
 644  * If we consumed everything, see if we have more
 645  */
 646 if (wbc.nr_to_write = 0)
 647 continue;
 648 /*
 649  * Didn't write everything and we don't have more IO, bail
 650  */
 651 if (!wbc.more_io)
 652 break;
 653 /*
 654  * Did we write something? Try for more
 655  */
 656 if (wbc.nr_to_write  MAX_WRITEBACK_PAGES)
 657 continue;
 658 /*
 659  * Nothing written. Wait for some inode to
 660  * become available for writeback. Otherwise
 661  * we'll just busyloop.
 662  */
 663 spin_lock(inode_lock);
 664 if (!list_empty(wb-b_more_io))  {
 665 inode = list_entry(wb-b_more_io.prev,
 666 struct inode, i_list);
 667 trace_wbc_writeback_wait(wbc, wb-bdi);
 668 inode_wait_for_writeback(inode);
 669 }
 670 spin_unlock(inode_lock);
 671 }
 672
 673 return wrote;

I want you to read especially point that is line 631, 651, and 656. 
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte). OS writeback scheduler does not 
write over MAX_WRITEBACK_PAGES. Because, if it write big data than 
MAX_WRITEBACK_PAGES, it will be IO-busy. And if it cannot write at all, OS think 
it needs recovery of IO performance. It is same as my patch's logic.


In addition, if you set a large value of a checkpoint_timeout or 
checkpoint_complete_taget, you have said that performance is improved, but is it 
true in all the cases? Since the write of the dirty buffer which passed 30 
seconds or more is carried out at intervals of 5 seconds, as there are many 
recesses of a write, a possibility of becoming an inefficient random write. For 
example, as for the worsening case, when the sleep time for 200 ms is inserted 
each time, since only 25 page (200 KB) can write in 5 seconds. I think it is bad 
efficiency to write. When a checkpoint complication target is actually enlarged, 
performance may fall in some cases. I think this as the last fsync having become 
heavy owing to having write in slowly.


I would like to make a itemizing list which can be proof of my patch from you. 
Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours. 
Of course, I think it is important to obtain your consent.


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-19 Thread Greg Smith

On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:

Recently, a user who think system availability is important uses
synchronous replication cluster.


If your argument for why it's OK to ignore bounding crash recovery on 
the master is that it's possible to failover to a standby, I don't think 
that is acceptable.  PostgreSQL users certainly won't like it.



I want you to read especially point that is line 631, 651, and 656.
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).


You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm 
to realize everything you're telling me about the writeback code and its 
congestion logic I knew back in 2007.  The situation is even worse than 
you describe, because this section of Linux has gone through multiple, 
major revisions since then.  You can't just say here is the writeback 
source code; you have to reference each of the commonly deployed 
versions of the writeback feature to tell how this is going to play out 
if released.  There are four major ones I pay attention to.  The old 
kernel style as see in RHEL5/2.6.18--that's what my 2007 paper 
discussed--the similar code but with very different defaults in 2.6.22, 
the writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then 
there are newer kernels.  (The newer ones separate out into a few 
branches too, I haven't mapped those as carefully yet)


If you tried to model your feature on Linux's approach here, what that 
means is that the odds of an ugly feedback loop here are even higher. 
You're increasing the feedback on what's already a bad situation that 
triggers trouble for people in the field.  When Linux's congestion logic 
causes checkpoint I/O spikes to get worse than they otherwise might be, 
people panic because it seems like they stopped altogether.  There are 
some examples of what really bad checkpoints look like in 
http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf 
if you want to see some of them.  That's the talk I did around the same 
time I was trying out spreading the database fsync calls out over a 
longer period.


When I did that, checkpoints became even less predictable, and that was 
a major reason behind why I rejected the approach.  I think your 
suggestion will have the same problem.  You just aren't generating test 
cases with really large write workloads yet to see it.  You also don't 
seem afraid of how exceeding the checkpoint timeout is a very bad thing yet.



In addition, if you set a large value of a checkpoint_timeout or
checkpoint_complete_taget, you have said that performance is improved,
but is it true in all the cases?


The timeout, yes.  Throughput is always improved by increasing 
checkpoint_timeout.  Less checkpoints per unit of time increases 
efficiency.  Less writes of the most heavy accessed buffers happen per 
transaction.  It is faster because you are doing less work, which on 
average is always faster than doing more work.  And doing less work 
usually beats doing more work, but doing it smarter.


If you want to see how much work per transaction a test is doing, track 
the numbers of buffers written at the beginning/end of your test via 
pg_stat_bgwriter.  Tests that delay checkpoints will show a lower total 
number of writes per transaction.  That seems more efficient, but it's 
efficiency mainly gained by ignoring checkpoint_timeout.



When a checkpoint complication target is actually enlarged,
performance may fall in some cases. I think this as the last fsync
having become heavy owing to having write in slowly.


I think you're confusing throughput and latency here.  Increasing the 
checkpoint timeout, or to a lesser extent the completion target, on 
average that increases throughput.  It results in less work, and the 
more/less work amount is much more important than worrying about 
scheduler details.  Now matter how efficient a given write is, whether 
you've sorted it across elevator horizon boundary A or boundary B, it's 
better not do it at all.


But having less checkpoints makes latency worse sometimes too.  Whether 
latency or throughput is considered the more important thing is very 
complicated.  Having checkpoint_completion_target as the knob to control 
the latency/throughput trade-off hasn't worked out very well.  No one 
has done a really comprehensive look at this trade-off since the 8.3 
development.  I got halfway through it for 9.1, we figured out that the 
fsync queue filling was actually responsible for most of my result 
variation, and then Robert fixed that.  It was a big enough change that 
all my data from before that I had to throw out as no longer relevant.


By the way:  if you have a theory like the last fsync having become 
heavy for why something is happening, measure it.  Set log_min_messages 
to debug2 and you'll get details about every single fsync in your logs. 
 I did that for all my tests that led me to conclude fsync delaying on 
its own didn't help that problem.  I was 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Amit Kapila
On Wednesday, July 17, 2013 6:08 PM Ants Aasma wrote:
 On Wed, Jul 17, 2013 at 2:54 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  I think Oracle also use similar concept for making writes efficient,
 and
  they have patent also for this technology which you can find at below
 link:
 
 http://www.google.com/patents/US7194589?dq=645987hl=ensa=Xei=kn7mUZ-
 PIsWq
  rAe99oDgBwsqi=2pjf=1ved=0CEcQ6AEwAw
 
  Although Oracle has different concept for performing checkpoint
 writes, but
  I thought of sharing the above link with you, so that unknowingly we
 should
  not go into wrong path.
 
  AFAIK instead of depending on OS buffers, they use direct I/O and
 infact in
  the patent above they are using temporary buffer (Claim 3) to sort
 the
  writes which is not the same idea as far as I can understand by
 reading
  above thread.
 
 They are not even sorting anything, the patent is for
 opportunistically looking for adjacent dirty blocks when writing out a
 dirty buffer to disk. While a useful technique, this has nothing to do
 with sorting checkpoints. 

It is not sorting, rather it finds consecutive blocks before writing to disk
using hashing in buffer cache.
I think the patch is different from it in multiple ways.
I had read this patent some time back and thought that you are also trying
to achieve something similar (Reduce random I/O), 
so shared with you.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Robert Haas
On Sun, Jul 14, 2013 at 3:13 PM, Greg Smith g...@2ndquadrant.com wrote:
 Accordingly, the current behavior--no delay--is already the best possible
 throughput.  If you apply a write timing change and it seems to increase
 TPS, that's almost certainly because it executed less checkpoint writes.
 It's not a fair comparison.  You have to adjust any delaying to still hit
 the same end point on the checkpoint schedule. That's what my later
 submissions did, and under that sort of controlled condition most of the
 improvements went away.

This is all valid logic, but I don't think it's makes the patch a bad
idea.  What KONDO Mitsumasa is proposing (or proposed at one point,
upthread), is that when an fsync takes a long time, we should wait
before issuing the next fsync, and the delay should be proportional to
how long the previous fsync took.  On a system that's behaving well,
where fsyncs are always fast, that's going to make very little
difference.  On a system where fsync is sometimes very very slow, that
might result in the checkpoint overrunning its time budget - but SO
WHAT?

I mean, yes, we want checkpoints to complete in the time specified,
but if the I/O system is completely flogged, I suspect most people
would prefer to overrun the checkpoint's time budget rather than have
all foreground activity grind to a halt until the checkpoint finishes.
 As I'm pretty sure you've pointed out in the past, when this
situation develops, the checkpoint may be doomed to overrun whether we
like it or not.  We should view this as an emergency pressure release
valve; if we think not everyone will want it, then make it a GUC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith
Please stop all this discussion of patents in this area.  Bringing up a 
US patents here makes US list members more likely to be treated as 
willful infringers of that patent: 
http://www.ipwatchdog.com/patent/advanced-patent/willful-patent-infringement/ 
if the PostgreSQL code duplicates that method one day.


The idea of surveying patents in some area so that their methods can be 
avoided in code you develop, that is a reasonable private stance to 
take.  But don't do that on the lists.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith

On 7/18/13 11:04 AM, Robert Haas wrote:

On a system where fsync is sometimes very very slow, that
might result in the checkpoint overrunning its time budget - but SO
WHAT?


Checkpoints provide a boundary on recovery time.  That is their only 
purpose.  You can always do better by postponing them, but you've now 
changed the agreement with the user about how long recovery might take.


And if you don't respect the checkpoint boundary, what you can't do is 
then claim better execution performance than something that did.  It's 
always possible to improvement throughput by postponing I/O.  SO WHAT? 
If that's OK, you don't need complicated logic to do that.  Increase 
checkpoint_timeout.  The system with checkpoint_timeout at 6 minutes 
will always outperform one where it's 5.


You don't need to introduce a feedback loop--something that has 
significant schedule stability implications if it gets out of 
control--just to spread I/O out further.  I'd like to wander down the 
road of load-sensitive feedback for database operations, especially for 
maintenance work.  But if I build something that works mainly because it 
shifts the right edge of the I/O deadline forward, I am not fooled into 
thinking I did something awesome.  That's cheating, getting better 
performance mainly by throwing out the implied contract with the 
user--the one over their expected recovery time after a crash.  And I'm 
not excited about complicating the PostgreSQL code to add a new way to 
do that, not when checkpoint_timeout is already there with a direct, 
simple control on the exact same trade-off.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Robert Haas
On Thu, Jul 18, 2013 at 11:41 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/18/13 11:04 AM, Robert Haas wrote:
 On a system where fsync is sometimes very very slow, that
 might result in the checkpoint overrunning its time budget - but SO
 WHAT?

 Checkpoints provide a boundary on recovery time.  That is their only
 purpose.  You can always do better by postponing them, but you've now
 changed the agreement with the user about how long recovery might take.

 And if you don't respect the checkpoint boundary, what you can't do is then
 claim better execution performance than something that did.  It's always
 possible to improvement throughput by postponing I/O.  SO WHAT? If that's
 OK, you don't need complicated logic to do that.  Increase
 checkpoint_timeout.  The system with checkpoint_timeout at 6 minutes will
 always outperform one where it's 5.

 You don't need to introduce a feedback loop--something that has significant
 schedule stability implications if it gets out of control--just to spread
 I/O out further.  I'd like to wander down the road of load-sensitive
 feedback for database operations, especially for maintenance work.  But if I
 build something that works mainly because it shifts the right edge of the
 I/O deadline forward, I am not fooled into thinking I did something awesome.
 That's cheating, getting better performance mainly by throwing out the
 implied contract with the user--the one over their expected recovery time
 after a crash.  And I'm not excited about complicating the PostgreSQL code
 to add a new way to do that, not when checkpoint_timeout is already there
 with a direct, simple control on the exact same trade-off.

That's not the same trade-off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Alvaro Herrera
Greg Smith escribió:
 On 7/18/13 11:04 AM, Robert Haas wrote:
 On a system where fsync is sometimes very very slow, that
 might result in the checkpoint overrunning its time budget - but SO
 WHAT?
 
 Checkpoints provide a boundary on recovery time.  That is their only
 purpose.  You can always do better by postponing them, but you've
 now changed the agreement with the user about how long recovery
 might take.
 
 And if you don't respect the checkpoint boundary, what you can't do
 is then claim better execution performance than something that did.
 It's always possible to improvement throughput by postponing I/O.
 SO WHAT? If that's OK, you don't need complicated logic to do that.
 Increase checkpoint_timeout.  The system with checkpoint_timeout at
 6 minutes will always outperform one where it's 5.

I think the idea is to have a system in which most of the time the
recovery time will be that for checkpoint_timeout=5, but in those
(hopefully rare) cases where checkpoints take a bit longer, the recovery
time will be that for checkpoint_timeout=6.

In any case, if the system crashes past minute 5 after the previous
checkpoint (the worst possible timing), the current checkpoint will have
already started, so recovery will take slightly less time because some
flush work had already been done.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Greg Smith

On 7/18/13 12:00 PM, Alvaro Herrera wrote:

I think the idea is to have a system in which most of the time the
recovery time will be that for checkpoint_timeout=5, but in those
(hopefully rare) cases where checkpoints take a bit longer, the recovery
time will be that for checkpoint_timeout=6.


I understand the implementation.  My point is that if you do that, the 
fair comparison is to benchmark it against a current system where 
checkpoint_timeout=6 minutes.  That is a) simpler, b) requires no code 
change, and c) makes the looser standards the server is now settling for 
transparent to the administrator.  Also, my expectation is that it would 
perform better all of the time, not just during the periods this new 
behavior kicks in.


Right now we have checkpoint_completion_target as a GUC for controlling 
what's called the spread of a checkpoint over time.  That sometimes goes 
over, but that's happening against the best attempts of the server to do 
better.


The first word that comes to mind for for just disregarding the end time 
is that it's a sloppy checkpoint.  There is all sorts of sloppy behavior 
you might do here, but I've worked under the assumption that ignoring 
the contract with the administrator was frowned on by this project.  If 
people want this sort of behavior in the server, I'm satisfied my 
distaste for the idea and the reasoning behind it is clear now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-18 Thread Stephen Frost
* Greg Smith (g...@2ndquadrant.com) wrote:
 The first word that comes to mind for for just disregarding the end
 time is that it's a sloppy checkpoint.  There is all sorts of sloppy
 behavior you might do here, but I've worked under the assumption
 that ignoring the contract with the administrator was frowned on by
 this project.  If people want this sort of behavior in the server,
 I'm satisfied my distaste for the idea and the reasoning behind it
 is clear now.

For my part, I agree with Greg on this.  While we might want to provide
an option of go ahead and go past checkpoint timeout if the server gets
too busy to keep up, I don't think it should be the default.

To be honest, I'm also not convinced that this approach is better than
the existing mechanism where the user can adjust checkpoint_timeout to
be higher if they're ok with recovery taking longer and I share Greg's
concern about this backoff potentially running away and causing
checkpoints which never complete or do so far outside the configured
time.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-17 Thread Greg Smith

On 7/16/13 11:36 PM, Ants Aasma wrote:

As you know running a full suite of write benchmarks takes a very long
time, with results often being inconclusive (noise is greater than
effect we are trying to measure).


I didn't say that.  What I said is that over a full suite of write 
benchmarks, the effect of changes like this has always averaged out to 
zero.  You should try it sometime.  Then we can have a useful discussion 
of non-trivial results instead of you continuing to tell me I don't 
understand things.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-17 Thread Amit Kapila
On Tuesday, July 16, 2013 10:16 PM Ants Aasma wrote:
 On Jul 14, 2013 9:46 PM, Greg Smith g...@2ndquadrant.com wrote:
  I updated and re-reviewed that in 2011:
 http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com
 and commented on why I think the improvement was difficult to reproduce
 back then.  The improvement didn't follow for me either.  It would take
 a really amazing bit of data to get me to believe write sorting code is
 worthwhile after that.  On large systems capable of dirtying enough
 blocks to cause a problem, the operating system and RAID controllers
 are already sorting block.  And *that* sorting is also considering
 concurrent read requests, which are a lot more important to an
 efficient schedule than anything the checkpoint process knows about.
 The database doesn't have nearly enough information yet to compete
 against OS level sorting.
 
 That reasoning makes no sense. OS level sorting can only see the
 writes in the time window between PostgreSQL write, and being forced
 to disk. Spread checkpoints sprinkles the writes out over a long
 period and the general tuning advice is to heavily bound the amount of
 memory the OS willing to keep dirty. This makes probability of
 scheduling adjacent writes together quite low, the merging window
 being limited either by dirty_bytes or dirty_expire_centisecs. The
 checkpointer has the best long term overview of the situation here, OS
 scheduling only has the short term view of outstanding read and write
 requests. By sorting checkpoint writes it is much more likely that
 adjacent blocks are visible to OS writeback at the same time and will
 be issued together.

I think Oracle also use similar concept for making writes efficient, and
they have patent also for this technology which you can find at below link:
http://www.google.com/patents/US7194589?dq=645987hl=ensa=Xei=kn7mUZ-PIsWq
rAe99oDgBwsqi=2pjf=1ved=0CEcQ6AEwAw

Although Oracle has different concept for performing checkpoint writes, but
I thought of sharing the above link with you, so that unknowingly we should
not go into wrong path. 

AFAIK instead of depending on OS buffers, they use direct I/O and infact in
the patent above they are using temporary buffer (Claim 3) to sort the
writes which is not the same idea as far as I can understand by reading
above thread.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-17 Thread Ants Aasma
On Wed, Jul 17, 2013 at 1:54 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/16/13 11:36 PM, Ants Aasma wrote:

 As you know running a full suite of write benchmarks takes a very long
 time, with results often being inconclusive (noise is greater than
 effect we are trying to measure).


 I didn't say that.  What I said is that over a full suite of write
 benchmarks, the effect of changes like this has always averaged out to zero.
 You should try it sometime.  Then we can have a useful discussion of
 non-trivial results instead of you continuing to tell me I don't understand
 things.

The fact that other changes have been tradeoffs doesn't change the
point that there is no tradeoff here. I see no way in which writing
blocks to the OS in a logical order is worse than writing them out in
arbitrary order. This is why I considered blindly running write
benchmarks a waste of time at this point - if the worst case is zero
and there are cases where it helps then it can't average out to zero.
It would be better to identify the worst case and design a test for
that.

However I started the full gamut of scale factors and client count
tests just do quiet any fears of unexpected regressions. 4 scales, 6
client loads, 3 tests, 20min per test, 2 versions, the results will be
done in 48h.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-17 Thread Ants Aasma
On Wed, Jul 17, 2013 at 2:54 PM, Amit Kapila amit.kap...@huawei.com wrote:
 I think Oracle also use similar concept for making writes efficient, and
 they have patent also for this technology which you can find at below link:
 http://www.google.com/patents/US7194589?dq=645987hl=ensa=Xei=kn7mUZ-PIsWq
 rAe99oDgBwsqi=2pjf=1ved=0CEcQ6AEwAw

 Although Oracle has different concept for performing checkpoint writes, but
 I thought of sharing the above link with you, so that unknowingly we should
 not go into wrong path.

 AFAIK instead of depending on OS buffers, they use direct I/O and infact in
 the patent above they are using temporary buffer (Claim 3) to sort the
 writes which is not the same idea as far as I can understand by reading
 above thread.

They are not even sorting anything, the patent is for
opportunistically looking for adjacent dirty blocks when writing out a
dirty buffer to disk. While a useful technique, this has nothing to do
with sorting checkpoints. It's also a good example why the patent
system is stupid. It's an obvious idea that probably has loads of
prior art. I'm no patent lawyer, but the patent also looks like it
would be easy to bypass by doing the equivalent thing in a slightly
different way.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-16 Thread Ants Aasma
On Jul 14, 2013 9:46 PM, Greg Smith g...@2ndquadrant.com wrote:
 I updated and re-reviewed that in 2011: 
 http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and 
 commented on why I think the improvement was difficult to reproduce back 
 then.  The improvement didn't follow for me either.  It would take a really 
 amazing bit of data to get me to believe write sorting code is worthwhile 
 after that.  On large systems capable of dirtying enough blocks to cause a 
 problem, the operating system and RAID controllers are already sorting block. 
  And *that* sorting is also considering concurrent read requests, which are a 
 lot more important to an efficient schedule than anything the checkpoint 
 process knows about.  The database doesn't have nearly enough information yet 
 to compete against OS level sorting.

That reasoning makes no sense. OS level sorting can only see the
writes in the time window between PostgreSQL write, and being forced
to disk. Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty. This makes probability of
scheduling adjacent writes together quite low, the merging window
being limited either by dirty_bytes or dirty_expire_centisecs. The
checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests. By sorting checkpoint writes it is much more likely that
adjacent blocks are visible to OS writeback at the same time and will
be issued together.

I gave the linked patch a shot. I tried it with pgbench scale 100
concurrency 32, postgresql shared_buffers=3GB,
checkpoint_timeout=5min, checkpoint_segments=100,
checkpoint_completion_target=0.5, pgdata was on a 7200RPM HDD, xlog on
Intel 320 SSD, kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.

first checkpoint on master: wrote 209496 buffers (53.7%); 0
transaction log file(s) added, 0 removed, 26 recycled; write=314.444
s, sync=9.614 s, total=324.166 s; sync files=16, longest=9.208 s,
average=0.600 s
IO while checkpointing: about 500 write iops at 5MB/s, 100% utilisation.

first checkpoint with checkpoint sorting applied: wrote 205269 buffers
(52.6%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=149.049 s, sync=0.386 s, total=149.559 s; sync files=39,
longest=0.255 s, average=0.009 s
IO while checkpointing: about 23 write iops at 12MB/s, 10% utilisation.

Transaction processing rate for a 20min run went from 5200 to 7000.

Looks to me that in this admittedly best case workload the sorting is
working exactly as designed, converting mostly random IO into
sequential. I have seen many real world workloads where this kind of
sorting would have benefited greatly.

I also did a I/O bound test with scalefactor 100 and
checkpoint_timeout 30min. 2hour average tps went from 121 to 135, but
I'm not yet sure if it's repeatable or just noise.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-16 Thread Greg Smith

On 7/16/13 12:46 PM, Ants Aasma wrote:


Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty.


That's arguing that you can make this feature be useful if you tune in a 
particular way.  That's interesting, but the goal here isn't to prove 
the existence of some workload that a change is useful for.  You can 
usually find a test case that validates any performance patch as helpful 
if you search for one.  Everyone who has submitted a sorted checkpoint 
patch for example has found some setup where it shows significant gains. 
 We're trying to keep performance stable across a much wider set of 
possibilities though.


Let's talk about default parameters instead, which quickly demonstrates 
where your assumptions fail.  The server I happen to be running pgbench 
tests on today has 72GB of RAM running SL6 with RedHat derived kernel 
2.6.32-358.11.1.  This is a very popular middle grade server 
configuration nowadays.  There dirty_background_ratio and 
dirty_background_ratio are 10 (percent).  That means that roughly 7GB of 
RAM can be used for write caching.  Note that this is a fairly low write 
cache tuning compared to a survey of systems in the field--lots of 
people have servers with earlier kernels where these numbers can be as 
high as 20 or even 40% instead.


The current feasible tuning for shared_buffers suggests a value of 8GB 
is near the upper limit, beyond which cache related overhead makes 
increases counterproductive.  Your examples are showing 53% of 
shared_buffers dirty at checkpoint time; that's typical.  The 
checkpointer is then writing out just over 4GB of data.


With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize.  That 
potentially includes backend, background writer, checkpoint, temp table, 
statistics, log, and WAL data.  The scheduler is also considering read 
operations.


-The checkpointer process has 4GB of writes from rarely written shared 
memory it's trying to optimize.


This is why if you take the opposite approach of yours today--go 
searching for workloads where sorting is counterproductive--those are 
equally easy to find.  Any test of write speed I do starts with about 50 
different scale/client combinations.  Why do I suggest pgbench-tools as 
a way to do performance tests?  It's because an automated sweep of 
client setups like it does is the minimum necessary to create enough 
variation in workload for changing the database's write path.  It's 
really amazing how often doing that shows a proposed change is just 
shuffling the good and bad cases around.  That's been the case for every 
sorting and fsync delay change submitted so far.  I'm not even 
interested in testing today's submission because I tried that particular 
approach for a few months, twice so far, and it fell apart on just as 
many workloads as it helped.



The checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests.


True only if shared_buffers is large compared to the OS write cache, 
which was not the case on the example I generated with all of a minute's 
work.  I regularly see servers where Linux's Dirty area becomes a 
multiple of the dirty buffers written by a checkpoint.  I can usually 
make that happen at will with CLUSTER and VACUUM on big tables.  The 
idea that the checkpointer has a long-term view while the OS has a short 
one, that presumes a setup that I would say is possible but not common.



kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.


You disclaimed this as a best case scenario.  It is a low throughput / 
low latency tuning.  That's fine, but if Postgres optimizes itself 
toward those cases it runs the risk of high throughput servers with 
large caches being detuned.  I've posted examples before showing very 
low write caches like this leading to VACUUM running at 1/2 its normal 
speed or worse, as a simple example of where a positive change in one 
area can backfire badly on another workload.  That particular problem 
was so common I updated pgbench-tools recently to track table 
maintenance time between tests, because that demonstrated an issue even 
when the TPS numbers all looked fine.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-16 Thread Ants Aasma
On Tue, Jul 16, 2013 at 9:17 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/16/13 12:46 PM, Ants Aasma wrote:

 Spread checkpoints sprinkles the writes out over a long
 period and the general tuning advice is to heavily bound the amount of
 memory the OS willing to keep dirty.


 That's arguing that you can make this feature be useful if you tune in a
 particular way.  That's interesting, but the goal here isn't to prove the
 existence of some workload that a change is useful for.  You can usually
 find a test case that validates any performance patch as helpful if you
 search for one.  Everyone who has submitted a sorted checkpoint patch for
 example has found some setup where it shows significant gains.  We're trying
 to keep performance stable across a much wider set of possibilities though.

 Let's talk about default parameters instead, which quickly demonstrates
 where your assumptions fail.  The server I happen to be running pgbench
 tests on today has 72GB of RAM running SL6 with RedHat derived kernel
 2.6.32-358.11.1.  This is a very popular middle grade server configuration
 nowadays.  There dirty_background_ratio and dirty_background_ratio are 10
 (percent).  That means that roughly 7GB of RAM can be used for write
 caching.  Note that this is a fairly low write cache tuning compared to a
 survey of systems in the field--lots of people have servers with earlier
 kernels where these numbers can be as high as 20 or even 40% instead.

 The current feasible tuning for shared_buffers suggests a value of 8GB is
 near the upper limit, beyond which cache related overhead makes increases
 counterproductive.  Your examples are showing 53% of shared_buffers dirty at
 checkpoint time; that's typical.  The checkpointer is then writing out just
 over 4GB of data.

 With that background what process here has more data to make decisions with?

 -The operating system has 7GB of writes it's trying to optimize.  That
 potentially includes backend, background writer, checkpoint, temp table,
 statistics, log, and WAL data.  The scheduler is also considering read
 operations.

 -The checkpointer process has 4GB of writes from rarely written shared
 memory it's trying to optimize.

Actually I was arguing that the reasoning that OS will take care of
the sorting does not apply in reasonably common cases. My point is
that the OS isn't able to optimize the writes because spread
checkpoints trickle the writes out to the OS in random order over a
long time. If OS writeback behavior is left in the default
configuration it will start writing out data before checkpoint write
phase ends (due to dirty_expire_centisecs), this will miss write
combining opportunities that would arise if we sorted the data before
dumping them to the OS dirty buffers. I'm not arguing that we try to
bypass OS I/O scheduling decisions, I'm arguing that by arranging
checkpoint writes in logical order we will make pages visible to the
I/O scheduler in a way that will lead to more efficient writes.

Also I think that you are overestimating the capabilities of the OS IO
scheduler. At least for Linux, the IO scheduler does not see pages in
the dirty list - only pages for which writeback has been initiated. In
default configuration this means up to 128 read and 128 write I/Os are
considered. The writes are picked by basically doing round robin on
files with dirty pages and doing a clocksweep scan for a chunk of
pages from each. So in reality there is practically no benefit in
having the OS do the reordering, while there is the issue that
flushing a large amount of dirty pages at once does very nasty things
to query latency by overloading all of the I/O queues.

 This is why if you take the opposite approach of yours today--go searching
 for workloads where sorting is counterproductive--those are equally easy to
 find.  Any test of write speed I do starts with about 50 different
 scale/client combinations.  Why do I suggest pgbench-tools as a way to do
 performance tests?  It's because an automated sweep of client setups like it
 does is the minimum necessary to create enough variation in workload for
 changing the database's write path.  It's really amazing how often doing
 that shows a proposed change is just shuffling the good and bad cases
 around.  That's been the case for every sorting and fsync delay change
 submitted so far.  I'm not even interested in testing today's submission
 because I tried that particular approach for a few months, twice so far, and
 it fell apart on just as many workloads as it helped.

As you know running a full suite of write benchmarks takes a very long
time, with results often being inconclusive (noise is greater than
effect we are trying to measure). This is why I'm interested which
workloads you suspect might fall apart from this patch - because I
can't think of any. Worst case would be that the OS fully absorbs all
checkpoint writes before writing anything out, so the sorting is
useless waste of CPU and memory. The CPU 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/16/13 10:27 AM, Heikki Linnakangas wrote:

Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images...
Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.


That's exactly right.  When a checkpoint finishes the OS write cache is 
clean.  That means all of the full-page writes aren't even hitting disk 
in many cases.  They just pile up in the OS dirty memory, often sitting 
there all the way until when the next checkpoint fsyncs start.  That's 
why I never wandered down the road of changing FPW behavior.  I have 
never seen a benchmark workload hit a write bottleneck until long after 
the big burst of FPW pages is over.


I could easily believe that there are low-memory systems where the FPW 
write pressure becomes a problem earlier.  And slim VMs make sense as 
the place this behavior is being seen at.


I'm a big fan of instrumenting the code around a performance change 
before touching anything, as a companion patch that might make sense to 
commit on its own.  In the case of a change to FPW spacing, I'd want to 
see some diagnostic output in something like pg_stat_bgwriter that 
tracks how many FPW pages are being modified.  A 
pgstat_bgwriter.full_page_writes counter would be perfect here, and then 
graph that data over time as the benchmark runs.



Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.


There I also think the right way to proceed is instrumenting that area 
first.



A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.


I updated and re-reviewed that in 2011: 
http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com 
and commented on why I think the improvement was difficult to reproduce 
back then.  The improvement didn't follow for me either.  It would take 
a really amazing bit of data to get me to believe write sorting code is 
worthwhile after that.  On large systems capable of dirtying enough 
blocks to cause a problem, the operating system and RAID controllers are 
already sorting block.  And *that* sorting is also considering 
concurrent read requests, which are a lot more important to an efficient 
schedule than anything the checkpoint process knows about.  The database 
doesn't have nearly enough information yet to compete against OS level 
sorting.




Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).


For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.


Heikki has nailed the problem with the submitted dbt-2 results here.  If 
you spread checkpoints out more, you cannot fairly compare the resulting 
TPS or latency numbers anymore.


Simple example:  20 minute long test.  Server A does a checkpoint every 
5 minutes.  Server B has modified parameters or server code such that 
checkpoints happen every 6 minutes.  If you run both to completion, A 
will have hit 4 checkpoints that flush the buffer cache, B only 3.  Of 
course B will seem faster.  It didn't do as much work.


pgbench_tools measures the number of checkpoints during the test, as 
well as the buffer count statistics.  If those numbers are very 
different between two tests, I have to throw them out as unfair.  A lot 
of things that seem promising turn out to have this sort of problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 6/27/13 11:08 AM, Robert Haas wrote:

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.


That's correct, I spent about a year whipping that particular horse and 
submitted improvements on it to the community. 
http://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.com 
and its updates downthread are good ones to compare this current work 
against.


The important thing to realize about just delaying fsync calls is that 
it *cannot* increase TPS throughput.  Not possible in theory, obviously 
doesn't happen in practice.  The most efficient way to write things out 
is to delay those writes as long as possible.  The longer you postpone a 
write, the more elevator sorting and write combining you get out of the 
OS.  This is why operating systems like Linux come tuned for such 
delayed writes in the first place.  Throughput and latency are linked; 
any patch that aims to decrease latency will probably slow throughput.


Accordingly, the current behavior--no delay--is already the best 
possible throughput.  If you apply a write timing change and it seems to 
increase TPS, that's almost certainly because it executed less 
checkpoint writes.  It's not a fair comparison.  You have to adjust any 
delaying to still hit the same end point on the checkpoint schedule. 
That's what my later submissions did, and under that sort of controlled 
condition most of the improvements went away.


Now, I still do really believe that better spacing of fsync calls helps 
latency in the real world.  Far as I know the server that I developed 
that patch for originally in 2010 is still running with that change. 
The result is not a throughput change though; there is a throughput drop 
with a latency improvement.  That is the unbreakable trade-off in this 
area if all you touch is scheduling.


The reason why I was ignoring this discussion and working on pgbench 
throttling until now is that you need to measure latency at a constant 
throughput to advance here on this topic, and that's exactly what the 
new pgbench feature enables.  If we can take the current checkpoint 
scheduler and an altered one, run both at exactly the same rate, and one 
gives lower latency, now we're onto something.  It's possible to do that 
with DBT-2 as well, but I wanted something really simple that people 
could replicate results with in pgbench.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/3/13 9:39 AM, Andres Freund wrote:

I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.


The fsync calls decomposing into the queued set of block writes.  If 
they all need to go out eventually to finish a checkpoint, the most 
efficient way from a throughput perspective is to dump them all at once.


I'm not sure sync_file_range targeting checkpoint writes will turn out 
any differently than block sorting.  Let's say the database tries to get 
involved in forcing a particular write order that way.  Right now it's 
going to be making that ordering decision without the benefit of also 
knowing what blocks are being read.  That makes it hard to do better 
than the OS, which knows a different--and potentially more useful in a 
ready-heavy environment--set of information about all the pending I/O. 
And it would be very expensive to made all the backends start sharing 
information about what they read to ever pull that logic into the 
database.  It's really easy to wander down the path where you assume you 
must know more than the OS does, which leads to things like direct I/O. 
 I am skeptical of that path in general.  I really don't want Postgres 
to be competing with the innovation rate in Linux kernel I/O if we can 
ride it instead.


One idea I was thinking about that overlaps with a sync_file_range 
refactoring is simply tracking how many blocks have been written to each 
relation.  If there was a rule like fsync any relation that's gotten 
more than 100 8K writes, we'd never build up the sort of backlog that 
causes the worst latency issues.  You really need to start tracking the 
file range there, just to fairly account for multiple writes to the same 
block.  One of the reasons I don't mind all the work I'm planning to put 
into block write statistics is that I think that will make it easier to 
build this sort of facility too.  The original page write and the fsync 
call that eventually flushes it out are very disconnected right now, and 
file range data seems the right missing piece to connect them well.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread james

On 14/07/2013 20:13, Greg Smith wrote:
The most efficient way to write things out is to delay those writes as 
long as possible.


That doesn't smell right to me.  It might be that delaying allows more 
combining and allows the kernel to see more at once and optimise it, but 
I think the counter-argument is that it is an efficiency loss to have 
either CPU or disk idle waiting on the other.  It cannot make sense from 
a throughput point of view to have disks doing nothing and then become 
overloaded so they are a bottleneck (primarily seeking) and the CPU does 
nothing.


Now I have NOT measured behaviour but I'd observe that we see disks that 
can stream 100MB/s but do only 5% of that if they are doing random IO.  
Some random seeks during sync can't be helped, but if they are done when 
we aren't waiting for sync completion then they are in effect free.  The 
flip side is that we can't really know whether they will get merged with 
adjacent writes later so its hard to schedule them early.  But we can 
observe that if we have a bunch of writes to adjacent data then a seek 
to do the write is effectively amortised across them.


So it occurs to me that perhaps we can watch for patterns where we have 
groups of adjacent writes that might stream, and when they form we might 
schedule them to be pushed out early (if not immediately), ideally out 
as far as the drive (but not flushed from its cache) and without forcing 
all other data to be flushed too.  And perhaps we should always look to 
be getting drives dedicated to dbms to do something, even if it turns 
out to have been redundant in the end.


That's not necessarily easy on Linux without using a direct unbuffered 
IO but to me that is Linux' problem.  For a start its not the only 
target system, and having feedback 'we need' from db and mail system 
groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to 
hear what Oracle and Sybase devs felt they needed either.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote:
 I use linear combination method for considering about total checkpoint 
 schedule
 which are write phase and fsync phase. V3 patch was considered about only 
 fsync
 phase, V4 patch was considered about write phase and fsync phase, and v5 patch
 was considered about only fsync phase.

Your v5 now looks like my Self-tuning checkpoint sync spread series:
https://commitfest.postgresql.org/action/patch_view?id=514 which I did
after deciding write phase delays didn't help.  It looks to me like
some, maybe all, of your gain is coming from how any added delays spread
out the checkpoints.  The self-tuning part I aimed at was trying to
stay on exactly the same checkpoint end time even with the delays in
place.  I got that part to work, but the performance gain went away once
the schedule was a fair comparison.  You are trying to solve a very hard
problem.

How long are you running your dbt-2 tests for?  I didn't see that listed
anywhere.

 ** Average checkpoint duration (sec) (Not include during loading time)
   | write_duration | sync_duration | total
 fsync v3-0.7 | 296.6  | 251.8898  | 548.48 | OK
 fsync v3-0.9 | 292.086| 276.4525  | 568.53 | OK
 fsync v3-0.7_disabled| 303.5706   | 155.6116  | 459.18 | OK
 fsync v4-0.7 | 273.8338   | 355.6224  | 629.45 | OK
 fsync v4-0.9 | 329.0522   | 231.77| 560.82 | OK

I graphed the total times against the resulting NOTPM values and
attached that.  I expect transaction rate to increase along with time
time between checkpoints, and that's what I see here.  The fsync v4-0.7
result is worse than the rest for some reason, but all the rest line up
nicely.

Notice how fsync v3-0.7_disabled has the lowest total time between
checkpoints, at 459.18.  That is why it has the most I/O and therefore
runs more slowly than the rest.  If you take your fsync v3-0.7_disabled
and increase checkpoint_segments and/or checkpoint_timeout until that
test is averaging about 550 seconds between checkpoints, NOTPM should
also increase.  That's interesting to know, but you don't need any
change to Postgres for that.  That's what always happens when you have
less checkpoints per run.

If you get a checkpoint time table like this where the total duration is
very close--within +/-20 seconds is the sort of noise I would expect
there--at that point I would say you have all your patches on the same
checkpoint schedule.  And then you can compare the NOTPM numbers
usefully.  When the checkpoint times are in a large range like 459.18 to
629.45 in this table, as my graph shows the associated NOTPM numbers are
going to be based on that time.

I would recommend taking a snapshot of pg_stat_bgwriter before and after
the test runs, and then showing the difference between all of those
numbers too.  If the test runs for a while--say 30 minutes--the total
number of checkpoints should be very close too.

 * Test Server
Server: HP Proliant DL360 G7
CPU:Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk:   146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
(Add) Set off energy efficient function in BIOS and OS.

Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and
that same P410 controller + 4 disks.  I've been meaning to get DBT-2
running on there usefully, your research gives me a reason to do that.

You seem to be in a rush due to the commitfest schedule.  I have some
bad news for you there.  You're not going to see a change here committed
in this CF based on where it's at, so you might as well think about the
best longer term plan.  I would be shocked if anything came out of this
in less than 3 months really.  That's the shortest amount of time I've
ever done something useful in this area.  Each useful benchmarking run
takes me about 3 days of computer time, it's not a very fast development
cycle.

Even if all of your results were great, we'd need to get someone to
duplicate them on another server, and we'd need to make sure they didn't
make other workloads worse.  DBT-2 is very useful, but no one is going
to get a major change to the write logic in the database committed based
on one benchmark.  Past changes like this have used both DBT-2 and a
large number of pgbench tests to get enough evidence of improvement to
commit.  I can help with that part when you get to something I haven't
tried already.  I am very interesting in improving this area, it just
takes a lot of work to do it.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: NOTPM-Checkpoints.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Greg Smith

On 7/14/13 5:28 PM, james wrote:

Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.


That happens sometimes, but if you measure you'll find this doesn't 
actually occur usefully in the situation everyone dislikes.  In a write 
heavy environment where the database doesn't fit in RAM, backends and/or 
the background writer are constantly writing data out to the OS.  WAL is 
going out constantly as well, and in many cases that's competing for the 
disks too.  The most popular blocks in the database get high usage 
counts and they never leave shared_buffers except at checkpoint time. 
That's easy to prove to yourself with pg_buffercache.


And once the write cache fills, every I/O operation is now competing. 
There is nothing happening for free.  You're stealing I/O from something 
else any time you force a write out.  The optimal throughput path for 
checkpoints turns out to be delaying every single bit of I/O as long as 
possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever 
you delay a buffer write, you have increased the possibility that 
someone else will write the same block again.  And the buffers being 
written by the checkpointer are, on average, the most popular ones in 
the database.  Writing any of them to disk pre-emptively has high odds 
of writing the same block more than once per checkpoint.  And that easy 
to measure waste--it shows as more writes/transaction in 
pg_stat_bgwriter--it hurts throughput more than every reduction in seek 
overhead you might otherwise get from early writes.  The big gain isn't 
chasing after cheap seeks.  The best path is the one that decreases the 
total volume of writes.


We played this game with the background writer work for 8.3.  The main 
reason the one committed improved on the original design is that it 
completely eliminated doing work on popular buffers in advance. 
Everything happens at the last possible time, which is the optimal 
throughput situation.  The 8.1/8.2 BGW used to try and write things out 
before they were strictly necessary, in hopes that that I/O would be 
free.  But it rarely was, while there was always a cost to forcing them 
to disk early.  And that cost is highest when you're talking about the 
higher usage blocks the checkpointer tends to write.  When in doubt, 
always delay the write in hopes it will be written to again and you'll 
save work.



So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them...


Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening.  We 
can't even easily make it know, not without adding a whole new source of 
IPC and locking contention among clients.


Whatever scheduling decision the checkpointer might make with its 
limited knowledge of system I/O is going to be poor.  You might find a 
100% write benchmark that it helps, but those are not representative of 
the real world.  In any mixed read/write case, the operating system is 
likely to do better.  That's why things like sorting blocks sometimes 
seem to help someone, somewhere, with one workload, but then aren't 
repeatable.


We can decide to trade throughput for latency by nudging the OS to deal 
with its queued writes more regularly.  That will result in more total 
writes, which is the reason throughput drops.


But the idea that PostgreSQL is going to do a better global job of I/O 
scheduling, that road is a hard one to walk.  It's only going to happen 
if we pull all of the I/O into the database *and* do a better job on the 
entire process than the existing OS kernel does.  That sort of dream, of 
outperforming the filesystem, it is very difficult to realize.  There's 
a good reason that companies like Oracle stopped pushing so hard on 
recommending raw partitions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Jeff Janes
On Sunday, July 14, 2013, Greg Smith wrote:

 On 6/27/13 11:08 AM, Robert Haas wrote:

 I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
 it didn't work that well.


 That's correct, I spent about a year whipping that particular horse and
 submitted improvements on it to the community. http://www.postgresql.org/*
 *message-id/4D4F9A3D.5070700@**2ndquadrant.comhttp://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.comand
  its updates downthread are good ones to compare this current work
 against.

 The important thing to realize about just delaying fsync calls is that it
 *cannot* increase TPS throughput.  Not possible in theory, obviously
 doesn't happen in practice.  The most efficient way to write things out is
 to delay those writes as long as possible.  The longer you postpone a
 write, the more elevator sorting and write combining you get out of the OS.
  This is why operating systems like Linux come tuned for such delayed
 writes in the first place.  Throughput and latency are linked; any patch
 that aims to decrease latency will probably slow throughput.


Do common low level IO benchmarking tools cover this territory?  I've
looked at Bonnie, which seems to be the most famous one, and it doesn't
look like it covers effectiveness of write combining at all.

I've done my own casual benchmarking, and the results were astonishingly
bad for the OS/FS.  If I over-wrote 1024*1024 blocks of 8KB in random order
and then fsynced the 8GB of data (divided into 8x1GB files, in deference to
PG segment size) it took way longer than if I did the overwrite in block
order and then fsynced that.   This was a gift-horse machine not speced out
to be a database server, but the linux kernel is still the kernel
regardless of the hardware it sits on so I don't how much that should
matter.  To be clear, the writes did not take longer, it was the fsyncs
that took longer. All writes were successfully absorbed into memory
promptly.  Alas, I no longer have access to a machine which can absorb 8GB
of writes into RAM without thinking twice and which I can use for casual
experimentation.

Cheers,

Jeff


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-14 Thread Jeff Janes
On Sunday, July 14, 2013, Greg Smith wrote:

 On 7/14/13 5:28 PM, james wrote:

 Some random seeks during sync can't be helped, but if they are done when
 we aren't waiting for sync completion then they are in effect free.


 That happens sometimes, but if you measure you'll find this doesn't
 actually occur usefully in the situation everyone dislikes.  In a write
 heavy environment where the database doesn't fit in RAM, backends and/or
 the background writer are constantly writing data out to the OS.  WAL is
 going out constantly as well, and in many cases that's competing for the
 disks too.


While I think it is probably true that many systems don't separate WAL from
non-WAL to different IO controllers, is it true that many systems that are
in need of heavy IO tuning don't do so?  I thought that that would be the
first stop for any DBA of an highly IO-write constrained database.



  The most popular blocks in the database get high usage counts and they
 never leave shared_buffers except at checkpoint time. That's easy to prove
 to yourself with pg_buffercache.

 And once the write cache fills, every I/O operation is now competing.
 There is nothing happening for free.  You're stealing I/O from something
 else any time you force a write out.  The optimal throughput path for
 checkpoints turns out to be delaying every single bit of I/O as long as
 possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever you
 delay a buffer write, you have increased the possibility that someone else
 will write the same block again.   And the buffers being written by the
 checkpointer are, on average, the most popular ones in the database.
  Writing any of them to disk pre-emptively has high odds of writing the
 same block more than once per checkpoint.



Should the checkpointer make multiple passes over the buffer pool, writing
out the high usage_count buffers first, because no one else is going to do
it, and then going back for the low usage_count buffers in the hope they
were already written out?  On the other hand, if the checkpointer writes
out a low-usage buffer, why would anyone else need to write it again soon?
 If it were likely to get dirtied often, it wouldn't be low usage.  If it
was dirtied rarely, it wouldn't be dirty anymore once written.

Cheers,

Jeff


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-11 Thread KONDO Mitsumasa
Hi,l

I create fsync v3 v4 v5 patches and test them.

* Changes
 - Add considering about total checkpoint schedule in fsync phase (v3 v4 v5)
 - Add considering about total checkpoint schedule in write phase (v4 only)
 - Modify some implementations from v3 (v5 only)


I use linear combination method for considering about total checkpoint schedule
which are write phase and fsync phase. V3 patch was considered about only fsync
phase, V4 patch was considered about write phase and fsync phase, and v5 patch
was considered about only fsync phase.

Test result is here. Benchmark setting and server are same as previous test. 
'-*'
shows checkpoint_completion_target in each tests. And all tests which are except
'fsync v3_disabled' set 'checkpointer_fsync_delay_ratio=1' and
'checkpointer_fsync_delay_threshold=1000'. 'fsync v3_disabled' set
'checkpointer_fsync_delay_ratio=0' and 'checkpointer_fsync_delay_threshold= -1'.
V5 patch is testing now:-), but it will be same score as v3 patch.

* Result
** DBT-2 result
 | NOTPM | 90%tile | Average | S.Deviation | Maximum
-+---+-+-+-+
fsync v3-0.7 | 3649.02   | 9.703   | 4.226   | 3.853   | 21.754
fsync v3-0.9 | 3694.41   | 9.897   | 3.874   | 4.016   | 20.774
fsync v3-0.7_disabled| 3583.28   | 10.966  | 4.684   | 4.866   | 31.545
fsync v4-0.7 | 3546.38   | 12.734  | 5.062   | 4.798   | 24.468
fsync v4-0.9 | 3670.81   | 9.864   | 4.130   | 3.665   | 19.236

** Average checkpoint duration (sec) (Not include during loading time)
 | write_duration | sync_duration | total  | punctual to
checkpoint schedule
-++---++
fsync v3-0.7 | 296.6  | 251.8898  | 548.48 | OK
fsync v3-0.9 | 292.086| 276.4525  | 568.53 | OK
fsync v3-0.7_disabled| 303.5706   | 155.6116  | 459.18 | OK
fsync v4-0.7 | 273.8338   | 355.6224  | 629.45 | OK
fsync v4-0.9 | 329.0522   | 231.77| 560.82 | OK

** Increase of checkpoint duration (%) (Reference point is 'fsync 
v3-0.7_disabled'.)
 | write_duration | sync_duration | total
-++---+---
fsync v3-0.7 | 97.7%  | 161.9%| 119.4%
fsync v3-0.9 | 96.2%  | 177.7%| 123.8%
fsync v3-0.7_disabled| 100.0% | 100.0%| 100.0%
fsync v4-0.7 | 90.2%  | 228.5%| 137.1%
fsync v4-0.9 | 108.4% | 148.9%| 122.1%


* Examination
** DBT-2 result
V3 patch seems good result which is be faster response time about 10%-30% and
inclease NOTPM about 5% than no sleep(fsync v3-0.7_disabled), and v4 patch is 
not
good result. However, 'fsync v4-0.9' is same score as v3 patch when more large
checkpoint_completion_target. I think that considering about checkpoint schedule
about write phase and fsync phase makes more harsh in IO schedule. Because write
phase IO schedule is more strict than normal write phase. And it is also bad in
fsync phase and concern latter.

** Average checkpoint duration
All methods are punctual to checkpoint schedule. In enabling fsync sleep, it is
longer fsync time, however total time are much the same as no sleep.
'fsync v4-0.7 ' becomes very bad sync duration and total time. It indicates that
changing checkpoint_completion_target is very delicate. It had not better change
write phase scheduling, the same as it used to. At write phase in normal setting
, it have sufficiently time for punctual to checkpoint schedule. And I think 
that
many user want to be compatible with old version.

What do you think about these patches?

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
  */
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-08 Thread KONDO Mitsumasa
I create fsync v2 patch. There's not much time, so I try to focus fsync patch in 
this commit festa as adviced by Heikki. And I'm sorry that it is not good that 
diverging from main discussion in this commit festa... Of course, I continue to 
try another improvement.


* Changes
 - Add ckpt_flag in mdsync() etc with reference by Heikki's patch. It will be 
more controllable mdsync() in checkpoint.
 - Too long sleep in fsync phase is not good for checkpoint schedule. So I set 
limited sleep time which is always less than 10 seconds(MAX_FSYNC_SLEEP).
 I think that 10 seconds sleep time is a suitable value in various situations. 
And I also considered limited sleep time by checkpoint progress,
 however, I thought md.c should be simple and remain robust. So I have remained 
simple.
 - Maximum checkpoint_fsync_sleep_ratio in guc.c is changed 1 to 2. Because I 
set limited sleep time 10 secounds. We can more flexibly change it and be more 
safety.


And I considered abbreviation of parameters in my fsync patch.
 * checkpoint_fsync_delay_threshold
  In general, I think that it is suitable about 1 second in various 
environments.
  If we want to adjust sleep time in fsync phase, we can change 
checkpoint_fsync_sleep_ratio.


 * checkpoint_fsync_sleep_ratio
  I don't want to omit this parameter, because it can only regulate sleep time 
in fsync phase and checkpoint time.



* Benchmark Result(DBT-2)
 | NOTPMAverage  90%tile  Maximum
 +
 original_0.7 (baseline) | 3610.42  4.55610.9180  23.1326
 fsync v1| 3685.51  4.036 9.2017  17.5594
 fsync v2| 3748.80  3.562 8.1871  17.5101

I'm not sure about this result. Fsync v2 patch was too good. Of cource I didn't 
do anything in executing benchmark.
Please see checkpoint_time.txt which is written detail checkpoint in each 
checkpoint. Fsync v2 patch seems to be short in each checkpoint time.



* Benchmark Setting
 [postgresql.conf]
  archive_mode = on
  archive_command = '/bin/cp %p /pgdata/pgarch/arc_dbt2/%f'
  synchronous_commit = on
  max_connections = 300
  shared_buffers = 2458MB
  work_mem = 1MB
  fsync = on
  wal_sync_method = fdatasync
  full_page_writes = on
  checkpoint_segments = 300
  checkpoint_timeout = 15min
  checkpoint_completion_target = 0.7
  segsize=1GB(default)

 [patched postgresql.conf (add)]
  checkpointer_fsync_delay_ratio = 1
  checkpointer_fsync_delay_threshold = 1000ms

 [DBT-2 driver settings]
  SESSION:250
  WH:340
  TPW:10
  PRETEST_DURATION: 1800
  TEST_DURATION: 1800


* Test Server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB
  (Add) Set off energy efficient function in BIOS and OS.

Best regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..2b223e9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
  */
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -169,7 +171,6 @@ static pg_time_t last_xlog_switch_time;
 
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +644,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..3f02d0b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1828,7 +1828,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..d762511 100644
--- 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-05 Thread KONDO Mitsumasa

(2013/07/05 0:35), Joshua D. Drake wrote:

On 07/04/2013 06:05 AM, Andres Freund wrote:

Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.

Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.




I did testing on this a few years ago, I tried with 2MB segments over 16MB
thinking similarly to you. It failed miserably, performance completely tanked.
Just as you say, test result was miserable... Too small segsize is bad for 
parformance. It might be improved by separate derectory, but too many FD with 
open() and close() seem to be bad. However, I think taht this implementation have 
potential which is improve for IO performance, so we need to try to test with 
some methods.


* Performance result in DBT-2 (WH340)
 | NOTPM90%tileAverage  Maximum
 +---
 original_0.7 (baseline) | 3474.62  18.348328  5.73936.977713
 fsync + write   | 3586.85  14.459486  4.96027.266958
 fsync + write + segsize=0.25| 3661.17  8.288164.11717.23191
 fsync + wrote + segsize=0.03125 | 3309.99  10.851245  6.75919.500598


(2013/07/04 22:05), Andres Freund wrote:
 1) it breaks pg_upgrade. Which means many of the bigger users won't be
 able to migrate to this and most packagers would carry the old
 segsize around forever.
 Even if we could get pg_upgrade to split files accordingly link mode
 would still be broken.
I think that pg_upgrade is one of the contrib, but not mainly implimentation of 
Postgres. So contrib should not try to stand in improvement of main 
implimentaion. Pg_upgrade users might consider same opinion.


 2) It drastically increases the amount of file handles neccessary and by
 extension increases the amount of open/close calls. Those aren't all
 that cheap. And it increases metadata traffic since mtime/atime are
 kept for more files. Also, file creation is rather expensive since it
 requires metadata transaction on the filesystem level.
My test result was seemed this problem. But my test wasn't separate directory in 
base/. I'm not sure that which way is best. If you have time to create patch, 
please send us, and I try to test in DBT-2.


Best regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-04 Thread KONDO Mitsumasa

(2013/07/03 22:31), Robert Haas wrote:

On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!


This is interesting.  Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory.  As
it is, the number of files that exist there today has caused
performance problems for some of our customers.  I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.
Did you change number of max FD per process in kernel parameter? In default 
setting, number of max FD per process is 1024. I think that it might over limit 
in 500GB class database. Or, this problem might be caused by _mdfd_getseg() at 
md.c. In write phase, dirty buffers don't have own FD. Therefore they seek to 
find own FD and check the file in each dirty buffer. I think it is safe file 
writing, but it might too wasteful. I think that BufferTag should have own FD and 
it will be more efficient in checkpoint writing.



If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout.  Instead
of:

base/${DBOID}/${RELFILENODE}_{FORK}

We could have:

base/${DBOID}/${FORK}/${RELFILENODE}

That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly.  That might be worth doing independently of the issue
you're raising here.  For large clusters, you'd even want one more
level to keep the directories from getting too big:

base/${DBOID}/${FORK}/${X}/${RELFILENODE}

...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number.  But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.

It seems good idea! In generally, base directory was not seen by user.
So it should be more efficient arrangement for performance and adopt for
large database.

(2013/07/03 22:39), Andres Freund wrote: On 2013-07-03 17:18:29 +0900
 Hm. I wonder how much of this could be gained by doing a
 sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
 the original checkpoint-pass through the buffers or when fsyncing the
 files.
Sync_file_rage system call is interesting. But it was supported only by Linux 
kernel 2.6.22 or later. In postgresql, it will suits Robert's idea which does not 
depend on kind of OS.


 Presumably the smaller segsize is better because we don't
 completely stall the system by submitting up to 1GB of io at once. So,
 if we were to do it in 32MB chunks and then do a final fsync()
 afterwards we might get most of the benefits.
Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.

I think that best way to write buffers in checkpoint is sorted by buffer's FD and 
block-number with small segsize setting and each property sleep times. It will 
realize genuine sorted checkpint with sequential disk writing!


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-04 Thread Andres Freund
On 2013-07-04 21:28:11 +0900, KONDO Mitsumasa wrote:
 That would move all the vm and fsm forks to separate directories,
 which would cut down the number of files in the main-fork directory
 significantly.  That might be worth doing independently of the issue
 you're raising here.  For large clusters, you'd even want one more
 level to keep the directories from getting too big:
 
 base/${DBOID}/${FORK}/${X}/${RELFILENODE}
 
 ...where ${X} is two hex digits, maybe just the low 16 bits of the
 relfilenode number.  But this would be not as good for small clusters
 where you'd end up with oodles of little-tiny directories, and I'm not
 sure it'd be practical to smoothly fail over from one system to the
 other.
 It seems good idea! In generally, base directory was not seen by user.
 So it should be more efficient arrangement for performance and adopt for
 large database.

  Presumably the smaller segsize is better because we don't
  completely stall the system by submitting up to 1GB of io at once. So,
  if we were to do it in 32MB chunks and then do a final fsync()
  afterwards we might get most of the benefits.
 Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
 I will send you this test result tomorrow.

I don't like going in this direction at all:
1) it breaks pg_upgrade. Which means many of the bigger users won't be
   able to migrate to this and most packagers would carry the old
   segsize around forever.
   Even if we could get pg_upgrade to split files accordingly link mode
   would still be broken.
2) It drastically increases the amount of file handles neccessary and by
   extension increases the amount of open/close calls. Those aren't all
   that cheap. And it increases metadata traffic since mtime/atime are
   kept for more files. Also, file creation is rather expensive since it
   requires metadata transaction on the filesystem level.
3) It breaks readahead since that usually only works within a single
   file. I am pretty sure that this will significantly slow down
   uncached sequential reads on larger tables.

 (2013/07/03 22:39), Andres Freund wrote: On 2013-07-03 17:18:29 +0900
  Hm. I wonder how much of this could be gained by doing a
  sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
  the original checkpoint-pass through the buffers or when fsyncing the
  files.
 Sync_file_rage system call is interesting. But it was supported only by
 Linux kernel 2.6.22 or later. In postgresql, it will suits Robert's idea
 which does not depend on kind of OS.

Well. But it can be implemented without breaking things... Even if we
don't have sync_file_range() we can cope by simply doing fsync()s more
frequently. For every open file keep track of the amount of buffers
dirtied and every 32MB or so issue an fdatasync()/fsync().

 I think that best way to write buffers in checkpoint is sorted by buffer's
 FD and block-number with small segsize setting and each property sleep
 times. It will realize genuine sorted checkpint with sequential disk
 writing!

That would mke regular fdatasync()ing even easier.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-04 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 I don't like going in this direction at all:
 1) it breaks pg_upgrade. Which means many of the bigger users won't be
able to migrate to this and most packagers would carry the old
segsize around forever.
Even if we could get pg_upgrade to split files accordingly link mode
would still be broken.

TBH, I think *any* rearrangement of the on-disk storage files is going
to be rejected.  It seems very unlikely to me that you could demonstrate
a checkpoint performance improvement from that that occurs consistently
across different platforms and filesystems.  And as Andres points out,
the pain associated with it is going to be bad enough that a very high
bar will be set on whether you've proven the change is worthwhile.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-04 Thread Joshua D. Drake


On 07/04/2013 06:05 AM, Andres Freund wrote:


Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.

Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.




I did testing on this a few years ago, I tried with 2MB segments over 
16MB thinking similarly to you. It failed miserably, performance 
completely tanked.


JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-03 Thread KONDO Mitsumasa
Hi,

I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!

* Performance result in DBT-2 (WH340)
  | NOTPM90%tileAverage  Maximum
 -+---
 original_0.7 (baseline)  | 3474.62  18.348328  5.73936.977713
 fsync + write| 3586.85  14.459486  4.96027.266958
 fsync + write + segsize=0.25 | 3661.17  8.288164.11717.23191

Changing segsize with my checkpoint patches improved original over 50% at 
90%tile
and maximum response time.

However, this tests ware not same condition... I also changed SESSION parameter
100 to 300 in DBT-2 driver. In general, I heard good SESSION parameter is 100.
Andt I didn't understand optimized DBT-2 parameters a lot. So I will retry to
test my patches and baseline with optimized parameters in DBT-2. Please wait for
a while.

Best regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/configure b/configure
index 7c662c3..6269cb9 100755
--- a/configure
+++ b/configure
@@ -2879,7 +2879,7 @@ $as_echo $as_me: error: Invalid block size. Allowed values are 1,2,4,8,16,32.
 esac
 { $as_echo $as_me:$LINENO: result: ${blocksize}kB 5
 $as_echo ${blocksize}kB 6; }
-
+echo ${blocksize}
 
 cat confdefs.h _ACEOF
 #define BLCKSZ ${BLCKSZ}
@@ -2917,14 +2917,15 @@ else
   segsize=1
 fi
 
-
 # this expression is set up to avoid unnecessary integer overflow
 # blocksize is already guaranteed to be a factor of 1024
-RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-test $? -eq 0 || exit 1
+#RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
+RELSEG_SIZE=`echo 1024/$blocksize*$segsize*1024 | bc`
+#test $? -eq} 0 || exit 1
 { $as_echo $as_me:$LINENO: result: ${segsize}GB 5
 $as_echo ${segsize}GB 6; }
-
+echo ${segsize}
+echo ${RELSEG_SIZE}
 
 cat confdefs.h _ACEOF
 #define RELSEG_SIZE ${RELSEG_SIZE}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-03 Thread Robert Haas
On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 I tested and changed segsize=0.25GB which is max partitioned table file size 
 and
 default setting is 1GB in configure option (./configure --with-segsize=0.25).
 Because I thought that small segsize is good for fsync phase and background 
 disk
 write in OS in checkpoint. I got significant improvements in DBT-2 result!

This is interesting.  Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory.  As
it is, the number of files that exist there today has caused
performance problems for some of our customers.  I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.

If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout.  Instead
of:

base/${DBOID}/${RELFILENODE}_{FORK}

We could have:

base/${DBOID}/${FORK}/${RELFILENODE}

That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly.  That might be worth doing independently of the issue
you're raising here.  For large clusters, you'd even want one more
level to keep the directories from getting too big:

base/${DBOID}/${FORK}/${X}/${RELFILENODE}

...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number.  But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-03 Thread Andres Freund
On 2013-07-03 17:18:29 +0900, KONDO Mitsumasa wrote:
 Hi,
 
 I tested and changed segsize=0.25GB which is max partitioned table file size 
 and
 default setting is 1GB in configure option (./configure --with-segsize=0.25).
 Because I thought that small segsize is good for fsync phase and background 
 disk
 write in OS in checkpoint. I got significant improvements in DBT-2 result!
 
 * Performance result in DBT-2 (WH340)
   | NOTPM90%tileAverage  Maximum
  -+---
  original_0.7 (baseline)  | 3474.62  18.348328  5.73936.977713
  fsync + write| 3586.85  14.459486  4.96027.266958
  fsync + write + segsize=0.25 | 3661.17  8.288164.11717.23191
 
 Changing segsize with my checkpoint patches improved original over 50% at 
 90%tile
 and maximum response time.

Hm. I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files. Presumably the smaller segsize is better because we don't
completely stall the system by submitting up to 1GB of io at once. So,
if we were to do it in 32MB chunks and then do a final fsync()
afterwards we might get most of the benefits.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-07-03 Thread Gavin Flower

On 04/07/13 01:31, Robert Haas wrote:

On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

I tested and changed segsize=0.25GB which is max partitioned table file size and
default setting is 1GB in configure option (./configure --with-segsize=0.25).
Because I thought that small segsize is good for fsync phase and background disk
write in OS in checkpoint. I got significant improvements in DBT-2 result!

This is interesting.  Unfortunately, it has a significant downside:
potentially, there will be a lot more files in the data directory.  As
it is, the number of files that exist there today has caused
performance problems for some of our customers.  I'm not sure off-hand
to what degree those problems have been related to overall inode
consumption vs. the number of files in the same directory.

If the problem is mainly with number of of files in the same
directory, we could consider revising our directory layout.  Instead
of:

base/${DBOID}/${RELFILENODE}_{FORK}

We could have:

base/${DBOID}/${FORK}/${RELFILENODE}

That would move all the vm and fsm forks to separate directories,
which would cut down the number of files in the main-fork directory
significantly.  That might be worth doing independently of the issue
you're raising here.  For large clusters, you'd even want one more
level to keep the directories from getting too big:

base/${DBOID}/${FORK}/${X}/${RELFILENODE}

...where ${X} is two hex digits, maybe just the low 16 bits of the
relfilenode number.  But this would be not as good for small clusters
where you'd end up with oodles of little-tiny directories, and I'm not
sure it'd be practical to smoothly fail over from one system to the
other.


16 bits == 4 hex digits

Could you perhaps start with 1 hex digit, and automagically increase it 
to 2, 3, .. as needed?  There could be a status file at that level, that 
would indicate the current number of hex digits, plus a temporary 
mapping file when in transition.


Cheers,
Gavin


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-28 Thread KONDO Mitsumasa

(2013/06/28 0:08), Robert Haas wrote:

On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.  I have also tried it and the resulting
behavior was unimpressive.  It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter.  See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file.  On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see.  You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes.  Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.

Yes. In write phase, checkpointer writes numerous 8KB dirty pages in each
SyncOneBuffer(), therefore it can be well for tiny(100ms) sleep time. But
in fsync phase, checkpointer writes scores of relation files in each fsync(),
therefore it can not be well for tiny sleep. It shoud need longer sleep time
for recovery IO performance. If we know its best sleep time, we had better use 
previous fsync time. And if we want to prevent fast long fsync time, we had 
better change relation file size which is 1GB in default max size to smaller.


Go back to the subject. Here is our patches test results. Fsync + write patch was 
not good result in past result, so I retry benchmark in same condition. It seems 
to get good perfomance than past result.


* Performance result in DBT-2 (WH340)
   | TPS  90%tileAverage  Maximum
---+---
original_0.7   | 3474.62  18.348328  5.73936.977713
original_1.0   | 3469.03  18.637865  5.84241.754421
fsync  | 3525.03  13.872711  5.38228.062947
write  | 3465.96  19.653667  5.80440.664066
fsync + write  | 3586.85  14.459486  4.96027.266958
Heikki's patch | 3504.3   19.731743  5.76138.33814

* HTML result in DBT-2
http://pgstatsinfo.projects.pgfoundry.org/RESULT/

In attached text, I also describe in each checkpoint time. fsync patch was seemed 
to have longer time than not fsync patch. However, checkpoint schedule is on time 
in checkpoint_timeout and allowable time. I think that it is most important 
things in fsync phase that fast finished checkpoint is not but definitely and 
assurance write pages in end of checkpoint. So my fsync patch is not wrong 
working any more.


My write patch seems to have lot of riddle, so I try to investigate objective 
result and theory of effect.


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
* Performance result
   | TPS  90%tileAverage  Maximum  
---+---
original_0.7   | 3474.62  18.348328  5.73936.977713
original_1.0   | 3469.03  18.637865  5.84241.754421
fsync  | 3525.03  13.872711  5.38228.062947
write  | 3465.96  19.653667  5.80440.664066
fsync + write  | 3586.85  14.459486  4.96027.266958
Heikki's patch | 3504.3   19.731743  5.76138.33814 


* Checkpoint duration
# original_0.7
 instid |   start| flags | num_buffers | xlog_added | 
xlog_removed | xlog_recycled | write_duration | sync_duration | total_duration
++---+-++--+---++---+
 14 | 2013-06-19 15:15:24.658+09 | xlog  | 281 |  0 |   
 0 | 0 | 29.038 |  2.69 | 31.798
 14 | 2013-06-19 15:17:13.212+09 | xlog  | 177 |  0 |   
 0 |   300 |   17.9 | 0.886 | 18.818
 14 | 2013-06-19 15:18:45.525+09 | xlog  | 306 |  0 |   
 0 |   300 |  30.72 | 4.011 |  35.11
 14 | 2013-06-19 15:20:26.951+09 | xlog  | 215 |  0 |   
 0 |   300 | 21.952 | 2.148 | 24.197
 14 | 2013-06-19 15:21:56.425+09 | xlog  | 182 |  0 |   
 0 |   300 | 18.527 | 6.323 | 25.069
 14 | 2013-06-19 15:27:26.074+09 | xlog  |   15770 |  0 |   
 0 |   300 |335.431 |80.269 |420.405
 14 | 2013-06-19 15:42:26.272+09 | time  |   82306 |  0 |   
 0 |   300 | 209.34 |   119.159 |333.762
 14 | 2013-06-19 15:57:26.025+09 | time  |   88965 |  0 |   
 0 |   247 |211.095 |  

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-27 Thread Robert Haas
On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The only feedback we have on how bad things are is how long it took
 the last fsync to complete, so I actually think that's a much better
 way to go than any fixed sleep - which will often be unnecessarily
 long on a well-behaved system, and which will often be far too short
 on one that's having trouble. I'm inclined to think think Kondo-san
 has got it right.

 Quite possible, I really don't know. I'm inclined to first try the simplest
 thing possible, and only make it more complicated if that's not good enough.
 Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep
 between every fsync, unless you're behind the schedule, is even simpler.

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.  I have also tried it and the resulting
behavior was unimpressive.  It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter.  See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file.  On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see.  You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes.  Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.

 In
 particular, it's easier to tie that into the checkpoint scheduler - I'm not
 sure how you'd measure progress or determine how long to sleep unless you
 assume that every fsync is the same.

I think the thing to do is assume that the fsync phase will take 10%
or so of the total checkpoint time, but then be prepared to let the
checkpoint run a bit longer if the fsyncs end up being slow.  As Greg
has pointed out during prior discussions of this, the normal scenario
when things get bad here is that there is no way in hell you're going
to fit the checkpoint into the originally planned time.  Once all of
the write caches between PostgreSQL and the spinning rust are full,
the system is in trouble and things are going to suck.  The hope is
that we can stop beating the horse while it is merely in intensive
care rather than continuing until the corpse is fully skeletized.
Fixed delays don't work because - to push an already-overdone metaphor
a bit further - we have no idea how much of a beating the horse can
take; we need something adaptive so that we respond to what actually
happens rather than making predictions that will almost certainly be
wrong a large fraction of the time.

To put this another way, when we start the fsync() phase, it often
consumes 100% of the available I/O on the machine, completing starving
every other process that might need any.  This is certainly a
deficiency in the Linux I/O scheduler, but as they seem in no hurry to
fix it we'll have to cope with it as best we can.  If you do the
fsyncs in fast succession (and 100ms separation might as well be no
separation at all), then the I/O starvation of the entire system
persists through the entire fsync phase.  If, on the other hand, you
sleep for the same amount of time the previous fsync took, then on the
average, 50% of the machine's I/O capacity will be available for all
other system activity throughout the fsync phase, rather than 0%.

Now, unfortunately, this is still not that good, because it's often
the case that all of the fsyncs except one are reasonably fast, and
there's one monster one that is very slow.  ext3 has a known bad
behavior that dumps all dirty data for the entire *filesystem* when
you fsync, which tends to create these kinds of effects.  But even on
better-behaved filesystem, like ext4, it's fairly common to have one
fsync that is painfully longer than all the others.   So even with
this patch, there are still going to be cases where the whole system
becomes unresponsive.  I don't see any way to to do better without a
better kernel API, or a better I/O scheduler, but that doesn't mean we
shouldn't do at least this much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-26 Thread KONDO Mitsumasa

Thank you for comments!

 On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes the response
times somewhat smoother. I'd suggest that we drop the write patch for now, and

 focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is 
caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in 
write
phase. Therefore, combination of write patch and fsync patch are suiter each 
other than
only write patch. I think that amount of WAL write in beginning of checkpoint can 
indicate effect of write patch.


 What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold 
 settings did you use with the fsync patch? It's disabled by default.

I used these parameters.
  checkpointer_fsync_delay_ratio = 1
  checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.
  checkpoint_completion_target = 0.7
  checkpoint_smooth_target = 0.3
  checkpoint_smooth_margin = 0.5
  checkpointer_write_delay = 200ms


Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and 
the
assumption that fsync phase is 10% of the total checkpoint duration. I suspect 
100ms
 is too small to have much effect, but that happens to be what we have 
currently in

CheckpointWriteDelay(). Could you test this patch along with yours? If you can 
test
with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive 
the
test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about 
your patch essence.



(2013/06/26 5:28), Heikki Linnakangas wrote:

On 25.06.2013 23:03, Robert Haas wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.


Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.


Well, that's the point I was trying to make: you should sleep *before* the 
fsyncs
get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We 
cannot know about progress of background disk write before fsyncs. I think 
Robert's argument is right. Please see under following log messages.


* fsync file which had been already wrote in disk
 DEBUG:  0: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 
msec
 DEBUG:  0: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 
msec
 DEBUG:  0: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 
msec
 DEBUG:  0: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 
msec
 DEBUG:  0: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535 
msec

 DEBUG:  0: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 
msec

* fsync file which had not been wrote in disk very much
 DEBUG:  0: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759 
msec
 DEBUG:  0: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075 
msec
 DEBUG:  0: checkpoint sync: number=56 file=base/16384/16419.10 
time=13848.237 msec
 DEBUG:  0: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836 
msec

 DEBUG:  0: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 
msec
 DEBUG:  0: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 
msec

I think it is wasteful of sleep every fsyncs including short time, and fsync time 
performance is also changed by hardware which is like RAID card and kind of or 
number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed 
method will be more adoptive in these cases.



The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.


Quite possible, I really don't know. I'm inclined to first try the 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-26 Thread Heikki Linnakangas

On 26.06.2013 11:37, KONDO Mitsumasa wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes
the response
times somewhat smoother. I'd suggest that we drop the write patch
for now, and focus on the fsyncs.


Write patch is effective in TPS!


Your test results don't agree with that. You got 3465.96 TPS with the 
write patch, and 3474.62 and 3469.03 without it. The fsync+write 
combination got slightly more TPS than just the fsync patch, but only by 
about 1%, and then the response times were worse.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-26 Thread KONDO Mitsumasa

(2013/06/26 20:15), Heikki Linnakangas wrote:

On 26.06.2013 11:37, KONDO Mitsumasa wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes
the response
times somewhat smoother. I'd suggest that we drop the write patch
for now, and focus on the fsyncs.


Write patch is effective in TPS!


Your test results don't agree with that. You got 3465.96 TPS with the write
patch, and 3474.62 and 3469.03 without it. The fsync+write combination got
slightly more TPS than just the fsync patch, but only by about 1%, and then the
response times were worse.

Please see result of DBT-2 more careful. Average latency in fsync+write was
improoved from only fsync patch. 90% tile and Maximum latency are not all of
result but only part of result in DBT-2. And Average and TPS are all of result. 
Generally, when TPS become high in benchmark, checkpointer has to write more 
pages. Therefore, 90%tile and Maximum are worse in this case, and it is general 
in other benchmark tests.


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-25 Thread Heikki Linnakangas

On 21.06.2013 11:29, KONDO Mitsumasa wrote:

I took results of my separate patches and original PG.

* Result of DBT-2
| TPS 90%tile Average Maximum
--
original_0.7 | 3474.62 18.348328 5.739 36.977713
original_1.0 | 3469.03 18.637865 5.842 41.754421
fsync | 3525.03 13.872711 5.382 28.062947
write | 3465.96 19.653667 5.804 40.664066
fsync + write | 3564.94 16.31922 5.1 34.530766

- 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
- In other patched postgres, checkpoint_completion_target sets 0.7.
- 'write' is applied write patch, and 'fsync' is applied fsync patch.
- 'fsync + write' is applied both patches.


* Investigation of result
- Large value of checkpoint_completion_target in original and the patch
in write become slow latency in benchmark transactions. Because slow
write pages are caused long time fsync IO in final checkpoint.
- The patch in fsync has an effect latency in each file fsync. Continued
fsyncsin each files are caused slow latency. Therefore, it is good for
latency that fsync stage in checkpoint has sleeping time after slow
fsync IO.
- The patches of fsync + write were seemed to improve TPS. I think that
write patch does not disturb transactions which are in full-page-write
WAL write than original(plain) PG.


Hmm, so the write patch doesn't do much, but the fsync patch makes the 
response times somewhat smoother. I'd suggest that we drop the write 
patch for now, and focus on the fsyncs.


What checkpointer_fsync_delay_ratio and 
checkpointer_fsync_delay_threshold settings did you use with the fsync 
patch? It's disabled by default.


This is the interesting part of the patch:


@@ -1171,6 +1174,20 @@ mdsync(void)
 
FilePathName(seg-mdfd_vfd),
 (double) 
elapsed / 1000);

+   /*
+* If this fsync has long time, 
we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+* for giving priority to 
executing transaction.
+*/
+   if( CheckPointerFsyncDelayThreshold = 
0 
+   !shutdown_requested 
+   !ImmediateCheckpointRequested() 

+   (elapsed / 1000  
CheckPointerFsyncDelayThreshold))
+   {
+   pg_usleep((elapsed / 
1000) * CheckPointerFsyncDelayRatio * 1000L);
+   if(log_checkpoints)
+   elog(DEBUG1, 
checkpoint sync sleep: time=%.3f msec,
+   
(double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+   }
break;  /* out of retry loop */
}


I'm not sure it's a good idea to sleep proportionally to the time it 
took to complete the previous fsync. If you have a 1GB cache in the RAID 
controller, fsyncing the a 1GB segment will fill it up. But since it 
fits in cache, it will return immediately. So we proceed fsyncing other 
files, until the cache is full and the fsync blocks. But once we fill up 
the cache, it's likely that we're hurting concurrent queries. ISTM it 
would be better to stay under that threshold, keeping the I/O system 
busy, but never fill up the cache completely.


This is just a theory, though. I don't have a good grasp on how the OS 
and a typical RAID controller behaves under these conditions.


I'd suggest that we just sleep for a small fixed amount of time between 
every fsync, unless we're running behind the checkpoint schedule. And 
for a first approximation, let's just assume that the fsync phase is e.g 
10% of the whole checkpoint work.



I will send you more detail investigation and result next week. And I
will also take result in pgbench. If you mind other part of benchmark
result or parameter of postgres, please tell me.


Attached is a quick patch to implement a fixed, 100ms delay between 
fsyncs, and the assumption that fsync phase is 10% of the total 
checkpoint duration. I suspect 100ms is too small to have much effect, 
but that happens to be what we have currently in CheckpointWriteDelay(). 
Could you test this patch along with yours? If you can test with 
different delays (e.g 100ms, 500ms and 1000ms) and different ratios 
between the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of 
how sensitive the test case is to those settings.


- Heikki
diff --git 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-25 Thread Robert Haas
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I'm not sure it's a good idea to sleep proportionally to the time it took to
 complete the previous fsync. If you have a 1GB cache in the RAID controller,
 fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
 will return immediately. So we proceed fsyncing other files, until the cache
 is full and the fsync blocks. But once we fill up the cache, it's likely
 that we're hurting concurrent queries. ISTM it would be better to stay under
 that threshold, keeping the I/O system busy, but never fill up the cache
 completely.

Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.  The only feedback we
have on how bad things are is how long it took the last fsync to
complete, so I actually think that's a much better way to go than any
fixed sleep - which will often be unnecessarily long on a well-behaved
system, and which will often be far too short on one that's having
trouble.  I'm inclined to think think Kondo-san has got it right.

I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small.  When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big.  But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase.  I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments  10, use 1/checkpoint_segments.  Or just
assume the fsync phase will take 30 seconds.  Or ... something.  I'm
not really sure what the right model is here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-25 Thread Heikki Linnakangas

On 25.06.2013 23:03, Robert Haas wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.


Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.


Well, that's the point I was trying to make: you should sleep *before* 
the fsyncs get slow.



The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.


Quite possible, I really don't know. I'm inclined to first try the 
simplest thing possible, and only make it more complicated if that's not 
good enough. Kondo-san's patch wasn't very complicated, but nevertheless 
a fixed sleep between every fsync, unless you're behind the schedule, is 
even simpler. In particular, it's easier to tie that into the checkpoint 
scheduler - I'm not sure how you'd measure progress or determine how 
long to sleep unless you assume that every fsync is the same.



I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small.  When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big.  But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase.  I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments  10, use 1/checkpoint_segments.  Or just
assume the fsync phase will take 30 seconds.


If checkpoint_segments  10, there isn't very much dirty data to flush 
out. This isn't really problem in that case - no matter how stupidly we 
do the writing and fsyncing. the I/O cache can absorb it. It doesn't 
really matter what we do in that case.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-21 Thread KONDO Mitsumasa

Hi,

I took results of my separate patches and original PG.

* Result of DBT-2
  | TPS  90%tileAverage  Maximum
--
original_0.7  | 3474.62  18.348328  5.73936.977713
original_1.0  | 3469.03  18.637865  5.84241.754421
fsync | 3525.03  13.872711  5.38228.062947
write | 3465.96  19.653667  5.80440.664066
fsync + write | 3564.94  16.31922   5.1  34.530766

 - 'original_*' indicates checkpoint_completion_target in PG 9.2.4.
 - In other patched postgres, checkpoint_completion_target sets 0.7.
 - 'write' is applied write patch, and 'fsync' is applied fsync patch.
 - 'fsync + write' is applied both patches.


* Investigation of result
 - Large value of checkpoint_completion_target in original and the patch in 
write become slow latency in benchmark transactions. Because slow write pages are 
caused long time fsync IO in final checkpoint.
 - The patch in fsync has an effect latency in each file fsync. Continued 
fsyncsin each files are caused slow latency. Therefore, it is good for latency 
that fsync stage in checkpoint has sleeping time after slow fsync IO.
 - The patches of fsync + write were seemed to improve TPS. I think that write 
patch does not disturb transactions which are in full-page-write WAL write than 
original(plain) PG.


I will send you more detail investigation and result next week. And I will also 
take result in pgbench. If you mind other part of benchmark result or parameter 
of postgres, please tell me.


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread KONDO Mitsumasa

Thank you for giving comments and my patch reviewer!

(2013/06/16 23:27), Heikki Linnakangas wrote:

On 10.06.2013 13:51, KONDO Mitsumasa wrote:

I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.

* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.


Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL
activity caused by full page images. That's an interesting phenomenon, but did
you actually see that causing a problem in your tests?  I couldn't tell from the
results you posted what the impact of that was. Could you repeat the tests
separately with the two separate patches you posted later in this thread?

OK, I try to test with the two separate patches. My patches results which I 
send past
indicate high WAL throughputs(write_size_per_sec) and high transaction during 
checkpoint. Please see
under following HTML file which I set tag jump, and put 'checkpoint highlight 
switch' button.


* With my patched PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#wal_statistics

* Plain PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#wal_statistics

In wal statistics result, I think that high WAL thorouputs in checkpoint starting 
indicates that checkpoint IO does not disturb other executing transaction IO.



Rationalizing a bit, I could even argue to myself that it's a *good* thing. At
the beginning of a checkpoint, the OS write cache should be relatively empty, as
the checkpointer hasn't done any writes yet. So it might make sense to write a
burst of pages at the beginning, to partially fill the write cache first, before
starting to throttle. But this is just handwaving - I have no idea what the
effect is in real life.
Yes, I think so. If we want to change IO throttle, we change OS parameter which 
are '/proc/sys/vm/dirty_background_ratio' or '/proc/sys/vm/dirty_ratio'. But this 
parameter effects whole applications in OS, it is difficult to change this 
parameter and cannot set intuitive parameter. And I think that database tuning 
should be set in database parameter rather than OS parameter. It is more clear in 
tuning a server.



Another thought is that rather than trying to compensate for that effect in the
checkpoint scheduler, could we avoid the sudden rush of full-page images in the
first place? The current rule for when to write a full page image is
conservative: you don't actually need to write a full page image when you modify
a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed
to disk by the checkpointer yet, because the checkpointer will write and fsync 
it
later. I'm not sure how much it would smoothen WAL write I/O, but it would be
interesting to try.
It is most right method in ideal implementations. But I don't have any idea about 
this method. It seems very difficult...




Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.


A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because Tom
couldn't reproduce the numbers, and because sorting requires allocating a large
array, which has the 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread Pavan Deolasee
On Mon, Jun 17, 2013 at 2:18 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:

  A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
 www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp
 .
  He posted very promising performance numbers, but it was dropped because
 Tom
  couldn't reproduce the numbers, and because sorting requires allocating a
  large array, which has the risk of running out of memory, which would be
 bad
  when you're trying to checkpoint.

 Hm. We could allocate the array early on since the number of buffers
 doesn't change. Sure that would be pessimistic, but that seems fine.

 Alternatively I can very well imagine that it would still be beneficial
 to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
 dirty pages, sort them and only then write them out.


Without knowing that Itagaki had done something similar in the past, couple
of months back I tried exactly the same thing i.e. sort the shared buffers
in chunks and then write them out at once. But I did not get any
significant performance gain except when the shared buffers are 3/4th (or
some such number) or more than the available RAM. I will see if I can pull
out the patch and the numbers. But if memory serves well, I concluded that
the kernel is already utilising its buffer cache to achieve the same thing
and it does not help beyond a point.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread KONDO Mitsumasa
(2013/06/17 5:48), Andres Freund wrote: On 2013-06-16 17:27:56 +0300, Heikki 
Linnakangas wrote:

 If we don't mind scanning the buffer cache several times, we don't
 necessarily even need to sort the writes for that. Just scan the buffer
 cache for all buffers belonging to relation A, then fsync it. Then scan the
 buffer cache again, for all buffers belonging to relation B, then fsync
 that, and so forth.

 That would end up with quite a lot of scans in a reasonably sized
 machines. Not to talk of those that have a million+ relations. That
 doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
 we did for 9.3 to make it cheaper to drop a bunch of relations at once
 by only scanning shared_buffers once.
As I written to reply to Heikki, I think that it is unnecessary to exactly buffer 
sort which has expensive cost. What we need to solve this problem, we need 
accuracy of sort which can be optimized in OS IO scheduler. And we normally have 
two optimized IO scheduler layer which are OS layer and RAID controller layer. I 
think that performance will be improved if it enables sort accuracy to optimize 
in these process. I think that computational complexity required to solve this 
problem is one sequential buffer descriptor scan for roughly buffer sort. I will 
try to study about this implementation, too.


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-16 Thread Heikki Linnakangas

On 10.06.2013 13:51, KONDO Mitsumasa wrote:

I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.

* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.


Yeah, the checkpoint scheduling logic doesn't take into account the 
heavy WAL activity caused by full page images. That's an interesting 
phenomenon, but did you actually see that causing a problem in your 
tests? I couldn't tell from the results you posted what the impact of 
that was. Could you repeat the tests separately with the two separate 
patches you posted later in this thread?


Rationalizing a bit, I could even argue to myself that it's a *good* 
thing. At the beginning of a checkpoint, the OS write cache should be 
relatively empty, as the checkpointer hasn't done any writes yet. So it 
might make sense to write a burst of pages at the beginning, to 
partially fill the write cache first, before starting to throttle. But 
this is just handwaving - I have no idea what the effect is in real life.


Another thought is that rather than trying to compensate for that effect 
in the checkpoint scheduler, could we avoid the sudden rush of full-page 
images in the first place? The current rule for when to write a full 
page image is conservative: you don't actually need to write a full page 
image when you modify a buffer that's sitting in the buffer cache, if 
that buffer hasn't been flushed to disk by the checkpointer yet, because 
the checkpointer will write and fsync it later. I'm not sure how much it 
would smoothen WAL write I/O, but it would be interesting to try.



Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.


A long time ago, Itagaki wrote a patch to sort the checkpoint writes: 
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp. 
He posted very promising performance numbers, but it was dropped because 
Tom couldn't reproduce the numbers, and because sorting requires 
allocating a large array, which has the risk of running out of memory, 
which would be bad when you're trying to checkpoint.


Apart from the direct performance impact of that patch, sorting the 
writes would allow us to interleave the fsyncs with the writes. You 
would write out all buffers for relation A, then fsync it, then all 
buffers for relation B, then fsync it, and so forth. That would 
naturally spread out the fsyncs.


If we don't mind scanning the buffer cache several times, we don't 
necessarily even need to sort the writes for that. Just scan the buffer 
cache for all buffers belonging to relation A, then fsync it. Then scan 
the buffer cache again, for all buffers belonging to relation B, then 
fsync that, and so forth.



Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).


For a fair comparison, you should increase the 
checkpoint_completion_target of the unpatched test, so that the 
checkpoints run for roughly the same amount of time with and without the 
patch. Otherwise the benefit you're seeing could be just because of a 
more lazy checkpoint.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-16 Thread Andres Freund
On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:
 Another thought is that rather than trying to compensate for that effect in
 the checkpoint scheduler, could we avoid the sudden rush of full-page images
 in the first place? The current rule for when to write a full page image is
 conservative: you don't actually need to write a full page image when you
 modify a buffer that's sitting in the buffer cache, if that buffer hasn't
 been flushed to disk by the checkpointer yet, because the checkpointer will
 write and fsync it later. I'm not sure how much it would smoothen WAL write
 I/O, but it would be interesting to try.

Hm. Could you elaborate why that wouldn't open new hazards? I don't see
how that could be safe against crashes in some places. It seems to me
we could end up replaying records like heap_insert or similar pages
while the page is still torn?

 A long time ago, Itagaki wrote a patch to sort the checkpoint writes: 
 www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
 He posted very promising performance numbers, but it was dropped because Tom
 couldn't reproduce the numbers, and because sorting requires allocating a
 large array, which has the risk of running out of memory, which would be bad
 when you're trying to checkpoint.

Hm. We could allocate the array early on since the number of buffers
doesn't change. Sure that would be pessimistic, but that seems fine.

Alternatively I can very well imagine that it would still be beneficial
to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
dirty pages, sort them and only then write them out.

 Apart from the direct performance impact of that patch, sorting the writes
 would allow us to interleave the fsyncs with the writes. You would write out
 all buffers for relation A, then fsync it, then all buffers for relation B,
 then fsync it, and so forth. That would naturally spread out the
 fsyncs.

I personally think that optionally trying to force the pages to be
written out earlier (say, with sync_file_range) to make the actual
fsync() lateron cheaper is likely to be better overall.

 If we don't mind scanning the buffer cache several times, we don't
 necessarily even need to sort the writes for that. Just scan the buffer
 cache for all buffers belonging to relation A, then fsync it. Then scan the
 buffer cache again, for all buffers belonging to relation B, then fsync
 that, and so forth.

That would end up with quite a lot of scans in a reasonably sized
machines. Not to talk of those that have a million+ relations. That
doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
we did for 9.3 to make it cheaper to drop a bunch of relations at once
by only scanning shared_buffers once.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-14 Thread KONDO Mitsumasa

(2013/06/12 23:07), Robert Haas wrote:

On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs si...@2ndquadrant.com wrote:

On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:

I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.


Looks like good results, with good measurements. Should be an
interesting discussion.


+1.

I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs.  Also, I think this is
probably two separate patches, in the end.  But the direction seems
good to me.

Thank you for comment!

I separate my patch in checkpoint-wirte and in checkpoint-fsync. As you
say, my patch has a lot of new GUCs. I don't think it cannot be decided
automatic. However, it is difficult that chekpoint-scheduler is suitable
for all of enviroments which are like virtual server, public cloude server,
and embedded server, etc. So I think that default setting parameter works
same as before. Setting parameter is primitive and difficult, but if we can
set correctly, it is suitable for a lot of enviroments and will not work 
unintended action.


I try to take something into consideration about less GUCs version. And if you 
have good idea, please discussion about this!


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..0c0f215 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,9 +141,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /*
  * GUC parameters
  */
+int			CheckPointerWriteDelay = 200;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointSmoothTarget = 0.0;
+double		CheckPointSmoothMargin = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -715,7 +718,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(10L);
+		pg_usleep(CheckPointerWriteDelay * 1000L);
 	}
 	else if (--absorb_counter = 0)
 	{
@@ -742,14 +745,36 @@ IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
 	struct timeval now;
-	double		elapsed_xlogs,
+	double		original_progress,
+			elapsed_xlogs,
 elapsed_time;
 
 	Assert(ckpt_active);
 
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
+	/* This variable is used by smooth checkpoint schedule.*/
+	original_progress = progress * CheckPointCompletionTarget;
 
+	/* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+	if(progress = CheckPointSmoothTarget)
+	{
+		/* Normal checkpoint schedule. */
+		progress *= CheckPointCompletionTarget;
+	}
+	else
+	{
+		/*
+		 * Smooth checkpoint schedule.
+		 *
+		 * When initial checkpoint, it tends to be high IO road average
+		 * and slow executing transactions. This schedule reduces them
+		 * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+		 * it becomes near normal checkpoint schedule. If you want to more
+		 * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+		 */
+		progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget) +
+CheckPointCompletionTarget;
+	}
 	/*
 	 * Check against the cached value first. Only do the more expensive
 	 * calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
 			ckpt_cached_elapsed = elapsed_xlogs;
 			return false;
 		}
+		else if (original_progress  elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+
+			/* smooth checkpoint write */
+			pg_usleep(CheckPointerWriteDelay * 1000L);
+			return false;
+		}
 	}
 
 	/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
+	else if (original_progress  elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+
+		/* smooth checkpoint write */
+		pg_usleep(CheckPointerWriteDelay * 1000L);
+		return false;
+	}
 
 	/* It looks like we're on schedule. */
 	return true;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..d41dc17 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{checkpointer_write_delay, PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop(checkpointer sleep time during dirty buffers write in checkpoint.),
+			NULL,
+			GUC_UNIT_MS
+		},
+		CheckPointerWriteDelay,
+		200, 10, 1,
+		NULL, NULL, NULL
+	},
+
+	{
 		{wal_buffers, PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop(Sets the number of disk-page buffers in shared 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-12 Thread Robert Haas
On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:
 I create patch which is improvement of checkpoint IO scheduler for stable
 transaction responses.

 Looks like good results, with good measurements. Should be an
 interesting discussion.

+1.

I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs.  Also, I think this is
probably two separate patches, in the end.  But the direction seems
good to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-10 Thread KONDO Mitsumasa

Hi,

I create patch which is improvement of checkpoint IO scheduler for stable 
transaction responses.


* Problem in checkpoint IO schedule in heavy transaction case
  When heavy transaction in database, I think PostgreSQL checkpoint scheduler 
has two problems at start and end of checkpoint. One problem is IO heavy when 
starting initial checkpoint in rounds of checkpoint. This problem was caused by 
full-page-write which cause WAL IO in fast page writes after checkpoint write 
page. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrong 
judgment that is late schedule by full-page-write, nevertheless checkpoint 
schedule is not late. This is caused bad transaction response. I think WAL-based 
checkpoint scheduler was not property in starting checkpoint. Second problem is 
fsync freeze problem in end of checkpoint. Normally, checkpoint write is executed 
in background by OS's IO scheduler. But when it does not correctly work, end of 
checkpoint fsync was caused IO freeze and slower transactions. Unexpected slow 
transaction will cause monitor error in HA-cluster and decrease user-experience 
in application service. It is especially serious problem in cloud and virtual 
server database system which does not have IO performance. However we don't have 
solution in postgresql.conf parameter very much. We prefer checkpoint time to 
fast response transactions. In fact checkpoint time is short, and it becomes 
little bit long that is not problem. You may think that checkpoint_segments and 
checkpoint_timeout are set larger value, however large checkpoint_segments 
affects file-cache which is not read and is wasted, and large checkpoint_timeout 
was caused long-time crash-recovery.



* Improvement method of checkpoint IO scheduler
1. Improvement full-page-write IO heavy problem in start of checkpoint
 My idea is very simple. When start of checkpoint, checkpoint_completion_target 
become more loose. I set three parameter of this issue; 
'checkpoint_smooth_target', 'checkpoint_smooth_margin' and 
'checkpointer_write_delay'. 'checkpointer_smooth_target' parameter is a term 
point that is smooth checkpoint IO schedule in checkpoint progress. 
'checkpoint_smooth_margin' parameter can be more smooth checkpoint schedule. It 
is heuristic parameter, but it solves this problem effectively. 
'checkpointer_write_delay' parameter is sleep time for checkpoint schedule. This 
parameter is nearly same 'bgwriter_delay' in PG9.1 older.

 If you want to get more detail information, please see attached patch.

2. Improvement fsync freeze problem in end of checkpoint
 When fsync freeze problem was happened, file fsync more repeatedly is 
meaningless and causes stop transactions. So I think, if fsync executing time was 
long, IO queue is flooded and should give IO priority to transactions for fast 
response time. It realize by inserting sleep time during fsync when fsync time 
was long. It seems to be long time in checkpoint, but it is not very long. In 
fact, when fsync time is long, IO queue is packed by another IO which is included 
checkpoint writes, it only gives IO priority to another executing transactions.
 I tested my patch in DBT-2 benchmark. Please see result of test. My patch 
realize higher transaction and fast response than plain PG. Checkpoint time is 
little bit longer than plain PG, but it is not serious.



* Result of DBT-2 with this patch. (Compared with original PG9.2.4)
 I use DBT-2 benchmark software by OSDL. I also use pg_statsinfo and 
pg_stats_reporter in this benchmark.


  - Patched PG (patched 9.2.4)
DBT-2 result: http://goo.gl/1PD3l
statsinfo report: http://goo.gl/UlGAO
settings: http://goo.gl/X4Whu

  - Original PG (9.2.4)
DBT-2 result: http://goo.gl/XVxtj
statsinfo report: http://goo.gl/UT1Li
settings: http://goo.gl/eofmb

 Measurement Value is improved 4%, 'new-order 90%tile' is improved 20%, 
'new-order average' is improved 18%, 'new-order deviation' is improved 24%, and 
'new-order maximum' is improved 27%. I confirm high throughput and WAL IO at 
executing checkpoint in pg_stats_reporter's report. My patch realizes high 
response transactions and non-blocking executing transactions.


 Bad point of my patch is longer checkpoint. Checkpoint time was increased about 
10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout. 
Please see checkpoint result (http://goo.gl/NsbC6).


* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB


 It is not advertisement of pg_statsinfo and pg_stats_reporter:-) They are free 
software. If you have comment and another idea about my patch, please send me.


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..a66ce36 100644
--- 

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-10 Thread Simon Riggs
On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:

 I create patch which is improvement of checkpoint IO scheduler for stable
 transaction responses.

Looks like good results, with good measurements. Should be an
interesting discussion.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers