Re: [HACKERS] Sorted writes in checkpoint
Added to TODO: * Consider sorting writes during checkpoint http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php --- ITAGAKI Takahiro wrote: > Greg Smith <[EMAIL PROTECTED]> wrote: > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > If the kernel can treat sequential writes better than random writes, is > > > it worth sorting dirty buffers in block order per file at the start of > > > checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests| pgbench | DBT-2 response time (avg/90%/max) > ---+-+--- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t4 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) > > > > I think it has the potential to improve things. There are three obvious > > and one subtle argument against it I can think of: > > > > 1) Extra complexity for something that may not help. This would need some > > good, robust benchmarking improvements to justify its use. > > Exactly. I think we need a discussion board for I/O performance issues. > Can I use Developers Wiki for this purpose? Since performance graphs and > result tables are important for the discussion, so it might be better > than mailing lists, that are text-based. > > > > 2) Block number ordering may not reflect actual order on disk. While > > true, it's got to be better correlated with it than writing at random. > > 3) The OS disk elevator should be dealing with this issue, particularly > > because it may really know the actual disk ordering. > > Yes, both are true. However, I think there is pretty high correlation > in those orderings. In addition, we should use filesystem to assure > those orderings correspond to each other. For example, pre-allocation > of files might help us, as has often been discussed. > > > > Here's the subtle thing: by writing in the same order the LRU scan occurs > > in, you are writing dirty buffers in the optimal fashion to eliminate > > client backend writes during BuferAlloc. This makes the checkpoint a > > really effective LRU clearing mechanism. Writing in block order will > > change that. > > The issue will probably go away after we have LDC, because it writes LRU > buffers during checkpoints. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sorted writes in checkpoint
On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <[EMAIL PROTECTED]> wrote: > > > > tests| pgbench | DBT-2 response time (avg/90%/max) > > > ---+-+--- > > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage > > of writes has been saved by doing that? > > How long was the write phase of the checkpoint, how long > > between checkpoints? > > > > I can see the sorted writes having an effect because the OS may not > > receive blocks within a sufficient time window to fully optimise them. > > That effect would grow with increasing sizes of shared_buffers and > > decrease with size of controller cache. How big was the shared buffers > > setting? What OS scheduler are you using? The effect would be greatest > > when using Deadline. > > I didn't tune OS parameters, used default values. > In terms of cache amounts, postgres buffers were larger than kernel > write pool and controller cache. that's why the OS could not optimise > writes enough in checkpoint, I think. > > - 200MB <- RAM * dirty_background_ratio > - 128MB <- Controller cache > - 2GB <- postgres shared_buffers > > I forget to gather detail I/O information in the tests. > I'll retry it and report later. > > RAM 2GB > Controller cache 128MB > shared_buffers 1GB > checkpoint_timeout = 15min > checkpoint_write_percent = 50.0 > > RHEL4 (Linux 2.6.9-42.0.2.EL) > vm.dirty_background_ratio= 10 > vm.dirty_ratio = 40 > vm.dirty_expire_centisecs= 3000 > vm.dirty_writeback_centisecs = 500 > Using cfq io scheduler Sounds like sorting the buffers before checkpoint is going to be a win once we go above about ~128MB. We can do a simple test on NBuffers, rather than have a sort_blocks_at_checkoint (!) GUC. But it does seem there is a win for larger settings of shared_buffers. Does performance go up in the non-sorted case if we make shared_buffers smaller? Sounds like it might. We should check that first. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Sorted writes in checkpoint
"Simon Riggs" <[EMAIL PROTECTED]> wrote: > > tests| pgbench | DBT-2 response time (avg/90%/max) > > ---+-+--- > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage > of writes has been saved by doing that? > How long was the write phase of the checkpoint, how long > between checkpoints? > > I can see the sorted writes having an effect because the OS may not > receive blocks within a sufficient time window to fully optimise them. > That effect would grow with increasing sizes of shared_buffers and > decrease with size of controller cache. How big was the shared buffers > setting? What OS scheduler are you using? The effect would be greatest > when using Deadline. I didn't tune OS parameters, used default values. In terms of cache amounts, postgres buffers were larger than kernel write pool and controller cache. that's why the OS could not optimise writes enough in checkpoint, I think. - 200MB <- RAM * dirty_background_ratio - 128MB <- Controller cache - 2GB <- postgres shared_buffers I forget to gather detail I/O information in the tests. I'll retry it and report later. RAM 2GB Controller cache 128MB shared_buffers 1GB checkpoint_timeout = 15min checkpoint_write_percent = 50.0 RHEL4 (Linux 2.6.9-42.0.2.EL) vm.dirty_background_ratio= 10 vm.dirty_ratio = 40 vm.dirty_expire_centisecs= 3000 vm.dirty_writeback_centisecs = 500 Using cfq io scheduler Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Sorted writes in checkpoint
> > tests| pgbench | DBT-2 response time > (avg/90%/max) > > > ---+-+ > > ---+-+--- > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > > > (*) Don't write buffers that were dirtied after starting > the checkpoint. > > > > machine : 2GB-ram, SCSI*4 RAID-5 > > pgbench : -s400 -t4 -c10 (about 5GB of database) > > DBT-2 : 60WH (about 6GB of database) > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What > percentage of writes has been saved by doing that? We would > expect a small percentage of blocks only and so that > shouldn't make a significant difference. I thought we Wouldn't pages that are dirtied during the checkpoint also usually be rather hot ? Thus if we lock one of those for writing, the chances are high that a client needs to wait for the lock ? A write os call should usually be very fast, but when the IO gets bottlenecked it might easily become slower. Probably the recent result, that it saves ~53% of the writes, is sufficient explanation though. Very nice results :-) Looks like we want all of it including the sort. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Sorted writes in checkpoint
On Thu, 14 Jun 2007, Gregory Maxwell wrote: Linux has some instrumentation that might be useful for this testing, echo 1 > /proc/sys/vm/block_dump That bit was developed for tracking down who was spinning the hard drive up out of power saving mode, and I was under the impression that very rough feature isn't useful at all here. I just tried to track down again where I got that impression from, and I think it was this thread: http://linux.slashdot.org/comments.pl?sid=231817&cid=18832379 This mentions general issues figuring out who was responsible for a write and specifically mentions how you'll have to reconcile two different paths if fsync is mixed in. Not saying it won't work, it's just obvious using the block_dump output isn't a simple job. (For anyone who would like an intro to this feature, try http://www.linuxjournal.com/node/7539/print and http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump ) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Sorted writes in checkpoint
On 6/14/07, Simon Riggs <[EMAIL PROTECTED]> wrote: On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote: > Greg Smith <[EMAIL PROTECTED]> wrote: > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > If the kernel can treat sequential writes better than random writes, is > > > it worth sorting dirty buffers in block order per file at the start of > > > checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests| pgbench | DBT-2 response time (avg/90%/max) > ---+-+--- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t4 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage of writes has been saved by doing that? We would expect a small percentage of blocks only and so that shouldn't make a significant difference. I thought we discussed this before, about a year ago. It would be easy to get that wrong and to avoid writing a block that had been re-dirtied after the start of checkpoint, but was already dirty beforehand. How long was the write phase of the checkpoint, how long between checkpoints? I can see the sorted writes having an effect because the OS may not receive blocks within a sufficient time window to fully optimise them. That effect would grow with increasing sizes of shared_buffers and decrease with size of controller cache. How big was the shared buffers setting? What OS scheduler are you using? The effect would be greatest when using Deadline. Linux has some instrumentation that might be useful for this testing, echo 1 > /proc/sys/vm/block_dump Will have the kernel log all physical IO (disable syslog writing to disk before turning it on if you don't want the system to blow up). Certainly the OS elevator should be working well enough to not see that much of an improvement. Perhaps frequent fsync behavior is having unintended interaction with the elevator? ... It might be worthwhile to contact some Linux kernel developers and see if there is some misunderstanding. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Sorted writes in checkpoint
On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote: > Greg Smith <[EMAIL PROTECTED]> wrote: > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > If the kernel can treat sequential writes better than random writes, is > > > it worth sorting dirty buffers in block order per file at the start of > > > checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests| pgbench | DBT-2 response time (avg/90%/max) > ---+-+--- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t4 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage of writes has been saved by doing that? We would expect a small percentage of blocks only and so that shouldn't make a significant difference. I thought we discussed this before, about a year ago. It would be easy to get that wrong and to avoid writing a block that had been re-dirtied after the start of checkpoint, but was already dirty beforehand. How long was the write phase of the checkpoint, how long between checkpoints? I can see the sorted writes having an effect because the OS may not receive blocks within a sufficient time window to fully optimise them. That effect would grow with increasing sizes of shared_buffers and decrease with size of controller cache. How big was the shared buffers setting? What OS scheduler are you using? The effect would be greatest when using Deadline. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Sorted writes in checkpoint
On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote: I think we need a discussion board for I/O performance issues. Can I use Developers Wiki for this purpose? Since performance graphs and result tables are important for the discussion, so it might be better than mailing lists, that are text-based. I started pushing some of my stuff over to there recently to make it easier to edit and other people can expand with their expertise. http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW is what I've done so far on this particular topic. What I would like to see on the Wiki first are pages devoted to how to run the common benchmarks people use for useful performance testing. A recent thread on one of the lists reminded me how easy it is to get worthless results out of DBT2 if you don't have any guidance on that. I've already got a stack of documentation about how to wrestle with pgbench and am generating more. The problem with using the Wiki as the main focus is that when you get to the point that you want to upload detailed test results, that interface really isn't appropriate for it. For example, in the last day I've collected up data from about 400 short tests runs that generated 800 graphs. It's all organized as HTML so you can drill down into the specific tests that executed oddly. Heikki's DBT2 resuls are similar; not as many files, because he's running longer tests, but the navigation is even more complicated. There is no way to easily put that type and level of information into the Wiki page. You really just need a web server to copy the results onto. Then the main problem you have to be concerned about is a repeat of the OSDL situation, where all the results just dissapear if their hosting sponsor goes away. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Sorted writes in checkpoint
ITAGAKI Takahiro wrote: Greg Smith <[EMAIL PROTECTED]> wrote: On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: If the kernel can treat sequential writes better than random writes, is it worth sorting dirty buffers in block order per file at the start of checkpoints? I wrote and tested the attached sorted-writes patch base on Heikki's ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. tests| pgbench | DBT-2 response time (avg/90%/max) ---+-+--- LDC only | 181 tps | 1.12 / 4.38 / 12.13 s + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s (*) Don't write buffers that were dirtied after starting the checkpoint. machine : 2GB-ram, SCSI*4 RAID-5 pgbench : -s400 -t4 -c10 (about 5GB of database) DBT-2 : 60WH (about 6GB of database) Wow, I didn't expect that much gain from the sorted writes. How was LDC configured? 3) The OS disk elevator should be dealing with this issue, particularly because it may really know the actual disk ordering. Yeah, but we don't give the OS that much chance to coalesce writes when we spread them out. Here's the subtle thing: by writing in the same order the LRU scan occurs in, you are writing dirty buffers in the optimal fashion to eliminate client backend writes during BuferAlloc. This makes the checkpoint a really effective LRU clearing mechanism. Writing in block order will change that. The issue will probably go away after we have LDC, because it writes LRU buffers during checkpoints. I think so too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Sorted writes in checkpoint
"ITAGAKI Takahiro" <[EMAIL PROTECTED]> writes: > Exactly. I think we need a discussion board for I/O performance issues. > Can I use Developers Wiki for this purpose? Since performance graphs and > result tables are important for the discussion, so it might be better > than mailing lists, that are text-based. I would suggest keeping the discussion on mail and including links to refer to charts and tables in the wiki. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Sorted writes in checkpoint
Greg Smith <[EMAIL PROTECTED]> wrote: > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > If the kernel can treat sequential writes better than random writes, is > > it worth sorting dirty buffers in block order per file at the start of > > checkpoints? I wrote and tested the attached sorted-writes patch base on Heikki's ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. tests| pgbench | DBT-2 response time (avg/90%/max) ---+-+--- LDC only | 181 tps | 1.12 / 4.38 / 12.13 s + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s (*) Don't write buffers that were dirtied after starting the checkpoint. machine : 2GB-ram, SCSI*4 RAID-5 pgbench : -s400 -t4 -c10 (about 5GB of database) DBT-2 : 60WH (about 6GB of database) > I think it has the potential to improve things. There are three obvious > and one subtle argument against it I can think of: > > 1) Extra complexity for something that may not help. This would need some > good, robust benchmarking improvements to justify its use. Exactly. I think we need a discussion board for I/O performance issues. Can I use Developers Wiki for this purpose? Since performance graphs and result tables are important for the discussion, so it might be better than mailing lists, that are text-based. > 2) Block number ordering may not reflect actual order on disk. While > true, it's got to be better correlated with it than writing at random. > 3) The OS disk elevator should be dealing with this issue, particularly > because it may really know the actual disk ordering. Yes, both are true. However, I think there is pretty high correlation in those orderings. In addition, we should use filesystem to assure those orderings correspond to each other. For example, pre-allocation of files might help us, as has often been discussed. > Here's the subtle thing: by writing in the same order the LRU scan occurs > in, you are writing dirty buffers in the optimal fashion to eliminate > client backend writes during BuferAlloc. This makes the checkpoint a > really effective LRU clearing mechanism. Writing in block order will > change that. The issue will probably go away after we have LDC, because it writes LRU buffers during checkpoints. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center sorted-ckpt.patch Description: Binary data ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster