subject:"\[HACKERS\] Sorted writes in checkpoint"

Re: [HACKERS] Sorted writes in checkpoint

2008-03-11 Thread Bruce Momjian


Added to TODO:

* Consider sorting writes during checkpoint

  http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php


---

ITAGAKI Takahiro wrote:
> Greg Smith <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is 
> > > it worth sorting dirty buffers in block order per file at the start of 
> > > checkpoints?
> 
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> 
>   tests| pgbench | DBT-2 response time (avg/90%/max)
> ---+-+---
>  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> (*) Don't write buffers that were dirtied after starting the checkpoint.
> 
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t4 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)
> 
> 
> > I think it has the potential to improve things.  There are three obvious 
> > and one subtle argument against it I can think of:
> > 
> > 1) Extra complexity for something that may not help.  This would need some 
> > good, robust benchmarking improvements to justify its use.
> 
> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose?  Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.
> 
> 
> > 2) Block number ordering may not reflect actual order on disk.  While 
> > true, it's got to be better correlated with it than writing at random.
> > 3) The OS disk elevator should be dealing with this issue, particularly 
> > because it may really know the actual disk ordering.
> 
> Yes, both are true. However, I think there is pretty high correlation
> in those orderings. In addition, we should use filesystem to assure
> those orderings correspond to each other. For example, pre-allocation
> of files might help us, as has often been discussed.
> 
> 
> > Here's the subtle thing:  by writing in the same order the LRU scan occurs 
> > in, you are writing dirty buffers in the optimal fashion to eliminate 
> > client backend writes during BuferAlloc.  This makes the checkpoint a 
> > really effective LRU clearing mechanism.  Writing in block order will 
> > change that.
> 
> The issue will probably go away after we have LDC, because it writes LRU
> buffers during checkpoints.
> 
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
> 

[ Attachment, skipping... ]

> 
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread Simon Riggs

On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <[EMAIL PROTECTED]> wrote:
> 
> > >   tests| pgbench | DBT-2 response time (avg/90%/max)
> > > ---+-+---
> > >  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
> > >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> > >  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
> > 
> > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> > of writes has been saved by doing that?
> > How long was the write phase of the checkpoint, how long
> > between checkpoints?
> >
> > I can see the sorted writes having an effect because the OS may not
> > receive blocks within a sufficient time window to fully optimise them.
> > That effect would grow with increasing sizes of shared_buffers and
> > decrease with size of controller cache. How big was the shared buffers
> > setting? What OS scheduler are you using? The effect would be greatest
> > when using Deadline.
> 
> I didn't tune OS parameters, used default values.
> In terms of cache amounts, postgres buffers were larger than kernel
> write pool and controller cache. that's why the OS could not optimise
> writes enough in checkpoint, I think.
> 
>   - 200MB <- RAM * dirty_background_ratio
>   - 128MB <- Controller cache
>   - 2GB   <- postgres shared_buffers
> 
> I forget to gather detail I/O information in the tests.
> I'll retry it and report later.
> 
> RAM  2GB
> Controller cache 128MB
> shared_buffers   1GB
> checkpoint_timeout   = 15min
> checkpoint_write_percent = 50.0
> 
> RHEL4 (Linux 2.6.9-42.0.2.EL)
> vm.dirty_background_ratio= 10
> vm.dirty_ratio   = 40
> vm.dirty_expire_centisecs= 3000
> vm.dirty_writeback_centisecs = 500
> Using cfq io scheduler

Sounds like sorting the buffers before checkpoint is going to be a win
once we go above about ~128MB. We can do a simple test on NBuffers,
rather than have a sort_blocks_at_checkoint (!) GUC.

But it does seem there is a win for larger settings of shared_buffers.

Does performance go up in the non-sorted case if we make shared_buffers
smaller? Sounds like it might. We should check that first.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread ITAGAKI Takahiro


"Simon Riggs" <[EMAIL PROTECTED]> wrote:

> >   tests| pgbench | DBT-2 response time (avg/90%/max)
> > ---+-+---
> >  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
> >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> >  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> of writes has been saved by doing that?
> How long was the write phase of the checkpoint, how long
> between checkpoints?
>
> I can see the sorted writes having an effect because the OS may not
> receive blocks within a sufficient time window to fully optimise them.
> That effect would grow with increasing sizes of shared_buffers and
> decrease with size of controller cache. How big was the shared buffers
> setting? What OS scheduler are you using? The effect would be greatest
> when using Deadline.

I didn't tune OS parameters, used default values.
In terms of cache amounts, postgres buffers were larger than kernel
write pool and controller cache. that's why the OS could not optimise
writes enough in checkpoint, I think.

  - 200MB <- RAM * dirty_background_ratio
  - 128MB <- Controller cache
  - 2GB   <- postgres shared_buffers

I forget to gather detail I/O information in the tests.
I'll retry it and report later.

RAM  2GB
Controller cache 128MB
shared_buffers   1GB
checkpoint_timeout   = 15min
checkpoint_write_percent = 50.0

RHEL4 (Linux 2.6.9-42.0.2.EL)
vm.dirty_background_ratio= 10
vm.dirty_ratio   = 40
vm.dirty_expire_centisecs= 3000
vm.dirty_writeback_centisecs = 500
Using cfq io scheduler

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Sorted writes in checkpoint

2007-06-15 Thread Zeugswetter Andreas ADI SD


> >   tests| pgbench | DBT-2 response time 
> (avg/90%/max)
> > 
> ---+-+
> > ---+-+---
> >  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
> >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> >  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
> > 
> > (*) Don't write buffers that were dirtied after starting 
> the checkpoint.
> > 
> > machine : 2GB-ram, SCSI*4 RAID-5
> > pgbench : -s400 -t4 -c10  (about 5GB of database)
> > DBT-2   : 60WH (about 6GB of database)
> 
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What 
> percentage of writes has been saved by doing that? We would 
> expect a small percentage of blocks only and so that 
> shouldn't make a significant difference. I thought we 

Wouldn't pages that are dirtied during the checkpoint also usually be
rather hot ?
Thus if we lock one of those for writing, the chances are high that a
client needs to wait for the lock ? 
A write os call should usually be very fast, but when the IO gets
bottlenecked it might easily become slower.

Probably the recent result, that it saves ~53% of the writes, is
sufficient explanation though.

Very nice results :-) Looks like we want all of it including the sort. 

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Greg Smith


On Thu, 14 Jun 2007, Gregory Maxwell wrote:


Linux has some instrumentation that might be useful for this testing,
echo 1 > /proc/sys/vm/block_dump


That bit was developed for tracking down who was spinning the hard drive 
up out of power saving mode, and I was under the impression that very 
rough feature isn't useful at all here.  I just tried to track down again 
where I got that impression from, and I think it was this thread:


http://linux.slashdot.org/comments.pl?sid=231817&cid=18832379

This mentions general issues figuring out who was responsible for a write 
and specifically mentions how you'll have to reconcile two different paths 
if fsync is mixed in.  Not saying it won't work, it's just obvious using 
the block_dump output isn't a simple job.


(For anyone who would like an intro to this feature, try 
http://www.linuxjournal.com/node/7539/print and 
http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump 
)


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Gregory Maxwell

On 6/14/07, Simon Riggs <[EMAIL PROTECTED]> wrote:

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> Greg Smith <[EMAIL PROTECTED]> wrote:
>
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is
> > > it worth sorting dirty buffers in block order per file at the start of
> > > checkpoints?
>
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
>
>   tests| pgbench | DBT-2 response time (avg/90%/max)
> ---+-+---
>  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
>
> (*) Don't write buffers that were dirtied after starting the checkpoint.
>
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t4 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.

Linux has some instrumentation that might be useful for this testing,

echo 1 > /proc/sys/vm/block_dump
Will have the kernel log all physical IO (disable syslog writing to
disk before turning it on if you don't want the system to blow up).

Certainly the OS elevator should be working well enough to not see
that much of an improvement. Perhaps frequent fsync behavior is having
unintended interaction with the elevator?  ... It might be worthwhile
to contact some Linux kernel developers and see if there is some
misunderstanding.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Simon Riggs

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> Greg Smith <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is 
> > > it worth sorting dirty buffers in block order per file at the start of 
> > > checkpoints?
> 
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> 
>   tests| pgbench | DBT-2 response time (avg/90%/max)
> ---+-+---
>  LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> (*) Don't write buffers that were dirtied after starting the checkpoint.
> 
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t4 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Greg Smith


On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote:

I think we need a discussion board for I/O performance issues. Can I use 
Developers Wiki for this purpose?  Since performance graphs and result 
tables are important for the discussion, so it might be better than 
mailing lists, that are text-based.


I started pushing some of my stuff over to there recently to make it 
easier to edit and other people can expand with their expertise.
http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW 
is what I've done so far on this particular topic.


What I would like to see on the Wiki first are pages devoted to how to run 
the common benchmarks people use for useful performance testing.  A recent 
thread on one of the lists reminded me how easy it is to get worthless 
results out of DBT2 if you don't have any guidance on that.  I've already 
got a stack of documentation about how to wrestle with pgbench and am 
generating more.


The problem with using the Wiki as the main focus is that when you get to 
the point that you want to upload detailed test results, that interface 
really isn't appropriate for it.  For example, in the last day I've 
collected up data from about 400 short tests runs that generated 800 
graphs.  It's all organized as HTML so you can drill down into the 
specific tests that executed oddly.  Heikki's DBT2 resuls are similar; not 
as many files, because he's running longer tests, but the navigation is 
even more complicated.


There is no way to easily put that type and level of information into the 
Wiki page.  You really just need a web server to copy the results onto. 
Then the main problem you have to be concerned about is a repeat of the 
OSDL situation, where all the results just dissapear if their hosting 
sponsor goes away.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Heikki Linnakangas


ITAGAKI Takahiro wrote:

Greg Smith <[EMAIL PROTECTED]> wrote:

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
If the kernel can treat sequential writes better than random writes, is 
it worth sorting dirty buffers in block order per file at the start of 
checkpoints?


I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

  tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
 LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
 + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
 + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t4 -c10  (about 5GB of database)
DBT-2   : 60WH (about 6GB of database)


Wow, I didn't expect that much gain from the sorted writes. How was LDC 
configured?


3) The OS disk elevator should be dealing with this issue, particularly 
because it may really know the actual disk ordering.


Yeah, but we don't give the OS that much chance to coalesce writes when 
we spread them out.


Here's the subtle thing:  by writing in the same order the LRU scan occurs 
in, you are writing dirty buffers in the optimal fashion to eliminate 
client backend writes during BuferAlloc.  This makes the checkpoint a 
really effective LRU clearing mechanism.  Writing in block order will 
change that.


The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.


I think so too.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] Sorted writes in checkpoint

2007-06-14 Thread Gregory Stark


"ITAGAKI Takahiro" <[EMAIL PROTECTED]> writes:

> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose?  Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.

I would suggest keeping the discussion on mail and including links to refer to
charts and tables in the wiki.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] Sorted writes in checkpoint

2007-06-14 Thread ITAGAKI Takahiro

Greg Smith <[EMAIL PROTECTED]> wrote:

> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > If the kernel can treat sequential writes better than random writes, is 
> > it worth sorting dirty buffers in block order per file at the start of 
> > checkpoints?

I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

  tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
 LDC only  | 181 tps | 1.12 / 4.38 / 12.13 s
 + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
 + Sorted writes   | 224 tps | 0.36 / 0.80 /  8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t4 -c10  (about 5GB of database)
DBT-2   : 60WH (about 6GB of database)

> I think it has the potential to improve things.  There are three obvious 
> and one subtle argument against it I can think of:
> 
> 1) Extra complexity for something that may not help.  This would need some 
> good, robust benchmarking improvements to justify its use.

Exactly. I think we need a discussion board for I/O performance issues.
Can I use Developers Wiki for this purpose?  Since performance graphs and
result tables are important for the discussion, so it might be better
than mailing lists, that are text-based.

> 2) Block number ordering may not reflect actual order on disk.  While 
> true, it's got to be better correlated with it than writing at random.
> 3) The OS disk elevator should be dealing with this issue, particularly 
> because it may really know the actual disk ordering.

Yes, both are true. However, I think there is pretty high correlation
in those orderings. In addition, we should use filesystem to assure
those orderings correspond to each other. For example, pre-allocation
of files might help us, as has often been discussed.

> Here's the subtle thing:  by writing in the same order the LRU scan occurs 
> in, you are writing dirty buffers in the optimal fashion to eliminate 
> client backend writes during BuferAlloc.  This makes the checkpoint a 
> really effective LRU clearing mechanism.  Writing in block order will 
> change that.

The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

sorted-ckpt.patch
Description: Binary data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

Re: [HACKERS] Sorted writes in checkpoint

[HACKERS] Sorted writes in checkpoint

11 matches

Site Navigation

Mail list logo

Footer information