Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-13 Thread Amit Kapila
Dan, I believe your approach of double buffer write is right as it has 
potential that it can avoid the latency backends incur during full page writes 
after checkpoint. Although there are chances that overall I/O will be more in 
this case but if we can make sure that in most scenarios backend has to never 
do I/O it can show performance improvement as well as compare to full page 
writes.

-Original Message-
From: Dan Scales [mailto:sca...@vmware.com] 
Sent: Thursday, February 09, 2012 5:30 AM
To: Amit Kapila
Cc: PG Hackers
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 Is there any problem if the double-write happens only by bgwriter or 
 checkpoint. 
 Something like whenever backend process has to evict the buffer, it will do 
 same as you have described that write in a double-write buffer, but  
 bgwriter  will check this double-buffer and flush from it.
 Also whenever any backend will see that the double buffer is more than 2/3rd 
 or some threshhold value full it will tell bgwriter to flush from  
 double-write buffer.
 This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

- Original Message -
From: Amit Kapila amit.kap...@huawei.com
To: Dan Scales sca...@vmware.com, PG Hackers 
pgsql-hackers@postgresql.org
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 I think it is a good idea, and can help double-writes perform better in the 
 case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using double-write buffer approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5

Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-08 Thread Dan Scales
 Is there any problem if the double-write happens only by bgwriter or 
 checkpoint. 
 Something like whenever backend process has to evict the buffer, it will do 
 same as you have described that write in a double-write buffer, but  
 bgwriter  will check this double-buffer and flush from it.
 Also whenever any backend will see that the double buffer is more than 2/3rd 
 or some threshhold value full it will tell bgwriter to flush from  
 double-write buffer.
 This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

- Original Message -
From: Amit Kapila amit.kap...@huawei.com
To: Dan Scales sca...@vmware.com, PG Hackers 
pgsql-hackers@postgresql.org
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 I think it is a good idea, and can help double-writes perform better in the 
 case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using double-write buffer approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch is completed.  Also, I 
need to sort the buffers being checkpointed by file

Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-07 Thread Amit Kapila
 I think it is a good idea, and can help double-writes perform better in the 
 case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using double-write buffer approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch is completed.  Also, I 
need to sort the buffers being checkpointed by file/offset (see ioseq.c), so 
that the checkpointer batches will most likely only have to write and fsync one 
data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though 
still has wiggles) with double writes enabled, since there are no unpredictable 
long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as previous 
numbers), for 2-processor, 60-minute 50-warehouse DBT2.  One the right shows 
the size of the shared_buffers, and the size of the RAM in the virtual machine. 
 FPW stands for full_page_writes, DW for double_writes.  'two disk' means the 
WAL log is on a separate ext3 filesystem from the data files.

   FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713[5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   111599758[3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   1082998655806[1G buffers, 8G VM]
two disk   13605   126945682

one disk:   675261294878
two disk:   725366775239[1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now much 
better, though still not as good as FPW on.  In the medium cache case (3G 
buffers), where there are significant backend dirty 

Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-07 Thread Greg Smith

On 02/07/2012 12:09 AM, Dan Scales wrote:

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.


There is already a check for this sort of problem during the base 
backup.  It forces full_pages_writes on for the backup, even if the 
running configuration has it off.  So long as double writes can be 
smoothly turned off and back on again, that same section of code can 
easily be made to handle that, too.


As far as not making the double write file part of the base backup, I 
was assuming that would go into a subdirectory under pg_xlog by 
default.  I would think that people who relocate pg_xlog using one of 
the methods for doing that would want the double write buffer to move as 
well.  And if it's inside pg_xlog, existing base backup scripts won't 
need to be changed--the correct ones already exclude pg_xlog files.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-06 Thread Fujii Masao
On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales sca...@vmware.com wrote:
 Let me know if you have any thoughts/comments, etc.  The patch is
 enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-06 Thread Dan Scales
I don't know a lot about base backup, but it sounds like full_page_writes must 
be turned on for base backup, in order to deal with the inconsistent reads of 
pages (which you might call torn pages) that can happen when you backup the 
data files while the database is running.  The relevant parts of the WAL log 
are then copied separately (and consistently) once the backup of the data files 
is done, and used to recover the database into a consistent state later.

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.

Dan

- Original Message -
From: Fujii Masao masao.fu...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Monday, February 6, 2012 3:08:15 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales sca...@vmware.com wrote:
 Let me know if you have any thoughts/comments, etc.  The patch is
 enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-05 Thread Dan Scales
Thanks for the detailed followup.  I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations.  However, Postgres
seems to run fine in many large shared_memory configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs.  As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening.  As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either.  The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted.  The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection.  These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled.  When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written.  This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization.  The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial.  It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan


- Original Message -
From: Robert Haas robertmh...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the feedback!  I think you make a good point about the small size 
 of dirty data in the OS cache.  I think what you can say about this 
 double-write patch is that it will work not work well for configurations that 
 have a small Postgres cache and a large OS cache, since every write from the 
 Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better

Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-05 Thread Robert Haas
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the detailed followup.  I do see how Postgres is tuned for
 having a bunch of memory available that is not in shared_buffers, both
 for the OS buffer cache and other memory allocations.  However, Postgres
 seems to run fine in many large shared_memory configurations that I
 gave performance numbers for, including 5G shared_buffers for an 8G
 machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
 sufficient extra memory beyond the shared_buffers cache.

I agree that you could probably set shared_buffers to 3GB on a 6GB
machine and get decent performance - but would it be the optimal
performance, and for what workload?  To really figure out whether this
patch is a win, you need to get the system optimally tuned for the
unpatched sources (which we can't tell whether you've done, since you
haven't posted the configuration settings or any comparative figures
for different settings, or any details on which commit you tested
against) and then get the system optimally tuned for the patched
sources with double_writes=on, and then see whether there's a gain.

 I think the pgbench run is pointing out a problem that this double_writes
 implementation has with BULK_WRITEs.  As you point out, the
 BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.

Bulk reads will have the same problem.  Consider loading a bunch of
data into a new data with COPY, and then scanning the table.  The
table scan will be a bulk read and every page will be dirtied
setting hint bits.  Another thing to worry about is vacuum, which also
uses a BufferAccessStrategy.  Greg Smith has done some previous
benchmarking showing that when the kernel is too aggressive about
flushing dirty data to disk, vacuum becomes painfully slow.  I suspect
this patch is going to have that problem in spades (but it would be
good to test that).  Checkpoints might be a problem, too, since they
flush a lot of dirty data, and that's going to require a lot of extra
fsyncing with this implementation.  It certainly seems that unless you
have a pg_xlog and the data separated and a battery-backed write cache
for each, checkpoints might be really slow.  I'm not entirely
convinced they'll be fast even if you have all that (but it would be
good to test that, too).

 I'm not sure if there is a great solution that always works for that
 issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
 archiving/replication is happening.  As I understand it, if the
 BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
 double-written either.  The BULK_WRITE data is not officially synced and
 committed until it is all written, so there doesn't have to be any
 torn-page protection for that data, which is why the WAL logging can be
 omitted.  The double-write implementation can be improved by marking each
 buffer if it doesn't need torn-page protection.  These buffers would be
 those new pages that are explicitly not WAL-logged, even when
 full_page_writes is enabled.  When such a buffer is eventually synced
 (perhaps because of an eviction), it would not be double-written.  This
 would often avoid double-writes for BULK_WRITE, etc., especially since
 the administrator is often not archiving or doing replication when doing
 bulk loads.

I agree - this optimization seems like a must.  I'm not sure that it's
sufficient, but it certainly seems necessary.  It's not going to help
with VACUUM, though, so I think that case needs some careful looking
at to determine how bad the regression is and what can be done to
mitigate it.  In particular, I note that I suggested an idea that
might help in the final paragraph of my last email.

My general feeling about this patch is that it needs a lot more work
before we should consider committing it.  Your tests so far overlook
quite a few important problem cases (bulk loads, SELECT on large
unhinted tables, vacuum speed, checkpoint duration, and others) and
still mostly show it losing to full_page_writes, sometimes by large
margins.  Even in the one case where you got an 8% speedup, it's not
really clear that the same speedup (or an even bigger one) couldn't
have been gotten by some other kind of tuning.  I think you really
need to spend some more time thinking about how to blunt the negative
impact on the cases where it hurts, and increase the benefit in the
cases where it helps.  The approach seems to have potential, but it
seems way to immature to think about shipping it at this point.  (You
may have been thinking along similar lines since I note that the patch
is marked WIP.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-03 Thread Dan Scales
Hi Robert,

Thanks for the feedback!  I think you make a good point about the small size of 
dirty data in the OS cache.  I think what you can say about this double-write 
patch is that it will work not work well for configurations that have a small 
Postgres cache and a large OS cache, since every write from the Postgres cache 
requires double-writes and an fsync.  However, it should work much better for 
configurations with a much large Postgres cache and relatively smaller OS cache 
(including the configurations that I've given performance results for).  In 
that case, there is a lot more capacity for dirty pages in the Postgres cache, 
and you won't have nearly as many dirty evictions.  The checkpointer is doing a 
good number of the writes, and this patch sorts the checkpointer's buffers so 
its IO is efficient.

Of course, I can also increase the size of the non-checkpointer ring buffer to 
be much larger, though I wouldn't want to make it too large, since it is 
consuming memory.  If I increase the size of the ring buffers significantly, I 
will probably need to add some data structures so that the ring buffer lookups 
in smgrread() and smgrwrite() are more efficient.

Can you let me know what the shared_buffers and RAM sizes were for your pgbench 
run?  I can try running the same workload.  If the size of shared_buffers is 
especially small compared to RAM, then we should increase the size of 
shared_buffers when using double_writes.

Thanks,

Dan 


- Original Message -
From: Robert Haas robertmh...@gmail.com
To: Dan Scales sca...@vmware.com
Cc: PG Hackers pgsql-hackers@postgresql.org
Sent: Thursday, February 2, 2012 7:19:47 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales sca...@vmware.com wrote:
 I've been prototyping the double-write buffer idea that Heikki and Simon
 had proposed (as an alternative to a previous patch that only batched up
 writes by the checkpointer).  I think it is a good idea, and can help
 double-writes perform better in the case of lots of backend evictions.
 It also centralizes most of the code change in smgr.c.  However, it is
 trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running pgbench -i -s 10 with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-03 Thread Robert Haas
On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales sca...@vmware.com wrote:
 Thanks for the feedback!  I think you make a good point about the small size 
 of dirty data in the OS cache.  I think what you can say about this 
 double-write patch is that it will work not work well for configurations that 
 have a small Postgres cache and a large OS cache, since every write from the 
 Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.
 The entire database fits in shared_buffers, with most of it left
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache.  We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access.  We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal.  In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial.  It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too.  The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one.  We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait.  It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full.  Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-02 Thread Robert Haas
On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales sca...@vmware.com wrote:
 I've been prototyping the double-write buffer idea that Heikki and Simon
 had proposed (as an alternative to a previous patch that only batched up
 writes by the checkpointer).  I think it is a good idea, and can help
 double-writes perform better in the case of lots of backend evictions.
 It also centralizes most of the code change in smgr.c.  However, it is
 trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running pgbench -i -s 10 with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers