Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-07 Thread Zeugswetter Andreas SB SD


  Keep in mind that we support platforms without O_DSYNC.  I am not
  sure whether there are any that don't have O_SYNC either, but I am
  fairly sure that we measured O_SYNC to be slower than fsync()s on
  some platforms.

This measurement is quite understandable, since the current software 
does 8k writes, and the OS only has a chance to write bigger blocks in the
write+fsync case. In the O_SYNC case you need to group bigger blocks yourself.
(bigger blocks are essential for max IO)

I am still convinced, that writing bigger blocks would allow the fastest
solution. But reading the recent posts the solution might only be to change
the current loop foreach dirty 8k WAL buffer write 8k to one or two large 
write calls.  

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Antti Haapala


On 6 Oct 2002, Greg Copeland wrote:

 On Sat, 2002-10-05 at 14:46, Curtis Faith wrote:
 
  2) aio_write vs. normal write.
 
  Since as you and others have pointed out aio_write and write are both
  asynchronous, the issue becomes one of whether or not the copies to the
  file system buffers happen synchronously or not.

 Actually, I believe that write will be *mostly* asynchronous while
 aio_write will always be asynchronous.  In a buffer poor environment, I
 believe write will degrade into a synchronous operation.  In an ideal
 situation, I think they will prove to be on par with one another with a
 slight bias toward aio_write.  In less than ideal situations where
 buffer space is at a premium, I think aio_write will get the leg up.

Browsed web and came across this piece of text regarding a Linux-KAIO
patch by Silicon Graphics...

The asynchronous I/O (AIO) facility implements interfaces defined by the
POSIX standard, although it has not been through formal compliance
certification. This version of AIO is implemented with support from kernel
modifications, and hence will be called KAIO to distinguish it from AIO
facilities available from newer versions of glibc/librt.  Because of the
kernel support, KAIO is able to perform split-phase I/O to maximize
concurrency of I/O at the device. With split-phase I/O, the initiating
request (such as an aio_read) truly queues the I/O at the device as the
first phase of the I/O request; a second phase of the I/O request,
performed as part of the I/O completion, propagates results of the
request.  The results may include the contents of the I/O buffer on a
read, the number of bytes read or written, and any error status.

Preliminary experience with KAIO have shown  over  35% improvement in
database performance tests. Unit tests (which only perform I/O) using KAIO
and Raw I/O have been successful in achieving 93% saturation with 12 disks
hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
encouraging results are a direct result of implementing  a significant
part of KAIO in the kernel using split-phase I/O while avoiding or
minimizing the use of any globally contented locks.

Well...

 In a worse case scenario, it seems that aio_write does get a win.

 I personally would at least like to see an aio implementation and would
 be willing to even help benchmark it to benchmark/validate any returns
 in performance.  Surely if testing reflected a performance boost it
 would be considered for baseline inclusion?


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Greg Copeland

On Mon, 2002-10-07 at 10:38, Antti Haapala wrote:
 Browsed web and came across this piece of text regarding a Linux-KAIO
 patch by Silicon Graphics...
 

Ya, I have read this before.  The problem here is that I'm not aware of
which AIO implementation on Linux is the forerunner nor do I have any
idea how it's implementation or performance details defer from that of
other implementations on other platforms.  I know there are at least two
aio efforts underway for Linux.  There could yet be others.  Attempting
to cite specifics that only pertain to Linux and then, only with a
specific implementation which may or may not be in general use is
questionable.  Because of this I simply left it as saying that I believe
my analysis is pessimistic.

Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

 
 Preliminary experience with KAIO have shown  over  35% improvement in
 database performance tests. Unit tests (which only perform I/O) using KAIO
 and Raw I/O have been successful in achieving 93% saturation with 12 disks
 hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
 encouraging results are a direct result of implementing  a significant
 part of KAIO in the kernel using split-phase I/O while avoiding or
 minimizing the use of any globally contented locks.

The problem here is, I have no idea what they are comparing to (worse
case read/writes which we know PostgreSQL *mostly* isn't suffering
from).  If we assume that PostgreSQL's read/write operations are
somewhat optimized (as it currently sounds like they are), I'd seriously
doubt we'd see that big of a difference.  On the other hand, I'm hoping
that if an aio postgresql implementation does get done we'll see
something like a 5%-10% performance boost.  Even still, I have nothing
to pin that on other than hope.  If we do see a notable performance
increase for Linux, I have no idea what it will do for other platforms.

Then, there are all of the issues that Tom brought up about
bloat/uglification and maintainability.  So, while I certainly do keep
those remarks in my mind, I think it's best to simply encourage the
effort (or something like it) and help determine where we really sit by
means of empirical evidence.


Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Neil Conway

Greg Copeland [EMAIL PROTECTED] writes:
 Ya, I have read this before.  The problem here is that I'm not aware of
 which AIO implementation on Linux is the forerunner nor do I have any
 idea how it's implementation or performance details defer from that of
 other implementations on other platforms.

The implementation of AIO in 2.5 is the one by Ben LaHaise (not
SGI). Not sure what the performance is like -- although it's been
merged into 2.5 already, so someone can do some benchmarking. Can
anyone suggest a good test?

Keep in mind that glibc has had a user-space implementation for a
little while (although I'd guess the performance to be unimpressive),
so AIO would not be *that* kernel-version specific.

 Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

RH AS uses Ben LaHaise's implemention of AIO, I believe.

Cheers,

Neil

-- 
Neil Conway [EMAIL PROTECTED] || PGP Key ID: DB3C29FC


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-07 Thread Ken Hirsch


I sent this yesterday, but it seems not to have made it to the list... 


I have a couple of comments orthogonal to the present discussion. 

1) It would be fairly easy to write log records over a network to a 
dedicated process on another system.  If the other system has an 
uninterruptible power supply, this is about as safe as writing to disk. 

This would get rid of the need for any fsync on the log at all.  There 
would be extra code needed on restart to get the end of the log from the 
other system, but it doesn't seem like much. 

I think this would be an attractive option to a lot of people.  Most 
people have at least two systems, and the requirements of the logging 
system would be minimal. 


2) It is also possible, with kernel modifications, to have special 
logging partitions where log records are written where the head is. 
Tzi-cker Chueh and Lan Huang at Stony Brook 
(http://www.cs.sunysb.edu/~lanhuang/research.htm) have written this, 
although I don't think they have released any code. 

(A similar idea called WADS is mentioned in Gray  Reuter's book.) 

If the people at Red Hat are interested in having some added value for 
using PostgreSQL on Red Hat Linux, this would be one idea.  It could 
also be used to speed up ext3 and other journaling file systems. 





---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-06 Thread Hannu Krosing

On Sun, 2002-10-06 at 04:03, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  Or its solution ;) as instead of the predicting we just write all data
  in log that is ready to be written. If we postpone writing, there will
  be hickups when we suddenly discover that we need to write a whole lot
  of pages (fsync()) after idling the disk for some period.
 
 This part is exactly the same point that I've been proposing to solve
 with a background writer process.  We don't need aio_write for that.
 The background writer can handle pushing completed WAL pages out to
 disk.  The sticky part is trying to gang the writes for multiple 
 transactions whose COMMIT records would fit into the same WAL page,
 and that WAL page isn't full yet.

I just hoped that kernel could be used as the background writer process
and in the process also solve the multiple commits on the same page
problem

 The rest of what you wrote seems like wishful thinking about how
 aio_write might behave :-(.  I have no faith in it.

Yeah, and the fact that there are several slightly different
implementations of AIO even on Linux alone does not help.

I have to test the SGI KAIO implementation for conformance with my
wishful thinking ;)

Perhaps you could ask around about AIO in RedHat Advanced Server (is it
the same AIO as SGI, how does it behave in multiple writes on the same
page case) as you may have better links to RedHat ?

--
Hannu



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-06 Thread Greg Copeland

On Sat, 2002-10-05 at 14:46, Curtis Faith wrote:
 
 2) aio_write vs. normal write.
 
 Since as you and others have pointed out aio_write and write are both
 asynchronous, the issue becomes one of whether or not the copies to the
 file system buffers happen synchronously or not.

Actually, I believe that write will be *mostly* asynchronous while
aio_write will always be asynchronous.  In a buffer poor environment, I
believe write will degrade into a synchronous operation.  In an ideal
situation, I think they will prove to be on par with one another with a
slight bias toward aio_write.  In less than ideal situations where
buffer space is at a premium, I think aio_write will get the leg up.

 
 The kernel doesn't need to know anything about platter rotation. It
 just needs to keep the disk write buffers full enough not to cause
 a rotational latency.


Which is why in a buffer poor environment, aio_write is generally
preferred as the write is still queued even if the buffer is full.  That
means it will be ready to begin placing writes into the buffer, all
without the process having to wait. On the other hand, when using write,
the process must wait.

In a worse case scenario, it seems that aio_write does get a win.

I personally would at least like to see an aio implementation and would
be willing to even help benchmark it to benchmark/validate any returns
in performance.  Surely if testing reflected a performance boost it
would be considered for baseline inclusion?

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-06 Thread Tom Lane

Greg Copeland [EMAIL PROTECTED] writes:
 I personally would at least like to see an aio implementation and would
 be willing to even help benchmark it to benchmark/validate any returns
 in performance.  Surely if testing reflected a performance boost it
 would be considered for baseline inclusion?

It'd be considered, but whether it'd be accepted would have to depend
on the size of the performance boost, its portability (how many
platforms/scenarios do you actually get a boost for), and the extent of
bloat/uglification of the code.

I can't personally get excited about something that only helps if your
server is starved for RAM --- who runs servers that aren't fat on RAM
anymore?  But give it a shot if you like.  Perhaps your analysis is
pessimistic.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-06 Thread Greg Copeland

On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
 I can't personally get excited about something that only helps if your
 server is starved for RAM --- who runs servers that aren't fat on RAM
 anymore?  But give it a shot if you like.  Perhaps your analysis is
 pessimistic.

I do suspect my analysis is somewhat pessimistic too but to what degree,
I have no idea.  You make a good case on your memory argument but please
allow me to further kick it around.  I don't find it far fetched to
imagine situations where people may commit large amounts of memory for
the database yet marginally starve available memory for file system
buffers.  Especially so on heavily I/O bound systems or where sporadicly
other types of non-database file activity may occur.

Now, while I continue to assure myself that it is not far fetched I
honestly have no idea how often this type of situation will typically
occur.  Of course, that opens the door for simply adding more memory
and/or slightly reducing the amount of memory available to the database
(thus making it available elsewhere).  Now, after all that's said and
done, having something like aio in use would seemingly allowing it to be
somewhat more self-tuning from a potential performance perspective.

Greg




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Tom Lane

Curtis Faith [EMAIL PROTECTED] writes:
 Assume Transaction A which writes a lot of buffers and XLog entries,
 so the Commit forces a relatively lengthy fsynch.

 Transactions B - E block not on the kernel lock from fsync but on
 the WALWriteLock. 

You are confusing WALWriteLock with WALInsertLock.  A
transaction-committing flush operation only holds the former.
XLogInsert only needs the latter --- at least as long as it
doesn't need to write.

Thus, given adequate space in the WAL buffers, transactions B-E do not
get blocked by someone else who is writing/syncing in order to commit.

Now, as the code stands at the moment there is no event other than
commit or full-buffers that prompts a write; that means that we are
likely to run into the full-buffer case more often than is good for
performance.  But a background writer task would fix that.

 Back-end servers would not issue fsync calls. They would simply block
 waiting until the LogWriter had written their record to the disk, i.e.
 until the sync'd block # was greater than the block that contained the
 XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
 ends after its log write returns.

This will pessimize performance except in the case where WAL traffic
is very heavy, because it means you don't commit until the block
containing your commit record is filled.  What if you are the only
active backend?

My view of this is that backends would wait for the background writer
only when they encounter a full-buffer situation, or indirectly when
they are trying to do a commit write and the background guy has the
WALWriteLock.  The latter serialization is unavoidable: in that
scenario, the background guy is writing/flushing an earlier page of
the WAL log, and we *must* have that down to disk before we can declare
our transaction committed.  So any scheme that tries to eliminate the
serialization of WAL writes will fail.  I do not, however, see any
value in forcing all the WAL writes to be done by a single process;
which is essentially what you're saying we should do.  That just adds
extra process-switch overhead that we don't really need.

 The log file would be opened O_DSYNC, O_APPEND every time.

Keep in mind that we support platforms without O_DSYNC.  I am not
sure whether there are any that don't have O_SYNC either, but I am
fairly sure that we measured O_SYNC to be slower than fsync()s on
some platforms.

 The nice part is that the WALWriteLock semantics could be changed to
 allow the LogWriter to write to disk while WALWriteLocks are acquired
 by back-end servers.

As I said, we already have that; you are confusing WALWriteLock
with WALInsertLock.

 Many transactions would commit on the same fsync (now really a write
 with O_DSYNC) and we would get optimal write throughput for the log
 system.

How are you going to avoid pessimizing the few-transactions case?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Curtis Faith

 You are confusing WALWriteLock with WALInsertLock.  A
 transaction-committing flush operation only holds the former.
 XLogInsert only needs the latter --- at least as long as it
 doesn't need to write.

Well that make things better than I thought. We still end up with
a disk write for each transaction though and I don't see how this
can ever get better than (Disk RPM)/ 60 transactions per second,
since commit fsyncs are serialized. Every fsync will have to wait
almost a full revolution to reach the end of the log.

As a practial matter then everyone will use commit_delay to
improve this.
 
 This will pessimize performance except in the case where WAL traffic
 is very heavy, because it means you don't commit until the block
 containing your commit record is filled.  What if you are the only
 active backend?

We could handle this using a mechanism analogous to the current
commit delay. If there are more than commit_siblings other processes
running then do the write automatically after commit_delay seconds.

This would make things no more pessimistic than the current
implementation but provide the additional benefit of allowing the
LogWriter to write in optimal sizes if there are many transactions.

The commit_delay method won't be as good in many cases. Consider
a update scenario where a larger commit delay gives better throughput.
A given transaction will flush after commit_delay milliseconds. The
delay is very unlikely to result in a scenario where the dirty log 
buffers are the optimal size.

As a practical matter I think this would tend to make the writes
larger than they would otherwise have been and this would
unnecessarily delay the commit on the transaction.

 I do not, however, see any
 value in forcing all the WAL writes to be done by a single process;
 which is essentially what you're saying we should do.  That just adds
 extra process-switch overhead that we don't really need.

I don't think that an fsync will ever NOT cause the process to get
switched out so I don't see how another process doing the write would
result in more overhead. The fsync'ing process will block on the
fsync, so there will always be at least one process switch (probably
many) while waiting for the fsync to comlete since we are talking
many milliseconds for the fsync in every case.

  The log file would be opened O_DSYNC, O_APPEND every time.
 
 Keep in mind that we support platforms without O_DSYNC.  I am not
 sure whether there are any that don't have O_SYNC either, but I am
 fairly sure that we measured O_SYNC to be slower than fsync()s on
 some platforms.

Well there is no reason that the logwriter couldn't be doing fsyncs
instead of O_DSYNC writes in those cases. I'd leave this switchable
using the current flags. Just change the semantics a bit.

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

In particular, it would seriously degrade performance if the WAL file
 isn't on its own spindle but has to share bandwidth with
 data file access.

If the OS is stupid I could see this happening. But if there are
buffers and some sort of elevator algorithm the I/O won't happen
at bad times.

I agree with you though that writing for every single insert probably
does not make sense. There should be some blocking of writes. The
optimal size would have to be derived empirically.

 What we really want, of course, is write on every revolution where
 there's something worth writing --- either we've filled a WAL blovk
 or there is a commit pending.  But that just gets us back into the
 same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
 I don't see how an extra process makes that problem any easier.

The whole point of the extra process handling all the writes is so
that it can write on every revolution, if there is something to
write. It doesn't need to care if more commits will arrive soon.

 BTW, it would seem to me that aio_write() buys nothing over plain write()
 in terms of ability to gang writes.  If we issue the write at time T
 and it completes at T+X, we really know nothing about exactly when in
 that interval the data was read out of our WAL buffers.  We cannot
 assume that commit records that were stored into the WAL buffer during
 that interval got written to disk.

Why would we need to make that assumption? The only thing we'd need to
know is that a given write succeeded meaning that commits before that
write are done.

The advantage to aio_write in this scenario is when writes cross track
boundaries or when the head is in the wrong spot. If we write
in reasonable blocks with aio_write the write might get to the disk
before the head passes the location for the write.

Consider a scenario where:

Head is at file offset 10,000.

Log contains blocks 12,000 - 12,500

..time passes..

Head is now at 12,050

Commit occurs writing block 12,501

In the aio_write case the write would already have been done for blocks  
12,000 to 12,050 and would be queued up for some additional blocks up to
potentially 12,500. So the write for the commit could occur without an
additional rotation delay. We are talking 85 to 200 milliseconds
delay for this rotation on a single disk. I don't know how often this
happens in actual practice but it might occur as often as every other
time.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Bruce Momjian

Curtis Faith wrote:
 Back-end servers would not issue fsync calls. They would simply block
 waiting until the LogWriter had written their record to the disk, i.e.
 until the sync'd block # was greater than the block that contained the
 XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
 ends after its log write returns.
 
 The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
 would issue writes of the optimal size when enough data was present or
 of smaller chunks if enough time had elapsed since the last write.

So every backend is to going to wait around until its fsync gets done by
the backend process?  How is that a win?  This is just another version
of our GUC parameters:

#commit_delay = 0   # range 0-10, in microseconds
#commit_siblings = 5# range 1-1000

which attempt to delay fsync if other backends are nearing commit.  
Pushing things out to another process isn't a win;  figuring out if
someone else is coming for commit is.  Remember, write() is fast, fsync
is slow.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Bruce Momjian

pgman wrote:
 Curtis Faith wrote:
  Back-end servers would not issue fsync calls. They would simply block
  waiting until the LogWriter had written their record to the disk, i.e.
  until the sync'd block # was greater than the block that contained the
  XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
  ends after its log write returns.
  
  The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
  would issue writes of the optimal size when enough data was present or
  of smaller chunks if enough time had elapsed since the last write.
 
 So every backend is to going to wait around until its fsync gets done by
 the backend process?  How is that a win?  This is just another version
 of our GUC parameters:
   
   #commit_delay = 0   # range 0-10, in microseconds
   #commit_siblings = 5# range 1-1000
 
 which attempt to delay fsync if other backends are nearing commit.  
 Pushing things out to another process isn't a win;  figuring out if
 someone else is coming for commit is.  Remember, write() is fast, fsync
 is slow.

Let me add to what I just said:

While the above idea doesn't win for normal operation, because each
backend waits for the fsync, and we have no good way of determining of
other backends are nearing commit, a background WAL fsync process would
be nice if we wanted an option between fsync on (wait for fsync before
reporting commit), and fsync off (no crash recovery).

We could have a mode where we did an fsync every X milliseconds, so we
issue a COMMIT to the client, but wait a few milliseconds before
fsync'ing.  Many other databases have such a mode, but we don't, and I
always felt it would be valuable.  It may allow us to remove the fsync
option in favor of one that has _some_ crash recovery.
 
-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-05 Thread Hannu Krosing

Bruce Momjian kirjutas L, 05.10.2002 kell 13:49:
 Curtis Faith wrote:
  Back-end servers would not issue fsync calls. They would simply block
  waiting until the LogWriter had written their record to the disk, i.e.
  until the sync'd block # was greater than the block that contained the
  XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
  ends after its log write returns.
  
  The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
  would issue writes of the optimal size when enough data was present or
  of smaller chunks if enough time had elapsed since the last write.
 
 So every backend is to going to wait around until its fsync gets done by
 the backend process?  How is that a win?  This is just another version
 of our GUC parameters:
   
   #commit_delay = 0   # range 0-10, in microseconds
   #commit_siblings = 5# range 1-1000
 
 which attempt to delay fsync if other backends are nearing commit.  
 Pushing things out to another process isn't a win;  figuring out if
 someone else is coming for commit is. 

Exactly. If I understand correctly what Curtis is proposing, you don't
have to figure it out under his scheme - you just issue a WALWait
command and the WAL writing process notifies you when your transactions
WAL is safe storage. 

If the other committer was able to get his WALWait in before the actual
write took place, it will notified too, if not, it will be notified
about 1/166th sec. later (for 10K rpm disk) when it's write is done on
the next rev of disk platters.

The writer process should just issue a continuous stream of
aio_write()'s while there are any waiters and keep track which waiters
are safe to continue - thus no guessing of who's gonna commit.

If supported by platform this should use zero-copy writes - it should be
safe because WAL is append-only.

---
Hannu


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching

2002-10-05 Thread Curtis Faith

Bruce Momjian wrote:
 So every backend is to going to wait around until its fsync gets done by
 the backend process?  How is that a win?  This is just another version
 of our GUC parameters:
   
   #commit_delay = 0   # range 0-10, in microseconds
   #commit_siblings = 5# range 1-1000
 
 which attempt to delay fsync if other backends are nearing commit.  
 Pushing things out to another process isn't a win;  figuring out if
 someone else is coming for commit is.  

It's not the same at all. My proposal make two extremely important changes
from a performance perspective.

1) WALWriteLocks are never held by processes for lengthy transations. Only
for long enough to copy the log entry into the buffer. This means real
work can be done by other processes while a transaction is waiting for
it's commit to finish. I'm sure that blocking on XLogInsert because another
transaction is performing an fsync is extremely common with frequent update
scenarios.

2) The log is written using optimal write sizes which is much better than
a user-defined guess of the microseconds to delay the fsync. We should be
able to get the bottleneck to be the maximum write throughput of the disk
with the modifications to Tom Lane's scheme I proposed.

 Remember, write() is fast, fsync is slow.

Okay, it's clear I missed the point about Unix write earlier :-)

However, it's not just saving fsyncs that we need to worry about. It's the
unnecessary blocking of other processes that are simply trying to
append some log records in the course of whatever updating, inserting they
are doing. They may be a long way from commit.

fsync being slow is the whole reason for not wanting to have exclusive
locks held for the duration of an fsync.

On an SMP machine this change alone would probably speed things up by
an order of magnitude (assuming there aren't any other similar locks
causing the same problem).

- Curtis

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Doug McNaught

Tom Lane [EMAIL PROTECTED] writes:

 Curtis Faith [EMAIL PROTECTED] writes:

  The log file would be opened O_DSYNC, O_APPEND every time.
 
 Keep in mind that we support platforms without O_DSYNC.  I am not
 sure whether there are any that don't have O_SYNC either, but I am
 fairly sure that we measured O_SYNC to be slower than fsync()s on
 some platforms.

And don't we preallocate WAL files anyway?  So O_APPEND would be
irrelevant?

-Doug

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Tom Lane

Hannu Krosing [EMAIL PROTECTED] writes:
 The writer process should just issue a continuous stream of
 aio_write()'s while there are any waiters and keep track which waiters
 are safe to continue - thus no guessing of who's gonna commit.

This recipe sounds like eat I/O bandwidth whether we need it or not.
It might be optimal in the case where activity is so heavy that we
do actually need a WAL write on every disk revolution, but in any
scenario where we're not maxing out the WAL disk's bandwidth, it will
hurt performance.  In particular, it would seriously degrade performance
if the WAL file isn't on its own spindle but has to share bandwidth with
data file access.

What we really want, of course, is write on every revolution where
there's something worth writing --- either we've filled a WAL blovk
or there is a commit pending.  But that just gets us back into the
same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
I don't see how an extra process makes that problem any easier.

BTW, it would seem to me that aio_write() buys nothing over plain write()
in terms of ability to gang writes.  If we issue the write at time T
and it completes at T+X, we really know nothing about exactly when in
that interval the data was read out of our WAL buffers.  We cannot
assume that commit records that were stored into the WAL buffer during
that interval got written to disk.  The only safe assumption is that
only records that were in the buffer at time T are down to disk; and
that means that late arrivals lose.  You can't issue aio_write
immediately after the previous one completes and expect that this
optimizes performance --- you have to delay it as long as you possibly
can in hopes that more commit records arrive.  So it comes down to being
the same problem.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Bruce Momjian

Curtis Faith wrote:
 The advantage to aio_write in this scenario is when writes cross track
 boundaries or when the head is in the wrong spot. If we write
 in reasonable blocks with aio_write the write might get to the disk
 before the head passes the location for the write.
 
 Consider a scenario where:
 
 Head is at file offset 10,000.
 
 Log contains blocks 12,000 - 12,500
 
 ..time passes..
 
 Head is now at 12,050
 
 Commit occurs writing block 12,501
 
 In the aio_write case the write would already have been done for blocks  
 12,000 to 12,050 and would be queued up for some additional blocks up to
 potentially 12,500. So the write for the commit could occur without an
 additional rotation delay. We are talking 85 to 200 milliseconds
 delay for this rotation on a single disk. I don't know how often this
 happens in actual practice but it might occur as often as every other
 time.

So, you are saying that we may get back aio confirmation quicker than if
we issued our own write/fsync because the OS was able to slip our flush
to disk in as part of someone else's or a general fsync?

I don't buy that because it is possible our write() gets in as part of
someone else's fsync and our fsync becomes a no-op, meaning there aren't
any dirty buffers for that file.  Isn't that also possible?

Also, remember the kernel doesn't know where the platter rotation is
either. Only the SCSI drive can reorder the requests to match this. The
OS can group based on head location, but it doesn't know much about the
platter location, and it doesn't even know where the head is.

Also, does aio return info when the data is in the kernel buffers or
when it is actually on the disk?   

Simply, aio allows us to do the write and get notification when it is
complete.  I don't see how that helps us, and I don't see any other
advantages to aio.  To use aio, we need to find something that _can't_
be solved with more traditional Unix API's, and I haven't seen that yet.

This aio thing is getting out of hand.  It's like we have a hammer, and
everything looks like a nail, or a use for aio.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

 So, you are saying that we may get back aio confirmation quicker than if
 we issued our own write/fsync because the OS was able to slip our flush
 to disk in as part of someone else's or a general fsync?
 
 I don't buy that because it is possible our write() gets in as part of
 someone else's fsync and our fsync becomes a no-op, meaning there aren't
 any dirty buffers for that file.  Isn't that also possible?

Separate out the two concepts:

1) Writing of incomplete transactions at the block level by a
background LogWriter. 

I think it doesn't matter whether the write is aio_write or
write, writing blocks when we get them should provide the benefit
I outlined.

Waiting till fsync could miss the opporunity to write before the 
head passes the end of the last durable write because the drive
buffers might empty causing up to a full rotation's delay.

2) aio_write vs. normal write.

Since as you and others have pointed out aio_write and write are both
asynchronous, the issue becomes one of whether or not the copies to the
file system buffers happen synchronously or not.

This is not a big difference but it seems to me that the OS might be
able to avoid some context switches by grouping copying in the case
of aio_write. I've heard anecdotal reports that this is significantly
faster for some things but I don't know for certain.

 
 Also, remember the kernel doesn't know where the platter rotation is
 either. Only the SCSI drive can reorder the requests to match this. The
 OS can group based on head location, but it doesn't know much about the
 platter location, and it doesn't even know where the head is.

The kernel doesn't need to know anything about platter rotation. It
just needs to keep the disk write buffers full enough not to cause
a rotational latency.

It's not so much a matter of reordering as it is of getting the data
into the SCSI drive before the head passes the last write's position.
If the SCSI drive's buffers are kept full it can continue writing at
its full throughput. If the writes stop and the buffers empty
it will need to wait up to a full rotation before it gets to the end 
of the log again

 Also, does aio return info when the data is in the kernel buffers or
 when it is actually on the disk?   
 
 Simply, aio allows us to do the write and get notification when it is
 complete.  I don't see how that helps us, and I don't see any other
 advantages to aio.  To use aio, we need to find something that _can't_
 be solved with more traditional Unix API's, and I haven't seen that yet.
 
 This aio thing is getting out of hand.  It's like we have a hammer, and
 everything looks like a nail, or a use for aio.

Yes, while I think its probably worth doing and faster, it won't help as
much as just keeping the drive buffers full even if that's by using write
calls.

I still don't understand the opposition to aio_write. Could we just have
the configuration setup determine whether one or the other is used? I 
don't see why we wouldn't use the faster calls if they were present and
reliable on a given system.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Bruce Momjian

Curtis Faith wrote:
  So, you are saying that we may get back aio confirmation quicker than if
  we issued our own write/fsync because the OS was able to slip our flush
  to disk in as part of someone else's or a general fsync?
  
  I don't buy that because it is possible our write() gets in as part of
  someone else's fsync and our fsync becomes a no-op, meaning there aren't
  any dirty buffers for that file.  Isn't that also possible?
 
 Separate out the two concepts:
 
 1) Writing of incomplete transactions at the block level by a
 background LogWriter. 
 
 I think it doesn't matter whether the write is aio_write or
 write, writing blocks when we get them should provide the benefit
 I outlined.
 
 Waiting till fsync could miss the opportunity to write before the 
 head passes the end of the last durable write because the drive
 buffers might empty causing up to a full rotation's delay.

No question about that!  The sooner we can get stuff to the WAL buffers,
the more likely we will get some other transaction to do our fsync work.
Any ideas on how we can do that?

 2) aio_write vs. normal write.
 
 Since as you and others have pointed out aio_write and write are both
 asynchronous, the issue becomes one of whether or not the copies to the
 file system buffers happen synchronously or not.
 
 This is not a big difference but it seems to me that the OS might be
 able to avoid some context switches by grouping copying in the case
 of aio_write. I've heard anecdotal reports that this is significantly
 faster for some things but I don't know for certain.

I suppose it is possible, but because we spend so much time in fsync, we
want to focus on that.  People have recommended mmap of the WAL file,
and that seems like a much more direct way to handle it rather than aio.
However, we can't control when the stuff gets sent to disk with mmap'ed
WAL, or should I say we can't write to it and withhold writes to the
disk file with mmap, so we would need some intermediate step, and then
again, it just becomes more steps and extra steps slow things down too.


  This aio thing is getting out of hand.  It's like we have a hammer, and
  everything looks like a nail, or a use for aio.
 
 Yes, while I think its probably worth doing and faster, it won't help as
 much as just keeping the drive buffers full even if that's by using write
 calls.

 I still don't understand the opposition to aio_write. Could we just have
 the configuration setup determine whether one or the other is used? I 
 don't see why we wouldn't use the faster calls if they were present and
 reliable on a given system.

We hesitate to add code relying on new features unless it is a
significant win, and in the aio case, we would have different WAL disk
write models for with/without aio, so it clearly could be two code
paths, and with two code paths, we can't as easily improve or optimize. 
If we get 2% boost out of some feature,  but it later discourages us
from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
optimization is for a few platform, while the 5% optimization is for
all.  This code is +15 years old, so we are looking way down the road,
not just for today's hot feature.

For example, Tom just improved DISTINCT by 25% by optimizing some of the
sorting and function call handling.  If we had more complex threaded
sort code, that may not have been possible, or it may have been possible
for him to optimize only one of the code paths.

I can't tell you how many aio/mmap/fancy feature discussions we have
had, and we obviously discuss them, but in the end, they end up being of
questionable value for the risk/complexity;  but, we keep talking,
hoping we are wrong or some good ideas come out of it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large

2002-10-05 Thread Hannu Krosing

On Sat, 2002-10-05 at 20:32, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  The writer process should just issue a continuous stream of
  aio_write()'s while there are any waiters and keep track which waiters
  are safe to continue - thus no guessing of who's gonna commit.
 
 This recipe sounds like eat I/O bandwidth whether we need it or not.
 It might be optimal in the case where activity is so heavy that we
 do actually need a WAL write on every disk revolution, but in any
 scenario where we're not maxing out the WAL disk's bandwidth, it will
 hurt performance.  In particular, it would seriously degrade performance
 if the WAL file isn't on its own spindle but has to share bandwidth with
 data file access.
 
 What we really want, of course, is write on every revolution where
 there's something worth writing --- either we've filled a WAL blovk
 or there is a commit pending. 

That's what I meant by while there are any waiters.

 But that just gets us back into the
 same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
 I don't see how an extra process makes that problem any easier.

I still think that we could get gang writes automatically, if we just
ask for aio_write at completion of each WAL file page and keep track of
those that are written. We could also keep track of write position
inside the WAL page for

1. end of last write() of each process

2. WAL files write position at each aio_write()

Then we can safely(?) assume, that each backend wants only its own
write()'s be on disk before it can assume the trx has committed. If the
fsync()-like request comes in at time when aio_write for that processes
last position has committed, we can let that process continue without
even a context switch.

In the above scenario I assume that kernel can do the right thing by
doing multiple aio_write requests for the same page in one sweep and not
doing one physical write for each aio_write.

 BTW, it would seem to me that aio_write() buys nothing over plain write()
 in terms of ability to gang writes.  If we issue the write at time T
 and it completes at T+X, we really know nothing about exactly when in
 that interval the data was read out of our WAL buffers. 

Yes, most likely. If we do several write's of the same pages they will
hit physical disk at the same physical write.

 We cannot
 assume that commit records that were stored into the WAL buffer during
 that interval got written to disk.  The only safe assumption is that
 only records that were in the buffer at time T are down to disk; and
 that means that late arrivals lose. 

I assume that if each commit record issues an aio_write when all of
those which actually reached the disk will be notified. 

IOW the first aio_write orders the write, but all the latecomers which
arrive before actual write will also get written and notified.

 You can't issue aio_write
 immediately after the previous one completes and expect that this
 optimizes performance --- you have to delay it as long as you possibly
 can in hopes that more commit records arrive. 

I guess we have quite different cases for different hardware
configurations - if we have a separate disk subsystem for WAL, we may
want to keep the log flowing to disk as fast as it is ready, including
the writing of last, partial page as often as new writes to it are done
- as we possibly can't write more than ~ 250 times/sec (with 15K drives,
no battery RAM) we will always have at least two context switches
between writes (for 500Hz ontext switch clock), and much more if
processes background themselves while waiting for small transactions to
commit.

 So it comes down to being the same problem.

Or its solution ;) as instead of the predicting we just write all data
in log that is ready to be written. If we postpone writing, there will
be hickups when we suddenly discover that we need to write a whole lot
of pages (fsync()) after idling the disk for some period.

---
Hannu






















---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Curtis Faith

 No question about that!  The sooner we can get stuff to the WAL buffers,
 the more likely we will get some other transaction to do our fsync work.
 Any ideas on how we can do that?

More like the sooner we get stuff out of the WAL buffers and into the
disk's buffers whether by write or aio_write.

It doesn't do any good to have information in the XLog unless it
gets written to the disk buffers before they empty.

 We hesitate to add code relying on new features unless it is a
 significant win, and in the aio case, we would have different WAL disk
 write models for with/without aio, so it clearly could be two code
 paths, and with two code paths, we can't as easily improve or optimize. 
 If we get 2% boost out of some feature,  but it later discourages us
 from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
 optimization is for a few platform, while the 5% optimization is for
 all.  This code is +15 years old, so we are looking way down the road,
 not just for today's hot feature.

I'll just have to implement it and see if it's as easy and isolated as I
think it might be and would allow the same algorithm for aio_write or
write.

 I can't tell you how many aio/mmap/fancy feature discussions we have
 had, and we obviously discuss them, but in the end, they end up being of
 questionable value for the risk/complexity;  but, we keep talking,
 hoping we are wrong or some good ideas come out of it.

I'm all in favor of keeping clean designs. I'm very pleased with how
easy PostreSQL is to read and understand given how much it does.

- Curtis

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Bruce Momjian

Curtis Faith wrote:
  No question about that!  The sooner we can get stuff to the WAL buffers,
  the more likely we will get some other transaction to do our fsync work.
  Any ideas on how we can do that?
 
 More like the sooner we get stuff out of the WAL buffers and into the
 disk's buffers whether by write or aio_write.

Does aio_write to write or write _and_ fsync()?

 It doesn't do any good to have information in the XLog unless it
 gets written to the disk buffers before they empty.

Just for clarification, we have two issues in this thread:

WAL memory buffers fill up, forcing WAL write
multiple commits at the same time force too many fsync's

I just wanted to throw that out.

  I can't tell you how many aio/mmap/fancy feature discussions we have
  had, and we obviously discuss them, but in the end, they end up being of
  questionable value for the risk/complexity;  but, we keep talking,
  hoping we are wrong or some good ideas come out of it.
 
 I'm all in favor of keeping clean designs. I'm very pleased with how
 easy PostreSQL is to read and understand given how much it does.

Glad you see the situation we are in.  ;-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance

2002-10-05 Thread Tom Lane

Hannu Krosing [EMAIL PROTECTED] writes:
 Or its solution ;) as instead of the predicting we just write all data
 in log that is ready to be written. If we postpone writing, there will
 be hickups when we suddenly discover that we need to write a whole lot
 of pages (fsync()) after idling the disk for some period.

This part is exactly the same point that I've been proposing to solve
with a background writer process.  We don't need aio_write for that.
The background writer can handle pushing completed WAL pages out to
disk.  The sticky part is trying to gang the writes for multiple 
transactions whose COMMIT records would fit into the same WAL page,
and that WAL page isn't full yet.

The rest of what you wrote seems like wishful thinking about how
aio_write might behave :-(.  I have no faith in it.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:

lock-  down(inode-i_sem);
ret = filemap_fdatasync(inode-i_mapping);
err = file-f_op-fsync(file, dentry, 1);
if (err  !ret)
ret = err;
err = filemap_fdatawait(inode-i_mapping);
if (err  !ret)
ret = err;
unlock-up(inode-i_sem);

But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.

tom lane wrote:
 This is kind of ugly in general terms but I'm not sure that it really
 hurts Postgres.  In our present scheme, the only files we ever fsync()
 are WAL log files, not data files.  And in normal operation there is
 only one WAL writer at a time, and *no* WAL readers.  So an exclusive
 kernel-level lock on a WAL file while we fsync really shouldn't create
 any problem for us.  (Unless this indirectly blocks other operations
 that I'm missing?)

I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.

Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.

Transactions B - E block not on the kernel lock from fsync but on
the WALWriteLock. 

When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.

C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.

B Releases and now C writes its XLOG_XACT_COMMIT.

There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
incredible concurrency.

 As I commented before, I think we could do with an extra process to
 issue WAL writes in places where they're not in the critical path for
 a foreground process.  But that seems to be orthogonal from this issue.
 
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.

Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.

The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.

The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.

Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
system.

This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical 
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org