Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-07 Thread Zeugswetter Andreas SB SD


> > Keep in mind that we support platforms without O_DSYNC.  I am not
> > sure whether there are any that don't have O_SYNC either, but I am
> > fairly sure that we measured O_SYNC to be slower than fsync()s on
> > some platforms.

This measurement is quite understandable, since the current software 
does 8k writes, and the OS only has a chance to write bigger blocks in the
write+fsync case. In the O_SYNC case you need to group bigger blocks yourself.
(bigger blocks are essential for max IO)

I am still convinced, that writing bigger blocks would allow the fastest
solution. But reading the recent posts the solution might only be to change
the current "loop foreach dirty 8k WAL buffer write 8k" to one or two large 
write calls.  

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Curtis Faith

> You are confusing WALWriteLock with WALInsertLock.  A
> transaction-committing flush operation only holds the former.
> XLogInsert only needs the latter --- at least as long as it
> doesn't need to write.

Well that make things better than I thought. We still end up with
a disk write for each transaction though and I don't see how this
can ever get better than (Disk RPM)/ 60 transactions per second,
since commit fsyncs are serialized. Every fsync will have to wait
almost a full revolution to reach the end of the log.

As a practial matter then everyone will use commit_delay to
improve this.
 
> This will pessimize performance except in the case where WAL traffic
> is very heavy, because it means you don't commit until the block
> containing your commit record is filled.  What if you are the only
> active backend?

We could handle this using a mechanism analogous to the current
commit delay. If there are more than commit_siblings other processes
running then do the write automatically after commit_delay seconds.

This would make things no more pessimistic than the current
implementation but provide the additional benefit of allowing the
LogWriter to write in optimal sizes if there are many transactions.

The commit_delay method won't be as good in many cases. Consider
a update scenario where a larger commit delay gives better throughput.
A given transaction will flush after commit_delay milliseconds. The
delay is very unlikely to result in a scenario where the dirty log 
buffers are the optimal size.

As a practical matter I think this would tend to make the writes
larger than they would otherwise have been and this would
unnecessarily delay the commit on the transaction.

> I do not, however, see any
> value in forcing all the WAL writes to be done by a single process;
> which is essentially what you're saying we should do.  That just adds
> extra process-switch overhead that we don't really need.

I don't think that an fsync will ever NOT cause the process to get
switched out so I don't see how another process doing the write would
result in more overhead. The fsync'ing process will block on the
fsync, so there will always be at least one process switch (probably
many) while waiting for the fsync to comlete since we are talking
many milliseconds for the fsync in every case.

> > The log file would be opened O_DSYNC, O_APPEND every time.
> 
> Keep in mind that we support platforms without O_DSYNC.  I am not
> sure whether there are any that don't have O_SYNC either, but I am
> fairly sure that we measured O_SYNC to be slower than fsync()s on
> some platforms.

Well there is no reason that the logwriter couldn't be doing fsyncs
instead of O_DSYNC writes in those cases. I'd leave this switchable
using the current flags. Just change the semantics a bit.

- Curtis

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Doug McNaught

Tom Lane <[EMAIL PROTECTED]> writes:

> "Curtis Faith" <[EMAIL PROTECTED]> writes:

> > The log file would be opened O_DSYNC, O_APPEND every time.
> 
> Keep in mind that we support platforms without O_DSYNC.  I am not
> sure whether there are any that don't have O_SYNC either, but I am
> fairly sure that we measured O_SYNC to be slower than fsync()s on
> some platforms.

And don't we preallocate WAL files anyway?  So O_APPEND would be
irrelevant?

-Doug

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-05 Thread Tom Lane

"Curtis Faith" <[EMAIL PROTECTED]> writes:
> Assume Transaction A which writes a lot of buffers and XLog entries,
> so the Commit forces a relatively lengthy fsynch.

> Transactions B - E block not on the kernel lock from fsync but on
> the WALWriteLock. 

You are confusing WALWriteLock with WALInsertLock.  A
transaction-committing flush operation only holds the former.
XLogInsert only needs the latter --- at least as long as it
doesn't need to write.

Thus, given adequate space in the WAL buffers, transactions B-E do not
get blocked by someone else who is writing/syncing in order to commit.

Now, as the code stands at the moment there is no event other than
commit or full-buffers that prompts a write; that means that we are
likely to run into the full-buffer case more often than is good for
performance.  But a background writer task would fix that.

> Back-end servers would not issue fsync calls. They would simply block
> waiting until the LogWriter had written their record to the disk, i.e.
> until the sync'd block # was greater than the block that contained the
> XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
> ends after its log write returns.

This will pessimize performance except in the case where WAL traffic
is very heavy, because it means you don't commit until the block
containing your commit record is filled.  What if you are the only
active backend?

My view of this is that backends would wait for the background writer
only when they encounter a full-buffer situation, or indirectly when
they are trying to do a commit write and the background guy has the
WALWriteLock.  The latter serialization is unavoidable: in that
scenario, the background guy is writing/flushing an earlier page of
the WAL log, and we *must* have that down to disk before we can declare
our transaction committed.  So any scheme that tries to eliminate the
serialization of WAL writes will fail.  I do not, however, see any
value in forcing all the WAL writes to be done by a single process;
which is essentially what you're saying we should do.  That just adds
extra process-switch overhead that we don't really need.

> The log file would be opened O_DSYNC, O_APPEND every time.

Keep in mind that we support platforms without O_DSYNC.  I am not
sure whether there are any that don't have O_SYNC either, but I am
fairly sure that we measured O_SYNC to be slower than fsync()s on
some platforms.

> The nice part is that the WALWriteLock semantics could be changed to
> allow the LogWriter to write to disk while WALWriteLocks are acquired
> by back-end servers.

As I said, we already have that; you are confusing WALWriteLock
with WALInsertLock.

> Many transactions would commit on the same fsync (now really a write
> with O_DSYNC) and we would get optimal write throughput for the log
> system.

How are you going to avoid pessimizing the few-transactions case?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



[HACKERS] Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

2002-10-04 Thread Curtis Faith

It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:

lock->  down(&inode->i_sem);
ret = filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 1);
if (err && !ret)
ret = err;
err = filemap_fdatawait(inode->i_mapping);
if (err && !ret)
ret = err;
unlock->up(&inode->i_sem);

But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.

tom lane wrote:
> This is kind of ugly in general terms but I'm not sure that it really
> hurts Postgres.  In our present scheme, the only files we ever fsync()
> are WAL log files, not data files.  And in normal operation there is
> only one WAL writer at a time, and *no* WAL readers.  So an exclusive
> kernel-level lock on a WAL file while we fsync really shouldn't create
> any problem for us.  (Unless this indirectly blocks other operations
> that I'm missing?)

I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.

Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.

Transactions B - E block not on the kernel lock from fsync but on
the WALWriteLock. 

When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.

C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.

B Releases and now C writes its XLOG_XACT_COMMIT.

There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
incredible concurrency.

> As I commented before, I think we could do with an extra process to
> issue WAL writes in places where they're not in the critical path for
> a foreground process.  But that seems to be orthogonal from this issue.
 
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.

Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.

The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.

The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.

Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
system.

This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical 
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.

- Curtis



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org