Re: [HACKERS] 500 tpsQL + WAL log implementation

2002-11-12 Thread Curtis Faith
tom lane wrote:
 What can you do *without* using a raw partition?

 I dislike that idea for two reasons: portability and security.  The
 portability disadvantages are obvious.  And in ordinary system setups
 Postgres would have to run as root in order to write on a raw partition.

 It occurs to me that the same technique could be used without any raw
 device access.  Preallocate a large WAL file and apply the method within
 it.  You'll have more noise in the measurements due to greater
 variability in the physical positioning of the blocks --- but it's
 rather illusory to imagine that you know the disk geometry with any
 accuracy anyway.  Modern drives play a lot of games under the hood.

A write to a raw disk file is immediate and completes with minimal system
overhead. I'll try to test a file-based approach using a write followed by an
immediate fdatasynch and see if that approaches the speed of the raw
partition access. I suspect we'll get decent performance, only perhaps 10% to
15% slower.

As you mention, there is nothing exact about the technique, so we should be
able to get similar improvements with a file based system. I've been able to
get over 1,500 raw writes confirmed to disk using raw partition writes each
slightly offset ahead of they other, yet, only somewhere between 500 and 650
on a sustained basis using the technique I described because of the noise in
the geometry measurements and variable timing for the writes themselves.


 This scares me quite a bit too.  The reason that the existing
 implementation maxes out at one WAL write per rotation is that for small
 transactions it's having to repeatedly write the same disk sector.  You
 could only get around that by writing multiple versions of the same WAL
 page at different disk locations.  Reliably reconstructing what data to
 use is not something that I'm prepared to accept on a handwave...

I'm pretty sure this could be done very reliably but at the cost of slightly
slower reading after a failure for redo.

I figured that whenever a transaction wrote to the log it would set the log
offset marker for new transactions to force the next transaction to use a new
block. This would result in space waste which could be partially offset by
using writes smaller than the 8K block size (along disk block size
boundaries, 512 bytes for my disk). This has the advantage of making it
fairly easy to make sure that the log can be reconstructed in order since
there would be no partial block writes to worry about.

I believe that 4 to 8 full rotations worth of usable blocks could be
maintained and blocks would be written to the lowest offset tracks first
unless there were no free blocks of sufficient size. This would probably
result in 90% to 95% utilization of the blocks (disregarding waste inside the
blocks themselves). When the lowest offset track filled up sufficiently,
another empty track would be added to the usable blocks list and the lowest
offset track taken off the unused list.

This would ensure that a read of 4 to 8 tracks, which needs to be a fixed
number for any given installation, could reconstruct the order of the WAL log
since at no time would blocks be out of order beyond that range.

Disk space is much cheaper than CPU and memory so I think that a logging
system that used as much as three or four times the space but is three or
four times faster would be a worthwhile improvement for those systems where
updates or insert volume are very heavy. Obviously, this needs to be an
option, not the default configuration.

- Curtis


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] 500 tpsQL + WAL log implementation

2002-11-12 Thread scott.marlowe

Curtis, have you considered comparing raw writes versus file system writes 
on a raw multi-disk partition?

I always set up my machines to store data on a mirror set (RAID1) or RAID5 
set, and it seems your method should be tested there too.

P.s., Tom, the postgresql user would NOT need to run as root to do this, 
you can just do a 'chown postgres.postgres /dev/sda1' or 'chown 
postgres.postgres /dev/md0' to give postgresql permission to write / read 
from a raw partition.

That said, I am not sure how much we are likely to gain from this method 
either.  But if no one tries it out we'll never know.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] 500 tpsQL + WAL log implementation

2002-11-12 Thread Justin Clift
Bravo Curtis,

This is all excellent research.

:-)

Regards and best wishes,

Justin Clift


Curtis Faith wrote:
snip
 Disk space is much cheaper than CPU and memory so I think that a logging
 system that used as much as three or four times the space but is three or
 four times faster would be a worthwhile improvement for those systems where
 updates or insert volume are very heavy. Obviously, this needs to be an
 option, not the default configuration.
 
 - Curtis

-- 
My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there.
   - Indira Gandhi

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] 500 tpsQL + WAL log implementation

2002-11-11 Thread Tom Lane
Curtis Faith [EMAIL PROTECTED] writes:
 Using a raw file partition and a time-based technique for determining the
 optimal write position, I am able to get 8K writes physically written to disk
 synchronously in the range of 500 to 650 writes per second using FreeBSD raw
 device partitions on IDE disks (with write cache disabled).

What can you do *without* using a raw partition?

I dislike that idea for two reasons: portability and security.  The
portability disadvantages are obvious.  And in ordinary system setups
Postgres would have to run as root in order to write on a raw partition.

It occurs to me that the same technique could be used without any raw
device access.  Preallocate a large WAL file and apply the method within
it.  You'll have more noise in the measurements due to greater
variability in the physical positioning of the blocks --- but it's
rather illusory to imagine that you know the disk geometry with any
accuracy anyway.  Modern drives play a lot of games under the hood.

 The obvious problem with the above mechanism is that the WAL log needs to be
 able to read from the log file in transaction order during recovery. This
 could be provided for using an abstraction that prepends the logical order
 for each block written to the disk and makes sure that the log blocks contain
 either a valid logical order number or some other marker indicating that the
 block is not being used.

This scares me quite a bit too.  The reason that the existing
implementation maxes out at one WAL write per rotation is that for small
transactions it's having to repeatedly write the same disk sector.  You
could only get around that by writing multiple versions of the same WAL
page at different disk locations.  Reliably reconstructing what data to
use is not something that I'm prepared to accept on a handwave...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html