Re: [HACKERS] 500 tpsQL + WAL log implementation
tom lane wrote: What can you do *without* using a raw partition? I dislike that idea for two reasons: portability and security. The portability disadvantages are obvious. And in ordinary system setups Postgres would have to run as root in order to write on a raw partition. It occurs to me that the same technique could be used without any raw device access. Preallocate a large WAL file and apply the method within it. You'll have more noise in the measurements due to greater variability in the physical positioning of the blocks --- but it's rather illusory to imagine that you know the disk geometry with any accuracy anyway. Modern drives play a lot of games under the hood. A write to a raw disk file is immediate and completes with minimal system overhead. I'll try to test a file-based approach using a write followed by an immediate fdatasynch and see if that approaches the speed of the raw partition access. I suspect we'll get decent performance, only perhaps 10% to 15% slower. As you mention, there is nothing exact about the technique, so we should be able to get similar improvements with a file based system. I've been able to get over 1,500 raw writes confirmed to disk using raw partition writes each slightly offset ahead of they other, yet, only somewhere between 500 and 650 on a sustained basis using the technique I described because of the noise in the geometry measurements and variable timing for the writes themselves. This scares me quite a bit too. The reason that the existing implementation maxes out at one WAL write per rotation is that for small transactions it's having to repeatedly write the same disk sector. You could only get around that by writing multiple versions of the same WAL page at different disk locations. Reliably reconstructing what data to use is not something that I'm prepared to accept on a handwave... I'm pretty sure this could be done very reliably but at the cost of slightly slower reading after a failure for redo. I figured that whenever a transaction wrote to the log it would set the log offset marker for new transactions to force the next transaction to use a new block. This would result in space waste which could be partially offset by using writes smaller than the 8K block size (along disk block size boundaries, 512 bytes for my disk). This has the advantage of making it fairly easy to make sure that the log can be reconstructed in order since there would be no partial block writes to worry about. I believe that 4 to 8 full rotations worth of usable blocks could be maintained and blocks would be written to the lowest offset tracks first unless there were no free blocks of sufficient size. This would probably result in 90% to 95% utilization of the blocks (disregarding waste inside the blocks themselves). When the lowest offset track filled up sufficiently, another empty track would be added to the usable blocks list and the lowest offset track taken off the unused list. This would ensure that a read of 4 to 8 tracks, which needs to be a fixed number for any given installation, could reconstruct the order of the WAL log since at no time would blocks be out of order beyond that range. Disk space is much cheaper than CPU and memory so I think that a logging system that used as much as three or four times the space but is three or four times faster would be a worthwhile improvement for those systems where updates or insert volume are very heavy. Obviously, this needs to be an option, not the default configuration. - Curtis ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] 500 tpsQL + WAL log implementation
Curtis, have you considered comparing raw writes versus file system writes on a raw multi-disk partition? I always set up my machines to store data on a mirror set (RAID1) or RAID5 set, and it seems your method should be tested there too. P.s., Tom, the postgresql user would NOT need to run as root to do this, you can just do a 'chown postgres.postgres /dev/sda1' or 'chown postgres.postgres /dev/md0' to give postgresql permission to write / read from a raw partition. That said, I am not sure how much we are likely to gain from this method either. But if no one tries it out we'll never know. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 500 tpsQL + WAL log implementation
Bravo Curtis, This is all excellent research. :-) Regards and best wishes, Justin Clift Curtis Faith wrote: snip Disk space is much cheaper than CPU and memory so I think that a logging system that used as much as three or four times the space but is three or four times faster would be a worthwhile improvement for those systems where updates or insert volume are very heavy. Obviously, this needs to be an option, not the default configuration. - Curtis -- My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there. - Indira Gandhi ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] 500 tpsQL + WAL log implementation
Curtis Faith [EMAIL PROTECTED] writes: Using a raw file partition and a time-based technique for determining the optimal write position, I am able to get 8K writes physically written to disk synchronously in the range of 500 to 650 writes per second using FreeBSD raw device partitions on IDE disks (with write cache disabled). What can you do *without* using a raw partition? I dislike that idea for two reasons: portability and security. The portability disadvantages are obvious. And in ordinary system setups Postgres would have to run as root in order to write on a raw partition. It occurs to me that the same technique could be used without any raw device access. Preallocate a large WAL file and apply the method within it. You'll have more noise in the measurements due to greater variability in the physical positioning of the blocks --- but it's rather illusory to imagine that you know the disk geometry with any accuracy anyway. Modern drives play a lot of games under the hood. The obvious problem with the above mechanism is that the WAL log needs to be able to read from the log file in transaction order during recovery. This could be provided for using an abstraction that prepends the logical order for each block written to the disk and makes sure that the log blocks contain either a valid logical order number or some other marker indicating that the block is not being used. This scares me quite a bit too. The reason that the existing implementation maxes out at one WAL write per rotation is that for small transactions it's having to repeatedly write the same disk sector. You could only get around that by writing multiple versions of the same WAL page at different disk locations. Reliably reconstructing what data to use is not something that I'm prepared to accept on a handwave... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html