[HACKERS] fsyncing data to disk

2011-09-09 Thread Nulik Nol
Hi,
this is not exactly a Postgresql question, but an input from hackers
list like this would be invaluable for me.
I am coding my own database engine, and I decided to do not implement
transaction engine because it implies too much code.
But to achieve the Durability of ACID I need a 100% reliable write to
disk. By design no record in my DB will be larger than 512 bytes, so I
am using the page size of 512 bytes, that matches the size of the disk
block, so every write() I will execute with the following fdatasync()
call will be 100% written, is that correct? It won't make a 300 byte
write if I tell it to write 512 and the power goes off or will it? I
am going to use the whole partition device for the DB (like /dev/sda1)
, so no filesystem code will be used. Also I am using asynchronous IO
(the aio_read and aio_write) and I don't know if they can be combined
with the fdatasync() syscall?

Will appreciate your comments

Regards

-- 
==
The power of zero is infinite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsyncing data to disk

2011-09-09 Thread Florian Pflug
On Sep9, 2011, at 20:15 , Nulik Nol wrote:
 this is not exactly a Postgresql question, but an input from hackers
 list like this would be invaluable for me.
 I am coding my own database engine, and I decided to do not implement
 transaction engine because it implies too much code.
 But to achieve the Durability of ACID I need a 100% reliable write to
 disk. By design no record in my DB will be larger than 512 bytes, so I
 am using the page size of 512 bytes,

Beware that there *are* disks with block sizes other than 512 bytes. For
example, at least for 2.5 disks, 4096 bytes/block is becoming quite
common these days.

 that matches the size of the disk
 block, so every write() I will execute with the following fdatasync()
 call will be 100% written, is that correct? It won't make a 300 byte
 write if I tell it to write 512 and the power goes off or will it?

Since error correction is done per-block, it's very unlikely that you'd see
only 300 of the 512 bytes overwritten - the drive would detect uncorrectable
data corruption and report an error instead. Whether that error is reported back
to the application as an IO error or as a zeroed-out block probably depends on
the OS.

What you actually seem to want is a stronger all-or-nothing guarantee which
precludes the error case. AFAIK, most disk drives kinda-of do that, because
the various capacitors which stabilize the power supply usually hold enough
charge to complete a write once it's started, and because they stop operating
if the power drops below some threshold. But I doubt that they provide any
hard guarantees in this area, I guess it's more of a best-effort thing.

To get hard guarantees, you'll need to use a RAID controller with a
battery-backed cache. Or use a journal/WAL like postgres (and most filesystems)
do, and protect journal/WAL entries with a checksum to detect partially written
entries.

 I am going to use the whole partition device for the DB (like /dev/sda1)
 , so no filesystem code will be used. Also I am using asynchronous IO
 (the aio_read and aio_write) and I don't know if they can be combined
 with the fdatasync() syscall?

Someone else (maybe the POSIX spec?) must answer that as I know very little
about asynchronous IO.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsyncing data to disk

2011-09-09 Thread Greg Stark
On Fri, Sep 9, 2011 at 7:46 PM, Florian Pflug f...@phlo.org wrote:
 I am going to use the whole partition device for the DB (like /dev/sda1)
 , so no filesystem code will be used. Also I am using asynchronous IO
 (the aio_read and aio_write) and I don't know if they can be combined
 with the fdatasync() syscall?

 Someone else (maybe the POSIX spec?) must answer that as I know very little
 about asynchronous IO.


There's an aio_fsync as part of the aio api, But you could use fsync
or fdatasync -- I assume you would have to wait for the aio_write to
have finished before you issue the fsync. But if you're going to
fdatasync all your writes right away you may as well open with O_DSYNC
which is, I gather, exactly how aio is intended to be used.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers