On 11 Oct 2012, at 10:41pm, Nico Williams <n...@cryptonector.com> wrote:

> On Thu, Oct 11, 2012 at 11:59 AM, Simon Slavin <slav...@bigfraud.org> wrote:
>> On 11 Oct 2012, at 5:38pm, Nico Williams <n...@cryptonector.com> wrote:
>>> There is something you can do: use a combination of COW on-disk
>>> formats in such a way that it's possible to detect partially-committed
>>> transactions and rollback to the last good known root
>> 
>> This is actually the problem, not the solution.  Traditional disk drivers 
>> for spinning disks change the order in which they write things to disk.  
>> They will buffer several write commands up and notice that on the way to 
>> moving the write head to do write #1 the write head will pass over the 
>> correct spot to do write #4, so they will do write #4 first.  Many disks 
>> will do this even with the disk jumpers set to enforce in-order writing: 
>> they lie.
> 
> You missed something: because fsync()s are done (in the background,
> you are guaranteed that transactions do eventually make it (in order,
> up to the point of the fsync()) onto disk. 

Unfortunately they are not guaranteed to be made in order.  fsync() depends on 
the hard disk waiting until its writes are done before the driver call returns. 
 In other words ...

1. Your program calls the operating system's fsync() or equivalent
   (Technically speaking you may also want to fsync() the directory containing 
the file.)
2.     Operating system flushes impending writes to the hard disk by calling 
the storage driver
3.          Storage driver receives pending writes
4.          Storage converts those changes to instructions to the storage 
hardware
5.              Physical changes are made within the storage hardware
6.          Storage driver waits until they have all actually been made
7.          Once those changes have been made storage driver reports success to 
operating system
8.     Operating system (fsync()) loops waiting for the drive's driver to 
report all writes have done
9. Operating system returns from fsync() call reporting success
a. Your program can proceed

The problem is that a standard desktop computer doesn't do 5 and may not do 8 
either.  The storage drivers save a great deal of time by reporting success 
before the disk's surface has been changed before waiting for physical changes. 
 And they will immediately accept more changes, disrespecting the write 
barrier.  At the driver level, the barrier between transactions gets lost.

Server-level hardware (not most popular cheap hard disks) can be configured to 
do this properly, usually using mini-switches or jumper settings.  But if you 
try to do that to a standard desktop computer you will find it slows down so 
much it's unusable: type a few characters in Word and you can wait two or three 
seconds to see them on the screen.  That's why the makers of normal computers 
don't do it.  In addition, many operating systems also don't implement fsync() 
properly, for the same reason: to make the computer feel faster.  You can see 
something about this in section 9.2 of

<http://www.sqlite.org/atomiccommit.html>

Simon.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to