Re: [sqlite] light weight write barriers

2012-11-28 Thread Vladislav Bolkhovitin


Nico Williams, on 11/26/2012 03:05 PM wrote:

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this. Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.


Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.


Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.


There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.


As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?



So just give us a barrier.


Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.


If you look at it, it offers exactly what you want, only named correctly.

Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-26 Thread Nico Williams
On Mon, Nov 26, 2012 at 6:05 PM, Larry Brasfield
 wrote:
> Nico Williams emitted:
>
>> You keep saying that programmers don't understand "barriers".  You've
>> provided no evidence of this.  Meanwhile memory barriers are generally
>> well understood, and every programmer I know understands that a
>> "barrier" is a synchronization primitive that says that all operations
>> of a certain type will have completed prior to the barrier returning
>> control to its caller.
>
>
> Well, since you don't know me, this does not contradict you, but ...
>
> My understanding of a memory barrier, formed from close study of the need
> for them and some implementations, is that they cause a partial ordering of
> memory operations, such that all accesses instigated before the barrier is
> created occur before all accesses instigated after the barrier is created.
> This does not mean that the caller of a barrier-creating function (or the
> executor of a barrier-creating instruction) does not get control until all
> prior accesses have been "completed".  The "caller" may well continue
> executing instructions from cache, and other execution units may not be held
> up at all unless they instigate an "after" access.
>
> I will be happy to become differently educated on this subject. (perhaps via
> some evidence ;-)

That's fair, but the effect is still indistinguishable from from what
I wrote.  (Well, I suppose one has to be careful about the possibility
of a CPU with I/O ports writes to which are not included in the
concept of a memory barrier, but we have to simplify somewhere, and
the point is that barriers are a simple enough concept that we can
program with it, and this is all the more so in filesystems, where we
don't have to concern ourselves with the nuances of many different
CPUs.  There are nuances in filesystem barriers, particularly writes
to MAP_SHARED mmap()ed regions, but barriers don't create new problems
there.)

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-26 Thread Larry Brasfield

Nico Williams emitted:

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this.  Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.


Well, since you don't know me, this does not contradict you, but ...

My understanding of a memory barrier, formed from close study of the 
need for them and some implementations, is that they cause a partial 
ordering of memory operations, such that all accesses instigated before 
the barrier is created occur before all accesses instigated after the 
barrier is created.  This does not mean that the caller of a 
barrier-creating function (or the executor of a barrier-creating 
instruction) does not get control until all prior accesses have been 
"completed".  The "caller" may well continue executing instructions from 
cache, and other execution units may not be held up at all unless they 
instigate an "after" access.


I will be happy to become differently educated on this subject. 
(perhaps via some evidence ;-)


Cheers,
--
Larry Brasfield

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-26 Thread Nico Williams
Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this.  Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.

For some filesystems it is possible to configure fsync() to act as a
barrier: for example, ZFS can be told to perform no synchronous
operations for a given dataset, in which case fsync() devolves into a
simple barrier.  (Cue Simon to tell us that some hardware and some
OSes, and some filesystems simply cannot implement fsync(), with or
without synchronicity.)

So just give us a barrier.  Yes, I know, it's tricky to implement, but
it'd be OK to return EOPNOSUPP, and let the app do something else
(e.g., call fsync() instead, tell the user to expect instability, tell
the user to get a better system, ...).

As for implementation, it helps to have a journalled or log-structured
filesystem.  It also helps to have hardware synchronization primitives
that don't suck, but these aren't entirely necessary: ZFS, for
example, can recover [*] from N incomplete transactions[**], and still
provides fsync() as a barrier given its on-disk structure and the ZIL.
 Note that ZFS recovery from incomplete transactions should never be
necessary where the HW has proper cache flush support, but the
recovery functionality was added precisely because of lousy hardware.

[*]   At volume import time, such as at boot-time.
[**] Granted, this requires user input, but if the user didn't care it
could be made automatic.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-19 Thread Vladislav Bolkhovitin

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, 
which
would allow to set the required capabilities, i.e. if this request is FUA, or 
full
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
each iocb.

For the regular read()/write() I would add to "flags" parameter of
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make
the latest submitted write in this fd ORDERED.


Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.


For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.


(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Keith Medcalf
> The more money you pay for your storage, the less likely this is to be an
> issue (high end SSD's, enterprise class arrays, etc don't have volatile write
> caches and most SAS drives perform reasonably well with the write cache 
> disabled).

"Performance" without a write cache is a physical property.  It varies 
according to very simple principles related to arial density, rotational speed, 
actuator speed (stepping -- momentum, acceleration and deceleration and 
settling of the read/write heads).  "Performance", without a write cache, has 
absolutely nothing whatsoever to do with the external data transfer method or 
whether the bus is parallel, serial, or via hyperspace, just as it has nothing 
to do with whether the moon is made of green of purple cheese.

Statements such as "most SAS drives perform reasonably well with the write 
cache disabled" demonstrate a very deep seated ignorance of "the way things 
work" that ought to indicate that anything said should be taken as highly 
likely to be incorrect.

---
()  ascii ribbon campaign against html e-mail
/\  www.asciiribbon.org



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin


Chris Friesen, on 11/15/2012 05:35 PM wrote:

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().


This is how I would export cache sync and requests ordering abstractions to the 
user space:


For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.


For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.


To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.


All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.


Vlad

1. See my other today's e-mail about what is immediate cache sync.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently.


Exactly!


However, logically they are two distinctive functionalities


Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Vladislav Bolkhovitin

David Lang, on 11/15/2012 07:07 AM wrote:

There's no such thing as "barrier". It is fully artificial abstraction. After
all, at the bottom of your stack, you will have to translate it either to cache
flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.


Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.


That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.



the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)


The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.


For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media


1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.


2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.


B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.


Those abstractions can naturally fit all storage capabilities. For instance:

 - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.


 - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.


On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.



barriers keep getting mentioned because they are a easy concept to understand.


Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-17 Thread Ric Wheeler

On 11/16/2012 10:54 AM, Howard Chu wrote:

Ric Wheeler wrote:

On 11/16/2012 10:06 AM, Howard Chu wrote:

David Lang wrote:

barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of 
the

users.

Users readily accept that if the system crashes, they will loose the most 
recent

stuff that they did,


*some* users may accept that. *None* should.


but they get annoyed when things get corrupted to the point
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user 
must sit
and wait for the syncs to take place can be a problem. It would be far 
better to
be able to say "write to temp file, and after it's on disk, rename the 
file" and

not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as 
there

is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to 
userspace,

it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users 
without

needing application changes.


There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it
because they don't know better. We programmers, who know better, have failed
to raise a stink and demand that this be fixed.
   A) Drives should not lose data on power failure. If a drive accepts a write
request and says "OK, done" then that data should get written to stable
storage, period. Whether it requires capacitors or some other onboard power
supply, or whatever, they should just do it. Keep in mind that today, most of
the difference between enterprise drives and consumer desktop drives is just a
firmware change, that hardware is already identical. Nobody should accept a
product that doesn't offer this guarantee. It's inexcusable.
   B) it should go without saying - drives should reliably report back to the
host, when something goes wrong. E.g., if a write request has been accepted,
cached, and reported complete, but then during the actual write an ECC failure
is detected in the cacheline, the drive needs to tell the host "oh by the way,
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and
we're not going to take it any more" the hard drive industry would have no
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which
doesn't mean they never fail, it only means they tell the truth about
successes or failures), most of these other issues disappear. Most of the need
for barriers disappear.



I think that you are arguing a fairly silly point.


Seems to me that you're arguing that we should accept inferior technology. 
Who's really being silly?


No, just suggesting that you either pay for the expensive stuff or learn how to 
use cost effective, high capacity storage like the rest of the world.


I don't disagree that having non-volatile write caches would be nice, but 
everyone has learned how to deal with volatile write caches at the low end of 
market.





If you want that behaviour, you have had it for more than a decade - simply
disable the write cache on your drive and you are done.


You seem to believe it's nonsensical for someone to want both fast and 
reliable writes, or that it's unreasonable for a storage device to offer the 
same, cheaply. And yet it is clearly trivial to provide all of the above.


I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design 
resume, I suggest you spend the money to get the parts with non-volatile write 
caches or fix your code.


Ric



If you - as a user - want to run faster and use applications that are coded to
handle data integrity properly (fsync, fdatasync, etc), leave the write cache
enabled and use file system barriers.


Applications aren't supposed to need to worry about such details, that's why 
we have operating systems.


Drives should tell the truth. In event of an error detected after the fact, 
the drive should report the error back to the host. There's nothing 
nonsensical there.


When a drive's cache is enabled, the host should maintain a queue of written 
pages, of a length equal to the size of the 

Re: [sqlite] light weight write barriers

2012-11-17 Thread Ric Wheeler

On 11/16/2012 10:06 AM, Howard Chu wrote:

David Lang wrote:

barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.

Users readily accept that if the system crashes, they will loose the most recent
stuff that they did,


*some* users may accept that. *None* should.


but they get annoyed when things get corrupted to the point
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user must sit
and wait for the syncs to take place can be a problem. It would be far better to
be able to say "write to temp file, and after it's on disk, rename the file" and
not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as there
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to userspace,
it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users without
needing application changes.


There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
  A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just a 
firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.
  B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC failure 
is detected in the cacheline, the drive needs to tell the host "oh by the way, 
block XXX didn't actually make it to disk like I told you it did 10ms ago."


If the entire software industry were to simply state "your shit stinks and 
we're not going to take it any more" the hard drive industry would have no 
choice but to fix it. And in most cases it would be a zero-cost fix for them.


Once you have drives that are actually trustworthy, actually reliable (which 
doesn't mean they never fail, it only means they tell the truth about 
successes or failures), most of these other issues disappear. Most of the need 
for barriers disappear.




I think that you are arguing a fairly silly point.

If you want that behaviour, you have had it for more than a decade - simply 
disable the write cache on your drive and you are done.


If you - as a user - want to run faster and use applications that are coded to 
handle data integrity properly (fsync, fdatasync, etc), leave the write cache 
enabled and use file system barriers.


Everyone has to trade off cost versus something else and this is a very, very 
long standing trade off that drive manufacturers have made.


The more money you pay for your storage, the less likely this is to be an issue 
(high end SSD's, enterprise class arrays, etc don't have volatile write caches 
and most SAS drives perform reasonably well with the write cache disabled).


Regards,

Ric


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-16 Thread David Lang

On Fri, 16 Nov 2012, Howard Chu wrote:


David Lang wrote:
barriers keep getting mentioned because they are a easy concept to 
understand.
"do this set of stuff before doing any of this other set of stuff, but I 
don't
care when any of this gets done" and they fit well with the requirements of 
the

users.

Users readily accept that if the system crashes, they will loose the most 
recent

stuff that they did,


*some* users may accept that. *None* should.


when users are given a choice of having all their work be very slow, or have it 
be fast, but in the unlikely event of a crash they loose their mose recent 
changes, they are willing to loose their most recent changes.


If you think about it, this is not much different from the fact that you loose 
all changes since the last time you saved the thing you are working on. Many 
programs save state periodically so that if the application crashes the user 
hasn't lost everything, but any application that tried to save after every 
single change would be so slow that nobody would use it.


There is always going to be a window after a user hits 'save' where the data can 
be lost, because it's not yet on disk.



There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
 A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just 
a firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.


This is an option to you. However if you have enabled write caching and 
reordering, you have explicitly told the system to be faster at the expense of 
loosing data under some conditions. The fact that you then loose data under 
those conditions should not surprise you.


The idea that you must have enough power to write all the pending data to disk 
is problematic as that then severely limits the amount of cache that you have.


 B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC 
failure is detected in the cacheline, the drive needs to tell the host "oh by 
the way, block XXX didn't actually make it to disk like I told you it did 
10ms ago."


The issue isn't a drive having a write error, it's the system shutting down 
(or crashing) before the data is written, no OS level tricks will help you here.



The real problem here isn't the drive claiming the data has been written when it 
hasn't, the real problem is that the application has said 'write this data' to 
the OS, and the OS has not done so yet.


The OS delays the writes for many legitimate reasons (the disk may be busy, it 
can get things done more efficently by combining and reordering the writes, etc)


Unless the system crashes, this is not a problem, the data will eventually be 
written out, and on system shutdown everthing is good.


But if the system crashes, some of this postphoned work doesn't get done, and 
that can be a problem.


Applications can do fsync if they want to be sure that their data is safe on 
disk NOW, but they currently have no way of saying "I want to make sure that A 
happens before B, but I don't care if A happens now or 10 seconds from now"


That is the gap that it would be useful to provide a mechanism to deal with, and 
it doesn't matter what your disk system does in terms of lieing ot not, there 
still isn't a way to deal with this today.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-16 Thread Howard Chu

Ric Wheeler wrote:

On 11/16/2012 10:06 AM, Howard Chu wrote:

David Lang wrote:

barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.

Users readily accept that if the system crashes, they will loose the most recent
stuff that they did,


*some* users may accept that. *None* should.


but they get annoyed when things get corrupted to the point
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user must sit
and wait for the syncs to take place can be a problem. It would be far better to
be able to say "write to temp file, and after it's on disk, rename the file" and
not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as there
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to userspace,
it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users without
needing application changes.


There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it
because they don't know better. We programmers, who know better, have failed
to raise a stink and demand that this be fixed.
   A) Drives should not lose data on power failure. If a drive accepts a write
request and says "OK, done" then that data should get written to stable
storage, period. Whether it requires capacitors or some other onboard power
supply, or whatever, they should just do it. Keep in mind that today, most of
the difference between enterprise drives and consumer desktop drives is just a
firmware change, that hardware is already identical. Nobody should accept a
product that doesn't offer this guarantee. It's inexcusable.
   B) it should go without saying - drives should reliably report back to the
host, when something goes wrong. E.g., if a write request has been accepted,
cached, and reported complete, but then during the actual write an ECC failure
is detected in the cacheline, the drive needs to tell the host "oh by the way,
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and
we're not going to take it any more" the hard drive industry would have no
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which
doesn't mean they never fail, it only means they tell the truth about
successes or failures), most of these other issues disappear. Most of the need
for barriers disappear.



I think that you are arguing a fairly silly point.


Seems to me that you're arguing that we should accept inferior technology. 
Who's really being silly?



If you want that behaviour, you have had it for more than a decade - simply
disable the write cache on your drive and you are done.


You seem to believe it's nonsensical for someone to want both fast and 
reliable writes, or that it's unreasonable for a storage device to offer the 
same, cheaply. And yet it is clearly trivial to provide all of the above.



If you - as a user - want to run faster and use applications that are coded to
handle data integrity properly (fsync, fdatasync, etc), leave the write cache
enabled and use file system barriers.


Applications aren't supposed to need to worry about such details, that's why 
we have operating systems.


Drives should tell the truth. In event of an error detected after the fact, 
the drive should report the error back to the host. There's nothing 
nonsensical there.


When a drive's cache is enabled, the host should maintain a queue of written 
pages, of a length equal to the size of the drive's cache. If a drive says 
"hey, block XXX failed" the OS can reissue the write from its own queue. No 
muss, no fuss, no performance bottlenecks. This is what Real Computers did 
before the age of VAX Unix.



Everyone has to trade off cost versus something else and this is a very, very
long standing trade off that drive manufacturers have made.


With the cost of storage falling as rapidly as it has in recent years, this is 
a stupid tradeoff.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  

Re: [sqlite] light weight write barriers

2012-11-16 Thread Howard Chu

David Lang wrote:

barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.

Users readily accept that if the system crashes, they will loose the most recent
stuff that they did,


*some* users may accept that. *None* should.


but they get annoyed when things get corrupted to the point
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user must sit
and wait for the syncs to take place can be a problem. It would be far better to
be able to say "write to temp file, and after it's on disk, rename the file" and
not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as there
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to userspace,
it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users without
needing application changes.


There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
  A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just a 
firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.
  B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC failure 
is detected in the cacheline, the drive needs to tell the host "oh by the way, 
block XXX didn't actually make it to disk like I told you it did 10ms ago."


If the entire software industry were to simply state "your shit stinks and 
we're not going to take it any more" the hard drive industry would have no 
choice but to fix it. And in most cases it would be a zero-cost fix for them.


Once you have drives that are actually trustworthy, actually reliable (which 
doesn't mean they never fail, it only means they tell the truth about 
successes or failures), most of these other issues disappear. Most of the need 
for barriers disappear.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-16 Thread Chris Friesen

On 11/15/2012 11:06 AM, Ryan Johnson wrote:


The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.


This sounds interesting as a way to expose some useful semantics to 
userspace.


I assume we'd need to come up with a new syscall or something since it 
doesn't match the behaviour of posix fsync().


Chris
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread Ryan Johnson

On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:

Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning 
commands order, or barriers meaning both?


There's no such thing as "barrier". It is fully artificial 
abstraction. After all, at the bottom of your stack, you will have to 
translate it either to cache flush, or commands order enforcement, or 
both.
Isn't that  why we *have* "the stack" in the first place? So apps 
*don't* have to worry about how the OS implements an artificial (= 
high-level and portable) abstraction on a given device?




Are you going to invent 3 types of barriers?

One will do, it just needs to be a good one.

Maybe I'm missing something here, so I'm going to back up a bit and 
recap what I understand.


The filesystem abstracts the concept of encoding patterns of bits in 
some physical media (data), and making it easy to find and retrieve 
those bits later (metadata, incl. file name). When users read(), they 
expect to see whatever they most recently sent to write(). They also 
expect that what they write will still be there later,  in spite of any 
failure that leaves the disk itself intact.


Operating systems cheat by not actually writing to disk -- for 
performance reasons -- and users are (mostly, usually) OK with that, 
because the performance gains are so attractive and things usually work 
out anyway. Disks cheat too, in the same way and for the same reason.


The cheating works great most of the time, but breaks down -- badly -- 
if we actually care about what is on disk after a crash (or if we use a 
network filesystem). Enough people do care that fsync() was added to the 
toolbox. It is defined to transfer "all modified in-core data of the 
file referred to by the file descriptor fd to the disk device" and 
"blocks until the device reports that the transfer has completed" 
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make 
sure the stuff I already wrote actually got written. And tell the disk 
to stop cheating, too."


Problem is, this definition is asymmetric: it says what happens to 
writes issued before the fsync, but nothing about those issued after the 
fsync starts and before it returns [1]. The reader has to assume  
fsync() makes no promises whatsoever about these later writes: making 
fsync capture them exposes callers of fsync() to DoS attacks, and them 
from reaching disk until all outstanding fsync calls complete would add 
complexity the spec doesn't currently demand, leading to understandable 
reluctance by kernel devs to code it up. Unfortunately, we're left with 
the filesystem equivalent of what we in the database world call 
"eventual consistency" -- easy to implement, nice and fast, but very 
difficult to write reliable code against unless you're willing to pay 
the cost of being fully synchronous, all the time. Having tried that for 
a few years, many people are "returning" to better-specified concurrency 
models, trading some amount of performance for comfort that the app will 
at least work predictably when things go wrong in strange and 
unanticipated ways.


The request, then, is to tighten up fsync semantics in two conceptually 
straightforward ways [2]: First, guarantee that later writes to an fd do 
not hit disk until earlier calls to fsync() complete. Second, make the 
call asynchronous. That's all.


Note that both changes are necessary. The improved ordering semantic 
useless by itself, because it's still not safe to request a blocking 
fsync from one thread and and then let other threads continue issuing 
writes: there's a race between broadcasting that fsync has begun and 
issuing the actual syscall that begins it. An asynchronous fsync is also 
useless by itself, because it only benefits uncoordinated writes (which 
evidently don't care what data actually reaches disk anyway).


The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to 
the affected file, wait for the device to report success, issue a cache 
flush to the device (or request ordering commands, if available) to make 
it tell the truth, and wait for the device to report success. AFAIK this 
already happens, but without taking advantage of any request ordering 
commands.
2. The requesting thread returns as soon as the kernel has identified 
all data that will be written back. This is new, but pretty similar to 
what AIO already does.
3. No write is allowed to enqueue any requests at the device that 
involve the same file, until all outstanding fsync complete [3]. This is 
new.


The performance hit for #1 can be reduced significantly if the storage 
hardware at hand happens to support some form of request ordering. The 
amount of reduction could vary greatly depending on 

Re: [sqlite] light weight write barriers

2012-11-15 Thread 杨苏立 Yang Su Li
On Thu, Nov 15, 2012 at 10:29 AM, Simon Slavin  wrote:

>
> On 15 Nov 2012, at 4:14pm, 杨苏立 Yang Su Li  wrote:
>
> > 1. fsync actually does two things at the same time: ordering writes (in a
> > barrier-like manner), and forcing cached writes to disk. This makes it
> very
> > difficult to implement fsync efficiently. However, logically they are two
> > distinctive functionalities, and user might want one but not the other.
> > Particularly, users might want a barrier, but doesn't care about
> durability
> > (much). I have no idea why ordering and durability, which seem quite
> > different, ended up being bundled together in a single fsync call.
> >
> > 2. fsync semantic in POSIX is a bit vague (at least to me) in a
> concurrent
> > setting. What is the expected behavior when more than one thread write to
> > the same file descriptor, or different file descriptor associated with
> the
> > same file?
>
> And, as has been posted many times here and elsewhere, on many systems
> fsync does nothing at all.  It is literally implemented as a 'noop'.  So
> you cannot use it as a basis for barriers.
>

I think it is because it's so difficult to implement fsync efficiently,
some systems just stop trying.

>
> > In modern file system we do all kind of stuff to ensure ordering, and I
> > think I can see how leveraging ordered commands (when it is available
> from
> > hardware) could potentially boost performance.
>
> Similarly, on many hard disk subsystems (the circuit board and firmware
> provided with the hard disk), the 'wait until cache has been written'
> operation does nothing.  So even if you /could/ depend on fsync you still
> couldn't depend on the hardware.  Read the manufactuer's documentation:
> they don't hide it, they boast about it because it makes the hard drive far
> faster.  If you really want this feature to work you have to buy expensive
> server-quality hard drives and set the jumpers in the right positions.
>

When you cannot trust hardware, there are still somethings you can do to
ensure durability and consistency. There are some work from UW-Madsion,
talks about how do you do that without trusting the hardware, say, using
coerced cache eviction. Of course, this is expensive, thus we want to
decouple ordering and durability even more.

Suli

>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread Simon Slavin

On 15 Nov 2012, at 4:14pm, 杨苏立 Yang Su Li  wrote:

> 1. fsync actually does two things at the same time: ordering writes (in a
> barrier-like manner), and forcing cached writes to disk. This makes it very
> difficult to implement fsync efficiently. However, logically they are two
> distinctive functionalities, and user might want one but not the other.
> Particularly, users might want a barrier, but doesn't care about durability
> (much). I have no idea why ordering and durability, which seem quite
> different, ended up being bundled together in a single fsync call.
> 
> 2. fsync semantic in POSIX is a bit vague (at least to me) in a concurrent
> setting. What is the expected behavior when more than one thread write to
> the same file descriptor, or different file descriptor associated with the
> same file?

And, as has been posted many times here and elsewhere, on many systems fsync 
does nothing at all.  It is literally implemented as a 'noop'.  So you cannot 
use it as a basis for barriers.

> In modern file system we do all kind of stuff to ensure ordering, and I
> think I can see how leveraging ordered commands (when it is available from
> hardware) could potentially boost performance.

Similarly, on many hard disk subsystems (the circuit board and firmware 
provided with the hard disk), the 'wait until cache has been written' operation 
does nothing.  So even if you /could/ depend on fsync you still couldn't depend 
on the hardware.  Read the manufactuer's documentation: they don't hide it, 
they boast about it because it makes the hard drive far faster.  If you really 
want this feature to work you have to buy expensive server-quality hard drives 
and set the jumpers in the right positions.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread 杨苏立 Yang Su Li
On Thu, Nov 15, 2012 at 6:07 AM, David Lang  wrote:

> On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:
>
>  Nico Williams, on 11/13/2012 02:13 PM wrote:
>>
>>> declaring groups of internally-unordered writes where the groups are
>>> ordered with respect to each other... is practically the same as
>>> barriers.
>>>
>>
>> Which barriers? Barriers meaning cache flush or barriers meaning commands
>> order, or barriers meaning both?
>>
>> There's no such thing as "barrier". It is fully artificial abstraction.
>> After all, at the bottom of your stack, you will have to translate it
>> either to cache flush, or commands order enforcement, or both.
>>
>
> When people talk about barriers, they are talking about order enforcement.
>
>
>  Your mistake is that you are considering barriers as something real,
>> which can do something real for you, while it is just a artificial
>> abstraction apparently invented by people with limited knowledge how
>> storage works, hence having very foggy vision how barriers supposed to be
>> processed by it. A simple wrong answer.
>>
>> Generally, you can invent any abstraction convenient for you, but farther
>> your abstractions from reality of your hardware => less you will get from
>> it with bigger effort.
>>
>> There are no barriers in Linux and not going to be. Accept it. And start
>> instead thinking about offload capabilities your storage can offer to you.
>>
>
> the hardware capabilities are not directly accessable from userspace (and
> they probably shouldn't be)
>
> barriers keep getting mentioned because they are a easy concept to
> understand. "do this set of stuff before doing any of this other set of
> stuff, but I don't care when any of this gets done" and they fit well with
> the requirements of the users.
>

Well, I think there are two questions to be answered here: what primitive
should be offered to the user by the file system (currently we have fsync);
and what primitive should be offered by the lower level and used by the
file system (currently we have barrier, or flushing and FUA).

I do agree that we should keep what is accessible from user-space simple
and stupid. However if you look into fsync semantics a bit closer, I think
there are two things to be noted:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently. However, logically they are two
distinctive functionalities, and user might want one but not the other.
Particularly, users might want a barrier, but doesn't care about durability
(much). I have no idea why ordering and durability, which seem quite
different, ended up being bundled together in a single fsync call.

2. fsync semantic in POSIX is a bit vague (at least to me) in a concurrent
setting. What is the expected behavior when more than one thread write to
the same file descriptor, or different file descriptor associated with the
same file?

So I do think in the user space, we need some kind of barrier (or other)
primitive which is not tied to durability guarantees; and hopefully this
primitive could be implemented more efficiently than fsync. And of course,
this primitive should be simple and intuitive, abstracting the complexity
out.


On the other hand, we have the questions of what should file system use.
Traditionally block layer provides barrier primitive, and now I think they
are moving to flushing and FUA, or even ordered commands. (
http://lwn.net/Articles/400541/).

In terms of whether file system should be exposed with the hardware
capability, in this case, ordered commands. I personally think it should.
In modern file system we do all kind of stuff to ensure ordering, and I
think I can see how leveraging ordered commands (when it is available from
hardware) could potentially boost performance. And all the complexity of,
say, topological order, is dealt within the file system, and is not visible
to the user.

Of course, there are challenges in when you want to do ordered writes in
file system. As Ts'o mentioned, *when you have entagled metadata updates,
i.e., *you update file A, and file B, and file A and B might share
metadata, it could be difficult to get the ordering right without
sacrificing performance. But I personally think it is worth exploring.

Suli


>
> Users readily accept that if the system crashes, they will loose the most
> recent stuff that they did, but they get annoyed when things get corrupted
> to the point that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in
> the config file being blank. Yes, you can do the 'write to temp file, sync
> file, sync directory, rename file" dance, but the fact that to do so the
> user must sit and wait for the syncs to take place can be a problem. It
> would be far better to be able to say "write to temp file, and after it's
> on disk, rename the file" and not have the user 

Re: [sqlite] light weight write barriers

2012-11-15 Thread Vladislav Bolkhovitin


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.


Are you going to invent 3 types of barriers?


There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.


Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.


Vlad

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread Vladislav Bolkhovitin


Alan Cox, on 11/13/2012 12:40 PM wrote:

Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment 
supports
barriers and, as far as I know the storage history, has never supported.


The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.


The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.



Instead, what storage does support in this area are:


Yes - the devil is in the detail once you go beyond simple capabilities.


None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-15 Thread David Lang

On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:


Nico Williams, on 11/13/2012 02:13 PM wrote:

declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.


Which barriers? Barriers meaning cache flush or barriers meaning commands 
order, or barriers meaning both?


There's no such thing as "barrier". It is fully artificial abstraction. After 
all, at the bottom of your stack, you will have to translate it either to 
cache flush, or commands order enforcement, or both.


When people talk about barriers, they are talking about order enforcement.

Your mistake is that you are considering barriers as something real, which 
can do something real for you, while it is just a artificial abstraction 
apparently invented by people with limited knowledge how storage works, hence 
having very foggy vision how barriers supposed to be processed by it. A 
simple wrong answer.


Generally, you can invent any abstraction convenient for you, but farther 
your abstractions from reality of your hardware => less you will get from it 
with bigger effort.


There are no barriers in Linux and not going to be. Accept it. And start 
instead thinking about offload capabilities your storage can offer to you.


the hardware capabilities are not directly accessable from userspace (and they 
probably shouldn't be)


barriers keep getting mentioned because they are a easy concept to understand. 
"do this set of stuff before doing any of this other set of stuff, but I don't 
care when any of this gets done" and they fit well with the requirements of the 
users.


Users readily accept that if the system crashes, they will loose the most recent 
stuff that they did, but they get annoyed when things get corrupted to the point 
that they loose the entire file.


this includes things like modifying one option and a crash resulting in the 
config file being blank. Yes, you can do the 'write to temp file, sync file, 
sync directory, rename file" dance, but the fact that to do so the user must sit 
and wait for the syncs to take place can be a problem. It would be far better to 
be able to say "write to temp file, and after it's on disk, rename the file" and 
not have the user wait. The user doesn't really care if the changes hit disk 
immediately, or several seconds (or even 10s of seconds) later, as long as there 
is not any possibility of the rename hitting disk before the file contents.


The fact that this could be implemented in multiple ways in the existing 
hardware does not mean that there need to be multiple ways exposed to userspace, 
it just means that the cost of doing the operation will vary depending on the 
hardware that you have. This also means that if new hardware introduces a new 
way of implementing this, that improvement can be passed on to the users without 
needing application changes.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Nico Williams
On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox  wrote:
>> > Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment 
>> supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
>
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

Right: barriers are trivial to program with.  Ordered writes less so.
One could declare all writes to be ordered with respect to each other,
but this will almost certainly hurt performance (at least with disks,
though probably not SSDs) as opposed to barriers, which order one
group of internally-not-order writes relative to another.  And
declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.

There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Alan Cox
> > Barriers are pretty much universal as you need them for power off !
> 
> I'm afraid, no storage (drives, if you like this term more) at the moment 
> supports 
> barriers and, as far as I know the storage history, has never supported.

The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.

> Instead, what storage does support in this area are:

Yes - the devil is in the detail once you go beyond simple capabilities.

Alan
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:

 SATA's Native Command

Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.



And so? If SATA can't do it, does it mean that nobody else can't do it
too? I know a plenty of non-SATA devices, which can do the ordering
requirements you need.



I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.


I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.


But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin

Richard Hipp, on 11/02/2012 08:24 AM wrote:

SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin


Alan Cox, on 11/02/2012 08:33 AM wrote:

b) most drives will internally re-order requests anyway


They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.


c) cheap drives won't support barriers


Barriers are pretty much universal as you need them for power off !


I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.


Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.


3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.


Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.



Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?


This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.


Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-13 Thread Vladislav Bolkhovitin


Howard Chu, on 11/01/2012 08:38 PM wrote:

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers


This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-10 Thread 杨苏立 Yang Su Li
On Fri, Oct 26, 2012 at 8:54 PM, Vladislav Bolkhovitin  wrote:

>
> Theodore Ts'o, on 10/25/2012 01:14 AM wrote:
>
>> On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
>>
>>> Yes, SCSI has full support for ordered/simple commands designed
>>> exactly for that task: to have steady flow of commands even in case
>>> when some of them are ordered.
>>>
>>
>> SCSI does, yes --- *if* the device actually implements Tagged Command
>> Queuing (TCQ).  Not all devices do.
>>
>> More importantly, SATA drives do *not* have this capability, and when
>> you compare the price of SATA drives to uber-expensive "enterprise
>> drives", it's not surprising that most people don't actually use
>> SCSI/SAS drives that have implemented TCQ.
>>
>
> What different in our positions is that you are considering storage as
> something you can connect to your desktop, while in my view storage is
> something, which stores data and serves them the best possible way with the
> best performance.
>
> Hence, for you the least common denominator of all storage features is the
> most important, while for me to get the best of what possible from storage
> is the most important.
>
> In my view storage should offload from the host system as much as
> possible: data movements, ordered operations requirements, atomic
> operations, deduplication, snapshots, reliability measures (eg RAIDs), load
> balancing, etc.
>
> It's the same as with 2D/3D video acceleration hardware. If you want the
> best performance from your system, you should offload from it as much as
> possible. In case of video - to the video hardware, in case of storage - to
> the storage. The same as with video, for storage better offload - better
> performance. On hundreds of thousands IOPS it's clearly visible.
>
> Price doesn't matter here, because it's completely different topic.
>
>
>  SATA's Native Command
>> Queuing (NCQ) is not equivalent; this allows the drive to reorder
>> requests (in particular read requests) so they can be serviced more
>> efficiently, but it does *not* allow the OS to specify a partial,
>> relative ordering of requests.
>>
>
> And so? If SATA can't do it, does it mean that nobody else can't do it
> too? I know a plenty of non-SATA devices, which can do the ordering
> requirements you need.
>

I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.

Thanks a lot!

Suli

>
> Vlad
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> http://vger.kernel.org/**majordomo-info.html
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-06 Thread Pavel Machek
On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:03:13AM -0700, da...@lang.hm wrote:
> > I agree, this is why I'm trying to figure out the recommended way to
> > do this without needing to do full commits.
> > 
> > Since in most cases it's acceptable to loose the last few chunks
> > written, if we had some way of specifying ordering, without having
> > to specify "write this NOW", the solution would be pretty obvious.
> 
> Well, using data journalling with ext3/4 may do what you want.  If you
> don't do any fsync, the changes will get written every 5 seconds when
> the automatic journal sync happens (and sub-4k writes will also get

Hmm. But that would need setting journalling mode per-file, no?

Like, make it journal data for all the databases, but keep normal mode
for rest of system...

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-05 Thread Theodore Ts'o
On Mon, Nov 05, 2012 at 05:37:02PM -0500, Richard Hipp wrote:
> 
> Per the docs:  "Only the superuser or a process possessing the
> CAP_SYS_RESOURCE capability can set or clear this attribute."  That
> prevents most applications that run SQLite from being able to take
> advantage of this, since most such applications lack elevated privileges.

If this feature would prove useful to sqllite, that's something we
could address.  I could image making this available to processes that
belong to a specific group that would be specified in the superblock
or as a mount option.  (We already have something like that which
allows a specific uid or gid to use the reserved space in the
superblock.)

- Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-05 Thread Richard Hipp
On Mon, Nov 5, 2012 at 5:04 PM, Theodore Ts'o  wrote:

> On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > > Well, using data journalling with ext3/4 may do what you want.  If you
> > > don't do any fsync, the changes will get written every 5 seconds when
> > > the automatic journal sync happens (and sub-4k writes will also get
> >
> > Hmm. But that would need setting journalling mode per-file, no?
> >
> > Like, make it journal data for all the databases, but keep normal mode
> > for rest of system...
>
> You can do that, using "chattr +j file.db".  It's apparently not a
> well known feature of ext3/4
>

Per the docs:  "Only the superuser or a process possessing the
CAP_SYS_RESOURCE capability can set or clear this attribute."  That
prevents most applications that run SQLite from being able to take
advantage of this, since most such applications lack elevated privileges.


>
> - Ted
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-05 Thread Theodore Ts'o
On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > Well, using data journalling with ext3/4 may do what you want.  If you
> > don't do any fsync, the changes will get written every 5 seconds when
> > the automatic journal sync happens (and sub-4k writes will also get
> 
> Hmm. But that would need setting journalling mode per-file, no?
> 
> Like, make it journal data for all the databases, but keep normal mode
> for rest of system...

You can do that, using "chattr +j file.db".  It's apparently not a
well known feature of ext3/4

- Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-02 Thread Alan Cox
> Isn't any type of kernel-side ordering an exercise in futility, since
>a) the kernel has no knowledge of the disk's actual geometry
>b) most drives will internally re-order requests anyway

They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.

>c) cheap drives won't support barriers

Barriers are pretty much universal as you need them for power off !

> Even assuming the drives honored all your requests without lying, how would 
> you really want this behavior exposed? From the userland perspective, there 
> are very few apps that care. Probably only transactional databases, really.

And file systems internally sometimes. A file system is after all a
transactional database of sorts.

Alan
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-02 Thread Richard Hipp
On Thu, Nov 1, 2012 at 8:38 PM, Howard Chu  wrote:

> Alan Cox wrote:
>
>> How about that recently preliminary infrastructure to send ORDERED
>>> commands
>>> instead of queue draining was deleted from the kernel, because "there's
>>> no
>>> difference where to drain the queue, on the kernel or the storage side"?
>>>
>>
>> Send patches.
>>
>
> Isn't any type of kernel-side ordering an exercise in futility, since
>   a) the kernel has no knowledge of the disk's actual geometry
>   b) most drives will internally re-order requests anyway
>   c) cheap drives won't support barriers
>
> Even assuming the drives honored all your requests without lying, how
> would you really want this behavior exposed? From the userland perspective,
> there are very few apps that care. Probably only transactional databases,
> really.
>
> As a DB author, I'm not sure I'd be keen on this as an open() or fcntl()
> option. Databases that really care would be on dedicated filesystems and/or
> devices, so per-file control would be tedious. You would most likely want
> to say "all writes to this string of devices should be order-preserving"
> and forget about it. With that guarantee, a careful writer can have
> perfectly intact data structures all the time, without ever slowing down
> for a fsync.
>
>
SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


> --
>   -- Howard Chu
>   CTO, Symas Corp.   http://www.symas.com
>   Director, Highland Sun http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  
> http://www.openldap.org/**project/
>
> __**_
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-02 Thread Vladislav Bolkhovitin


Alan Cox, on 10/31/2012 05:54 AM wrote:

I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed 
by
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.


Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point


Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.



Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.


How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-01 Thread Howard Chu

Alan Cox wrote:

How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?


Send patches.


Isn't any type of kernel-side ordering an exercise in futility, since
  a) the kernel has no knowledge of the disk's actual geometry
  b) most drives will internally re-order requests anyway
  c) cheap drives won't support barriers

Even assuming the drives honored all your requests without lying, how would 
you really want this behavior exposed? From the userland perspective, there 
are very few apps that care. Probably only transactional databases, really.


As a DB author, I'm not sure I'd be keen on this as an open() or fcntl() 
option. Databases that really care would be on dedicated filesystems and/or 
devices, so per-file control would be tedious. You would most likely want to 
say "all writes to this string of devices should be order-preserving" and 
forget about it. With that guarantee, a careful writer can have perfectly 
intact data structures all the time, without ever slowing down for a fsync.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-11-01 Thread Alan Cox
> How about that recently preliminary infrastructure to send ORDERED commands 
> instead of queue draining was deleted from the kernel, because "there's no 
> difference where to drain the queue, on the kernel or the storage side"?

Send patches.

Alan
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-31 Thread Simon Slavin

On 30 Oct 2012, at 10:22pm, Vladislav Bolkhovitin  wrote:

> I fully understand your position. But "affordable" and "useful" are 
> completely orthogonal things. The "high end" features are very useful, if you 
> want to get high performance. Then ones, who can afford them, will use them, 
> which might be your favorite bank, for instance, hence they will be 
> indirectly working for you.
> 
> Of course, you don't have to work on those features, especially for free, but 
> you similarly don't have then to call them useless only because they are not 
> affordable to be put in a desktop [1].

The problem is, I think, that no bank should be using SQLite for customer 
records.  At its lowest basic level it is unsuited to high-end, multi-user, 
live-duplication work (for instance, all locking is carried out by locking the 
entire database !), and adding some features wanted by high-level users aren't 
going to change that.  For those users you need a DBMS which provides a 
network-access server/client model.  This is clearly laid out in



Think of SQLite as the thing a mobile phone uses to show its messages in 
chronological order, and the thing a TV recorder uses to maintain its list of 
recorded programmes.  Both of which are literally true in my case.  It is not 
suited to a huge institution-level data repository.  The fact that some people 
use it for multi-user live access anyway is merely a sign that its excellent 
design and testing regime let it stand up to use far beyond original intent.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-31 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:

What different in our positions is that you are considering storage
as something you can connect to your desktop, while in my view
storage is something, which stores data and serves them the best
possible way with the best performance.


I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.


I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.


You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.



As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?


Price doesn't matter here, because it's completely different topic.


It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.


I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.


Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].


Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?


Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-31 Thread Alan Cox
> I don't want to flame on this topic, but you are not right here. As far as I 
> can 
> see, a big chunk of Linux storage and file system developers are/were 
> employed by 
> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and 
> Oracle.
> 
> You know, RedHat from recent times also stepped to this market, at least I 
> saw 
> their advertisement on SDC 2012. So, you can add here all RedHat employees.

Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point

(and I used to have fibrechannel on my Thinkpad 600 when docked 8))

> Our discussion started not from "value-for-money", but from a constant demand 
> to 
> perform ordered commands without full queue draining, which is ignored by the 
> Linux storage developers for YEARS as not useful, right?

Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.

Alan
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-30 Thread Nico Williams
Hmm, so sorry I didn't notice the cc'ing of the linux-kernel list,
resulting in so much additional traffic to sqlite-users, which I'll
drop in my replies to the linux-kernel list.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.


Here I agree with you.

Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:

Yes, SCSI has full support for ordered/simple commands designed
exactly for that task: to have steady flow of commands even in case
when some of them are ordered.


SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.


What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.


Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.


In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.


It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.


Price doesn't matter here, because it's completely different topic.


SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.


And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-27 Thread Vladislav Bolkhovitin


Nico Williams, on 10/24/2012 05:17 PM wrote:

Yes, SCSI has full support for ordered/simple commands designed exactly for
that task: [...]

[...]

But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands,
hence had a lot troubles with their ambiguous semantic. As far as I can tell
the reason of that was some lack of sufficiently deep SCSI understanding
(how to handle errors, believe that ACA is something legacy from parallel
SCSI times, etc.).


Barriers are a very simple abstraction, so there's that.


It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.



Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing]


It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.


Often there's a big difference where it's done: on the system side, or on the 
storage side.


Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.



is not what's being asked for here. Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:


Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...


Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-26 Thread Theodore Ts'o
On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
> What different in our positions is that you are considering storage
> as something you can connect to your desktop, while in my view
> storage is something, which stores data and serves them the best
> possible way with the best performance.

I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.

As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?

> Price doesn't matter here, because it's completely different topic.

It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.

- Ted

P.S.  All of the storage I have access to at home is SATA.  If someone
would like to change that and ship me free hardware, as long as it
doesn't require three-phase power (or require some exotic interconnect
which is ghastly expensive and which you are also not going to provide
me for free), do contact me off-line.  :-)
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 11:03:13AM -0700, da...@lang.hm wrote:
> I agree, this is why I'm trying to figure out the recommended way to
> do this without needing to do full commits.
> 
> Since in most cases it's acceptable to loose the last few chunks
> written, if we had some way of specifying ordering, without having
> to specify "write this NOW", the solution would be pretty obvious.

Well, using data journalling with ext3/4 may do what you want.  If you
don't do any fsync, the changes will get written every 5 seconds when
the automatic journal sync happens (and sub-4k writes will also get
coalesced to a 5 second granularity).  Even with plain text files,
it's pretty easy to tell whether or not the final record is a
partially written or not after a crash; just look for a trailing
newline.

Better yet, if you are writing to multiple log files with data
journalling, all of the writes will happen at the same time, and they
will be streamed to the file system journal, minimizing random writes
for at least the journal writes.

- Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread david

On Thu, 25 Oct 2012, Theodore Ts'o wrote:


Or does rsyslog *really* need to issue an fsync after each log
message?  Or could it batch updates so that every N seconds, it
flushes writes to the disk?


In part this depends on how paranoid the admin is. By default rsyslog 
doesn't do fsyncs, but admins can configure it to do so and can configure 
the batch size.


However, what I'm talking about here is not normal message traffic, it's 
the case where the admin has decided that they don't want to use the 
normal inmemory queues, they want to have the queues be on disk so that if 
the system crashes the queued data will still be there to be processed 
after the crash (In addition, this can get used to cover cases where you 
want queue sizes larger than your available RAM)


In this case, the extreme, and only at the explicit direction of the 
admin, is to fsync after every message.


The norm is that it's acceptable to loose the last few messages, but 
loosing a chunk out of the middle of the queue file can cause a whole lot 
more to be lost, passing the threshold of acceptable.



Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer.  Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.


I agree, this is why I'm trying to figure out the recommended way to do 
this without needing to do full commits.


Since in most cases it's acceptable to loose the last few chunks written, 
if we had some way of specifying ordering, without having to specify 
"write this NOW", the solution would be pretty obvious.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Simon Slavin

On 25 Oct 2012, at 2:04am, da...@lang.hm wrote:

> But unless you are a filesystem, how can you make sure that the message data 
> is written to file1 before you write the metadata about the message to file2?

Wait for long enough for the disk subsystem to clear its backlog of write 
commands.  A few seconds should do it.

> right now it seems that there is no way for an application to do this other 
> than doing a fsync(file1) before writing the metadata to file2

No, as I've posted previously to this thread, you can assume that fsync() 
literally does nothing.  It really is implemented as a 'noop' in many cases.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Alan Cox
> > Hopefully, eventually the storage developers will realize the value
> > behind ordered commands and learn corresponding SCSI facilities to
> > deal with them.
> 
> Eventually, drive manufacturers will realize that trying to price
> guage people who want advanced features such as TCQ, DIF/DIX, is the
> best way to gaurantee that most people won't bother to purchase them,
> and hence the features will remain largely unused

I doubt they care. The profit on high end features from the people who
really need them I would bet far exceeds any other benefit of giving it to
others. Welcome to capitalism 8)

Plus - spinning rust for those end users is on the way out, SATA to flash
is a bit of hack and people are already putting a lot of focus onto
things like NVM Express.

Alan
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 11:58:49PM -0700, da...@lang.hm wrote:
> The frustrating thing is that when people point out how things like
> sqlite are so horribly slow, the reply seems to be "well, that's
> what you get for doing so many fsyncs, don't do that", when there is
> a 'problem' like the KDE "config loss" problem a few years ago, the
> response is "well, that's what you get for not doing fsync"

Sure... but the answer is to only do the fsync's when you need to.
For example, if GNOME and KDE is rewriting the entire registry file
each time the application is changing a single registry key, sure, if
you rewrite the entire registry file, and then fsync after each
rewrite before you replace the file, you will be safe.  And if the
application needs to update dozens or hundreds of registry keys (or
every time the window gets moved or resized), then yes, it will be
slow.  But the application didn't have to do that!  It could have
updated all the registry keys in memory, and then update the registry
file periodically instead.

Similarly, Firefox didn't need to do a sqllite commit after every
single time its history file was written, causing a third of a
megabyte of write traffic each time you clicked on a web page.  It
could have batched its updates to the history file, since most of the
time, you don't care about making sure the web history is written to
stable store before you're allowed to click on a web page and visit
the next web page.

Or does rsyslog *really* need to issue an fsync after each log
message?  Or could it batch updates so that every N seconds, it
flushes writes to the disk?

(And this is a problem with most Android applications as well.
Apparently the framework API's are such that it's easier for an
application to treat each sqlite statement as an atomic update, so
many/most application writers don't use explicit transaction
boundaries, so updates don't get batched even though it would be more
efficient if they did so.)

Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer.  Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.

- Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote:
> 
> I doubt they care. The profit on high end features from the people who
> really need them I would bet far exceeds any other benefit of giving it to
> others. Welcome to capitalism 8)

Yes, but it's a question of pricing.  If they had priced it a just a
wee bit higher, then there would have been incentive to add support
for TCQ so it could actually be used into various Linux file systems,
since there would have been lots of users of it.  But as it is, the
folks who are purchasing huge, vast number of these drives --- such as
at the large cloud providers: Amazon, Facebook, Racespace, et. al. ---
will choose to purchase large numbers of commodity drives, and then
find ways to work around the missing functionality in userspace.  For
example, DIF/DIX would be nice, and if it were available for cheap, I
could imagine it being used.  But you can accomplish the same thing in
userspace, and in fact at Google I've implemented a special
not-for-mainline patch which spikes out stable writes (required for
DIF/DIX) because it has significant performance overhead, and DIF/DIX
has zero benefit if you're not willing to shell out $$$ for hardware
that supports it.

Maybe the HDD manufacturers have been able to price guage a small
number enterprise I/T shops with more dollars than sense, but
personally, I'm not convinced they picked an optimal pricing
strategy

Put another way, I accept that Toyota should price a Lexus ES more
than a Camry, but if it's priced at say, 3x the price of a Camry
instead of 20%, they might find that precious few people are willing
to pay that kind of money for what is essentially the same car with
minor luxury tweaks added to it.

> Plus - spinning rust for those end users is on the way out, SATA to flash
> is a bit of hack and people are already putting a lot of focus onto
> things like NVM Express.

Yeah  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.

  - Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread david

On Thu, 25 Oct 2012, Theodore Ts'o wrote:


On Wed, Oct 24, 2012 at 03:03:00PM -0700, da...@lang.hm wrote:

Like what is being described for sqlite, loosing the tail end of the
messages is not a big problem under normal conditions. But there is
a need to be sure that what is there is complete up to the point
where it's lost.

this is similar in concept to write-ahead-logs done for databases
(without the absolute durability requirement)


If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements.  It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.

The potential downsides that you may or may not care about for this
particular application:

(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.

(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.

(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk).  If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.


I'll have to dig into this option. In the case of rsyslog it sounds 
like it could work (not as good as a filesystem independant way of doing 
things, but better than full fsyncs)


Truncated messages are not great, but they are a detectable, and 
acceptable risk.


while the average message size is much smaller than 4K (on my network it's 
~250 bytes), the metadata that's broken out expands this somewhat, and we 
can afford to waste disk space if it makes things safer or more efficient.


If we do update in place with flags with each message, each message will 
need to be written up to three times (on recipt, being processed, finished 
processed). With high message burst rates, I'm worried that we would fill 
up the journal, is there a good way to deal with this?


I believe that ext4 can put the journal on a different device from the 
filesystem, would this help a lot?


If you were to put the journal for an ext4 filesystem on a ram disk, you 
would loose the data recovery protection of the journal, but could you use 
this trick to get ordered data writes onto the filesystem?


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread david

On Thu, 25 Oct 2012, Theodore Ts'o wrote:


On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:


By trusting fsync().  And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.


The challenge is when you have entagled metadata updates.  That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated.  You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/

If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.


hmm, two thoughts occur to me.

1. to avoid entanglement, put the two files in separate directories

2. take advantage of entaglement to enforce ordering


thread 1 (repeated): write new message to file 1, spawn new thread to 
fsync


thread 2: write to file 2 that message1-5 are being worked on

thread 2 (later): write to file 2 that messages 1-5 are done

when thread 1 spawns the new thread to do the fsync, the system will be 
forced to write the data to file 2 as of the time it does the fsync.


This should make it so that you never have data written to file2 that 
refers to data that hasn't been written to file1 yet.




It certainly is not impossible; RDBMS's have implemented this.  On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.


the RDBMS's have implemented stronger guarantees than what we are needing

A few years ago I was investigating this for logging. With the reliable 
(RDBMS style) , but inefficent disk queue that rsyslog has, writing to a 
high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K 
logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered 
stable enough at the time to be tested)



Still, if you want to try to implement such a thing, by all means,
give it a try.  But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.


The question is trying to figure a way to get ordering right with existing 
filesystms (preferrably without using something too tied to a single 
filesystem implementation), not try and create a new one.


The frustrating thing is that when people point out how things like sqlite 
are so horribly slow, the reply seems to be "well, that's what you get for 
doing so many fsyncs, don't do that", when there is a 'problem' like the 
KDE "config loss" problem a few years ago, the response is "well, that's 
what you get for not doing fsync"


Both responses are correct, from a purely technical point of view.

But what's missing is any way to get the result of ordered I/O that will 
let you do something pretty fast, but with the guarantee that, if you 
loose data in a crash, the only loss you are risking is that your most 
recent data may be missing. (either for one file, or using multiple files 
if that's what it takes)


Since this topic came up again, I figured I'd poke a bit and try to either 
get educated on how to do this "right" or try and see if there's something 
that could be added to the kernel to make it possible for userspace 
programs to do this.


What I think userspace really needs is something like a barrier function 
call. "for this fd, don't re-order writes as they go down through the 
stack"


If the hardware is going to reorder things once it hits the hardware, this 
is going to hurt performance (how much depends on a lot of stuff)


but the filesystems are able to make their journals work, so there should 
be some way to let userspace do some sort of similar ordering


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
> 
> By trusting fsync().  And if you don't care about immediate Durability
> you can run the fsync() in a background thread and mark the associated
> transaction as completed in the next transaction to be written after
> the fsync() completes.

The challenge is when you have entagled metadata updates.  That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated.  You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/

If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.

It certainly is not impossible; RDBMS's have implemented this.  On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.  (About a decade ago,
Oracle tried to claim that you could run file system workloads using
an Oracle databsae as a back-end.  Everyone laughed at them, and the
idea died a quick, merciful death.)

Still, if you want to try to implement such a thing, by all means,
give it a try.  But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.

 - Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 03:03:00PM -0700, da...@lang.hm wrote:
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is
> a need to be sure that what is there is complete up to the point
> where it's lost.
> 
> this is similar in concept to write-ahead-logs done for databases
> (without the absolute durability requirement)

If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements.  It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.

The potential downsides that you may or may not care about for this
particular application:

(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.

(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.

(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk).  If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.

- Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread Theodore Ts'o
On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
> Yes, SCSI has full support for ordered/simple commands designed
> exactly for that task: to have steady flow of commands even in case
> when some of them are ordered.

SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.  SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.

Yes, you can turn off writeback caching, but that has pretty huge
performance costs; and there is the FUA bit, but that's just an
unconditional high priority bypass of the writeback cache, which is
useful in some cases, but which again, does not give the ability for
the OS to specify a partial order, while letting the drive reorder
other requests for efficiency/performance's sake, since the drive has
a lot more information about the optimal way to reorder requests based
on the current location of the drive head and where certain blocks may
have been remapped due to bad block sparing, etc.

> Hopefully, eventually the storage developers will realize the value
> behind ordered commands and learn corresponding SCSI facilities to
> deal with them.

Eventually, drive manufacturers will realize that trying to price
guage people who want advanced features such as TCQ, DIF/DIX, is the
best way to gaurantee that most people won't bother to purchase them,
and hence the features will remain largely unused

   - Ted
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread david

On Wed, 24 Oct 2012, Nico Williams wrote:


On Wed, Oct 24, 2012 at 5:03 PM,   wrote:

I'm doing some work with rsyslog and it's disk-baded queues and there is a
similar issue there. The good news is that we can have a version that is
linux specific (rsyslog is used on other OSs, but there is an existing queue
implementation that they can use, if the faster one is linux-only, but is
significantly faster, that's just a win for Linux)

Like what is being described for sqlite, loosing the tail end of the
messages is not a big problem under normal conditions. But there is a need
to be sure that what is there is complete up to the point where it's lost.

this is similar in concept to write-ahead-logs done for databases (without
the absolute durability requirement)

[...]

I am not fully understanding how what you are describing (COW, separate
fsync threads, etc) would be implemented on top of existing filesystems.
Most of what you are describing seems like it requires access to the
underlying storage to implement.

could you give a more detailed explination?


COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written.  In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.


so how can you do this, and keep the writes in order (especially between 
two files) without being the filesystem?



As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.


If you could specify ordering between two writes, I could see a process 
along the lines of


Append new message to file1

append tiny status updates to file2

every million messages, move to new files. once the last message has been 
processed for the old set of files, delete them.


since file2 is small, you can reconstruct state fairly cheaply

But unless you are a filesystem, how can you make sure that the message 
data is written to file1 before you write the metadata about the message 
to file2?


right now it seems that there is no way for an application to do this 
other than doing a fsync(file1) before writing the metadata to file2


And there is no way for the application to tell the filesystem to write 
the data in file2 in order (to make sure that block 3 is not written and 
then have the system crash before block 2 is written), so the application 
needs to do frequent fsync(file2) calls.


If you need complete durability of your data, there are well documented 
ways of enforcing it (including the lwn.net article 
http://lwn.net/Articles/457667/ )


But if you don't need the gurantee that your data is on disk now, you just 
need to have it ordered so that if you crash you can be guaranteed only to 
loose data off of the tail of your file, there doesn't seem to be any way 
to do this other than using the fsync() hammer and wait for the overhead 
of forcing the data to disk now.



Or, as I type this, it occurs to me that you may be saying that every time 
you want to do an ordering guarantee, spawn a new thread to do the fsync 
and then just keep processing. The fsync will happen at some point, and 
the writes will not be re-ordered across the fsync, but you can keep 
going, writing more data while the fsync's are pending.


Then if you have a filesystem and I/O subsystem that can consolodate the 
fwyncs from all the different threads together into one I/O operation 
without having to flush the entire I/O queue for each one, you can get 
acceptable performance, with ordering. If the system crashes, data that 
hasn't had it's fsync() complete will be the only thing that is lost.


David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-25 Thread david

On Wed, 24 Oct 2012, Nico Williams wrote:


Before that happens, people will keep returning again and again with those
simple questions: why the queue must be flushed for any ordered operation?
Isn't is an obvious overkill?


That [cache flushing] is not what's being asked for here.  Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.


I'm doing some work with rsyslog and it's disk-baded queues and there is a 
similar issue there. The good news is that we can have a version that is 
linux specific (rsyslog is used on other OSs, but there is an existing 
queue implementation that they can use, if the faster one is linux-only, 
but is significantly faster, that's just a win for Linux)


Like what is being described for sqlite, loosing the tail end of the 
messages is not a big problem under normal conditions. But there is a need 
to be sure that what is there is complete up to the point where it's lost.


this is similar in concept to write-ahead-logs done for databases (without 
the absolute durability requirement)


1. new messages arrive and get added to the end of the queue file.

2. a thread updates the queue to indicate that it is in the process 
of delivering a block of messages


3. the thread updates the queue to indicate that the block of messages has 
been delivered


4. garbage collection happens to delete the old messages to free up space 
(if queues go into files, this can just be to limit the file size, 
spilling to multiple files, and when an old file is completely marked as 
delivered, delete it)


I am not fully understanding how what you are describing (COW, separate 
fsync threads, etc) would be implemented on top of existing filesystems. 
Most of what you are describing seems like it requires access to the 
underlying storage to implement.


could you give a more detailed explination?

David Lang
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Nico Williams
On Wed, Oct 24, 2012 at 8:04 PM,   wrote:
> On Wed, 24 Oct 2012, Nico Williams wrote:
>> COW is "copy on write", which is actually a bit of a misnomer -- all
>> COW means is that blocks aren't over-written, instead new blocks are
>> written.  In particular this means that inodes, indirect blocks, data
>> blocks, and so on, that are changed are actually written to new
>> locations, and the on-disk format needs to handle this indirection.
>
> so how can you do this, and keep the writes in order (especially between two
> files) without being the filesystem?

By trusting fsync().  And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.

>> As for fsyn() and background threads... fsync() is synchronous, but in
>> this scheme we want it to happen asynchronously and then we want to
>> update each transaction with a pointer to the last transaction that is
>> known stable given an fsync()'s return.
>
> If you could specify ordering between two writes, I could see a process
> along the lines of
>
> [...]

fsync() deals with just one file.  fsync()s of different files are
another story.  That said, as long as the format of the two files is
COW then you can still compose transactions involving two files.  The
key is the file contents itself must be COW-structured.

Incidentally, here's a single-file, bag of b-trees that uses a COW
format: MDB, which can be found in
git://git.openldap.org/openldap.git, in the mdb.master branch.

> Or, as I type this, it occurs to me that you may be saying that every time
> you want to do an ordering guarantee, spawn a new thread to do the fsync and
> then just keep processing. The fsync will happen at some point, and the
> writes will not be re-ordered across the fsync, but you can keep going,
> writing more data while the fsync's are pending.

Yes, but only if the file's format is COWish.

The point is that COW saves the day.  A file-based DB needs to be COW.
 And the filesystem needs to be as well.

Note that write ahead logging approximates COW well enough most of the time.

> Then if you have a filesystem and I/O subsystem that can consolodate the
> fwyncs from all the different threads together into one I/O operation
> without having to flush the entire I/O queue for each one, you can get
> acceptable performance, with ordering. If the system crashes, data that
> hasn't had it's fsync() complete will be the only thing that is lost.

With the above caveat, yes.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Nico Williams
On Wed, Oct 24, 2012 at 7:17 PM, Simon Slavin  wrote:
> A) fsync() doesn't work the way it's meant to on the majority of user 
> platforms.  It effectively does nothing.  Here are typical notes for Windows 
> Server and FreeBSD:

Many systems lie, that's true.  For example: Virtual Box by default
lies about cache flushes.  And consumer hardware typically does as
well.  The systems I'm familiar with implement fsync() correctly as
long as the hardware doesn't lie.  (Nothing much can be done about
lying hardware, especially if the lies go beyond merely not flushing
caches.  Though if cache flushing is the only thing the hardware lies
about then the OS/filesystem can implement a technique for recovery
like the one I described.  Indeed, ZFS does just that.)

But the point is that Richard asked for a light-weight barrier API and
it exists as I described.  Any API explicitly designed for this
purpose could still be implemented incorrectly, or just lie through
its teeth.  SQLite can't help this.  SQLite *can* use available APIs:
when the OS/FS/HW don't lie using these APIs is way better than not
using them, and if the OS/FS/HW lie, well, that's not SQLite's
problem.  At best SQLite could mitigate the lies by... doing what I
suggested: keep around N non-garbage-collected most recent
transactions so the the most recent transaction that can be validated
-meaning its writes hit disk- is taken as the current state of the DB.

Nico

PS: Typically OSes implement fsync(), and all filesystem system calls
via a VFS switch, so the actual implementation of fsync() generally
depends on the actual filesystem in addition to the OS and the
hardware.  A filesystem like a traditional UFS might correctly flush
caches and so on and yet fail to implement fsync() as a Durability
guarantee on account of not having a COW structure on disk, such that
a power failure in the middle of subsequent writes can leave the
filesystem inconsistent.  A filesystem like ZFS doesn't have this
problem.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Nico Williams
On Wed, Oct 24, 2012 at 5:03 PM,   wrote:
> I'm doing some work with rsyslog and it's disk-baded queues and there is a
> similar issue there. The good news is that we can have a version that is
> linux specific (rsyslog is used on other OSs, but there is an existing queue
> implementation that they can use, if the faster one is linux-only, but is
> significantly faster, that's just a win for Linux)
>
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is a need
> to be sure that what is there is complete up to the point where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases (without
> the absolute durability requirement)
>
> [...]
>
> I am not fully understanding how what you are describing (COW, separate
> fsync threads, etc) would be implemented on top of existing filesystems.
> Most of what you are describing seems like it requires access to the
> underlying storage to implement.
>
> could you give a more detailed explination?

COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written.  In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.

As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Simon Slavin

On 24 Oct 2012, at 10:17pm, Nico Williams  wrote:

> That [cache flushing] is not what's being asked for here.  Just a
> light-weight barrier.  My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,

Nico,

A) fsync() doesn't work the way it's meant to on the majority of user 
platforms.  It effectively does nothing.  Here are typical notes for Windows 
Server and FreeBSD:



"fsync shouldn't be a noop on windows"



"The fsync appears to be a noop."

I'm not knocking any particular OS, they're all like that.  Because actually 
implementing fsync() causes massive slow-downs on all disk writes, and makes 
the computer feel unresponsive to users.

B) Your hard disk lies.  Unless it's a server-level (i.e. expensive) hard disk 
sold especially for server use, it does not enforce in-order writing, either at 
the firmware level or the driver level.  Protocols like SCSI have ways to do 
this correctly, but the overwhelming majority of IDE controllers out there will 
just completely ignore it.

Again, I'm not knocking a particular model, they're all like that.  Because 
write-back caching is so much faster and implementing write-through caching as 
well takes additional programming for something most users will never use.

To propose a new utility which rides above the operating system, both of the 
above have to be remedied.  Your call depends on the OS not lying, and the OS 
depends on the hardware not lying.  You cannot fix this at the level you 
propose.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Nico Williams
On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin
<...@gmail.com> wrote:
>> As most of the time the order we need do not involve too many blocks
>> (certainly a lot less than all the cached blocks in the system or in
>> the disk's cache), that topological order isn't likely to be very
>> complicated, and I image it could be implemented efficiently in a
>> modern device, which already has complicated caching/garbage
>> collection/whatever going on internally. Particularly, it seems not
>> too hard to be implemented on top of SCSI's ordered/simple task mode?

If you have multiple layers involved (e.g., SQLite then the
filesystem, and if the filesystem is spread over multiple storage
devices), and if transactions are not bounded, and on top of that if
there are other concurrent writers to the same filesystem (even if not
the same files) then the set of blocks to write and internal ordering
can get complex.  In practice filesystems try to break these up into
large self-consistent chunks and write those -- ZFS does this, for
example -- and this is aided by the lack of transactional semantics in
the filesystem.

For SQLite with a VFS that talks [i]SCSI directly then things could be
much more manageable as there's only one write transaction in progress
at any given time.  But that's not realistic, except, perhaps, in some
embedded systems.

> Yes, SCSI has full support for ordered/simple commands designed exactly for
> that task: [...]
>
> [...]
>
> But historically for some reason Linux storage developers were stuck with
> "barriers" concept, which is obviously not the same as ORDERED commands,
> hence had a lot troubles with their ambiguous semantic. As far as I can tell
> the reason of that was some lack of sufficiently deep SCSI understanding
> (how to handle errors, believe that ACA is something legacy from parallel
> SCSI times, etc.).

Barriers are a very simple abstraction, so there's that.

> Hopefully, eventually the storage developers will realize the value behind
> ordered commands and learn corresponding SCSI facilities to deal with them.
> It's quite easy to demonstrate this value, if you know where to look at and
> not blindly refusing such possibility. I have already tried to explain it a
> couple of times, but was not successful.

Exposing ordering of lower-layer operations to filesystem applications
is a non-starter.  About the only reasonable thing to do with a
filesystem is add barrier operations.  I know, you're talking about
lower layer capabilities, and SQLite could talk to that layer
directly, but let's face it: it's not likely to.

> Before that happens, people will keep returning again and again with those
> simple questions: why the queue must be flushed for any ordered operation?
> Isn't is an obvious overkill?

That [cache flushing] is not what's being asked for here.  Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-24 Thread Vladislav Bolkhovitin

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?


Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.


In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).


But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).


Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.


Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?


Vlad
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Nico Williams
On Fri, Oct 12, 2012 at 5:14 PM, Simon Slavin  wrote:
> I think I understand what you're asking for, but I see no point in being 
> informed about D, because I can't see anything useful a program can do if the 
> transaction gets marked 'complete' but D doesn't succeed.  Either you see D 
> as being part of a transaction, or you don't.

A server application might not know which the client wants.  And sure,
the client could tell it.  But I think that to be general and thus
support arbitrary layering, having an option to pass this indication
through is best.

> By all means, use a PRAGMA to day whether you do need D or not.  SQLite 
> already has a dozen ways to do this (though it is very puzzling to try to 
> work out what combination of PRAGMAs to use under your conditions).  But 
> report D (or ignore D) as part of the result of 'COMMIT'.  Don't make an 
> extra status for "committed but not durable'.

We disagree then.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Simon Slavin

On 12 Oct 2012, at 10:23pm, Nico Williams  wrote:

> Here's some more examples of where delayed-D ACKs would be nice:
> distributed services.  These are really just a variant of my earlier
> UI example, but still: a server might respond with an ACK as soon as a
> transaction completes with ACI and again when D is achieved, thus
> allowing different clients to choose when to demand D independently of
> each other.

I think I understand what you're asking for, but I see no point in being 
informed about D, because I can't see anything useful a program can do if the 
transaction gets marked 'complete' but D doesn't succeed.  Either you see D as 
being part of a transaction, or you don't.

By all means, use a PRAGMA to day whether you do need D or not.  SQLite already 
has a dozen ways to do this (though it is very puzzling to try to work out what 
combination of PRAGMAs to use under your conditions).  But report D (or ignore 
D) as part of the result of 'COMMIT'.  Don't make an extra status for 
"committed but not durable'.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Nico Williams
On Fri, Oct 12, 2012 at 4:08 PM, Simon Slavin  wrote:
> If all you're doing is showing something on a display that's fine.  But if 
> that's what you're doing I see no point in distinguishing between 'success' 
> and 'durable'.  As far as I can see your program has nothing to do between 
> the two statuses.

D.R. Hipp asks for a write barrier so that ACI (no D) can be
implemented easily.  It turns out that this is possible (as described
earlier).  I also suggest that "delayed D" might be of use.  I don't
have great examples where one might want delayed D, but still, I would
rather have an API for learning when D is achieved.  You propose not
having such an API at all (or at least imply it).  Why wait for
someone to ask for what is clearly a sensible API to have?  Sometimes
the existence of an API can foster development of consumers.
Sometimes an API w/o initial consumers dies on the vine.  It's a
matter of judgement whether to add an API before having definite
consumers, but IMO it's worth doing in this case.

Here's some more examples of where delayed-D ACKs would be nice:
distributed services.  These are really just a variant of my earlier
UI example, but still: a server might respond with an ACK as soon as a
transaction completes with ACI and again when D is achieved, thus
allowing different clients to choose when to demand D independently of
each other.

A logging system might want ACID for critical messages and ACI for for
all others.

Durability is generally required in production systems.  So good
examples where it's not necessary are hard to come by.

Yet another variant on my first example would be a mobile note taking
app that uses color (or an icon) to indicate when the current state of
the document is safely committed on disk (flash, really).  The mobile
device environment generally requires D but doesn't require D before
proceeding.  Perhaps this is the best example I have so far.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Simon Slavin

On 12 Oct 2012, at 10:01pm, Nico Williams  wrote:

> On Fri, Oct 12, 2012 at 3:53 PM, Simon Slavin  wrote:
>> That's an interesting idea.  I have a question.  Suppose your program 
>> received the 'success' result for a transaction and carried on to do other 
>> transactions.  Later you test to see whether the transaction is durable and 
>> find that it isn't. What would be a useful thing to do at that point ?
> 
> It depends.  For most such applications I'd say this is just not
> appropriate.  But you might have an application where this is fine,
> provided you have a way to reflect the D/not-D state of the initial
> transaction in the dependent ones.

To do that properly, you would need a multi-value variable.  Because for every 
operation you actually have three statuses:

1) Operation hasn't been started yet.
2) Operation is just about to start
   ... here the computer actually tries to do the operation
3a) Operation complete and failed
3b) Operation complete and succeeded

The problem with your process as stated is that the operation may be complete 
(correctly written to disk and therefore durable) but the software may not yet 
have set the flag to indicate that.  If your software looks at the status and 
sees 'not durable' it might take 'not durable' actions on something which is 
actually durable.

> What I had in mind was something more like a booking system: let the
> user know that the transaction completed by updating the page so as to
> make the form go away, but leave an indicator (e.g., animated
> ellipsis) to indicate that the transaction is not quite completed.

If all you're doing is showing something on a display that's fine.  But if 
that's what you're doing I see no point in distinguishing between 'success' and 
'durable'.  As far as I can see your program has nothing to do between the two 
statuses.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Nico Williams
On Fri, Oct 12, 2012 at 3:53 PM, Simon Slavin  wrote:
> That's an interesting idea.  I have a question.  Suppose your program 
> received the 'success' result for a transaction and carried on to do other 
> transactions.  Later you test to see whether the transaction is durable and 
> find that it isn't.  What would be a useful thing to do at that point ?

It depends.  For most such applications I'd say this is just not
appropriate.  But you might have an application where this is fine,
provided you have a way to reflect the D/not-D state of the initial
transaction in the dependent ones.

What I had in mind was something more like a booking system: let the
user know that the transaction completed by updating the page so as to
make the form go away, but leave an indicator (e.g., animated
ellipsis) to indicate that the transaction is not quite completed.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Simon Slavin

On 12 Oct 2012, at 6:00pm, Nico Williams  wrote:

> I do think that applications should be able to request deferred
> durability *and* find out when a given transaction has indeed become
> durable.
> 
> A distinction between success and durability in the API might bleed
> into UIs too.  Imagine a web browser interface where "submit" does
> some XHR that causes a DB transaction to be run and committed, the
> page to be updated to show that the transaction succeeded, and
> finally, another XHR is used to find when the transaction is durable
> and the page is then updated again to reflect as much.

That's an interesting idea.  I have a question.  Suppose your program received 
the 'success' result for a transaction and carried on to do other transactions. 
 Later you test to see whether the transaction is durable and find that it 
isn't.  What would be a useful thing to do at that point ?

Simon
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Nico Williams
On Fri, Oct 12, 2012 at 2:58 AM, Dan Kennedy  wrote:
> On 10/11/2012 11:38 PM, Nico Williams wrote:
>> There is something you can do: [...]
>
> SQLite WAL mode comes close to that if you run your checkpoints
> in the background. [...]

Right.  WAL mode comes close to being a COW on-disk format.

> Omitting the D in ACID changes everything. With the D in, you need to
> fsync() after every transaction. Without it, you need to fsync() before
> reclaiming space (i.e. when overwriting old data with new - you need
> to be sure that the old data will not be required following recovery
> from a power failure, which means an fsync()).

Exactly.  You've put it more succinctly than I.

I do think that applications should be able to request deferred
durability *and* find out when a given transaction has indeed become
durable.

A distinction between success and durability in the API might bleed
into UIs too.  Imagine a web browser interface where "submit" does
some XHR that causes a DB transaction to be run and committed, the
page to be updated to show that the transaction succeeded, and
finally, another XHR is used to find when the transaction is durable
and the page is then updated again to reflect as much.

For a user "success" and "durable" means that the opposite of success
may be that they have to update a form and try again, so knowing that
the transaction succeeded is useful, in many cases even if the
transaction is not yet durable (because D failure is extremely rare).
To the user "durable" means that the transaction is truly complete.
Is this distinction something that users can be expected to
understand?  I believe that they can understand the distinction
intuitively, so the UI presentation needs to be finely tuned, and
having the ability to build such a UI means marking this distinction
in the API.  Is it valuable to expose this distinction to the user?  I
think that depends on the application -- it just shouldn't be
precluded.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Pavel Ivanov
Well, an article on write barriers published in May 2007 can't
contradict the statement that barriers don't exist these days. :)

Pavel

On Fri, Oct 12, 2012 at 5:38 AM, Black, Michael (IS)
<michael.bla...@ngc.com> wrote:
> There isn't  Somebody sure wasted their time on this article then...
> http://www.linux-magazine.com/w3/issue/78/Write_Barriers.pdf
>
> Michael D. Black
> Senior Scientist
> Advanced Analytics Directorate
> Advanced GEOINT Solutions Operating Unit
> Northrop Grumman Information Systems
>
> 
> From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
> behalf of Christoph Hellwig [h...@infradead.org]
> Sent: Thursday, October 11, 2012 12:41 PM
> To: ? Yang Su Li
> Cc: linux-fsde...@vger.kernel.org; General Discussion of SQLite Database; 
> linux-ker...@vger.kernel.org; d...@hwaci.com
> Subject: EXT :Re: [sqlite] light weight write barriers
>
> On Thu, Oct 11, 2012 at 11:32:27AM -0500, ? Yang Su Li wrote:
>> I am not quite whether I should ask this question here, but in terms
>> of light weight barrier/fsync, could anyone tell me why the device
>> driver / OS provide the barrier interface other than some other
>> abstractions anyway? I am sorry if this sounds like a stupid questions
>> or it has been discussed before
>
> It does not.  Except for the legacy mount option naming there is no such
> thing as a barrier in Linux these days.
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Black, Michael (IS)
There isn't  Somebody sure wasted their time on this article then...
http://www.linux-magazine.com/w3/issue/78/Write_Barriers.pdf

Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Advanced GEOINT Solutions Operating Unit
Northrop Grumman Information Systems


From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Christoph Hellwig [h...@infradead.org]
Sent: Thursday, October 11, 2012 12:41 PM
To: ? Yang Su Li
Cc: linux-fsde...@vger.kernel.org; General Discussion of SQLite Database; 
linux-ker...@vger.kernel.org; d...@hwaci.com
Subject: EXT :Re: [sqlite] light weight write barriers

On Thu, Oct 11, 2012 at 11:32:27AM -0500, ? Yang Su Li wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before

It does not.  Except for the legacy mount option naming there is no such
thing as a barrier in Linux these days.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Christoph Hellwig
On Thu, Oct 11, 2012 at 11:32:27AM -0500, ? Yang Su Li wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before

It does not.  Except for the legacy mount option naming there is no such
thing as a barrier in Linux these days.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-12 Thread Dan Kennedy

On 10/11/2012 11:38 PM, Nico Williams wrote:

On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp  wrote:

Could you list the requirements of such a light weight barrier?
i.e. what would it need to do minimally, what's different from
fsync/fdatasync ?


For SQLite, the write barrier needs to involve two separate inodes.  The
requirement is this:


...


Note also that when fsync() works as advertised, SQLite transactions are
ACID.  But when fsync() is reduced to a write-barrier, we loss the D
(durable) and transactions are only ACI.  In our experience, nobody really
cares very much about durable across a power-loss.  People are mainly
interested in Atomic, Consistent, and Isolated.  If you take a power loss
and then after reboot you find the 10 seconds of work prior to the power
loss is missing, nobody much cares about that as long as all of the prior
work is still present and consistent.


There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).


SQLite WAL mode comes close to that if you run your checkpoints
in the background. Following a power failure, those transactions that
have been checkpointed to the database file are assumed to have been
synced. Then SQLite uses checksums to determine the subset of
transactions in the WAL file that are intact.

I say close, because if you keep on writing to the db while the
checkpoint is running you end up with the WAL file growing indefinitely.
So it doesn't quite work.

Omitting the D in ACID changes everything. With the D in, you need to
fsync() after every transaction. Without it, you need to fsync() before
reclaiming space (i.e. when overwriting old data with new - you need
to be sure that the old data will not be required following recovery
from a power failure, which means an fsync()).

Dan.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread Nico Williams
Lying hardware is a different problem.  Richards was asking for something else.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread Simon Slavin

On 11 Oct 2012, at 10:41pm, Nico Williams  wrote:

> On Thu, Oct 11, 2012 at 11:59 AM, Simon Slavin  wrote:
>> On 11 Oct 2012, at 5:38pm, Nico Williams  wrote:
>>> There is something you can do: use a combination of COW on-disk
>>> formats in such a way that it's possible to detect partially-committed
>>> transactions and rollback to the last good known root
>> 
>> This is actually the problem, not the solution.  Traditional disk drivers 
>> for spinning disks change the order in which they write things to disk.  
>> They will buffer several write commands up and notice that on the way to 
>> moving the write head to do write #1 the write head will pass over the 
>> correct spot to do write #4, so they will do write #4 first.  Many disks 
>> will do this even with the disk jumpers set to enforce in-order writing: 
>> they lie.
> 
> You missed something: because fsync()s are done (in the background,
> you are guaranteed that transactions do eventually make it (in order,
> up to the point of the fsync()) onto disk. 

Unfortunately they are not guaranteed to be made in order.  fsync() depends on 
the hard disk waiting until its writes are done before the driver call returns. 
 In other words ...

1. Your program calls the operating system's fsync() or equivalent
   (Technically speaking you may also want to fsync() the directory containing 
the file.)
2. Operating system flushes impending writes to the hard disk by calling 
the storage driver
3.  Storage driver receives pending writes
4.  Storage converts those changes to instructions to the storage 
hardware
5.  Physical changes are made within the storage hardware
6.  Storage driver waits until they have all actually been made
7.  Once those changes have been made storage driver reports success to 
operating system
8. Operating system (fsync()) loops waiting for the drive's driver to 
report all writes have done
9. Operating system returns from fsync() call reporting success
a. Your program can proceed

The problem is that a standard desktop computer doesn't do 5 and may not do 8 
either.  The storage drivers save a great deal of time by reporting success 
before the disk's surface has been changed before waiting for physical changes. 
 And they will immediately accept more changes, disrespecting the write 
barrier.  At the driver level, the barrier between transactions gets lost.

Server-level hardware (not most popular cheap hard disks) can be configured to 
do this properly, usually using mini-switches or jumper settings.  But if you 
try to do that to a standard desktop computer you will find it slows down so 
much it's unusable: type a few characters in Word and you can wait two or three 
seconds to see them on the screen.  That's why the makers of normal computers 
don't do it.  In addition, many operating systems also don't implement fsync() 
properly, for the same reason: to make the computer feel faster.  You can see 
something about this in section 9.2 of



Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread Nico Williams
On Thu, Oct 11, 2012 at 11:59 AM, Simon Slavin  wrote:
> On 11 Oct 2012, at 5:38pm, Nico Williams  wrote:
>> There is something you can do: use a combination of COW on-disk
>> formats in such a way that it's possible to detect partially-committed
>> transactions and rollback to the last good known root
>
> This is actually the problem, not the solution.  Traditional disk drivers for 
> spinning disks change the order in which they write things to disk.  They 
> will buffer several write commands up and notice that on the way to moving 
> the write head to do write #1 the write head will pass over the correct spot 
> to do write #4, so they will do write #4 first.  Many disks will do this even 
> with the disk jumpers set to enforce in-order writing: they lie.

You missed something: because fsync()s are done (in the background,
you are guaranteed that transactions do eventually make it (in order,
up to the point of the fsync()) onto disk.  And this can be marked --
that is, each transaction's root block can name the last transaction
that should be fully on-disk at that point.  Then when you open the DB
you validate all transactions from the latest to the last one known to
have reached disk in toto.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread Nico Williams
To expand a bit, the on-disk format needs to allow the roots of N of
the last transactions to be/remain reachable at all times.  At open
time you look for the latest transaction, verify that it has been
written[0] completely, then use it, else look for the preceding
transaction, verify it, and so on.

N needs to be at least 2: the last and the preceding transactions.  No
blocks should be freed or reused for any transactions still in use or
possible use (e.g., for power failure recovery).  For high read
concurrency you can allow connections to lock a past transaction so
that no blocks are freed that are needed to access the DB at that
state.

This all goes back to 1980s DB and filesystem concepts.  See, for
example, the BSD4.4 Log Structure Filesystem.  (I mention this in case
there are concerns about patents, though IANAL and I make no
particular assertions here other than that there is plenty of old
prior art and expired patents that can probably be used to obtain
sufficient certainty as to the patent law risks in the approach
described herein.)

[0] E.g., check a transaction block manifest and check that those
blocks were written correctly; or traverse the tree looking for
differences to the previous transaction; this may require checking
block contents checksums.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread Nico Williams
On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp  wrote:
>> Could you list the requirements of such a light weight barrier?
>> i.e. what would it need to do minimally, what's different from
>> fsync/fdatasync ?
>
> For SQLite, the write barrier needs to involve two separate inodes.  The
> requirement is this:

...

> Note also that when fsync() works as advertised, SQLite transactions are
> ACID.  But when fsync() is reduced to a write-barrier, we loss the D
> (durable) and transactions are only ACI.  In our experience, nobody really
> cares very much about durable across a power-loss.  People are mainly
> interested in Atomic, Consistent, and Isolated.  If you take a power loss
> and then after reboot you find the 10 seconds of work prior to the power
> loss is missing, nobody much cares about that as long as all of the prior
> work is still present and consistent.

There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-11 Thread 杨苏立 Yang Su Li
I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?
(I believe Windows does this to an extent, but not quite sure).

Thanks a lot

Suli


On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen  wrote:
> Richard Hipp writes:
>>
>> We would really, really love to have some kind of write-barrier that is
>> lighter than fsync().  If there is some method other than fsync() for
>> forcing a write-barrier on Linux that we don't know about, please enlighten
>> us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>
> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] light weight write barriers

2012-10-10 Thread Richard Hipp
On Wed, Oct 10, 2012 at 1:17 PM, Andi Kleen  wrote:

> Richard Hipp writes:
> >
> > We would really, really love to have some kind of write-barrier that is
> > lighter than fsync().  If there is some method other than fsync() for
> > forcing a write-barrier on Linux that we don't know about, please
> enlighten
> > us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>

For SQLite, the write barrier needs to involve two separate inodes.  The
requirement is this:

After rebooting from a power loss or hard-reset, one or the other of the
following statements must be true of any reader process that examines the
two inodes associated with the write barrier:  (1) it can see the complete
results every write operation (and unlink) that occurred before the write
barrier or (2) it can see no results from any write operation (or unlink)
that occurred after the write barrier.

In the case of SQLite, the write-barrier never needs to involve more than
two inodes:  the original database file and the transaction journal (which
might be either a rollback journal or a write-ahead log, depending on how
SQLite is configured.)  But I would suppose that a general-purpose write
barrier mechanism should involve an arbitrary number of inodes.

Fsync() is a very close approximation to a write barrier since (when it
works as advertised) all pending I/O reaches persistent storage before the
fsync() returns.  And since no subsequent I/Os are issued until after the
fsync() returns, the requirements above a clearly satisfied.  But it really
isn't necessary to actually wait for content to reach persistent storage as
long as we know that content will not reach persistent storage out-of-order.

Note also that when fsync() works as advertised, SQLite transactions are
ACID.  But when fsync() is reduced to a write-barrier, we loss the D
(durable) and transactions are only ACI.  In our experience, nobody really
cares very much about durable across a power-loss.  People are mainly
interested in Atomic, Consistent, and Isolated.  If you take a power loss
and then after reboot you find the 10 seconds of work prior to the power
loss is missing, nobody much cares about that as long as all of the prior
work is still present and consistent.



>
> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users