Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-23 Thread Thanumalayan Sankaranarayana Pillai
Hi Marc,

Thanks for your comments! I just got confused that some SQLite webpages (
http://www.sqlite.org/transactional.html,
http://www.sqlite.org/features.html) mention that transactions are durable
after a power loss (the D in ACID); nowhere has it been mentioned that
"immediate durability after a commit" is true only when the journal_mode is
explicitly set to WAL. The documentation, however, was thorough regarding
stuff like bad disks and bad OSes, and how they affect the
integrity/corruption of the database.

My view had been fully biased: I was looking at some applications that use
MySQL and such, acting as nodes in a network, and they require this
"immediate durability" guarantee. I totally understand now that SQLite is
aimed at an entirely different set of applications that don't require
"immediate durability", but would rather that the database is fast.

Thanks again,
Thanu


On Thu, May 23, 2013 at 8:00 AM, Marc L. Allen
wrote:

> Just to throw in my $0.02 as a user
>
> Given the SQL stream of...
>
> 
> COMMIT
> 
>
> Vs.
>
> 
> 
> 
>
> Except in cases where, in the first example, I have time to inform someone
> about the COMMIT before the power loss, there's no functional difference
> between the two events.  I would hate to think I would ever demand that
> SQLite guarantee the commit occurs regardless of how quickly the power loss
> happens after the commit statement.
>
> For a huge majority of the applications I've dealt with (and I say huge
> instead of all only in case there's one I've forgotten about), the timing
> difference doesn't matter.  What matters is that when I come back up I have
> an intact database.
>
> -Original Message-
> From: sqlite-users-boun...@sqlite.org [mailto:
> sqlite-users-boun...@sqlite.org] On Behalf Of thanumalayan mad
> Sent: Wednesday, May 22, 2013 8:31 AM
> To: Richard Hipp
> Cc: General Discussion of SQLite Database
> Subject: Re: [sqlite] Potential bug in crash-recovery code: unlink() and
> friends are not synchronous
>
> I do not observe any loss in durability in WAL mode: it works totally fine.
>
> As for the documentation, http://www.sqlite.org/transactional.html and
> http://www.sqlite.org/features.html claim that SQLite is durable during
> power failures; and DELETE is the default journal_mode. Also, other pages,
> http://www.sqlite.org/pragma.html#pragma_synchronous,
> http://www.sqlite.org/atomiccommit.html, and
> http://www.sqlite.org/faq.html, made me think that rollback journaling
> ensures durability (given a honest disk and a "honest" OS), although those
> pages do not "legally" say that; I'm not sure if others would understand it
> this way though. The usual opinion in blogosphere (and forums) also seems
> to be that SQLite is, by default, durable across power failures, though
> they might be meaning only about a "5 second eventual durability". Finally,
> (I hope I do not come across here as being authoritative), a quick fix
> might be explicitly mentioning somewhere in the documentation that DELETE
> and TRUNCATE modes do not ensure immediate durability after a power loss;
> this would combat any wrongly-understood claims in the rest of the
> documentation.
>
> Also, not to spam, but it would be great if you could answer these
> questions for my research (you might send me a reply directly without going
> through the mailing list): [a] Was it always understood that unlink() and
> ftruncate() are not synchronous, and that SQLite transactions in DELETE
> mode are not immediately-durable in Linux; or had you initially
> misunderstood the semantics of those calls, or left-off the fsync() because
> of a typo error? [b] While designing the crash-tests, were the semantics of
> the calls in Unix understood? What if ftruncate() not being synchronous did
> lead to a consistency-loss? Was it reasoned-out that the non-synchronous
> ftruncate would not produce corruption? [c] How much of a loss in
> durability (what other than 5 seconds) would be "good enough" in most
> cases? ... Again, sorry for the spam; my research is trying to make
> sense of the flushing-mess in the entire storage stack, and feedback would
> be extremely useful.
>
> PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece
> of software.
>
> --
> Thanumalayan Sankaranarayana Pillai
> (Graduate student at the University of Wisconsin-Madison)
>
>
> On Wed, May 22, 2013 at 5:49 AM, Richard Hipp  wrote:
>
> >
> >
> > On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad  >wrote:
> >
> >>
> >> Expected result: You always find that the transaction had been executed.
> >> Observed result: You sometimes find that the transaction did not
> execute.
> >>
> >
> > The core team has discussed this.  In order to avoid a substantial
> > performance hit against transaction COMMIT, we have chosen to not do
> > fsyncs on the directory when a file is unlinked, and thus to allow
> > loss of durability following 

Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-23 Thread Marc L. Allen
Just to throw in my $0.02 as a user

Given the SQL stream of...


COMMIT


Vs.





Except in cases where, in the first example, I have time to inform someone 
about the COMMIT before the power loss, there's no functional difference 
between the two events.  I would hate to think I would ever demand that SQLite 
guarantee the commit occurs regardless of how quickly the power loss happens 
after the commit statement.

For a huge majority of the applications I've dealt with (and I say huge instead 
of all only in case there's one I've forgotten about), the timing difference 
doesn't matter.  What matters is that when I come back up I have an intact 
database.

-Original Message-
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of thanumalayan mad
Sent: Wednesday, May 22, 2013 8:31 AM
To: Richard Hipp
Cc: General Discussion of SQLite Database
Subject: Re: [sqlite] Potential bug in crash-recovery code: unlink() and 
friends are not synchronous

I do not observe any loss in durability in WAL mode: it works totally fine.

As for the documentation, http://www.sqlite.org/transactional.html and 
http://www.sqlite.org/features.html claim that SQLite is durable during power 
failures; and DELETE is the default journal_mode. Also, other pages, 
http://www.sqlite.org/pragma.html#pragma_synchronous,
http://www.sqlite.org/atomiccommit.html, and http://www.sqlite.org/faq.html, 
made me think that rollback journaling ensures durability (given a honest disk 
and a "honest" OS), although those pages do not "legally" say that; I'm not 
sure if others would understand it this way though. The usual opinion in 
blogosphere (and forums) also seems to be that SQLite is, by default, durable 
across power failures, though they might be meaning only about a "5 second 
eventual durability". Finally, (I hope I do not come across here as being 
authoritative), a quick fix might be explicitly mentioning somewhere in the 
documentation that DELETE and TRUNCATE modes do not ensure immediate durability 
after a power loss; this would combat any wrongly-understood claims in the rest 
of the documentation.

Also, not to spam, but it would be great if you could answer these questions 
for my research (you might send me a reply directly without going through the 
mailing list): [a] Was it always understood that unlink() and
ftruncate() are not synchronous, and that SQLite transactions in DELETE mode 
are not immediately-durable in Linux; or had you initially misunderstood the 
semantics of those calls, or left-off the fsync() because of a typo error? [b] 
While designing the crash-tests, were the semantics of the calls in Unix 
understood? What if ftruncate() not being synchronous did lead to a 
consistency-loss? Was it reasoned-out that the non-synchronous ftruncate would 
not produce corruption? [c] How much of a loss in durability (what other than 5 
seconds) would be "good enough" in most cases? ... Again, sorry for the 
spam; my research is trying to make sense of the flushing-mess in the entire 
storage stack, and feedback would be extremely useful.

PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece of 
software.

--
Thanumalayan Sankaranarayana Pillai
(Graduate student at the University of Wisconsin-Madison)


On Wed, May 22, 2013 at 5:49 AM, Richard Hipp  wrote:

>
>
> On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote:
>
>>
>> Expected result: You always find that the transaction had been executed.
>> Observed result: You sometimes find that the transaction did not execute.
>>
>
> The core team has discussed this.  In order to avoid a substantial 
> performance hit against transaction COMMIT, we have chosen to not do 
> fsyncs on the directory when a file is unlinked, and thus to allow 
> loss of durability following a power loss event.  ACI without the D is 
> still guaranteed.  But not the D.  The overwhelming majority of 
> applications care not one wit about durability following power loss.  
> For most applications, it is sufficient that the file is uncorrupted.  
> If recovery gives you a snapshot of the file as it existed 5 seconds 
> prior to the power loss, that's fine.
>
> WAL-mode transactions should be durable across power-loss events.  So 
> if durability is vitally important to you, you can always set PRAGMA 
> journal_mode=WAL.  Are you observing loss of durability following 
> power loss in WAL mode?
>
> Is there any place in the documentation that we have overlooked where 
> SQLite claims to be durable across a power loss in rollback mode?
>
>
>
> --
> D. Richard Hipp
> d...@sqlite.org
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


This email and any attachments are only for use by the intended recipient(s) 
and may contain legally privileged, confidential, proprietary or otherwise 

Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-23 Thread thanumalayan mad
I do not observe any loss in durability in WAL mode: it works totally fine.

As for the documentation, http://www.sqlite.org/transactional.html and
http://www.sqlite.org/features.html claim that SQLite is durable during
power failures; and DELETE is the default journal_mode. Also, other pages,
http://www.sqlite.org/pragma.html#pragma_synchronous,
http://www.sqlite.org/atomiccommit.html, and
http://www.sqlite.org/faq.html, made
me think that rollback journaling ensures durability (given a honest disk
and a "honest" OS), although those pages do not "legally" say that; I'm not
sure if others would understand it this way though. The usual opinion in
blogosphere (and forums) also seems to be that SQLite is, by default,
durable across power failures, though they might be meaning only about a "5
second eventual durability". Finally, (I hope I do not come across here as
being authoritative), a quick fix might be explicitly mentioning somewhere
in the documentation that DELETE and TRUNCATE modes do not ensure immediate
durability after a power loss; this would combat any wrongly-understood
claims in the rest of the documentation.

Also, not to spam, but it would be great if you could answer these
questions for my research (you might send me a reply directly without going
through the mailing list): [a] Was it always understood that unlink() and
ftruncate() are not synchronous, and that SQLite transactions in DELETE
mode are not immediately-durable in Linux; or had you initially
misunderstood the semantics of those calls, or left-off the fsync() because
of a typo error? [b] While designing the crash-tests, were the semantics of
the calls in Unix understood? What if ftruncate() not being synchronous did
lead to a consistency-loss? Was it reasoned-out that the non-synchronous
ftruncate would not produce corruption? [c] How much of a loss in
durability (what other than 5 seconds) would be "good enough" in most
cases? ... Again, sorry for the spam; my research is trying to make
sense of the flushing-mess in the entire storage stack, and feedback would
be extremely useful.

PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece
of software.

--
Thanumalayan Sankaranarayana Pillai
(Graduate student at the University of Wisconsin-Madison)


On Wed, May 22, 2013 at 5:49 AM, Richard Hipp  wrote:

>
>
> On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote:
>
>>
>> Expected result: You always find that the transaction had been executed.
>> Observed result: You sometimes find that the transaction did not execute.
>>
>
> The core team has discussed this.  In order to avoid a substantial
> performance hit against transaction COMMIT, we have chosen to not do fsyncs
> on the directory when a file is unlinked, and thus to allow loss of
> durability following a power loss event.  ACI without the D is still
> guaranteed.  But not the D.  The overwhelming majority of applications care
> not one wit about durability following power loss.  For most applications,
> it is sufficient that the file is uncorrupted.  If recovery gives you a
> snapshot of the file as it existed 5 seconds prior to the power loss,
> that's fine.
>
> WAL-mode transactions should be durable across power-loss events.  So if
> durability is vitally important to you, you can always set PRAGMA
> journal_mode=WAL.  Are you observing loss of durability following power
> loss in WAL mode?
>
> Is there any place in the documentation that we have overlooked where
> SQLite claims to be durable across a power loss in rollback mode?
>
>
>
> --
> D. Richard Hipp
> d...@sqlite.org
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-22 Thread Klaas V
Dear fellow SQLite afficionados,

  Thanumalayan Sankaranarayana Pillai  wrote:
  "I expect it wouldn't be a problem with WAL"
Thé SQLite (not wanting, but cobsidering him at leat  kind of) Force D. Richard 
H.  [who does not know Him don't read this message, you won;t understand enough 
I'm afraid]  answered to   
  There are no configuration options set wrong.  You shouldn't ever have to
set configuration options in order to get SQLite to work right.  I think
instead you are just running up against fundamental limits of modern
consumer-grade hardware.  I don't think there is anything anybody can do
about it" I'd say as a fellow consumer of a wee little bit different hardware 
as well as software
"asking has some kind of answering included sometimes"  I admit since it's not 
my native lingo in imperfect English, I'm brave and arrogant enough to 
encourage mr. Pillai to download and move to a directory within her or his PATH 
the brand new/recommended RVU(P) 3.17.0(±130520) ASAP (i.e. As Soon As Possible 
[just in the very unlikely case y'all did not know yet this geek speak]
To make a long story shorter: Use WAL to open not only yer eyes, but possibly 
you guys's (and gal's) mind as well. Hopefully it's not (yet) banned in your 
part of this wondrtful globe.
 
Just my not extremely humble opinion. Thanks a million for paying eventual 
attention, friends (and 'why not' foes),
 

Cordiali saluti/Vriendelijke groeten/Kind regards,
Klaas "Z4us" V MetaDBA
kla...@innocentisart.eu

P.S.In bocca al lupo, tutti!
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-22 Thread Thanumalayan Sankaranarayana Pillai
No, I have reported everything. The only thing I missed might be that it's
not "5 seconds" always, but rather the configurable commit interval of the
filesystem, which is by default 5 seconds in most desktop Linux distros.

I only read through the source code of test6.c, and misunderstood that
ftruncate() was assumed to be synchronous. Sorry. I still think that
test6.c might assume unlink() to be synchronous, though again, that is just
a "think".

Thanks for the replying to my questions. They are really useful. I am
extremely surprised to learn that unlink() cannot be made durable in some
systems. Do answer [c] too, if that's possible.  I still am examining
SQLite and a bunch of other DBMSs for power-crash scenarios, so might come
back to badger you about "bugs" in the future.

--
Thanumalayan Sankaranarayana Pillai
(Graduate student at the University of Wisconsin-Madison)


On Wed, May 22, 2013 at 8:05 AM, Richard Hipp  wrote:

>
>
> On Wed, May 22, 2013 at 8:31 AM, thanumalayan mad wrote:
>
>>
>> Also, not to spam, but it would be great if you could answer these
>> questions for my research (you might send me a reply directly without going
>> through the mailing list): [a] Was it always understood that unlink() and
>> ftruncate() are not synchronous, and that SQLite transactions in DELETE
>> mode are not immediately-durable in Linux; or had you initially
>> misunderstood the semantics of those calls, or left-off the fsync() because
>> of a typo error?
>>
>
> We are aware of the need to fsync() the directory that contains a file
> after unlink() in order to make the unlink() durable.  We have deliberately
> chosen to avoid that fsync() for performance reasons.
>
> Note that on some systems (ex: AIX and the Chromium sandbox) it is not
> possible to fsync() a directory and therefore it is not possible to make
> unlink() durable.
>
>
>
>
>> [b] While designing the crash-tests, were the semantics of the calls in
>> Unix understood? What if ftruncate() not being synchronous did lead to a
>> consistency-loss? Was it reasoned-out that the non-synchronous ftruncate
>> would not produce corruption?
>>
>
> Over the past 12 years of developing SQLite, we have become acutely aware
> of the semantics of unix, both as published and as commonly implemented,
> which are not necessarily the same thing.
>
> Crash-tests are looking for corruption problems only.  Crash testing does
> not look for durability.
>
> I do not believe that our crash-testing assumes that ftruncate() is
> synchronous.  Have you detected an error that you have not reported?
>
>
>
>>  [c] How much of a loss in durability (what other than 5 seconds) would
>> be "good enough" in most cases?
>>
>
>
>
>
>
> --
> D. Richard Hipp
> d...@sqlite.org
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-22 Thread Richard Hipp
On Wed, May 22, 2013 at 8:31 AM, thanumalayan mad wrote:

>
> Also, not to spam, but it would be great if you could answer these
> questions for my research (you might send me a reply directly without going
> through the mailing list): [a] Was it always understood that unlink() and
> ftruncate() are not synchronous, and that SQLite transactions in DELETE
> mode are not immediately-durable in Linux; or had you initially
> misunderstood the semantics of those calls, or left-off the fsync() because
> of a typo error?
>

We are aware of the need to fsync() the directory that contains a file
after unlink() in order to make the unlink() durable.  We have deliberately
chosen to avoid that fsync() for performance reasons.

Note that on some systems (ex: AIX and the Chromium sandbox) it is not
possible to fsync() a directory and therefore it is not possible to make
unlink() durable.




> [b] While designing the crash-tests, were the semantics of the calls in
> Unix understood? What if ftruncate() not being synchronous did lead to a
> consistency-loss? Was it reasoned-out that the non-synchronous ftruncate
> would not produce corruption?
>

Over the past 12 years of developing SQLite, we have become acutely aware
of the semantics of unix, both as published and as commonly implemented,
which are not necessarily the same thing.

Crash-tests are looking for corruption problems only.  Crash testing does
not look for durability.

I do not believe that our crash-testing assumes that ftruncate() is
synchronous.  Have you detected an error that you have not reported?



>  [c] How much of a loss in durability (what other than 5 seconds) would be
> "good enough" in most cases?
>





-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-22 Thread Richard Hipp
On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote:

>
> Expected result: You always find that the transaction had been executed.
> Observed result: You sometimes find that the transaction did not execute.
>

The core team has discussed this.  In order to avoid a substantial
performance hit against transaction COMMIT, we have chosen to not do fsyncs
on the directory when a file is unlinked, and thus to allow loss of
durability following a power loss event.  ACI without the D is still
guaranteed.  But not the D.  The overwhelming majority of applications care
not one wit about durability following power loss.  For most applications,
it is sufficient that the file is uncorrupted.  If recovery gives you a
snapshot of the file as it existed 5 seconds prior to the power loss,
that's fine.

WAL-mode transactions should be durable across power-loss events.  So if
durability is vitally important to you, you can always set PRAGMA
journal_mode=WAL.  Are you observing loss of durability following power
loss in WAL mode?

Is there any place in the documentation that we have overlooked where
SQLite claims to be durable across a power loss in rollback mode?


-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-22 Thread Thanumalayan Sankaranarayana Pillai
Thank you for your replies! I now fully understand (and appreciate) that
the "ACI" part of transactions is the most important. Also, I didn't notice
any of ACI being broken: SQLite guarantees those conditions really well.

However, just to be clear, my "potential bug" affects out-of-the-box Fedora
and Ubuntu, even with costly SCSI drives, and with the OS installed using
all default options. On the other hand, in these environments, SQLite
transactions will be immediately durable if fsync()-s are issued after the
ftruncate() or unlink() calls while resetting the rollback journal; I
verified this by modifying unixDelete() within SQLite, setting dirSync = 1
always. I'm writing the rest of this email assuming that this is an
interesting piece of news (and not something already known). Also,
confession: I am a grad student researching filesystems, and I haven’t
directly used SQLite much; I looked at SQLite because I observed some funny
disk activity. So if I am totally out-of-place with this rigmarole,
imagining things about SQLite that aren’t actually true, please forgive the
spam.

It is entirely true that desktop SATA drives with disk-caches, some bad
filesystems, and crazy OS configurations, all disrespect the entire concept
of flushing. But, the setup I verified used either Fedora 17 and Ubuntu
12.04, and either Ext4, Ext3, or Btrfs. I did my best to make sure that the
effects are neither due to a lying disk, nor because of configuration stuff
like disabling filesystem barriers. Moreover, *the last SQLite
transaction's durability is almost always delayed by 5 seconds* (it's not
something you observe only once in a thousand power crashes), if there are
no other applications flushing to the disk. If other applications do flush
to the disk, then the delay is only till when the next flush (of another
application) happens. The bug probably does not affect Windows and other
OSes - it deals with the "Unix" VFS exclusively.

Not calling fsync() might be based on the assumption that they are
synchronous. As far as I can tell, both the ftruncate() and unlink() system
calls are not synchronous (not immediately **issued to the disk**) in the
Linux distros I used, independent of the disk. I understand that such
semantics (of whether the system calls are synchronous) are often
non-standard; at other times, the documentation is not clear about what the
standards are. However, some manpages (http://linux.die.net/man/2/fsync for
ftruncate, and http://linux.die.net/man/8/mount for unlink: look for
"dirsync") suggest that most Linux distributions require an fsync() for
unlink and ftruncate to be synchronous. Some filesystems or distros might,
however, automatically flush, though: I haven't yet looked into the
behavior of other distributions such as RHEL, other Unix-like OSes such as
FreeBSD, or file systems such as XFS. But, I'll be looking into them soon,
as part of my research, so do let me know if you are interested in their
behavior.

Thanks for reading all the way through! Again, it might be a better
decision to not issue the fsync()-s after unlink and ftruncate: "D" is
probably less important than the extra overhead. I'm just trying to make
sure this is by design, and not an accident.

--
Thanu


On Tue, May 21, 2013 at 11:52 AM, Richard Hipp  wrote:
>
> On Tue, May 21, 2013 at 12:04 PM, Thanumalayan Sankaranarayana Pillai <
> madth...@cs.wisc.edu> wrote:
>
> > Hi all,
> >
> > Did anyone look into this? I might be setting some config option wrong,
> > so it would be great if you sent me a "you did something wrong" reply if
> > you feel that I might have the wrong config (or might be doing something
> > totally idiotic).
> >
>
> There are no configuration options set wrong.  You shouldn't ever have to
> set configuration options in order to get SQLite to work right.   I think
> instead you are just running up against fundamental limits of modern
> consumer-grade hardware.  I don't think there is anything anybody can do
> about it.
>
>
> >
> > I tested with a few other Linux machines and a few different SQLite
> > versions, and found that there is an upto-5-second delay (the default
> > filesystem commit interval) in the transactions getting durable (with
power
> > crashes, that is).
> >
> > Thanks in advance,
> > Thanu
> >
> > On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad  > >wrote:
> >
> > > Hi All,
> > >
> > > I was testing out SQLite with a framework I developed. I believe,
while
> > > running on Linux, transactions might not be durable when a power crash
> > > happens immediately after a commit. I observed this using "SQLite
version
> > > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64".
Steps
> > to
> > > reproduce:
> > >
> > > 1. Use a Linux machine with an Ext4 filesystem (default mount
options).
> > > 2. Create a database file, set journal_mode to DELETE, perform a
> > > transaction using "begin transaction ... commit;".
> > > 3. Pull the power plug 

Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-21 Thread Richard Hipp
On Tue, May 21, 2013 at 12:04 PM, Thanumalayan Sankaranarayana Pillai <
madth...@cs.wisc.edu> wrote:

> Hi all,
>
> Did anyone look into this? I might be setting some config option wrong,
> so it would be great if you sent me a "you did something wrong" reply if
> you feel that I might have the wrong config (or might be doing something
> totally idiotic).
>

There are no configuration options set wrong.  You shouldn't ever have to
set configuration options in order to get SQLite to work right.   I think
instead you are just running up against fundamental limits of modern
consumer-grade hardware.  I don't think there is anything anybody can do
about it.


>
> I tested with a few other Linux machines and a few different SQLite
> versions, and found that there is an upto-5-second delay (the default
> filesystem commit interval) in the transactions getting durable (with power
> crashes, that is).
>
> Thanks in advance,
> Thanu
>
> On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad  >wrote:
>
> > Hi All,
> >
> > I was testing out SQLite with a framework I developed. I believe, while
> > running on Linux, transactions might not be durable when a power crash
> > happens immediately after a commit. I observed this using "SQLite version
> > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps
> to
> > reproduce:
> >
> > 1. Use a Linux machine with an Ext4 filesystem (default mount options).
> > 2. Create a database file, set journal_mode to DELETE, perform a
> > transaction using "begin transaction ... commit;".
> > 3. Pull the power plug immediately after "commit;" returns.
> > 4. Put plug back in, power on the machine, open database file with
> SQLite,
> > and examine whether the transaction has been executed.
> >
> > Expected result: You always find that the transaction had been executed.
> > Observed result: You sometimes find that the transaction did not execute.
> > (To increase the chance that you end with the transaction not having
> > executed, create an Ext4 partition in an unmounted hard disk, mount the
> > partition with the "-o commit=30" mount option, and do the experiment
> with
> > the database in that partition).
> >
> > Of course, it's possible that the transactions are durable, and I'm just
> > being totally bonkers, or running SQLite in some really wrong way. If
> not,
> > I believe that the problem lies in SQLite not doing an fsync() on the
> > parent directory after unlink()-ing the rollback journal file. It seems
> > that, in Ext4, unlink() is atomic, but not synchronous: the rollback
> > journal can exist after rebooting if a power crash follows the unlink().
> > Also, ftruncate() doesn't seem to be synchronous either (non-durability
> > even if the journal_mode is set to TRUNCATE).
> >
> > I haven't examined how other filesystems treat unlink(). I haven't also
> > tested with other pragma options and other journal_modes. I expect it
> > wouldn't be a problem with WAL (it should be correct even if the log
> exists
> > after the reboot); however, with things like VACUUM-ing, the database
> size
> > would probably not be reduced if a power crash happened, because of
> > ftruncate() not being synchronous. More generally, I think some of the
> code
> > assumes that all non-write and non-create operations that modify a file
> (in
> > other words, unlink and ftruncate) are synchronous; but that's true in
> > Linux only if the partition is mounted with "-o dirsync". One more random
> > thought: if I'm not wrong, "SQLite crash-tests" currently tests
> > partial-flushing only with write() calls
> >
> > Again, sorry for the spam if I'm just being bonkers.
> >
> > Thanks,
> > Thanu
> >
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-21 Thread Simon Slavin

On 21 May 2013, at 5:04pm, Thanumalayan Sankaranarayana Pillai 
 wrote:

> Did anyone look into this? I might be setting some config option wrong,
> so it would be great if you sent me a "you did something wrong" reply if
> you feel that I might have the wrong config (or might be doing something
> totally idiotic).
> 
> I tested with a few other Linux machines and a few different SQLite
> versions, and found that there is an upto-5-second delay (the default
> filesystem commit interval) in the transactions getting durable (with power
> crashes, that is).

I think it's a reflection of what the operating system thinks 'synchronous' 
should mean.  Flushing changes to disk takes time and slows down your 
application so the operating system doesn't always do it when it says it does.  
With Unix, for example, it's quite common to find that some file systems do 
absolutely nothing for flush API calls, or put a timer on flush() which 
prevents it from running more frequently than every few seconds, which would 
introduce extra lag and thrashing your mass storage subsystem (e.g. wearing out 
the stepper motor on a hard disk).  Also, hard disk subsystems (the metal thing 
with a circuit board you hold in your hand) aimed at desktop computers actively 
lie to the OS about when data is written to disk unless you buy hard disks 
explicitly sold for use in servers (and priced to match).

Tracking down exactly what's going on would probably involve reading the source 
code for ftruncate() in your version of your OS, and then assuming that the 
hard disk unit doesn't lie.  It could be done if you want, but ftruncate() 
differs even for different versions of the same Unix file system, so the result 
may not tell us much.

> One more random
> thought: if I'm not wrong, "SQLite crash-tests" currently tests
> partial-flushing only with write() calls

Immediate and true flushing is not a huge concern of the developer team.  A 
delay of 5 seconds, in conditions relating to a normal running of an 
application, probably isn't going to be a deal-breaker in anything you should 
be using a DBMS for.

The main thing, and one you seem to confirm is working properly, is that 
transactions are always either completely written or not written, never with 
some parts written and others not written.  That's part of the ACID conditions 
SQLite does its best to implement, and any breaking of those conditions would 
get full and eager attention from the developers.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous

2013-05-21 Thread Thanumalayan Sankaranarayana Pillai
Hi all,

Did anyone look into this? I might be setting some config option wrong,
so it would be great if you sent me a "you did something wrong" reply if
you feel that I might have the wrong config (or might be doing something
totally idiotic).

I tested with a few other Linux machines and a few different SQLite
versions, and found that there is an upto-5-second delay (the default
filesystem commit interval) in the transactions getting durable (with power
crashes, that is).

Thanks in advance,
Thanu

On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad wrote:

> Hi All,
>
> I was testing out SQLite with a framework I developed. I believe, while
> running on Linux, transactions might not be durable when a power crash
> happens immediately after a commit. I observed this using "SQLite version
> 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps to
> reproduce:
>
> 1. Use a Linux machine with an Ext4 filesystem (default mount options).
> 2. Create a database file, set journal_mode to DELETE, perform a
> transaction using "begin transaction ... commit;".
> 3. Pull the power plug immediately after "commit;" returns.
> 4. Put plug back in, power on the machine, open database file with SQLite,
> and examine whether the transaction has been executed.
>
> Expected result: You always find that the transaction had been executed.
> Observed result: You sometimes find that the transaction did not execute.
> (To increase the chance that you end with the transaction not having
> executed, create an Ext4 partition in an unmounted hard disk, mount the
> partition with the "-o commit=30" mount option, and do the experiment with
> the database in that partition).
>
> Of course, it's possible that the transactions are durable, and I'm just
> being totally bonkers, or running SQLite in some really wrong way. If not,
> I believe that the problem lies in SQLite not doing an fsync() on the
> parent directory after unlink()-ing the rollback journal file. It seems
> that, in Ext4, unlink() is atomic, but not synchronous: the rollback
> journal can exist after rebooting if a power crash follows the unlink().
> Also, ftruncate() doesn't seem to be synchronous either (non-durability
> even if the journal_mode is set to TRUNCATE).
>
> I haven't examined how other filesystems treat unlink(). I haven't also
> tested with other pragma options and other journal_modes. I expect it
> wouldn't be a problem with WAL (it should be correct even if the log exists
> after the reboot); however, with things like VACUUM-ing, the database size
> would probably not be reduced if a power crash happened, because of
> ftruncate() not being synchronous. More generally, I think some of the code
> assumes that all non-write and non-create operations that modify a file (in
> other words, unlink and ftruncate) are synchronous; but that's true in
> Linux only if the partition is mounted with "-o dirsync". One more random
> thought: if I'm not wrong, "SQLite crash-tests" currently tests
> partial-flushing only with write() calls
>
> Again, sorry for the spam if I'm just being bonkers.
>
> Thanks,
> Thanu
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users