Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Hi Marc, Thanks for your comments! I just got confused that some SQLite webpages ( http://www.sqlite.org/transactional.html, http://www.sqlite.org/features.html) mention that transactions are durable after a power loss (the D in ACID); nowhere has it been mentioned that "immediate durability after a commit" is true only when the journal_mode is explicitly set to WAL. The documentation, however, was thorough regarding stuff like bad disks and bad OSes, and how they affect the integrity/corruption of the database. My view had been fully biased: I was looking at some applications that use MySQL and such, acting as nodes in a network, and they require this "immediate durability" guarantee. I totally understand now that SQLite is aimed at an entirely different set of applications that don't require "immediate durability", but would rather that the database is fast. Thanks again, Thanu On Thu, May 23, 2013 at 8:00 AM, Marc L. Allen wrote: > Just to throw in my $0.02 as a user > > Given the SQL stream of... > > > COMMIT > > > Vs. > > > > > > Except in cases where, in the first example, I have time to inform someone > about the COMMIT before the power loss, there's no functional difference > between the two events. I would hate to think I would ever demand that > SQLite guarantee the commit occurs regardless of how quickly the power loss > happens after the commit statement. > > For a huge majority of the applications I've dealt with (and I say huge > instead of all only in case there's one I've forgotten about), the timing > difference doesn't matter. What matters is that when I come back up I have > an intact database. > > -Original Message- > From: sqlite-users-boun...@sqlite.org [mailto: > sqlite-users-boun...@sqlite.org] On Behalf Of thanumalayan mad > Sent: Wednesday, May 22, 2013 8:31 AM > To: Richard Hipp > Cc: General Discussion of SQLite Database > Subject: Re: [sqlite] Potential bug in crash-recovery code: unlink() and > friends are not synchronous > > I do not observe any loss in durability in WAL mode: it works totally fine. > > As for the documentation, http://www.sqlite.org/transactional.html and > http://www.sqlite.org/features.html claim that SQLite is durable during > power failures; and DELETE is the default journal_mode. Also, other pages, > http://www.sqlite.org/pragma.html#pragma_synchronous, > http://www.sqlite.org/atomiccommit.html, and > http://www.sqlite.org/faq.html, made me think that rollback journaling > ensures durability (given a honest disk and a "honest" OS), although those > pages do not "legally" say that; I'm not sure if others would understand it > this way though. The usual opinion in blogosphere (and forums) also seems > to be that SQLite is, by default, durable across power failures, though > they might be meaning only about a "5 second eventual durability". Finally, > (I hope I do not come across here as being authoritative), a quick fix > might be explicitly mentioning somewhere in the documentation that DELETE > and TRUNCATE modes do not ensure immediate durability after a power loss; > this would combat any wrongly-understood claims in the rest of the > documentation. > > Also, not to spam, but it would be great if you could answer these > questions for my research (you might send me a reply directly without going > through the mailing list): [a] Was it always understood that unlink() and > ftruncate() are not synchronous, and that SQLite transactions in DELETE > mode are not immediately-durable in Linux; or had you initially > misunderstood the semantics of those calls, or left-off the fsync() because > of a typo error? [b] While designing the crash-tests, were the semantics of > the calls in Unix understood? What if ftruncate() not being synchronous did > lead to a consistency-loss? Was it reasoned-out that the non-synchronous > ftruncate would not produce corruption? [c] How much of a loss in > durability (what other than 5 seconds) would be "good enough" in most > cases? ... Again, sorry for the spam; my research is trying to make > sense of the flushing-mess in the entire storage stack, and feedback would > be extremely useful. > > PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece > of software. > > -- > Thanumalayan Sankaranarayana Pillai > (Graduate student at the University of Wisconsin-Madison) > > > On Wed, May 22, 2013 at 5:49 AM, Richard Hipp wrote: > > > > > > > On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad >wrote: > > > >> > >> Expected result: You always find that the transaction had been executed. > >> Ob
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Just to throw in my $0.02 as a user Given the SQL stream of... COMMIT Vs. Except in cases where, in the first example, I have time to inform someone about the COMMIT before the power loss, there's no functional difference between the two events. I would hate to think I would ever demand that SQLite guarantee the commit occurs regardless of how quickly the power loss happens after the commit statement. For a huge majority of the applications I've dealt with (and I say huge instead of all only in case there's one I've forgotten about), the timing difference doesn't matter. What matters is that when I come back up I have an intact database. -Original Message- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of thanumalayan mad Sent: Wednesday, May 22, 2013 8:31 AM To: Richard Hipp Cc: General Discussion of SQLite Database Subject: Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous I do not observe any loss in durability in WAL mode: it works totally fine. As for the documentation, http://www.sqlite.org/transactional.html and http://www.sqlite.org/features.html claim that SQLite is durable during power failures; and DELETE is the default journal_mode. Also, other pages, http://www.sqlite.org/pragma.html#pragma_synchronous, http://www.sqlite.org/atomiccommit.html, and http://www.sqlite.org/faq.html, made me think that rollback journaling ensures durability (given a honest disk and a "honest" OS), although those pages do not "legally" say that; I'm not sure if others would understand it this way though. The usual opinion in blogosphere (and forums) also seems to be that SQLite is, by default, durable across power failures, though they might be meaning only about a "5 second eventual durability". Finally, (I hope I do not come across here as being authoritative), a quick fix might be explicitly mentioning somewhere in the documentation that DELETE and TRUNCATE modes do not ensure immediate durability after a power loss; this would combat any wrongly-understood claims in the rest of the documentation. Also, not to spam, but it would be great if you could answer these questions for my research (you might send me a reply directly without going through the mailing list): [a] Was it always understood that unlink() and ftruncate() are not synchronous, and that SQLite transactions in DELETE mode are not immediately-durable in Linux; or had you initially misunderstood the semantics of those calls, or left-off the fsync() because of a typo error? [b] While designing the crash-tests, were the semantics of the calls in Unix understood? What if ftruncate() not being synchronous did lead to a consistency-loss? Was it reasoned-out that the non-synchronous ftruncate would not produce corruption? [c] How much of a loss in durability (what other than 5 seconds) would be "good enough" in most cases? ... Again, sorry for the spam; my research is trying to make sense of the flushing-mess in the entire storage stack, and feedback would be extremely useful. PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece of software. -- Thanumalayan Sankaranarayana Pillai (Graduate student at the University of Wisconsin-Madison) On Wed, May 22, 2013 at 5:49 AM, Richard Hipp wrote: > > > On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote: > >> >> Expected result: You always find that the transaction had been executed. >> Observed result: You sometimes find that the transaction did not execute. >> > > The core team has discussed this. In order to avoid a substantial > performance hit against transaction COMMIT, we have chosen to not do > fsyncs on the directory when a file is unlinked, and thus to allow > loss of durability following a power loss event. ACI without the D is > still guaranteed. But not the D. The overwhelming majority of > applications care not one wit about durability following power loss. > For most applications, it is sufficient that the file is uncorrupted. > If recovery gives you a snapshot of the file as it existed 5 seconds > prior to the power loss, that's fine. > > WAL-mode transactions should be durable across power-loss events. So > if durability is vitally important to you, you can always set PRAGMA > journal_mode=WAL. Are you observing loss of durability following > power loss in WAL mode? > > Is there any place in the documentation that we have overlooked where > SQLite claims to be durable across a power loss in rollback mode? > > > > -- > D. Richard Hipp > d...@sqlite.org > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users This
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
I do not observe any loss in durability in WAL mode: it works totally fine. As for the documentation, http://www.sqlite.org/transactional.html and http://www.sqlite.org/features.html claim that SQLite is durable during power failures; and DELETE is the default journal_mode. Also, other pages, http://www.sqlite.org/pragma.html#pragma_synchronous, http://www.sqlite.org/atomiccommit.html, and http://www.sqlite.org/faq.html, made me think that rollback journaling ensures durability (given a honest disk and a "honest" OS), although those pages do not "legally" say that; I'm not sure if others would understand it this way though. The usual opinion in blogosphere (and forums) also seems to be that SQLite is, by default, durable across power failures, though they might be meaning only about a "5 second eventual durability". Finally, (I hope I do not come across here as being authoritative), a quick fix might be explicitly mentioning somewhere in the documentation that DELETE and TRUNCATE modes do not ensure immediate durability after a power loss; this would combat any wrongly-understood claims in the rest of the documentation. Also, not to spam, but it would be great if you could answer these questions for my research (you might send me a reply directly without going through the mailing list): [a] Was it always understood that unlink() and ftruncate() are not synchronous, and that SQLite transactions in DELETE mode are not immediately-durable in Linux; or had you initially misunderstood the semantics of those calls, or left-off the fsync() because of a typo error? [b] While designing the crash-tests, were the semantics of the calls in Unix understood? What if ftruncate() not being synchronous did lead to a consistency-loss? Was it reasoned-out that the non-synchronous ftruncate would not produce corruption? [c] How much of a loss in durability (what other than 5 seconds) would be "good enough" in most cases? ... Again, sorry for the spam; my research is trying to make sense of the flushing-mess in the entire storage stack, and feedback would be extremely useful. PS: @Richard Hipp and other developers - thanks for SQLite! Amazing piece of software. -- Thanumalayan Sankaranarayana Pillai (Graduate student at the University of Wisconsin-Madison) On Wed, May 22, 2013 at 5:49 AM, Richard Hipp wrote: > > > On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote: > >> >> Expected result: You always find that the transaction had been executed. >> Observed result: You sometimes find that the transaction did not execute. >> > > The core team has discussed this. In order to avoid a substantial > performance hit against transaction COMMIT, we have chosen to not do fsyncs > on the directory when a file is unlinked, and thus to allow loss of > durability following a power loss event. ACI without the D is still > guaranteed. But not the D. The overwhelming majority of applications care > not one wit about durability following power loss. For most applications, > it is sufficient that the file is uncorrupted. If recovery gives you a > snapshot of the file as it existed 5 seconds prior to the power loss, > that's fine. > > WAL-mode transactions should be durable across power-loss events. So if > durability is vitally important to you, you can always set PRAGMA > journal_mode=WAL. Are you observing loss of durability following power > loss in WAL mode? > > Is there any place in the documentation that we have overlooked where > SQLite claims to be durable across a power loss in rollback mode? > > > > -- > D. Richard Hipp > d...@sqlite.org > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Dear fellow SQLite afficionados, Thanumalayan Sankaranarayana Pillai wrote: "I expect it wouldn't be a problem with WAL" Thé SQLite (not wanting, but cobsidering him at leat kind of) Force D. Richard H. [who does not know Him don't read this message, you won;t understand enough I'm afraid] answered to There are no configuration options set wrong. You shouldn't ever have to set configuration options in order to get SQLite to work right. I think instead you are just running up against fundamental limits of modern consumer-grade hardware. I don't think there is anything anybody can do about it" I'd say as a fellow consumer of a wee little bit different hardware as well as software "asking has some kind of answering included sometimes" I admit since it's not my native lingo in imperfect English, I'm brave and arrogant enough to encourage mr. Pillai to download and move to a directory within her or his PATH the brand new/recommended RVU(P) 3.17.0(±130520) ASAP (i.e. As Soon As Possible [just in the very unlikely case y'all did not know yet this geek speak] To make a long story shorter: Use WAL to open not only yer eyes, but possibly you guys's (and gal's) mind as well. Hopefully it's not (yet) banned in your part of this wondrtful globe. Just my not extremely humble opinion. Thanks a million for paying eventual attention, friends (and 'why not' foes), Cordiali saluti/Vriendelijke groeten/Kind regards, Klaas "Z4us" V MetaDBA kla...@innocentisart.eu P.S.In bocca al lupo, tutti! ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
No, I have reported everything. The only thing I missed might be that it's not "5 seconds" always, but rather the configurable commit interval of the filesystem, which is by default 5 seconds in most desktop Linux distros. I only read through the source code of test6.c, and misunderstood that ftruncate() was assumed to be synchronous. Sorry. I still think that test6.c might assume unlink() to be synchronous, though again, that is just a "think". Thanks for the replying to my questions. They are really useful. I am extremely surprised to learn that unlink() cannot be made durable in some systems. Do answer [c] too, if that's possible. I still am examining SQLite and a bunch of other DBMSs for power-crash scenarios, so might come back to badger you about "bugs" in the future. -- Thanumalayan Sankaranarayana Pillai (Graduate student at the University of Wisconsin-Madison) On Wed, May 22, 2013 at 8:05 AM, Richard Hipp wrote: > > > On Wed, May 22, 2013 at 8:31 AM, thanumalayan mad wrote: > >> >> Also, not to spam, but it would be great if you could answer these >> questions for my research (you might send me a reply directly without going >> through the mailing list): [a] Was it always understood that unlink() and >> ftruncate() are not synchronous, and that SQLite transactions in DELETE >> mode are not immediately-durable in Linux; or had you initially >> misunderstood the semantics of those calls, or left-off the fsync() because >> of a typo error? >> > > We are aware of the need to fsync() the directory that contains a file > after unlink() in order to make the unlink() durable. We have deliberately > chosen to avoid that fsync() for performance reasons. > > Note that on some systems (ex: AIX and the Chromium sandbox) it is not > possible to fsync() a directory and therefore it is not possible to make > unlink() durable. > > > > >> [b] While designing the crash-tests, were the semantics of the calls in >> Unix understood? What if ftruncate() not being synchronous did lead to a >> consistency-loss? Was it reasoned-out that the non-synchronous ftruncate >> would not produce corruption? >> > > Over the past 12 years of developing SQLite, we have become acutely aware > of the semantics of unix, both as published and as commonly implemented, > which are not necessarily the same thing. > > Crash-tests are looking for corruption problems only. Crash testing does > not look for durability. > > I do not believe that our crash-testing assumes that ftruncate() is > synchronous. Have you detected an error that you have not reported? > > > >> [c] How much of a loss in durability (what other than 5 seconds) would >> be "good enough" in most cases? >> > > > > > > -- > D. Richard Hipp > d...@sqlite.org > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
On Wed, May 22, 2013 at 8:31 AM, thanumalayan mad wrote: > > Also, not to spam, but it would be great if you could answer these > questions for my research (you might send me a reply directly without going > through the mailing list): [a] Was it always understood that unlink() and > ftruncate() are not synchronous, and that SQLite transactions in DELETE > mode are not immediately-durable in Linux; or had you initially > misunderstood the semantics of those calls, or left-off the fsync() because > of a typo error? > We are aware of the need to fsync() the directory that contains a file after unlink() in order to make the unlink() durable. We have deliberately chosen to avoid that fsync() for performance reasons. Note that on some systems (ex: AIX and the Chromium sandbox) it is not possible to fsync() a directory and therefore it is not possible to make unlink() durable. > [b] While designing the crash-tests, were the semantics of the calls in > Unix understood? What if ftruncate() not being synchronous did lead to a > consistency-loss? Was it reasoned-out that the non-synchronous ftruncate > would not produce corruption? > Over the past 12 years of developing SQLite, we have become acutely aware of the semantics of unix, both as published and as commonly implemented, which are not necessarily the same thing. Crash-tests are looking for corruption problems only. Crash testing does not look for durability. I do not believe that our crash-testing assumes that ftruncate() is synchronous. Have you detected an error that you have not reported? > [c] How much of a loss in durability (what other than 5 seconds) would be > "good enough" in most cases? > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
On Sat, May 18, 2013 at 4:41 AM, thanumalayan mad wrote: > > Expected result: You always find that the transaction had been executed. > Observed result: You sometimes find that the transaction did not execute. > The core team has discussed this. In order to avoid a substantial performance hit against transaction COMMIT, we have chosen to not do fsyncs on the directory when a file is unlinked, and thus to allow loss of durability following a power loss event. ACI without the D is still guaranteed. But not the D. The overwhelming majority of applications care not one wit about durability following power loss. For most applications, it is sufficient that the file is uncorrupted. If recovery gives you a snapshot of the file as it existed 5 seconds prior to the power loss, that's fine. WAL-mode transactions should be durable across power-loss events. So if durability is vitally important to you, you can always set PRAGMA journal_mode=WAL. Are you observing loss of durability following power loss in WAL mode? Is there any place in the documentation that we have overlooked where SQLite claims to be durable across a power loss in rollback mode? -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Thank you for your replies! I now fully understand (and appreciate) that the "ACI" part of transactions is the most important. Also, I didn't notice any of ACI being broken: SQLite guarantees those conditions really well. However, just to be clear, my "potential bug" affects out-of-the-box Fedora and Ubuntu, even with costly SCSI drives, and with the OS installed using all default options. On the other hand, in these environments, SQLite transactions will be immediately durable if fsync()-s are issued after the ftruncate() or unlink() calls while resetting the rollback journal; I verified this by modifying unixDelete() within SQLite, setting dirSync = 1 always. I'm writing the rest of this email assuming that this is an interesting piece of news (and not something already known). Also, confession: I am a grad student researching filesystems, and I haven’t directly used SQLite much; I looked at SQLite because I observed some funny disk activity. So if I am totally out-of-place with this rigmarole, imagining things about SQLite that aren’t actually true, please forgive the spam. It is entirely true that desktop SATA drives with disk-caches, some bad filesystems, and crazy OS configurations, all disrespect the entire concept of flushing. But, the setup I verified used either Fedora 17 and Ubuntu 12.04, and either Ext4, Ext3, or Btrfs. I did my best to make sure that the effects are neither due to a lying disk, nor because of configuration stuff like disabling filesystem barriers. Moreover, *the last SQLite transaction's durability is almost always delayed by 5 seconds* (it's not something you observe only once in a thousand power crashes), if there are no other applications flushing to the disk. If other applications do flush to the disk, then the delay is only till when the next flush (of another application) happens. The bug probably does not affect Windows and other OSes - it deals with the "Unix" VFS exclusively. Not calling fsync() might be based on the assumption that they are synchronous. As far as I can tell, both the ftruncate() and unlink() system calls are not synchronous (not immediately **issued to the disk**) in the Linux distros I used, independent of the disk. I understand that such semantics (of whether the system calls are synchronous) are often non-standard; at other times, the documentation is not clear about what the standards are. However, some manpages (http://linux.die.net/man/2/fsync for ftruncate, and http://linux.die.net/man/8/mount for unlink: look for "dirsync") suggest that most Linux distributions require an fsync() for unlink and ftruncate to be synchronous. Some filesystems or distros might, however, automatically flush, though: I haven't yet looked into the behavior of other distributions such as RHEL, other Unix-like OSes such as FreeBSD, or file systems such as XFS. But, I'll be looking into them soon, as part of my research, so do let me know if you are interested in their behavior. Thanks for reading all the way through! Again, it might be a better decision to not issue the fsync()-s after unlink and ftruncate: "D" is probably less important than the extra overhead. I'm just trying to make sure this is by design, and not an accident. -- Thanu On Tue, May 21, 2013 at 11:52 AM, Richard Hipp wrote: > > On Tue, May 21, 2013 at 12:04 PM, Thanumalayan Sankaranarayana Pillai < > madth...@cs.wisc.edu> wrote: > > > Hi all, > > > > Did anyone look into this? I might be setting some config option wrong, > > so it would be great if you sent me a "you did something wrong" reply if > > you feel that I might have the wrong config (or might be doing something > > totally idiotic). > > > > There are no configuration options set wrong. You shouldn't ever have to > set configuration options in order to get SQLite to work right. I think > instead you are just running up against fundamental limits of modern > consumer-grade hardware. I don't think there is anything anybody can do > about it. > > > > > > I tested with a few other Linux machines and a few different SQLite > > versions, and found that there is an upto-5-second delay (the default > > filesystem commit interval) in the transactions getting durable (with power > > crashes, that is). > > > > Thanks in advance, > > Thanu > > > > On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad > >wrote: > > > > > Hi All, > > > > > > I was testing out SQLite with a framework I developed. I believe, while > > > running on Linux, transactions might not be durable when a power crash > > > happens immediately after a commit. I observed this using "SQLite version > > > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps > > to > > > reproduce: > > > > > > 1. Use a Linux machine with an Ext4 filesystem (default mount options). > > > 2. Create a database file, set journal_mode to DELETE, perform a > > > transaction using "begin transaction ... commit;". > > > 3. Pull the power plug immediately after "commit;" returns. > > >
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
On Tue, May 21, 2013 at 12:04 PM, Thanumalayan Sankaranarayana Pillai < madth...@cs.wisc.edu> wrote: > Hi all, > > Did anyone look into this? I might be setting some config option wrong, > so it would be great if you sent me a "you did something wrong" reply if > you feel that I might have the wrong config (or might be doing something > totally idiotic). > There are no configuration options set wrong. You shouldn't ever have to set configuration options in order to get SQLite to work right. I think instead you are just running up against fundamental limits of modern consumer-grade hardware. I don't think there is anything anybody can do about it. > > I tested with a few other Linux machines and a few different SQLite > versions, and found that there is an upto-5-second delay (the default > filesystem commit interval) in the transactions getting durable (with power > crashes, that is). > > Thanks in advance, > Thanu > > On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad >wrote: > > > Hi All, > > > > I was testing out SQLite with a framework I developed. I believe, while > > running on Linux, transactions might not be durable when a power crash > > happens immediately after a commit. I observed this using "SQLite version > > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps > to > > reproduce: > > > > 1. Use a Linux machine with an Ext4 filesystem (default mount options). > > 2. Create a database file, set journal_mode to DELETE, perform a > > transaction using "begin transaction ... commit;". > > 3. Pull the power plug immediately after "commit;" returns. > > 4. Put plug back in, power on the machine, open database file with > SQLite, > > and examine whether the transaction has been executed. > > > > Expected result: You always find that the transaction had been executed. > > Observed result: You sometimes find that the transaction did not execute. > > (To increase the chance that you end with the transaction not having > > executed, create an Ext4 partition in an unmounted hard disk, mount the > > partition with the "-o commit=30" mount option, and do the experiment > with > > the database in that partition). > > > > Of course, it's possible that the transactions are durable, and I'm just > > being totally bonkers, or running SQLite in some really wrong way. If > not, > > I believe that the problem lies in SQLite not doing an fsync() on the > > parent directory after unlink()-ing the rollback journal file. It seems > > that, in Ext4, unlink() is atomic, but not synchronous: the rollback > > journal can exist after rebooting if a power crash follows the unlink(). > > Also, ftruncate() doesn't seem to be synchronous either (non-durability > > even if the journal_mode is set to TRUNCATE). > > > > I haven't examined how other filesystems treat unlink(). I haven't also > > tested with other pragma options and other journal_modes. I expect it > > wouldn't be a problem with WAL (it should be correct even if the log > exists > > after the reboot); however, with things like VACUUM-ing, the database > size > > would probably not be reduced if a power crash happened, because of > > ftruncate() not being synchronous. More generally, I think some of the > code > > assumes that all non-write and non-create operations that modify a file > (in > > other words, unlink and ftruncate) are synchronous; but that's true in > > Linux only if the partition is mounted with "-o dirsync". One more random > > thought: if I'm not wrong, "SQLite crash-tests" currently tests > > partial-flushing only with write() calls > > > > Again, sorry for the spam if I'm just being bonkers. > > > > Thanks, > > Thanu > > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
On 21 May 2013, at 5:04pm, Thanumalayan Sankaranarayana Pillai wrote: > Did anyone look into this? I might be setting some config option wrong, > so it would be great if you sent me a "you did something wrong" reply if > you feel that I might have the wrong config (or might be doing something > totally idiotic). > > I tested with a few other Linux machines and a few different SQLite > versions, and found that there is an upto-5-second delay (the default > filesystem commit interval) in the transactions getting durable (with power > crashes, that is). I think it's a reflection of what the operating system thinks 'synchronous' should mean. Flushing changes to disk takes time and slows down your application so the operating system doesn't always do it when it says it does. With Unix, for example, it's quite common to find that some file systems do absolutely nothing for flush API calls, or put a timer on flush() which prevents it from running more frequently than every few seconds, which would introduce extra lag and thrashing your mass storage subsystem (e.g. wearing out the stepper motor on a hard disk). Also, hard disk subsystems (the metal thing with a circuit board you hold in your hand) aimed at desktop computers actively lie to the OS about when data is written to disk unless you buy hard disks explicitly sold for use in servers (and priced to match). Tracking down exactly what's going on would probably involve reading the source code for ftruncate() in your version of your OS, and then assuming that the hard disk unit doesn't lie. It could be done if you want, but ftruncate() differs even for different versions of the same Unix file system, so the result may not tell us much. > One more random > thought: if I'm not wrong, "SQLite crash-tests" currently tests > partial-flushing only with write() calls Immediate and true flushing is not a huge concern of the developer team. A delay of 5 seconds, in conditions relating to a normal running of an application, probably isn't going to be a deal-breaker in anything you should be using a DBMS for. The main thing, and one you seem to confirm is working properly, is that transactions are always either completely written or not written, never with some parts written and others not written. That's part of the ACID conditions SQLite does its best to implement, and any breaking of those conditions would get full and eager attention from the developers. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Hi all, Did anyone look into this? I might be setting some config option wrong, so it would be great if you sent me a "you did something wrong" reply if you feel that I might have the wrong config (or might be doing something totally idiotic). I tested with a few other Linux machines and a few different SQLite versions, and found that there is an upto-5-second delay (the default filesystem commit interval) in the transactions getting durable (with power crashes, that is). Thanks in advance, Thanu On Sat, May 18, 2013 at 3:41 AM, thanumalayan mad wrote: > Hi All, > > I was testing out SQLite with a framework I developed. I believe, while > running on Linux, transactions might not be durable when a power crash > happens immediately after a commit. I observed this using "SQLite version > 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps to > reproduce: > > 1. Use a Linux machine with an Ext4 filesystem (default mount options). > 2. Create a database file, set journal_mode to DELETE, perform a > transaction using "begin transaction ... commit;". > 3. Pull the power plug immediately after "commit;" returns. > 4. Put plug back in, power on the machine, open database file with SQLite, > and examine whether the transaction has been executed. > > Expected result: You always find that the transaction had been executed. > Observed result: You sometimes find that the transaction did not execute. > (To increase the chance that you end with the transaction not having > executed, create an Ext4 partition in an unmounted hard disk, mount the > partition with the "-o commit=30" mount option, and do the experiment with > the database in that partition). > > Of course, it's possible that the transactions are durable, and I'm just > being totally bonkers, or running SQLite in some really wrong way. If not, > I believe that the problem lies in SQLite not doing an fsync() on the > parent directory after unlink()-ing the rollback journal file. It seems > that, in Ext4, unlink() is atomic, but not synchronous: the rollback > journal can exist after rebooting if a power crash follows the unlink(). > Also, ftruncate() doesn't seem to be synchronous either (non-durability > even if the journal_mode is set to TRUNCATE). > > I haven't examined how other filesystems treat unlink(). I haven't also > tested with other pragma options and other journal_modes. I expect it > wouldn't be a problem with WAL (it should be correct even if the log exists > after the reboot); however, with things like VACUUM-ing, the database size > would probably not be reduced if a power crash happened, because of > ftruncate() not being synchronous. More generally, I think some of the code > assumes that all non-write and non-create operations that modify a file (in > other words, unlink and ftruncate) are synchronous; but that's true in > Linux only if the partition is mounted with "-o dirsync". One more random > thought: if I'm not wrong, "SQLite crash-tests" currently tests > partial-flushing only with write() calls > > Again, sorry for the spam if I'm just being bonkers. > > Thanks, > Thanu > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Potential bug in crash-recovery code: unlink() and friends are not synchronous
Hi All, I was testing out SQLite with a framework I developed. I believe, while running on Linux, transactions might not be durable when a power crash happens immediately after a commit. I observed this using "SQLite version 3.7.16.2 2013-04-12 11:52:43", and kernel "3.8.4-102.fc17.x86_64". Steps to reproduce: 1. Use a Linux machine with an Ext4 filesystem (default mount options). 2. Create a database file, set journal_mode to DELETE, perform a transaction using "begin transaction ... commit;". 3. Pull the power plug immediately after "commit;" returns. 4. Put plug back in, power on the machine, open database file with SQLite, and examine whether the transaction has been executed. Expected result: You always find that the transaction had been executed. Observed result: You sometimes find that the transaction did not execute. (To increase the chance that you end with the transaction not having executed, create an Ext4 partition in an unmounted hard disk, mount the partition with the "-o commit=30" mount option, and do the experiment with the database in that partition). Of course, it's possible that the transactions are durable, and I'm just being totally bonkers, or running SQLite in some really wrong way. If not, I believe that the problem lies in SQLite not doing an fsync() on the parent directory after unlink()-ing the rollback journal file. It seems that, in Ext4, unlink() is atomic, but not synchronous: the rollback journal can exist after rebooting if a power crash follows the unlink(). Also, ftruncate() doesn't seem to be synchronous either (non-durability even if the journal_mode is set to TRUNCATE). I haven't examined how other filesystems treat unlink(). I haven't also tested with other pragma options and other journal_modes. I expect it wouldn't be a problem with WAL (it should be correct even if the log exists after the reboot); however, with things like VACUUM-ing, the database size would probably not be reduced if a power crash happened, because of ftruncate() not being synchronous. More generally, I think some of the code assumes that all non-write and non-create operations that modify a file (in other words, unlink and ftruncate) are synchronous; but that's true in Linux only if the partition is mounted with "-o dirsync". One more random thought: if I'm not wrong, "SQLite crash-tests" currently tests partial-flushing only with write() calls Again, sorry for the spam if I'm just being bonkers. Thanks, Thanu ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users