Re: [sqlite] Database corruption on Linux ext3
On Thu, Jul 15, 2010 at 10:20:22PM +0200, Florian Weimer wrote: > It's a generic Linux problem, not an ext3-specific issue. Until > recently, the Linux block layer had no concept of a sync operation. > Linux basically assumed that all writes were synchronous and ordered, > which they are not if your hardware has (non-transparent) write > caches. It did not disable write caching by default, either. For quite some time, ext3 in mainline and most distro's defaulted to nobarrier (barrier=0) --- the only exception to this I'm aware of was SLES. Mainline switched to having barriers on by default, I assume the distro's have or will follow. Some other filesystems (XFS for example) have had barriers enabled by default. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
* D. Richard Hipp: > An appliance manufacturer has discovered a database corruption issue > on Linux using ext3. The issue is documented here: > > http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem It's a generic Linux problem, not an ext3-specific issue. Until recently, the Linux block layer had no concept of a sync operation. Linux basically assumed that all writes were synchronous and ordered, which they are not if your hardware has (non-transparent) write caches. It did not disable write caching by default, either. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On Wed, Jul 14, 2010 at 1:35 AM, Roger Binns wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 07/13/2010 05:30 PM, Jim Wilcoxson wrote: > > I don't think this would work, because the problem described is that the > > writes aren't making it to disk. If pages don't make it to disk, the > old > > pages will be present, with the old, and valid checksums. > > You are assuming the checksums are stored in the page they checksum. That > would only detect corruption of that page. You could have pages that store > the checksums of numerous other pages, so both the checksum page and the > data page would have to fail to make it to disk. Yes, there are scenarios > where you could still get old apparently valid pages, but those are harder > to happen. > It seems there are several level of checking possible: - checksum on the page itself lets you detect some errors, with no extra I/O - checksum pages for a group of pages lets you detect missing writes within the group, with some extra I/O - checksum of all checksum pages lets you detect missing writes for an entire commit, with even more extra I/O How much extra I/O depends on the size of the db, page size, and how much memory is available for caching checksum pages. Scott mentioned that a detection system without the ability to correct might not be useful, but I think it is useful. Not as good as correction of course, but useful because: - it might prevent the application program from issuing a bogus error message like "the row you asked for isn't in the database"; lots of time could be spent in the weeds chasing down a misleading error - some applications might have backup copies of the database; they could display an error message and revert to a backup Jim ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/13/2010 05:30 PM, Jim Wilcoxson wrote: > I don't think this would work, because the problem described is that the > writes aren't making it to disk. If pages don't make it to disk, the old > pages will be present, with the old, and valid checksums. You are assuming the checksums are stored in the page they checksum. That would only detect corruption of that page. You could have pages that store the checksums of numerous other pages, so both the checksum page and the data page would have to fail to make it to disk. Yes, there are scenarios where you could still get old apparently valid pages, but those are harder to happen. > SQLite would have to verify the checksum on every page when the > database is opened and a hot journal exists, which could be quite a lot of > overhead for a large database. I would want checksums looked at when reading pages always. For more valuable data I would want a thorough open check. I already use the pragmas (integrity/quick check) on opens as appropriate. > I think a checksum on every page, and maybe even an error-correction code, > is a great idea as an optional feature. It would have to be optional, much like WAL for backwards compatibility issues, and because not everyone would want this. > But it would only detect hardware problems and bit rot. Plus normal errors, such as the undetectable one every 20TB of I/O rate I hope I remembered correctly from the ZFS paper. I don't know what the error rate of Flash is, nor how that interacts with SATA signalling error rates, but all these numbers are non-zero and we keep doing more and more I/O, have larger datasets, and have things around for longer. Plus we do like buying cheap storage :-) > This problem of not doing writes, or doing them in > the wrong order, is a different animal IMO. There will always be loopholes you could construct with that that are very hard to detect. Hopefully checksumming or some sort of similar data integrity scheme would make it easier to detect some of the scenarios. ie things could be made better than they are now. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw9TJYACgkQmOOfHg372QRnkACfdJp+FZHvxXeLNN0DMCHSNJ1+ 3E8AnR1PXa4PTQ1ridHBaAC7WpSSwnyM =5sbp -END PGP SIGNATURE- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On Tue, Jul 13, 2010 at 10:03 PM, Scott Hess wrote: > On Tue, Jul 13, 2010 at 8:24 PM, Simon Slavin wrote: >> It might be useful to figure out whether we're aiming for >> detection or correction. By 'correction' I don't mean recovery >> of all information, I mean restoring the database to some state >> it was in just after a COMMIT took effect. There's no point in >> implementing a detection system if the users consider "This >> database is corrupt" something worth complaining about. On the >> other hand, implementing a correction system may well slow down >> every write operation and perhaps '_open' too. It's not worth >> doing that if slowing down SQLite will decrease usability. > > The best case is a system where corruption cannot happen. Since > that's clearly impossible ... > > Second-best would be an ability to rollback to a priori valid state. [Sigh, did not mean some whizzy technical term, there. Meant "a prior valid state."] -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On Tue, Jul 13, 2010 at 8:24 PM, Simon Slavin wrote: > It might be useful to figure out whether we're aiming for > detection or correction. By 'correction' I don't mean recovery > of all information, I mean restoring the database to some state > it was in just after a COMMIT took effect. There's no point in > implementing a detection system if the users consider "This > database is corrupt" something worth complaining about. On the > other hand, implementing a correction system may well slow down > every write operation and perhaps '_open' too. It's not worth > doing that if slowing down SQLite will decrease usability. The best case is a system where corruption cannot happen. Since that's clearly impossible ... Second-best would be an ability to rollback to a priori valid state. WAL should provide some of this, because over time the system should tend towards having everything stably written to the disk. You just have to think about how much you're willing to lose. Currently, there is a gray area where you've seen something which is incorrect, but you cannot reason about how extensive the damage is. Checksums can provide you with some ability to detect corruption more quickly, and perhaps a way to reason about it. You could have a function like iscorrupt(rowid) which could be used to salvage uncorrupted rows, and then carefully inspect corrupted rows. Or maybe you could delete the corrupt rows and then the table would be clean. Perhaps there could even be an ON CORRUPTION trigger involved. [Obviously, here I'm not thinking about the case where a DBA sits down and reconstructs things. I'm thinking about what you do when a database in the field has corruption. For instance, Chrome has a history database, and being able to easily delete corrupted rows and reconstruct indices would be an improvement over nuking the database from orbit. Without explicit database support, though, I've always been nervous about attempting such things in an ad-hoc fashion.] -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On 14 Jul 2010, at 1:30am, Jim Wilcoxson wrote: > This problem of not doing writes, or doing them in > the wrong order, is a different animal IMO. If writes are not happening, or are happening in the wrong order, you're in trouble. It's almost impossible to figure out how to even detect that hardware problem without a time-consuming scan of each unit that should be written which, in SQLite, means reading every page. Since SQLite doesn't run a server process, it has no opportunity to use slack time to check integrity. Under the conditions described on the web page, this problem can happen only because of a power failure or an OS (not an application) crash. Under these conditions, the ext3 file system doesn't support ACID at all: any system that relies on ACID is not going to work. And if the file system doesn't support ACID, the software can't. I don't see any fast way of solving that kind of problem. It might be useful to figure out whether we're aiming for detection or correction. By 'correction' I don't mean recovery of all information, I mean restoring the database to some state it was in just after a COMMIT took effect. There's no point in implementing a detection system if the users consider "This database is corrupt" something worth complaining about. On the other hand, implementing a correction system may well slow down every write operation and perhaps '_open' too. It's not worth doing that if slowing down SQLite will decrease usability. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On Tue, Jul 13, 2010 at 8:06 PM, Roger Binns wrote: > > On 07/13/2010 04:57 PM, Simon Slavin wrote: > > One on each page and one for the entire file that checksums the page > checksums ? > > One for each page plus one of the header would make the most sense, but the > I don't think this would work, because the problem described is that the writes aren't making it to disk. If pages don't make it to disk, the old pages will be present, with the old, and valid checksums. The only way I can see checksums helping with this problem is if there is a checksum over the entire file (or checksums of checksums of each page). Then if you do any writes, but not all writes, the overall checksum will be invalid. SQLite would have to verify the checksum on every page when the database is opened and a hot journal exists, which could be quite a lot of overhead for a large database. Plus, SQLite would have to keep a list of the checksums for every page, and at commit time, recompute the overall hash/checksum. This could be lots of memory for a large database. A 1GB database for example would require 1M 20-byte SHA1 hashes, so 20MB. If a bit for every page in the database was a scalability problem in earlier versions of SQLite, I'm guessing that 20 bytes for every page would be unworkable. I think a checksum on every page, and maybe even an error-correction code, is a great idea as an optional feature. But it would only detect hardware problems and bit rot. This problem of not doing writes, or doing them in the wrong order, is a different animal IMO. Jim -- HashBackup: easy onsite and offsite Unix backup http://sites.google.com/site/hashbackup ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/13/2010 04:57 PM, Simon Slavin wrote: > One on each page and one for the entire file that checksums the page > checksums ? One for each page plus one of the header would make the most sense, but the overriding concern would be something that is as backwards and forwards compatible as possible. The journal would also need to be covered, as well as some indication linking the database and the journal. Currently if you wrote a stray zero somewhere in the file then the chances of it being detected are virtually zero. It may even be possible to have entire sectors/512 bytes revert to all zeroes or 0xff without it being detected. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw8/48ACgkQmOOfHg372QTZ8ACfVIvqoavqysTwII+c8N13gLp2 7XUAoIa/D3j8+DmuYtT/6I9TAjOBGgP7 =5lT2 -END PGP SIGNATURE- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
On 13 Jul 2010, at 9:26pm, Roger Binns wrote: > On 07/13/2010 12:59 PM, D. Richard Hipp wrote: >> You are encouraged to submit comments, insights, criticism, and >> analysis to this mailing list. Thanks. > > Have you considered adding internal checksums to SQLite files so that at the > very least corruption can be detected? One on each page and one for the entire file that checksums the page checksums ? I've seen that done elsewhere. Perhaps instead one for each thing that needs its own pages (e.g. one for each table and index). Or just one for each table which includes the index data in with it. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Database corruption on Linux ext3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/13/2010 12:59 PM, D. Richard Hipp wrote: > You are encouraged to submit comments, insights, criticism, and > analysis to this mailing list. Thanks. Have you considered adding internal checksums to SQLite files so that at the very least corruption can be detected? I think this is one of the best features of the current crop of version control systems. IIRC one of the original papers describing ZFS claimed an undetected error rate of once every 20TB of activity with hard drives. That is not an unreasonable amount of disk access in one day now. Errors could of course instead be detected by using better filesystems, drives, protocols etc, but we often don't have the luxury of dictating what systems SQLite databases will be used on. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw8y/IACgkQmOOfHg372QTWfgCgjcPCvOLQuPv1Xqc6XM3TkXLY X80An2TFiktmQxOY0sMxO1VLD74YzYpX =tJ9V -END PGP SIGNATURE- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Database corruption on Linux ext3
An appliance manufacturer has discovered a database corruption issue on Linux using ext3. The issue is documented here: http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem You are encouraged to submit comments, insights, criticism, and analysis to this mailing list. Thanks. D. Richard Hipp d...@hwaci.com ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users