Re: [sqlite] Database corruption on Linux ext3

2010-07-17 Thread Chris Wedgwood
On Thu, Jul 15, 2010 at 10:20:22PM +0200, Florian Weimer wrote:

> It's a generic Linux problem, not an ext3-specific issue.  Until
> recently, the Linux block layer had no concept of a sync operation.
> Linux basically assumed that all writes were synchronous and ordered,
> which they are not if your hardware has (non-transparent) write
> caches.  It did not disable write caching by default, either.

For quite some time, ext3 in mainline and most distro's defaulted to
nobarrier (barrier=0) --- the only exception to this I'm aware of was
SLES.

Mainline switched to having barriers on by default, I assume the
distro's have or will follow.

Some other filesystems (XFS for example) have had barriers enabled by
default.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-15 Thread Florian Weimer
* D. Richard Hipp:

> An appliance manufacturer has discovered a database corruption issue  
> on Linux using ext3.   The issue is documented here:
>
>  http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem

It's a generic Linux problem, not an ext3-specific issue.  Until
recently, the Linux block layer had no concept of a sync operation.
Linux basically assumed that all writes were synchronous and ordered,
which they are not if your hardware has (non-transparent) write
caches.  It did not disable write caching by default, either.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-14 Thread Jim Wilcoxson
On Wed, Jul 14, 2010 at 1:35 AM, Roger Binns  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 07/13/2010 05:30 PM, Jim Wilcoxson wrote:
> > I don't think this would work, because the problem described is that the
> > writes aren't making it to disk.   If pages don't make it to disk, the
> old
> > pages will be present, with the old, and valid checksums.
>
> You are assuming the checksums are stored in the page they checksum.  That
> would only detect corruption of that page.  You could have pages that store
> the checksums of numerous other pages, so both the checksum page and the
> data page would have to fail to make it to disk.  Yes, there are scenarios
> where you could still get old apparently valid pages, but those are harder
> to happen.
>

It seems there are several level of checking possible:

- checksum on the page itself lets you detect some errors, with no extra I/O
- checksum pages for a group of pages lets you detect missing writes within
the group, with some extra I/O
- checksum of all checksum pages lets you detect missing writes for an
entire commit, with even more extra I/O

How much extra I/O depends on the size of the db, page size, and how much
memory is available for caching checksum pages.

Scott mentioned that a detection system without the ability to correct might
not be useful, but I think it is useful.  Not as good as correction of
course, but useful because:

- it might prevent the application program from issuing a bogus error
message like "the row you asked for isn't in the database"; lots of time
could be spent in the weeds chasing down a misleading error

- some applications might have backup copies of the database; they could
display an error message and revert to a backup

Jim
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/13/2010 05:30 PM, Jim Wilcoxson wrote:
> I don't think this would work, because the problem described is that the
> writes aren't making it to disk.   If pages don't make it to disk, the old
> pages will be present, with the old, and valid checksums.

You are assuming the checksums are stored in the page they checksum.  That
would only detect corruption of that page.  You could have pages that store
the checksums of numerous other pages, so both the checksum page and the
data page would have to fail to make it to disk.  Yes, there are scenarios
where you could still get old apparently valid pages, but those are harder
to happen.

> SQLite would have to verify the checksum on every page when the
> database is opened and a hot journal exists, which could be quite a lot of
> overhead for a large database.

I would want checksums looked at when reading pages always.  For more
valuable data I would want a thorough open check.  I already use the pragmas
(integrity/quick check) on opens as appropriate.

> I think a checksum on every page, and maybe even an error-correction code,
> is a great idea as an optional feature.

It would have to be optional, much like WAL for backwards compatibility
issues, and because not everyone would want this.

> But it would only detect hardware problems and bit rot.

Plus normal errors, such as the undetectable one every 20TB of I/O rate I
hope I remembered correctly from the ZFS paper.  I don't know what the error
rate of Flash is, nor how that interacts with SATA signalling error rates,
but all these numbers are non-zero and we keep doing more and more I/O, have
larger datasets, and have things around for longer.  Plus we do like buying
cheap storage :-)

> This problem of not doing writes, or doing them in
> the wrong order, is a different animal IMO.

There will always be loopholes you could construct with that that are very
hard to detect.  Hopefully checksumming or some sort of similar data
integrity scheme would make it easier to detect some of the scenarios.  ie
things could be made better than they are now.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw9TJYACgkQmOOfHg372QRnkACfdJp+FZHvxXeLNN0DMCHSNJ1+
3E8AnR1PXa4PTQ1ridHBaAC7WpSSwnyM
=5sbp
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Scott Hess
On Tue, Jul 13, 2010 at 10:03 PM, Scott Hess  wrote:
> On Tue, Jul 13, 2010 at 8:24 PM, Simon Slavin  wrote:
>> It might be useful to figure out whether we're aiming for
>> detection or correction.  By 'correction' I don't mean recovery
>> of all information, I mean restoring the database to some state
>> it was in just after a COMMIT took effect.  There's no point in
>> implementing a detection system if the users consider "This
>> database is corrupt" something worth complaining about.  On the
>> other hand, implementing a correction system may well slow down
>> every write operation and perhaps '_open' too.  It's not worth
>> doing that if slowing down SQLite will decrease usability.
>
> The best case is a system where corruption cannot happen.  Since
> that's clearly impossible ...
>
> Second-best would be an ability to rollback to a priori valid state.

[Sigh, did not mean some whizzy technical term, there.  Meant "a prior
valid state."]

-scott
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Scott Hess
On Tue, Jul 13, 2010 at 8:24 PM, Simon Slavin  wrote:
> It might be useful to figure out whether we're aiming for
> detection or correction.  By 'correction' I don't mean recovery
> of all information, I mean restoring the database to some state
> it was in just after a COMMIT took effect.  There's no point in
> implementing a detection system if the users consider "This
> database is corrupt" something worth complaining about.  On the
> other hand, implementing a correction system may well slow down
> every write operation and perhaps '_open' too.  It's not worth
> doing that if slowing down SQLite will decrease usability.

The best case is a system where corruption cannot happen.  Since
that's clearly impossible ...

Second-best would be an ability to rollback to a priori valid state.
WAL should provide some of this, because over time the system should
tend towards having everything stably written to the disk.  You just
have to think about how much you're willing to lose.

Currently, there is a gray area where you've seen something which is
incorrect, but you cannot reason about how extensive the damage is.
Checksums can provide you with some ability to detect corruption more
quickly, and perhaps a way to reason about it.  You could have a
function like iscorrupt(rowid) which could be used to salvage
uncorrupted rows, and then carefully inspect corrupted rows.  Or maybe
you could delete the corrupt rows and then the table would be clean.
Perhaps there could even be an ON CORRUPTION trigger involved.

[Obviously, here I'm not thinking about the case where a DBA sits down
and reconstructs things.  I'm thinking about what you do when a
database in the field has corruption.  For instance, Chrome has a
history database, and being able to easily delete corrupted rows and
reconstruct indices would be an improvement over nuking the database
from orbit.  Without explicit database support, though, I've always
been nervous about attempting such things in an ad-hoc fashion.]

-scott
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Simon Slavin

On 14 Jul 2010, at 1:30am, Jim Wilcoxson wrote:

> This problem of not doing writes, or doing them in
> the wrong order, is a different animal IMO.

If writes are not happening, or are happening in the wrong order, you're in 
trouble.  It's almost impossible to figure out how to even detect that hardware 
problem without a time-consuming scan of each unit that should be written 
which, in SQLite, means reading every page.  Since SQLite doesn't run a server 
process, it has no opportunity to use slack time to check integrity.

Under the conditions described on the web page, this problem can happen only 
because of a power failure or an OS (not an application) crash.  Under these 
conditions, the ext3 file system doesn't support ACID at all: any system that 
relies on ACID is not going to work.  And if the file system doesn't support 
ACID, the software can't.  I don't see any fast way of solving that kind of 
problem.

It might be useful to figure out whether we're aiming for detection or 
correction.  By 'correction' I don't mean recovery of all information, I mean 
restoring the database to some state it was in just after a COMMIT took effect. 
 There's no point in implementing a detection system if the users consider 
"This database is corrupt" something worth complaining about.  On the other 
hand, implementing a correction system may well slow down every write operation 
and perhaps '_open' too.  It's not worth doing that if slowing down SQLite will 
decrease usability.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Jim Wilcoxson
On Tue, Jul 13, 2010 at 8:06 PM, Roger Binns  wrote:

>
> On 07/13/2010 04:57 PM, Simon Slavin wrote:
> > One on each page and one for the entire file that checksums the page
> checksums ?
>
> One for each page plus one of the header would make the most sense, but the
>

I don't think this would work, because the problem described is that the
writes aren't making it to disk.   If pages don't make it to disk, the old
pages will be present, with the old, and valid checksums.

The only way I can see checksums helping with this problem is if there is a
checksum over the entire file (or checksums of checksums of each page).
Then if you do any writes, but not all writes, the overall checksum will be
invalid.  SQLite would have to verify the checksum on every page when the
database is opened and a hot journal exists, which could be quite a lot of
overhead for a large database.

Plus, SQLite would have to keep a list of the checksums for every page, and
at commit time, recompute the overall hash/checksum.  This could be lots of
memory for a large database.  A 1GB database for example would require 1M
20-byte SHA1 hashes, so 20MB.  If a bit for every page in the database was a
scalability problem in earlier versions of SQLite, I'm guessing that 20
bytes for every page would be unworkable.

I think a checksum on every page, and maybe even an error-correction code,
is a great idea as an optional feature.  But it would only detect hardware
problems and bit rot.  This problem of not doing writes, or doing them in
the wrong order, is a different animal IMO.

Jim
--
HashBackup: easy onsite and offsite Unix backup
http://sites.google.com/site/hashbackup
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/13/2010 04:57 PM, Simon Slavin wrote:
> One on each page and one for the entire file that checksums the page 
> checksums ?

One for each page plus one of the header would make the most sense, but the
overriding concern would be something that is as backwards and forwards
compatible as possible.  The journal would also need to be covered, as well
as some indication linking the database and the journal.

Currently if you wrote a stray zero somewhere in the file then the chances
of it being detected are virtually zero.  It may even be possible to have
entire sectors/512 bytes revert to all zeroes or 0xff without it being detected.

Roger

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw8/48ACgkQmOOfHg372QTZ8ACfVIvqoavqysTwII+c8N13gLp2
7XUAoIa/D3j8+DmuYtT/6I9TAjOBGgP7
=5lT2
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Simon Slavin

On 13 Jul 2010, at 9:26pm, Roger Binns wrote:

> On 07/13/2010 12:59 PM, D. Richard Hipp wrote:
>> You are encouraged to submit comments, insights, criticism, and  
>> analysis to this mailing list.  Thanks.
> 
> Have you considered adding internal checksums to SQLite files so that at the
> very least corruption can be detected?

One on each page and one for the entire file that checksums the page checksums 
?  I've seen that done elsewhere.  Perhaps instead one for each thing that 
needs its own pages (e.g. one for each table and index).  Or just one for each 
table which includes the index data in with it.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Database corruption on Linux ext3

2010-07-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/13/2010 12:59 PM, D. Richard Hipp wrote:
> You are encouraged to submit comments, insights, criticism, and  
> analysis to this mailing list.  Thanks.

Have you considered adding internal checksums to SQLite files so that at the
very least corruption can be detected?  I think this is one of the best
features of the current crop of version control systems.

IIRC one of the original papers describing ZFS claimed an undetected error
rate of once every 20TB of activity with hard drives.  That is not an
unreasonable amount of disk access in one day now.

Errors could of course instead be detected by using better filesystems,
drives, protocols etc, but we often don't have the luxury of dictating what
systems SQLite databases will be used on.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw8y/IACgkQmOOfHg372QTWfgCgjcPCvOLQuPv1Xqc6XM3TkXLY
X80An2TFiktmQxOY0sMxO1VLD74YzYpX
=tJ9V
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Database corruption on Linux ext3

2010-07-13 Thread D. Richard Hipp
An appliance manufacturer has discovered a database corruption issue  
on Linux using ext3.   The issue is documented here:

 http://www.sqlite.org/draft/lockingv3.html#ext3-barrier-problem

You are encouraged to submit comments, insights, criticism, and  
analysis to this mailing list.  Thanks.

D. Richard Hipp
d...@hwaci.com



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users