[chromium-dev] Re: Paper about DRAM error rates
I think it would be an interesting experiment as well. The paper provided a lot of interesting information on DRAM failure rates, but I don't think you can infer failure rates about the general population solely from examining the DRAM failures in a very isolated (though large) population such as Google's data centers. Google data centers introduce unique environmental factors (such as custom mother boards, power supplies, network linking, etc.) which can all affect failure rates. While the correlations drawn between things such as utilization and temperature and failure rate may be applicable, overall failure rate I think is not. -Shane On Oct 7, 12:37 pm, Scott Hess sh...@chromium.org wrote: I think it would be a very interesting experiment, because it would firewall the SQLite code from Chrome memory stompers (which IMHO are a much more likely danger than DRAM errors). -scott --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 6:49 PM, John Abd-El-Malek j...@chromium.org wrote: It's about getting rid of nasty problems like the browser process crashing every startup because of a corrupt database and decreasing browser process crashes in general. I am pretty sure that the sqlite wrapper and sanity checking work that's going on is a better fix than moving to a different process. Not only is it lower overhead, but we'd have to write error-handling code _anyway_ since the sqlite process could crash/fail, so IMO it's extra overhead that doesn't buy us anything. Not that you're not welcome to try doing it, but I wouldn't waste the time. PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. But I am unsure how to calculate, for example a random bit flip on the backingstores, which add to at least 10M on most machines does not hurt, or in the middle of a cache entry, or in the data part of some structure. I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
Our use of exclusive locking and page-cache preloading may open us up more to this kind of shenanigans. Basically SQLite will trust those pages which we faulted into memory days ago. We could mitigate against that somewhat, but this problem reaches into areas we cannot materially impact, such as filesystem caches. And don't even begin to imagine that there are not similar issues with commodity disk drives and controllers. That said, I don't think this is an incremental addition of any kind. I've pointed it out before, there are things in the woods which corrupt databases. We could MAYBE reduce occurrences to a suitable minimum using check-summing or something of the sort, but in the end we still have to detect corruption and decide what course to take from there. -scott On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. But I am unsure how to calculate, for example a random bit flip on the backingstores, which add to at least 10M on most machines does not hurt, or in the middle of a cache entry, or in the data part of some structure. I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. For what it's worth: This makes sense to me. It seems like pulling SQLite into its own process would be helpful for the reasons you laid out. I wonder if the only reason no one else has chimed in on this thread is that no one wants to have to implement it. :-) But I am unsure how to calculate, for example a random bit flip on the backingstores, which add to at least 10M on most machines does not hurt, or in the middle of a cache entry, or in the data part of some structure. I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.orgwrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. For what it's worth: This makes sense to me. It seems like pulling SQLite into its own process would be helpful for the reasons you laid out. I wonder if the only reason no one else has chimed in on this thread is that no one wants to have to implement it. :-) Chase is going to start investigating it (i.e. figure out what the cost in doing it is, how much change it requires and ways of measuring the benefit). But I am unsure how to calculate, for example a random bit flip on the backingstores, which add to at least 10M on most machines does not hurt, or in the middle of a cache entry, or in the data part of some structure. I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. For what it's worth: This makes sense to me. It seems like pulling SQLite into its own process would be helpful for the reasons you laid out. I wonder if the only reason no one else has chimed in on this thread is that no one wants to have to implement it. :-) I don't understand how this paper makes it very useful to pull SQLite into a separate process. I can see how it would make it useful to minimize how much in-memory data SQLite keeps, regardless of where SQLite lives. I can also see how this effect would increase our incidence of memory stompers, but in a lot of cases it just won't matter. If it corrupts a piece of data which we pass to SQLite, passing it direct or via IPC won't change that. Most memory stompers won't manifest by hitting SQLite anyhow, unless there's reason to believe specific bits will be munged. -scott --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:10 PM, Scott Hess sh...@chromium.org wrote: Our use of exclusive locking and page-cache preloading may open us up more to this kind of shenanigans. Basically SQLite will trust those pages which we faulted into memory days ago. We could mitigate against that somewhat, but this problem reaches into areas we cannot materially impact, such as filesystem caches. And don't even begin to imagine that there are not similar issues with commodity disk drives and controllers. That said, I don't think this is an incremental addition of any kind. I've pointed it out before, there are things in the woods which corrupt databases. We could MAYBE reduce occurrences to a suitable minimum using check-summing or something of the sort, but in the end we still have to detect corruption and decide what course to take from there. I do think these are two separate problems. Personally, I don't care as much if my history or any other database is corrupted and I start from scratch. But random crashes that I can't isolate is something else. -scott On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. But I am unsure how to calculate, for example a random bit flip on the backingstores, which add to at least 10M on most machines does not hurt, or in the middle of a cache entry, or in the data part of some structure. I imagine there's no way we can know when corruption happen in steady-state and the next query leads to some other browser memory (or another database) getting corrupted? On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote: It will be helpful to get our own measurement on database failures. Carlos just added something like that. Huan On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Saw this on slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf The conclusion is an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit. On my machine the browser process is usually 100MB, so that averages out to 176 to 493 error per year, with those numbers having big variance depending on the machine. Since most users don't have ECC, which means this will lead to corruption. Sqlite is a heavy user of memory, so even if it's 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally because of faulty DIMMs. Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. The IPC overhead is negligible compared to disk access. My hunch is that the complexity is also not that high, since the code that deals with it is already asynchronous since we don't use sqlite on the UI/IO threads. What do others think? --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:16 PM, Scott Hess sh...@chromium.org wrote: On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote: On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote: I'm not sure how Carlos is doing it? Will we know if something is corrupt just on load/save? Many sqlite calls can return sqlite_corrupt. For example a query or an insert We just check for error codes 1 to 26 with 5 or 6 of them being serious error such as sqlite_corrupt I am sure that random bit flip in memory and on disk is the cause of some crashes, this is probably the 'limit' factor of how low the crash rate of a perfect program deployed in millions of computers can go. The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption. For what it's worth: This makes sense to me. It seems like pulling SQLite into its own process would be helpful for the reasons you laid out. I wonder if the only reason no one else has chimed in on this thread is that no one wants to have to implement it. :-) I don't understand how this paper makes it very useful to pull SQLite into a separate process. I can see how it would make it useful to minimize how much in-memory data SQLite keeps, regardless of where SQLite lives. I can also see how this effect would increase our incidence of memory stompers, but in a lot of cases it just won't matter. If it corrupts a piece of data which we pass to SQLite, passing it direct or via IPC won't change that. Most memory stompers won't manifest by hitting SQLite anyhow, unless there's reason to believe specific bits will be munged. You would only pass POD over IPC, the same which is passed using Task objects across threads. So the only corruption you'll get over IPC would be data corruption, but you ensure that crashes due to corruption are isolated to the sqlite process. -scott --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote: The point I was trying to make is that the 'limit' factor as you put it is proportional to memory usage. Given our large memory consumption in the browser process, the numbers from the paper imply dozens of corruptions just in sqlite memory per user. The paper (which is great!) says that error rates are proportional to active use. How busy is the average user's system? I'll bet it's idle 99% of the time. So instead of 4000 or 8000 memory errors per year per machine, one might have 40 or 80. That's still a pretty scary number. Conclusions: 1) zfs was right: checksums are a good idea. Can we add them to sqlite? 2) isolating sqlite into its own process seems like a good idea anyway if it crashes a lot 3) congress should pass a law requiring all personal computers to use error correcting memory and mandating free replacement of DIMMs that have uncorrectable errors :-) --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote: Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. What does this mean for cases like the in-memory URL database, which is sqlite code running on a completely in-memory dataset? This is currently used synchronously inside the UI threas to do things like inline autocomplete. We _cannot_ make it async or move it to another thread. So are you proposing to move _some_ sqlite accesses elsewhere? (Personally, I don't think this paper is evidence that we should move sqlite into a separate process, for similar reasons as to Scott.) PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:24 PM, Dan Kegel d...@kegel.com wrote: 1) zfs was right: checksums are a good idea. Can we add them to sqlite? I believe so, but I'm still working through the details. And by working I mean thinking. The challenge is in finding places to tuck things away where they won't break compatibility. I _think_ we could tuck page-level checksums into unused space w/in the page (and arrange for unused space to exist). Then row-level checksums to handle overflow pages (which I don't think allow the same unused-space trick). I haven't dug in to figure out the free list. Note, though, that checksums only get you so far, the real challenge might be in finding all the places to check the checksums. Checking them at read time is unsatisfactory, given this research! -scott --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
Out of curiosity does Firefox do anything special w/ regard to how they manage Sqlite? I can't imagine this problem is totally unique to us. Kind Regards, Anthony Laforge Technical Program Manager Mountain View, CA On Tue, Oct 6, 2009 at 5:35 PM, Scott Hess sh...@chromium.org wrote: On Tue, Oct 6, 2009 at 5:24 PM, Dan Kegel d...@kegel.com wrote: 1) zfs was right: checksums are a good idea. Can we add them to sqlite? I believe so, but I'm still working through the details. And by working I mean thinking. The challenge is in finding places to tuck things away where they won't break compatibility. I _think_ we could tuck page-level checksums into unused space w/in the page (and arrange for unused space to exist). Then row-level checksums to handle overflow pages (which I don't think allow the same unused-space trick). I haven't dug in to figure out the free list. Note, though, that checksums only get you so far, the real challenge might be in finding all the places to check the checksums. Checking them at read time is unsatisfactory, given this research! -scott --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:30 PM, Peter Kasting pkast...@google.com wrote: On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.orgwrote: Given that sqlite corruption means (repeated) crashing of the browser process, it seems this data heavily suggests we should separate sqlite code into a separate process. What does this mean for cases like the in-memory URL database, which is sqlite code running on a completely in-memory dataset? This is currently used synchronously inside the UI threas to do things like inline autocomplete. We _cannot_ make it async or move it to another thread. So are you proposing to move _some_ sqlite accesses elsewhere? I'm not sure, I think this is one of the questions that needs to be explored more by whoever investigates this. If corruption in the in-memory URL database doesn't survive a crash (i.e. because it's recreated each time), then perhaps it's ok to keep it in the browser process. However, if it's not, I'm not sure I understand why moving it to another process is unworkable? Chrome code is littered with many examples of how we turned synchronous operations into asynchronous ones. But if worst comes to worst and we can't, you can always do sync IPC calls to another process. The overhead is in the microseconds so it won't be noticeable. (Personally, I don't think this paper is evidence that we should move sqlite into a separate process, for similar reasons as to Scott.) PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:53 PM, John Abd-El-Malek j...@chromium.org wrote: If corruption in the in-memory URL database doesn't survive a crash (i.e. because it's recreated each time), It doesn't I'm not sure I understand why moving it to another process is unworkable? Chrome code is littered with many examples of how we turned synchronous operations into asynchronous ones. The user-visible behavior cannot be async. Inline autocomplete must never race the user's action or it goes from awesome to hellishly annoying and awful. But if worst comes to worst and we can't, you can always do sync IPC calls to another process. The overhead is in the microseconds so it won't be noticeable. I'd far prefer to keep it in the UI thread even if it were subject to these corruption issues. Doing sync IPC from the UI thread to a sqlite thread sounds like a recipe for Jank. PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 5:56 PM, Peter Kasting pkast...@google.com wrote: On Tue, Oct 6, 2009 at 5:53 PM, John Abd-El-Malek j...@chromium.orgwrote: If corruption in the in-memory URL database doesn't survive a crash (i.e. because it's recreated each time), It doesn't I'm not sure I understand why moving it to another process is unworkable? Chrome code is littered with many examples of how we turned synchronous operations into asynchronous ones. The user-visible behavior cannot be async. Inline autocomplete must never race the user's action or it goes from awesome to hellishly annoying and awful. But if worst comes to worst and we can't, you can always do sync IPC calls to another process. The overhead is in the microseconds so it won't be noticeable. I'd far prefer to keep it in the UI thread even if it were subject to these corruption issues. Doing sync IPC from the UI thread to a sqlite thread sounds like a recipe for Jank. Note, I don't know what the right answer for this specific case (I think more investigation is necessary), but I do want to point out that I don't think moving it to another process will introduce jank. If it's currently not done on the same thread as other databases, then if it were moved to another process, it would have to be done on a separate thread as well. Our overhead for sync IPCs is in the microseconds. Of course, actually implementing this and comparing histograms of in-process and out-of-process delays is the fool-proof way of proving this :) PK --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
I am with the others that don't see move sqlite to another process as a natural outcome of these thread. If using more memory is the concern, another process uses more memory. sqlite is not crashing *that* much; yes it was the top crasher for a while but it was a data race --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
This isn't about decreasing memory usage. It's about getting rid of nasty problems like the browser process crashing every startup because of a corrupt database and decreasing browser process crashes in general. I think it's fair that not everyone shares the same opinion, but I do hope that an experiment is run and we can compare numbers to see the effects on crash rates. If Chase or others do it, that's great, if not, I'll try to do it after the jank task force. On Tue, Oct 6, 2009 at 6:20 PM, cpu c...@chromium.org wrote: I am with the others that don't see move sqlite to another process as a natural outcome of these thread. If using more memory is the concern, another process uses more memory. sqlite is not crashing *that* much; yes it was the top crasher for a while but it was a data race --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Paper about DRAM error rates
On Tue, Oct 6, 2009 at 6:49 PM, John Abd-El-Malek j...@chromium.org wrote: This isn't about decreasing memory usage. It's about getting rid of nasty problems like the browser process crashing every startup because of a corrupt database and decreasing browser process crashes in general. I think it's fair that not everyone shares the same opinion, but I do hope that an experiment is run and we can compare numbers to see the effects on crash rates. If Chase or others do it, that's great, if not, I'll try to do it after the jank task force. Didn't find another bug on file for this already so I filed crbug.com/24061. Chase On Tue, Oct 6, 2009 at 6:20 PM, cpu c...@chromium.org wrote: I am with the others that don't see move sqlite to another process as a natural outcome of these thread. If using more memory is the concern, another process uses more memory. sqlite is not crashing *that* much; yes it was the top crasher for a while but it was a data race --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---