[chromium-dev] Re: Paper about DRAM error rates

2009-10-12 Thread Shane Harrelson

I think it would be an interesting experiment as well.

The paper provided a lot of interesting information on DRAM failure
rates, but I don't think you can infer failure rates about the
general population solely from examining the DRAM failures in a very
isolated (though large) population such as Google's data centers.
Google data centers introduce unique environmental factors (such as
custom mother boards, power supplies, network linking, etc.) which can
all affect failure rates.   While the correlations drawn between
things such as utilization and temperature and failure rate may be
applicable, overall failure rate I think is not.

-Shane

On Oct 7, 12:37 pm, Scott Hess sh...@chromium.org wrote:
 I think it would be a very interesting experiment, because it would
 firewall the SQLite code from Chrome memory stompers (which IMHO are a
 much more likely danger than DRAM errors).

 -scott

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-07 Thread Peter Kasting
On Tue, Oct 6, 2009 at 6:49 PM, John Abd-El-Malek j...@chromium.org wrote:

 It's about getting rid of nasty problems like the browser process crashing
 every startup because of a corrupt database and decreasing browser process
 crashes in general.


I am pretty sure that the sqlite wrapper and sanity checking work that's
going on is a better fix than moving to a different process.  Not only is it
lower overhead, but we'd have to write error-handling code _anyway_ since
the sqlite process could crash/fail, so IMO it's extra overhead that doesn't
buy us anything.

Not that you're not welcome to try doing it, but I wouldn't waste the time.

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Huan Ren

It will be helpful to get our own measurement on database failures.
Carlos just added something like that.

Huan

On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote:
 Saw this on
 slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
 The conclusion is an average of 25,000–75,000 FIT (failures in time per
 billion hours of operation) per Mbit.
 On my machine the browser process is usually  100MB, so that averages out
 to 176 to 493 error per year, with those numbers having big variance
 depending on the machine.  Since most users don't have ECC, which means this
 will lead to corruption.  Sqlite is a heavy user of memory, so even if it's
 1/4 of the 100MB, that means we'll see an average of 40-120 errors naturally
 because of faulty DIMMs.
 Given that sqlite corruption means (repeated) crashing of the browser
 process, it seems this data heavily suggests we should separate sqlite code
 into a separate process.  The IPC overhead is negligible compared to disk
 access.  My hunch is that the complexity is also not that high, since the
 code that deals with it is already asynchronous since we don't use sqlite on
 the UI/IO threads.
 What do others think?
 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
I'm not sure how Carlos is doing it?  Will we know if something is corrupt
just on load/save?  I imagine there's no way we can know when corruption
happen in steady-state and the next query leads to some other browser memory
(or another database) getting corrupted?

On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:

 It will be helpful to get our own measurement on database failures.
 Carlos just added something like that.

 Huan

 On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  Saw this on
  slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
  The conclusion is an average of 25,000–75,000 FIT (failures in time per
  billion hours of operation) per Mbit.
  On my machine the browser process is usually  100MB, so that averages
 out
  to 176 to 493 error per year, with those numbers having big variance
  depending on the machine.  Since most users don't have ECC, which means
 this
  will lead to corruption.  Sqlite is a heavy user of memory, so even if
 it's
  1/4 of the 100MB, that means we'll see an average of 40-120 errors
 naturally
  because of faulty DIMMs.
  Given that sqlite corruption means (repeated) crashing of the browser
  process, it seems this data heavily suggests we should separate sqlite
 code
  into a separate process.  The IPC overhead is negligible compared to disk
  access.  My hunch is that the complexity is also not that high, since the
  code that deals with it is already asynchronous since we don't use sqlite
 on
  the UI/IO threads.
  What do others think?
   
 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:

 On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  I'm not sure how Carlos is doing it?  Will we know if something is
 corrupt
  just on load/save?

 Many sqlite calls can return sqlite_corrupt. For example a query or an
 insert
 We just check for error codes 1 to 26 with 5 or 6 of them being
 serious error such as sqlite_corrupt

 I am sure that random bit flip in memory and on disk is the cause of
 some crashes, this is probably the 'limit' factor of how low the crash
 rate of a perfect program deployed in millions of computers can go.


The point I was trying to make is that the 'limit' factor as you put it is
proportional to memory usage.  Given our large memory consumption in the
browser process, the numbers from the paper imply dozens of corruptions just
in sqlite memory per user.  Even if only a small fraction of these are
harmful, spread over millions of users that's a lot of corruption.


 But I am unsure how to calculate, for example a random bit flip on the
 backingstores, which add to at least 10M on most machines does not
 hurt, or in the middle of a cache entry, or in the data part of some
 structure.


   I imagine there's no way we can know when corruption
  happen in steady-state and the next query leads to some other browser
 memory
  (or another database) getting corrupted?
 
  On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:
 
  It will be helpful to get our own measurement on database failures.
  Carlos just added something like that.
 
  Huan
 
  On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
  wrote:
   Saw this on
   slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
   The conclusion is an average of 25,000–75,000 FIT (failures in time
 per
   billion hours of operation) per Mbit.
   On my machine the browser process is usually  100MB, so that averages
   out
   to 176 to 493 error per year, with those numbers having big variance
   depending on the machine.  Since most users don't have ECC, which
 means
   this
   will lead to corruption.  Sqlite is a heavy user of memory, so even if
   it's
   1/4 of the 100MB, that means we'll see an average of 40-120 errors
   naturally
   because of faulty DIMMs.
   Given that sqlite corruption means (repeated) crashing of the browser
   process, it seems this data heavily suggests we should separate sqlite
   code
   into a separate process.  The IPC overhead is negligible compared to
   disk
   access.  My hunch is that the complexity is also not that high, since
   the
   code that deals with it is already asynchronous since we don't use
   sqlite on
   the UI/IO threads.
   What do others think?
 
  
 
 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Scott Hess

Our use of exclusive locking and page-cache preloading may open us up
more to this kind of shenanigans.  Basically SQLite will trust those
pages which we faulted into memory days ago.  We could mitigate
against that somewhat, but this problem reaches into areas we cannot
materially impact, such as filesystem caches.  And don't even begin to
imagine that there are not similar issues with commodity disk drives
and controllers.

That said, I don't think this is an incremental addition of any kind.
I've pointed it out before, there are things in the woods which
corrupt databases.  We could MAYBE reduce occurrences to a suitable
minimum using check-summing or something of the sort, but in the end
we still have to detect corruption and decide what course to take from
there.

-scott


On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote:


 On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:

 On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  I'm not sure how Carlos is doing it?  Will we know if something is
  corrupt
  just on load/save?

 Many sqlite calls can return sqlite_corrupt. For example a query or an
 insert
 We just check for error codes 1 to 26 with 5 or 6 of them being
 serious error such as sqlite_corrupt

 I am sure that random bit flip in memory and on disk is the cause of
 some crashes, this is probably the 'limit' factor of how low the crash
 rate of a perfect program deployed in millions of computers can go.

 The point I was trying to make is that the 'limit' factor as you put it is
 proportional to memory usage.  Given our large memory consumption in the
 browser process, the numbers from the paper imply dozens of corruptions just
 in sqlite memory per user.  Even if only a small fraction of these are
 harmful, spread over millions of users that's a lot of corruption.

 But I am unsure how to calculate, for example a random bit flip on the
 backingstores, which add to at least 10M on most machines does not
 hurt, or in the middle of a cache entry, or in the data part of some
 structure.


   I imagine there's no way we can know when corruption
  happen in steady-state and the next query leads to some other browser
  memory
  (or another database) getting corrupted?
 
  On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:
 
  It will be helpful to get our own measurement on database failures.
  Carlos just added something like that.
 
  Huan
 
  On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
  wrote:
   Saw this on
   slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
   The conclusion is an average of 25,000–75,000 FIT (failures in time
   per
   billion hours of operation) per Mbit.
   On my machine the browser process is usually  100MB, so that
   averages
   out
   to 176 to 493 error per year, with those numbers having big variance
   depending on the machine.  Since most users don't have ECC, which
   means
   this
   will lead to corruption.  Sqlite is a heavy user of memory, so even
   if
   it's
   1/4 of the 100MB, that means we'll see an average of 40-120 errors
   naturally
   because of faulty DIMMs.
   Given that sqlite corruption means (repeated) crashing of the browser
   process, it seems this data heavily suggests we should separate
   sqlite
   code
   into a separate process.  The IPC overhead is negligible compared to
   disk
   access.  My hunch is that the complexity is also not that high, since
   the
   code that deals with it is already asynchronous since we don't use
   sqlite on
   the UI/IO threads.
   What do others think?

  
 
 


 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Jeremy Orlow
On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote:



 On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:

  On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  I'm not sure how Carlos is doing it?  Will we know if something is
 corrupt
  just on load/save?

 Many sqlite calls can return sqlite_corrupt. For example a query or an
 insert
 We just check for error codes 1 to 26 with 5 or 6 of them being
 serious error such as sqlite_corrupt

 I am sure that random bit flip in memory and on disk is the cause of
 some crashes, this is probably the 'limit' factor of how low the crash
 rate of a perfect program deployed in millions of computers can go.


 The point I was trying to make is that the 'limit' factor as you put it is
 proportional to memory usage.  Given our large memory consumption in the
 browser process, the numbers from the paper imply dozens of corruptions just
 in sqlite memory per user.  Even if only a small fraction of these are
 harmful, spread over millions of users that's a lot of corruption.


For what it's worth:  This makes sense to me.  It seems like pulling SQLite
into its own process would be helpful for the reasons you laid out.  I
wonder if the only reason no one else has chimed in on this thread is that
no one wants to have to implement it.  :-)


 But I am unsure how to calculate, for example a random bit flip on the
 backingstores, which add to at least 10M on most machines does not
 hurt, or in the middle of a cache entry, or in the data part of some
 structure.



   I imagine there's no way we can know when corruption
  happen in steady-state and the next query leads to some other browser
 memory
  (or another database) getting corrupted?
 
  On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:
 
  It will be helpful to get our own measurement on database failures.
  Carlos just added something like that.
 
  Huan
 
  On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
  wrote:
   Saw this on
   slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
   The conclusion is an average of 25,000–75,000 FIT (failures in time
 per
   billion hours of operation) per Mbit.
   On my machine the browser process is usually  100MB, so that
 averages
   out
   to 176 to 493 error per year, with those numbers having big variance
   depending on the machine.  Since most users don't have ECC, which
 means
   this
   will lead to corruption.  Sqlite is a heavy user of memory, so even
 if
   it's
   1/4 of the 100MB, that means we'll see an average of 40-120 errors
   naturally
   because of faulty DIMMs.
   Given that sqlite corruption means (repeated) crashing of the browser
   process, it seems this data heavily suggests we should separate
 sqlite
   code
   into a separate process.  The IPC overhead is negligible compared to
   disk
   access.  My hunch is that the complexity is also not that high, since
   the
   code that deals with it is already asynchronous since we don't use
   sqlite on
   the UI/IO threads.
   What do others think?

  
 
 



 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote:

 On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.orgwrote:



 On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:

  On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  I'm not sure how Carlos is doing it?  Will we know if something is
 corrupt
  just on load/save?

 Many sqlite calls can return sqlite_corrupt. For example a query or an
 insert
 We just check for error codes 1 to 26 with 5 or 6 of them being
 serious error such as sqlite_corrupt

 I am sure that random bit flip in memory and on disk is the cause of
 some crashes, this is probably the 'limit' factor of how low the crash
 rate of a perfect program deployed in millions of computers can go.


 The point I was trying to make is that the 'limit' factor as you put it is
 proportional to memory usage.  Given our large memory consumption in the
 browser process, the numbers from the paper imply dozens of corruptions just
 in sqlite memory per user.  Even if only a small fraction of these are
 harmful, spread over millions of users that's a lot of corruption.


 For what it's worth:  This makes sense to me.  It seems like pulling SQLite
 into its own process would be helpful for the reasons you laid out.  I
 wonder if the only reason no one else has chimed in on this thread is that
 no one wants to have to implement it.  :-)


Chase is going to start investigating it (i.e. figure out what the cost in
doing it is, how much change it requires and ways of measuring the benefit).




 But I am unsure how to calculate, for example a random bit flip on the
 backingstores, which add to at least 10M on most machines does not
 hurt, or in the middle of a cache entry, or in the data part of some
 structure.



   I imagine there's no way we can know when corruption
  happen in steady-state and the next query leads to some other browser
 memory
  (or another database) getting corrupted?
 
  On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:
 
  It will be helpful to get our own measurement on database failures.
  Carlos just added something like that.
 
  Huan
 
  On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
  wrote:
   Saw this on
   slashdot: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
   The conclusion is an average of 25,000–75,000 FIT (failures in time
 per
   billion hours of operation) per Mbit.
   On my machine the browser process is usually  100MB, so that
 averages
   out
   to 176 to 493 error per year, with those numbers having big variance
   depending on the machine.  Since most users don't have ECC, which
 means
   this
   will lead to corruption.  Sqlite is a heavy user of memory, so even
 if
   it's
   1/4 of the 100MB, that means we'll see an average of 40-120 errors
   naturally
   because of faulty DIMMs.
   Given that sqlite corruption means (repeated) crashing of the
 browser
   process, it seems this data heavily suggests we should separate
 sqlite
   code
   into a separate process.  The IPC overhead is negligible compared to
   disk
   access.  My hunch is that the complexity is also not that high,
 since
   the
   code that deals with it is already asynchronous since we don't use
   sqlite on
   the UI/IO threads.
   What do others think?

  
 
 



 



--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Scott Hess

On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote:
 On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote:
 On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:
 On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org wrote:
  I'm not sure how Carlos is doing it?  Will we know if something is
  corrupt
  just on load/save?

 Many sqlite calls can return sqlite_corrupt. For example a query or an
 insert
 We just check for error codes 1 to 26 with 5 or 6 of them being
 serious error such as sqlite_corrupt

 I am sure that random bit flip in memory and on disk is the cause of
 some crashes, this is probably the 'limit' factor of how low the crash
 rate of a perfect program deployed in millions of computers can go.

 The point I was trying to make is that the 'limit' factor as you put it is
 proportional to memory usage.  Given our large memory consumption in the
 browser process, the numbers from the paper imply dozens of corruptions just
 in sqlite memory per user.  Even if only a small fraction of these are
 harmful, spread over millions of users that's a lot of corruption.

 For what it's worth:  This makes sense to me.  It seems like pulling SQLite
 into its own process would be helpful for the reasons you laid out.  I
 wonder if the only reason no one else has chimed in on this thread is that
 no one wants to have to implement it.  :-)

I don't understand how this paper makes it very useful to pull SQLite
into a separate process.  I can see how it would make it useful to
minimize how much in-memory data SQLite keeps, regardless of where
SQLite lives.  I can also see how this effect would increase our
incidence of memory stompers, but in a lot of cases it just won't
matter.  If it corrupts a piece of data which we pass to SQLite,
passing it direct or via IPC won't change that.  Most memory stompers
won't manifest by hitting SQLite anyhow, unless there's reason to
believe specific bits will be munged.

-scott

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 5:10 PM, Scott Hess sh...@chromium.org wrote:

 Our use of exclusive locking and page-cache preloading may open us up
 more to this kind of shenanigans.  Basically SQLite will trust those
 pages which we faulted into memory days ago.  We could mitigate
 against that somewhat, but this problem reaches into areas we cannot
 materially impact, such as filesystem caches.  And don't even begin to
 imagine that there are not similar issues with commodity disk drives
 and controllers.

 That said, I don't think this is an incremental addition of any kind.
 I've pointed it out before, there are things in the woods which
 corrupt databases.  We could MAYBE reduce occurrences to a suitable
 minimum using check-summing or something of the sort, but in the end
 we still have to detect corruption and decide what course to take from
 there.


I do think these are two separate problems.  Personally, I don't care as
much if my history or any other database is corrupted and I start from
scratch.  But random crashes that I can't isolate is something else.



 -scott


 On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org
 wrote:
 
 
  On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:
 
  On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
  wrote:
   I'm not sure how Carlos is doing it?  Will we know if something is
   corrupt
   just on load/save?
 
  Many sqlite calls can return sqlite_corrupt. For example a query or an
  insert
  We just check for error codes 1 to 26 with 5 or 6 of them being
  serious error such as sqlite_corrupt
 
  I am sure that random bit flip in memory and on disk is the cause of
  some crashes, this is probably the 'limit' factor of how low the crash
  rate of a perfect program deployed in millions of computers can go.
 
  The point I was trying to make is that the 'limit' factor as you put it
 is
  proportional to memory usage.  Given our large memory consumption in the
  browser process, the numbers from the paper imply dozens of corruptions
 just
  in sqlite memory per user.  Even if only a small fraction of these are
  harmful, spread over millions of users that's a lot of corruption.
 
  But I am unsure how to calculate, for example a random bit flip on the
  backingstores, which add to at least 10M on most machines does not
  hurt, or in the middle of a cache entry, or in the data part of some
  structure.
 
 
I imagine there's no way we can know when corruption
   happen in steady-state and the next query leads to some other browser
   memory
   (or another database) getting corrupted?
  
   On Tue, Oct 6, 2009 at 3:58 PM, Huan Ren hu...@google.com wrote:
  
   It will be helpful to get our own measurement on database failures.
   Carlos just added something like that.
  
   Huan
  
   On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org
   wrote:
Saw this on
slashdot:
 http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
The conclusion is an average of 25,000–75,000 FIT (failures in
 time
per
billion hours of operation) per Mbit.
On my machine the browser process is usually  100MB, so that
averages
out
to 176 to 493 error per year, with those numbers having big
 variance
depending on the machine.  Since most users don't have ECC, which
means
this
will lead to corruption.  Sqlite is a heavy user of memory, so even
if
it's
1/4 of the 100MB, that means we'll see an average of 40-120 errors
naturally
because of faulty DIMMs.
Given that sqlite corruption means (repeated) crashing of the
 browser
process, it seems this data heavily suggests we should separate
sqlite
code
into a separate process.  The IPC overhead is negligible compared
 to
disk
access.  My hunch is that the complexity is also not that high,
 since
the
code that deals with it is already asynchronous since we don't use
sqlite on
the UI/IO threads.
What do others think?
 
   
  
  
 
 
   
 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 5:16 PM, Scott Hess sh...@chromium.org wrote:

 On Tue, Oct 6, 2009 at 5:09 PM, Jeremy Orlow jor...@chromium.org wrote:
  On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org
 wrote:
  On Tue, Oct 6, 2009 at 4:30 PM, Carlos Pizano c...@google.com wrote:
  On Tue, Oct 6, 2009 at 4:14 PM, John Abd-El-Malek j...@chromium.org
 wrote:
   I'm not sure how Carlos is doing it?  Will we know if something is
   corrupt
   just on load/save?
 
  Many sqlite calls can return sqlite_corrupt. For example a query or an
  insert
  We just check for error codes 1 to 26 with 5 or 6 of them being
  serious error such as sqlite_corrupt
 
  I am sure that random bit flip in memory and on disk is the cause of
  some crashes, this is probably the 'limit' factor of how low the crash
  rate of a perfect program deployed in millions of computers can go.
 
  The point I was trying to make is that the 'limit' factor as you put it
 is
  proportional to memory usage.  Given our large memory consumption in the
  browser process, the numbers from the paper imply dozens of corruptions
 just
  in sqlite memory per user.  Even if only a small fraction of these are
  harmful, spread over millions of users that's a lot of corruption.
 
  For what it's worth:  This makes sense to me.  It seems like pulling
 SQLite
  into its own process would be helpful for the reasons you laid out.  I
  wonder if the only reason no one else has chimed in on this thread is
 that
  no one wants to have to implement it.  :-)

 I don't understand how this paper makes it very useful to pull SQLite
 into a separate process.  I can see how it would make it useful to
 minimize how much in-memory data SQLite keeps, regardless of where
 SQLite lives.  I can also see how this effect would increase our
 incidence of memory stompers, but in a lot of cases it just won't
 matter.  If it corrupts a piece of data which we pass to SQLite,
 passing it direct or via IPC won't change that.  Most memory stompers
 won't manifest by hitting SQLite anyhow, unless there's reason to
 believe specific bits will be munged.


You would only pass POD over IPC, the same which is passed using Task
objects across threads.  So the only corruption you'll get over IPC would be
data corruption, but you ensure that crashes due to corruption are isolated
to the sqlite process.



 -scott


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Dan Kegel

On Tue, Oct 6, 2009 at 4:59 PM, John Abd-El-Malek j...@chromium.org wrote:
 The point I was trying to make is that the 'limit' factor as you put it is
 proportional to memory usage.  Given our large memory consumption in the
 browser process, the numbers from the paper imply dozens of corruptions just
 in sqlite memory per user.

The paper (which is great!) says that
error rates are proportional to active use.
How busy is the average user's system?
I'll bet it's idle 99% of the time.
So instead of 4000 or 8000 memory errors per year per
machine, one might have 40 or 80.
That's still a pretty scary number.

Conclusions:
1) zfs was right: checksums are a good idea.  Can we add them to sqlite?
2) isolating sqlite into its own process seems like a good idea anyway
if it crashes a lot
3) congress should pass a law requiring all personal computers to use
error correcting memory and mandating free replacement of DIMMs that have
uncorrectable errors :-)

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Peter Kasting
On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.org wrote:

 Given that sqlite corruption means (repeated) crashing of the browser
 process, it seems this data heavily suggests we should separate sqlite code
 into a separate process.


What does this mean for cases like the in-memory URL database, which is
sqlite code running on a completely in-memory dataset?  This is currently
used synchronously inside the UI threas to do things like inline
autocomplete.  We _cannot_ make it async or move it to another thread.

So are you proposing to move _some_ sqlite accesses elsewhere?

(Personally, I don't think this paper is evidence that we should move sqlite
into a separate process, for similar reasons as to Scott.)

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Scott Hess

On Tue, Oct 6, 2009 at 5:24 PM, Dan Kegel d...@kegel.com wrote:
 1) zfs was right: checksums are a good idea.  Can we add them to sqlite?

I believe so, but I'm still working through the details.  And by
working I mean thinking.  The challenge is in finding places to
tuck things away where they won't break compatibility.  I _think_ we
could tuck page-level checksums into unused space w/in the page (and
arrange for unused space to exist).  Then row-level checksums to
handle overflow pages (which I don't think allow the same unused-space
trick).  I haven't dug in to figure out the free list.

Note, though, that checksums only get you so far, the real challenge
might be in finding all the places to check the checksums.  Checking
them at read time is unsatisfactory, given this research!

-scott

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Anthony LaForge
Out of curiosity does Firefox do anything special w/ regard to how they
manage Sqlite?  I can't imagine this problem is totally unique to us.
Kind Regards,

Anthony Laforge
Technical Program Manager
Mountain View, CA


On Tue, Oct 6, 2009 at 5:35 PM, Scott Hess sh...@chromium.org wrote:


 On Tue, Oct 6, 2009 at 5:24 PM, Dan Kegel d...@kegel.com wrote:
  1) zfs was right: checksums are a good idea.  Can we add them to sqlite?

 I believe so, but I'm still working through the details.  And by
 working I mean thinking.  The challenge is in finding places to
 tuck things away where they won't break compatibility.  I _think_ we
 could tuck page-level checksums into unused space w/in the page (and
 arrange for unused space to exist).  Then row-level checksums to
 handle overflow pages (which I don't think allow the same unused-space
 trick).  I haven't dug in to figure out the free list.

 Note, though, that checksums only get you so far, the real challenge
 might be in finding all the places to check the checksums.  Checking
 them at read time is unsatisfactory, given this research!

 -scott

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 5:30 PM, Peter Kasting pkast...@google.com wrote:

 On Tue, Oct 6, 2009 at 3:49 PM, John Abd-El-Malek j...@chromium.orgwrote:

 Given that sqlite corruption means (repeated) crashing of the browser
 process, it seems this data heavily suggests we should separate sqlite code
 into a separate process.


 What does this mean for cases like the in-memory URL database, which is
 sqlite code running on a completely in-memory dataset?  This is currently
 used synchronously inside the UI threas to do things like inline
 autocomplete.  We _cannot_ make it async or move it to another thread.

 So are you proposing to move _some_ sqlite accesses elsewhere?


I'm not sure, I think this is one of the questions that needs to be explored
more by whoever investigates this.

If corruption in the in-memory URL database doesn't survive a crash (i.e.
because it's recreated each time), then perhaps it's ok to keep it in the
browser process.  However, if it's not, I'm not sure I understand why moving
it to another process is unworkable?  Chrome code is littered with many
examples of how we turned synchronous operations into asynchronous ones.
 But if worst comes to worst and we can't, you can always do sync IPC calls
to another process.  The overhead is in the microseconds so it won't
be noticeable.



 (Personally, I don't think this paper is evidence that we should move
 sqlite into a separate process, for similar reasons as to Scott.)

 PK


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Peter Kasting
On Tue, Oct 6, 2009 at 5:53 PM, John Abd-El-Malek j...@chromium.org wrote:

 If corruption in the in-memory URL database doesn't survive a crash (i.e.
 because it's recreated each time),


It doesn't

I'm not sure I understand why moving it to another process is unworkable?
  Chrome code is littered with many examples of how we turned synchronous
 operations into asynchronous ones.


The user-visible behavior cannot be async.  Inline autocomplete must never
race the user's action or it goes from awesome to hellishly annoying and
awful.

But if worst comes to worst and we can't, you can always do sync IPC calls
 to another process.  The overhead is in the microseconds so it won't
 be noticeable.


I'd far prefer to keep it in the UI thread even if it were subject to these
corruption issues.  Doing sync IPC from the UI thread to a sqlite thread
sounds like a recipe for Jank.

PK

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
On Tue, Oct 6, 2009 at 5:56 PM, Peter Kasting pkast...@google.com wrote:

 On Tue, Oct 6, 2009 at 5:53 PM, John Abd-El-Malek j...@chromium.orgwrote:

 If corruption in the in-memory URL database doesn't survive a crash (i.e.
 because it's recreated each time),


 It doesn't

 I'm not sure I understand why moving it to another process is unworkable?
  Chrome code is littered with many examples of how we turned synchronous
 operations into asynchronous ones.


 The user-visible behavior cannot be async.  Inline autocomplete must never
 race the user's action or it goes from awesome to hellishly annoying and
 awful.

 But if worst comes to worst and we can't, you can always do sync IPC calls
 to another process.  The overhead is in the microseconds so it won't
 be noticeable.


 I'd far prefer to keep it in the UI thread even if it were subject to these
 corruption issues.  Doing sync IPC from the UI thread to a sqlite thread
 sounds like a recipe for Jank.


Note, I don't know what the right answer for this specific case (I think
more investigation is necessary), but I do want to point out that I don't
think moving it to another process will introduce jank.  If it's currently
not done on the same thread as other databases, then if it were moved to
another process, it would have to be done on a separate thread as well.  Our
overhead for sync IPCs is in the microseconds.

Of course, actually implementing this and comparing histograms of in-process
and out-of-process delays is the fool-proof way of proving this :)


 PK


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread cpu

I am with the others that don't see move sqlite to another process
as a natural outcome of these thread.

If using more memory is the concern, another process uses more memory.
sqlite is not crashing *that* much; yes it was the top crasher for a
while but it was a data race





--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread John Abd-El-Malek
This isn't about decreasing memory usage.  It's about getting rid of nasty
problems like the browser process crashing every startup because of a
corrupt database and decreasing browser process crashes in general.
I think it's fair that not everyone shares the same opinion, but I do hope
that an experiment is run and we can compare numbers to see the effects on
crash rates.  If Chase or others do it, that's great, if not, I'll try to do
it after the jank task force.

On Tue, Oct 6, 2009 at 6:20 PM, cpu c...@chromium.org wrote:


 I am with the others that don't see move sqlite to another process
 as a natural outcome of these thread.

 If using more memory is the concern, another process uses more memory.
 sqlite is not crashing *that* much; yes it was the top crasher for a
 while but it was a data race





 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Paper about DRAM error rates

2009-10-06 Thread Chase Phillips
On Tue, Oct 6, 2009 at 6:49 PM, John Abd-El-Malek j...@chromium.org wrote:

 This isn't about decreasing memory usage.  It's about getting rid of nasty
 problems like the browser process crashing every startup because of a
 corrupt database and decreasing browser process crashes in general.
 I think it's fair that not everyone shares the same opinion, but I do hope
 that an experiment is run and we can compare numbers to see the effects on
 crash rates.  If Chase or others do it, that's great, if not, I'll try to do
 it after the jank task force.


Didn't find another bug on file for this already so I filed crbug.com/24061.

Chase


 On Tue, Oct 6, 2009 at 6:20 PM, cpu c...@chromium.org wrote:


 I am with the others that don't see move sqlite to another process
 as a natural outcome of these thread.

 If using more memory is the concern, another process uses more memory.
 sqlite is not crashing *that* much; yes it was the top crasher for a
 while but it was a data race








 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---