Re: Software based ECC ?
On Tue, 21 Aug 2007, Bodo Eggert wrote: > Folkert van Heusden <[EMAIL PROTECTED]> wrote: > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng > softecc_ddopson-meng.pdf "SoftECC : A System for Software Memory Integrity Checking" >>> >>> Personally, I'd recommend just shelling out the bucks for hardware ECC if >>> the reliability matters. >> >> a question and an idea: Q: is ecc guaranteed to detect all bitflips? > > It's guaranteed not to. > > Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips > (provided you use an optimal code). > > These extra bits can flip, too, so if you have m >= 1 data bits and any > finite number n of extra bits, it's possible to have an undetectable > n+1-bit-flip. > -- > If you can't remember, then the claymore IS pointed at you. > Of course common ECC codes detect and correct single bit errors. When used in memory, bits in a word are never adjacent so a cosmic ray or other stray particle which could upset bits usually result in bits being upset in different words so they remain correctable. The MIT paper is noticeably deficient in its ability to do anything useful. It proposes checking things at 100 Hz intervals and trapping each memory access as though these things happen only once in awhile and, of course, assumes that the code doing the checking will never be corrupted. Further, it ignores the cache(s). Cheers, Dick Johnson Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips). My book : http://www.AbominableFirebug.com/ _ The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [EMAIL PROTECTED] - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
Folkert van Heusden <[EMAIL PROTECTED]> wrote: >> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng softecc_ddopson-meng.pdf >> > "SoftECC : A System for Software Memory Integrity Checking" >> >> Personally, I'd recommend just shelling out the bucks for hardware ECC if >> the reliability matters. > > a question and an idea: Q: is ecc guaranteed to detect all bitflips? It's guaranteed not to. Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips (provided you use an optimal code). These extra bits can flip, too, so if you have m >= 1 data bits and any finite number n of extra bits, it's possible to have an undetectable n+1-bit-flip. -- If you can't remember, then the claymore IS pointed at you. Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
Folkert van Heusden [EMAIL PROTECTED] wrote: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? It's guaranteed not to. Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips (provided you use an optimal code). These extra bits can flip, too, so if you have m = 1 data bits and any finite number n of extra bits, it's possible to have an undetectable n+1-bit-flip. -- If you can't remember, then the claymore IS pointed at you. Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Tue, 21 Aug 2007, Bodo Eggert wrote: Folkert van Heusden [EMAIL PROTECTED] wrote: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? It's guaranteed not to. Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips (provided you use an optimal code). These extra bits can flip, too, so if you have m = 1 data bits and any finite number n of extra bits, it's possible to have an undetectable n+1-bit-flip. -- If you can't remember, then the claymore IS pointed at you. Of course common ECC codes detect and correct single bit errors. When used in memory, bits in a word are never adjacent so a cosmic ray or other stray particle which could upset bits usually result in bits being upset in different words so they remain correctable. The MIT paper is noticeably deficient in its ability to do anything useful. It proposes checking things at 100 Hz intervals and trapping each memory access as though these things happen only once in awhile and, of course, assumes that the code doing the checking will never be corrupted. Further, it ignores the cache(s). Cheers, Dick Johnson Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips). My book : http://www.AbominableFirebug.com/ _ The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [EMAIL PROTECTED] - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said: > a question and an idea: Q: is ecc guaranteed to detect all bitflips? It depends on the exact ECC function the hardware implements. Usually it provides performance such as: "Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher, but not correct". (Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it just takes more bits of ECC.) > Idea: what about a multicore system (3 or more) that runs the same > processes on 2 cores and a third core verifying that they both do the > same? As I think it is not only ram that can become faulty. This is actually done for high-reliability systems (Google for "tell me twice" and "tell me three times"). The problem is that it takes a lot of extra hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with the PowerPC G5) implemented dual computation units and a comparator that signals a 'Machine Check' condition if the two CPUs don't end up in the same exact state (as an added bonus, at the end of each instruction that both *do* compare good, it latches the *entire* state of the CPU out, and then does the following: 1) Retry the instruction on the same CPU - if it compares correctly, keep going and flag a "soft" error. 2) If it still fails, read out the last "known good" status latch, and load it into a spare CPU, and fire it up, and flag the failing one as bad. http://www.research.ibm.com/journal/rd/435/spainhower.pdf http://www.research.ibm.com/journal/rd/435/mueller.pdf These guys have forgotten more about designing highly reliable systems than most of us will ever know. ;) Needless to say, not everybody is willing to pay the costs of the hardware overhead of this approach. pgpCmbxDYMQib.pgp Description: PGP signature
Re: Software based ECC ?
On 8/12/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote: > > > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > > "SoftECC : A System for Software Memory Integrity Checking" > > > > Personally, I'd recommend just shelling out the bucks for hardware ECC if > > the reliability matters. > > a question and an idea: Q: is ecc guaranteed to detect all bitflips? > > Idea: what about a multicore system (3 or more) that runs the same > processes on 2 cores and a third core verifying that they both do the > same? As I think it is not only ram that can become faulty. Such hardware does exist -- for example, Stratus sells systems that run the same OS on two separate boards in lockstep, with a voter to determine what action to take if they ever diverge. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Aug 12 2007 18:51, Folkert van Heusden wrote: > >> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf >> > "SoftECC : A System for Software Memory Integrity Checking" >> >> Personally, I'd recommend just shelling out the bucks for hardware ECC if >> the reliability matters. > >a question and an idea: Q: is ecc guaranteed to detect all bitflips? > >Idea: what about a multicore system (3 or more) that runs the same >processes on 2 cores and a third core verifying that they both do the >same? As I think it is not only ram that can become faulty. Indeed. And for example BOINC ([EMAIL PROTECTED]) have to consider this. Hence they recalculate each work unit at least three times and then compare between each. What makes this different from ECC is that the checksum is not calculated on every memory operations, but at the end of a larger block of operations. Of course this may mean that an error can propagate for a while, but the total walltime (including recomputation) is lower. :) Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > "SoftECC : A System for Software Memory Integrity Checking" > > Personally, I'd recommend just shelling out the bucks for hardware ECC if > the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. Folkert van Heusden -- MultiTail er et flexible tool for å kontrolere Logfiles og commandoer. Med filtrer, farger, sammenføringer, forskeliger ansikter etc. http://www.vanheusden.com/multitail/ -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said: a question and an idea: Q: is ecc guaranteed to detect all bitflips? It depends on the exact ECC function the hardware implements. Usually it provides performance such as: Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher, but not correct. (Of course, correct all 1 or 2 bit and detect all 3 bit can be done, it just takes more bits of ECC.) Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. This is actually done for high-reliability systems (Google for tell me twice and tell me three times). The problem is that it takes a lot of extra hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with the PowerPC G5) implemented dual computation units and a comparator that signals a 'Machine Check' condition if the two CPUs don't end up in the same exact state (as an added bonus, at the end of each instruction that both *do* compare good, it latches the *entire* state of the CPU out, and then does the following: 1) Retry the instruction on the same CPU - if it compares correctly, keep going and flag a soft error. 2) If it still fails, read out the last known good status latch, and load it into a spare CPU, and fire it up, and flag the failing one as bad. http://www.research.ibm.com/journal/rd/435/spainhower.pdf http://www.research.ibm.com/journal/rd/435/mueller.pdf These guys have forgotten more about designing highly reliable systems than most of us will ever know. ;) Needless to say, not everybody is willing to pay the costs of the hardware overhead of this approach. pgpCmbxDYMQib.pgp Description: PGP signature
Re: Software based ECC ?
http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. Folkert van Heusden -- MultiTail er et flexible tool for å kontrolere Logfiles og commandoer. Med filtrer, farger, sammenføringer, forskeliger ansikter etc. http://www.vanheusden.com/multitail/ -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Aug 12 2007 18:51, Folkert van Heusden wrote: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. Indeed. And for example BOINC ([EMAIL PROTECTED]) have to consider this. Hence they recalculate each work unit at least three times and then compare between each. What makes this different from ECC is that the checksum is not calculated on every memory operations, but at the end of a larger block of operations. Of course this may mean that an error can propagate for a while, but the total walltime (including recomputation) is lower. :) Jan -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On 8/12/07, Folkert van Heusden [EMAIL PROTECTED] wrote: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. a question and an idea: Q: is ecc guaranteed to detect all bitflips? Idea: what about a multicore system (3 or more) that runs the same processes on 2 cores and a third core verifying that they both do the same? As I think it is not only ram that can become faulty. Such hardware does exist -- for example, Stratus sells systems that run the same OS on two separate boards in lockstep, with a voter to determine what action to take if they ever diverge. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Fri, 10 Aug 2007 23:16:45 +0200, roland said: > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf > > "SoftECC : A System for Software Memory Integrity Checking" > > Is it possible to implement something like this within the Linux virtual > memory subsystem ? Anything that can be simulated with a Turing machine is *possible*. The question is how many rocket boosters the pig needs for takeoff. Hint: The thesis talks about why he didn't implement it for Linux. > If it can be done, wouldn`t this be a great feature ? Read section 5.2 of that thesis, particularly this quote from 5.2.2: "For random word writes, this implies that SoftECC will need an order of magnitude more compute time than the user-mode code" Basically, on every single memory page that gets dirtied, we have to then re-checksum the page (blowing away cache lines in the process). If you want to get a feel for it, find the kernel code that recognizes that a page is dirtied, and just add a few lines there: int foo = 0, i; for (i=0;i++;<1024) { // adjust for non-4K pages foo ^= *(page+i); } and see how much your system crawls. Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. pgp59H6a1oMSE.pgp Description: PGP signature
Re: Software based ECC ?
On Fri, 10 Aug 2007 23:16:45 +0200, roland said: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Is it possible to implement something like this within the Linux virtual memory subsystem ? Anything that can be simulated with a Turing machine is *possible*. The question is how many rocket boosters the pig needs for takeoff. Hint: The thesis talks about why he didn't implement it for Linux. If it can be done, wouldn`t this be a great feature ? Read section 5.2 of that thesis, particularly this quote from 5.2.2: For random word writes, this implies that SoftECC will need an order of magnitude more compute time than the user-mode code Basically, on every single memory page that gets dirtied, we have to then re-checksum the page (blowing away cache lines in the process). If you want to get a feel for it, find the kernel code that recognizes that a page is dirtied, and just add a few lines there: int foo = 0, i; for (i=0;i++;1024) { // adjust for non-4K pages foo ^= *(page+i); } and see how much your system crawls. Personally, I'd recommend just shelling out the bucks for hardware ECC if the reliability matters. pgp59H6a1oMSE.pgp Description: PGP signature
Re: Software based ECC ?
On Fri, 10 Aug 2007 23:16:45 +0200 "roland" <[EMAIL PROTECTED]> wrote: > Hello ! > > since ECC (speaking in terms of ram/memory) is some widespread hardware > technology > within server/enterprise computing for protection of memory failure, i > wonder: > > Can`t this be done in software, too ? Only one way to find out. If it interest you - have a go at it - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Software based ECC ?
Hello ! since ECC (speaking in terms of ram/memory) is some widespread hardware technology within server/enterprise computing for protection of memory failure, i wonder: Can`t this be done in software, too ? I didn`t find a referenc on this list, but i found an interesting paper i'd like to share at: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf "SoftECC : A System for Software Memory Integrity Checking" Is it possible to implement something like this within the Linux virtual memory subsystem ? If it can be done, wouldn`t this be a great feature ? regards Roland K. system engineer - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Software based ECC ?
Hello ! since ECC (speaking in terms of ram/memory) is some widespread hardware technology within server/enterprise computing for protection of memory failure, i wonder: Can`t this be done in software, too ? I didn`t find a referenc on this list, but i found an interesting paper i'd like to share at: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf SoftECC : A System for Software Memory Integrity Checking Is it possible to implement something like this within the Linux virtual memory subsystem ? If it can be done, wouldn`t this be a great feature ? regards Roland K. system engineer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Software based ECC ?
On Fri, 10 Aug 2007 23:16:45 +0200 roland [EMAIL PROTECTED] wrote: Hello ! since ECC (speaking in terms of ram/memory) is some widespread hardware technology within server/enterprise computing for protection of memory failure, i wonder: Can`t this be done in software, too ? Only one way to find out. If it interest you - have a go at it - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/