Re: Software based ECC ?

2007-08-21 Thread linux-os \(Dick Johnson\)

On Tue, 21 Aug 2007, Bodo Eggert wrote:

> Folkert van Heusden <[EMAIL PROTECTED]> wrote:
>
 http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
> softecc_ddopson-meng.pdf
 "SoftECC : A System for Software Memory Integrity Checking"
>>>
>>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>>> the reliability matters.
>>
>> a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
> It's guaranteed not to.
>
> Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
> (provided you use an optimal code).
>
> These extra bits can flip, too, so if you have m >= 1 data bits and any
> finite number n of extra bits, it's possible to have an undetectable
> n+1-bit-flip.
> -- 
> If you can't remember, then the claymore IS pointed at you.
>
Of course common ECC codes detect and correct single bit errors.
When used in memory, bits in a word are never adjacent so a cosmic
ray or other stray particle which could upset bits usually result
in bits being upset in different words so they remain correctable.

The MIT paper is noticeably deficient in its ability to do anything
useful. It proposes checking things at 100 Hz intervals and trapping
each memory access as though these things happen only once in
awhile and, of course, assumes that the code doing the checking will
never be corrupted. Further, it ignores the cache(s).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips).
My book : http://www.AbominableFirebug.com/
_



The information transmitted in this message is confidential and may be 
privileged.  Any review, retransmission, dissemination, or other use of this 
information by persons or entities other than the intended recipient is 
prohibited.  If you are not the intended recipient, please notify Analogic 
Corporation immediately - by replying to this message or by sending an email to 
[EMAIL PROTECTED] - and destroy all copies of this information, including any 
attachments, without reading or disclosing them.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-21 Thread Bodo Eggert
Folkert van Heusden <[EMAIL PROTECTED]> wrote:

>> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
softecc_ddopson-meng.pdf
>> > "SoftECC : A System for Software Memory Integrity Checking"
>> 
>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>> the reliability matters.
> 
> a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It's guaranteed not to.

Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
(provided you use an optimal code).

These extra bits can flip, too, so if you have m >= 1 data bits and any
finite number n of extra bits, it's possible to have an undetectable
n+1-bit-flip.
-- 
If you can't remember, then the claymore IS pointed at you. 

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-21 Thread Bodo Eggert
Folkert van Heusden [EMAIL PROTECTED] wrote:

  http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
softecc_ddopson-meng.pdf
  SoftECC : A System for Software Memory Integrity Checking
 
 Personally, I'd recommend just shelling out the bucks for hardware ECC if
 the reliability matters.
 
 a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It's guaranteed not to.

Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
(provided you use an optimal code).

These extra bits can flip, too, so if you have m = 1 data bits and any
finite number n of extra bits, it's possible to have an undetectable
n+1-bit-flip.
-- 
If you can't remember, then the claymore IS pointed at you. 

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-21 Thread linux-os \(Dick Johnson\)

On Tue, 21 Aug 2007, Bodo Eggert wrote:

 Folkert van Heusden [EMAIL PROTECTED] wrote:

 http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
 softecc_ddopson-meng.pdf
 SoftECC : A System for Software Memory Integrity Checking

 Personally, I'd recommend just shelling out the bucks for hardware ECC if
 the reliability matters.

 a question and an idea: Q: is ecc guaranteed to detect all bitflips?

 It's guaranteed not to.

 Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
 (provided you use an optimal code).

 These extra bits can flip, too, so if you have m = 1 data bits and any
 finite number n of extra bits, it's possible to have an undetectable
 n+1-bit-flip.
 -- 
 If you can't remember, then the claymore IS pointed at you.

Of course common ECC codes detect and correct single bit errors.
When used in memory, bits in a word are never adjacent so a cosmic
ray or other stray particle which could upset bits usually result
in bits being upset in different words so they remain correctable.

The MIT paper is noticeably deficient in its ability to do anything
useful. It proposes checking things at 100 Hz intervals and trapping
each memory access as though these things happen only once in
awhile and, of course, assumes that the code doing the checking will
never be corrupted. Further, it ignores the cache(s).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips).
My book : http://www.AbominableFirebug.com/
_



The information transmitted in this message is confidential and may be 
privileged.  Any review, retransmission, dissemination, or other use of this 
information by persons or entities other than the intended recipient is 
prohibited.  If you are not the intended recipient, please notify Analogic 
Corporation immediately - by replying to this message or by sending an email to 
[EMAIL PROTECTED] - and destroy all copies of this information, including any 
attachments, without reading or disclosing them.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread Valdis . Kletnieks
On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said:

> a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It depends on the exact ECC function the hardware implements.  Usually it
provides performance such as:

"Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct".

(Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it
just takes more bits of ECC.)

> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.

This is actually done for high-reliability systems (Google for "tell me twice"
and "tell me three times").  The problem is that it takes a lot of extra
hardware.  The G5 and later IBM Z-series mainframe chipsets (not to be confused 
with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:

1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a "soft" error.

2) If it still fails, read out the last "known good" status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.

http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf

These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)

Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.  



pgpCmbxDYMQib.pgp
Description: PGP signature


Re: Software based ECC ?

2007-08-12 Thread chibiryuu
On 8/12/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote:
> > > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
> > > "SoftECC : A System for Software Memory Integrity Checking"
> >
> > Personally, I'd recommend just shelling out the bucks for hardware ECC if
> > the reliability matters.
>
> a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.

Such hardware does exist -- for example, Stratus sells systems that
run the same OS on two separate boards in lockstep, with a voter to
determine what action to take if they ever diverge.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 18:51, Folkert van Heusden wrote:
>
>> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
>> > "SoftECC : A System for Software Memory Integrity Checking"
>> 
>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>> the reliability matters.
>
>a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
>Idea: what about a multicore system (3 or more) that runs the same
>processes on 2 cores and a third core verifying that they both do the
>same? As I think it is not only ram that can become faulty.

Indeed. And for example BOINC ([EMAIL PROTECTED]) have to consider this. Hence 
they
recalculate each work unit at least three times and then compare between
each. What makes this different from ECC is that the checksum is not calculated
on every memory operations, but at the end of a larger block of operations. Of
course this may mean that an error can propagate for a while, but the total
walltime (including recomputation) is lower. :)


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread Folkert van Heusden
> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
> > "SoftECC : A System for Software Memory Integrity Checking"
> 
> Personally, I'd recommend just shelling out the bucks for hardware ECC if
> the reliability matters.

a question and an idea: Q: is ecc guaranteed to detect all bitflips?

Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.



Folkert van Heusden

-- 
MultiTail er et flexible tool for å kontrolere Logfiles og commandoer.
Med filtrer, farger, sammenføringer, forskeliger ansikter etc.
http://www.vanheusden.com/multitail/
--
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread Valdis . Kletnieks
On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said:

 a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It depends on the exact ECC function the hardware implements.  Usually it
provides performance such as:

Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct.

(Of course, correct all 1 or 2 bit and detect all 3 bit can be done, it
just takes more bits of ECC.)

 Idea: what about a multicore system (3 or more) that runs the same
 processes on 2 cores and a third core verifying that they both do the
 same? As I think it is not only ram that can become faulty.

This is actually done for high-reliability systems (Google for tell me twice
and tell me three times).  The problem is that it takes a lot of extra
hardware.  The G5 and later IBM Z-series mainframe chipsets (not to be confused 
with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:

1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a soft error.

2) If it still fails, read out the last known good status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.

http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf

These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)

Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.  



pgpCmbxDYMQib.pgp
Description: PGP signature


Re: Software based ECC ?

2007-08-12 Thread Folkert van Heusden
  http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
  SoftECC : A System for Software Memory Integrity Checking
 
 Personally, I'd recommend just shelling out the bucks for hardware ECC if
 the reliability matters.

a question and an idea: Q: is ecc guaranteed to detect all bitflips?

Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.



Folkert van Heusden

-- 
MultiTail er et flexible tool for å kontrolere Logfiles og commandoer.
Med filtrer, farger, sammenføringer, forskeliger ansikter etc.
http://www.vanheusden.com/multitail/
--
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 18:51, Folkert van Heusden wrote:

  http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
  SoftECC : A System for Software Memory Integrity Checking
 
 Personally, I'd recommend just shelling out the bucks for hardware ECC if
 the reliability matters.

a question and an idea: Q: is ecc guaranteed to detect all bitflips?

Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.

Indeed. And for example BOINC ([EMAIL PROTECTED]) have to consider this. Hence 
they
recalculate each work unit at least three times and then compare between
each. What makes this different from ECC is that the checksum is not calculated
on every memory operations, but at the end of a larger block of operations. Of
course this may mean that an error can propagate for a while, but the total
walltime (including recomputation) is lower. :)


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-12 Thread chibiryuu
On 8/12/07, Folkert van Heusden [EMAIL PROTECTED] wrote:
   http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
   SoftECC : A System for Software Memory Integrity Checking
 
  Personally, I'd recommend just shelling out the bucks for hardware ECC if
  the reliability matters.

 a question and an idea: Q: is ecc guaranteed to detect all bitflips?

 Idea: what about a multicore system (3 or more) that runs the same
 processes on 2 cores and a third core verifying that they both do the
 same? As I think it is not only ram that can become faulty.

Such hardware does exist -- for example, Stratus sells systems that
run the same OS on two separate boards in lockstep, with a voter to
determine what action to take if they ever diverge.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-11 Thread Valdis . Kletnieks
On Fri, 10 Aug 2007 23:16:45 +0200, roland said:

> http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
> 
> "SoftECC : A System for Software Memory Integrity Checking"
> 
> Is it possible to implement something like this within the Linux virtual
> memory subsystem ?

Anything that can be simulated with a Turing machine is *possible*.

The question is how many rocket boosters the pig needs for takeoff.

Hint: The thesis talks about why he didn't implement it for Linux.

> If it can be done, wouldn`t this be a great feature ?

Read section 5.2 of that thesis, particularly this quote from 5.2.2:

"For random word writes, this implies that SoftECC will need an order of
magnitude more compute time than the user-mode code"

Basically, on every single memory page that gets dirtied, we have to then
re-checksum the page (blowing away cache lines in the process).  If you want
to get a feel for it, find the kernel code that recognizes that a page is
dirtied, and just add a few lines there:

int foo = 0, i;
for (i=0;i++;<1024) { // adjust for non-4K pages
foo ^= *(page+i);
}

and see how much your system crawls.

Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.



pgp59H6a1oMSE.pgp
Description: PGP signature


Re: Software based ECC ?

2007-08-11 Thread Valdis . Kletnieks
On Fri, 10 Aug 2007 23:16:45 +0200, roland said:

 http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
 
 SoftECC : A System for Software Memory Integrity Checking
 
 Is it possible to implement something like this within the Linux virtual
 memory subsystem ?

Anything that can be simulated with a Turing machine is *possible*.

The question is how many rocket boosters the pig needs for takeoff.

Hint: The thesis talks about why he didn't implement it for Linux.

 If it can be done, wouldn`t this be a great feature ?

Read section 5.2 of that thesis, particularly this quote from 5.2.2:

For random word writes, this implies that SoftECC will need an order of
magnitude more compute time than the user-mode code

Basically, on every single memory page that gets dirtied, we have to then
re-checksum the page (blowing away cache lines in the process).  If you want
to get a feel for it, find the kernel code that recognizes that a page is
dirtied, and just add a few lines there:

int foo = 0, i;
for (i=0;i++;1024) { // adjust for non-4K pages
foo ^= *(page+i);
}

and see how much your system crawls.

Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.



pgp59H6a1oMSE.pgp
Description: PGP signature


Re: Software based ECC ?

2007-08-10 Thread Alan Cox
On Fri, 10 Aug 2007 23:16:45 +0200
"roland" <[EMAIL PROTECTED]> wrote:

> Hello !
> 
> since ECC (speaking in terms of ram/memory) is some widespread hardware
> technology
> within server/enterprise computing for protection of memory failure,  i
> wonder:
> 
> Can`t this be done in software, too ?

Only one way to find out. If it interest you - have a go at it
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Software based ECC ?

2007-08-10 Thread roland

Hello !

since ECC (speaking in terms of ram/memory) is some widespread hardware
technology
within server/enterprise computing for protection of memory failure,  i
wonder:

Can`t this be done in software, too ?

I didn`t find a referenc on this list, but i found an interesting paper i'd
like to share at:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

"SoftECC : A System for Software Memory Integrity Checking"

Is it possible to implement something like this within the Linux virtual
memory subsystem ?
If it can be done, wouldn`t this be a great feature ?

regards
Roland K.
system engineer




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Software based ECC ?

2007-08-10 Thread roland

Hello !

since ECC (speaking in terms of ram/memory) is some widespread hardware
technology
within server/enterprise computing for protection of memory failure,  i
wonder:

Can`t this be done in software, too ?

I didn`t find a referenc on this list, but i found an interesting paper i'd
like to share at:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

SoftECC : A System for Software Memory Integrity Checking

Is it possible to implement something like this within the Linux virtual
memory subsystem ?
If it can be done, wouldn`t this be a great feature ?

regards
Roland K.
system engineer




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Software based ECC ?

2007-08-10 Thread Alan Cox
On Fri, 10 Aug 2007 23:16:45 +0200
roland [EMAIL PROTECTED] wrote:

 Hello !
 
 since ECC (speaking in terms of ram/memory) is some widespread hardware
 technology
 within server/enterprise computing for protection of memory failure,  i
 wonder:
 
 Can`t this be done in software, too ?

Only one way to find out. If it interest you - have a go at it
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/