Re: OS support for fault tolerance

2012-02-24 Thread Dieter BSD
The problem then is how to feed both machines the same inputs, and compare the outputs. Do we need a third machine to supervise? Can we have each machine keep an eye on the other, avoiding the need for a third machine? A pair would work as long as the only failures are obvious (e.g.

Re: OS support for fault tolerance

2012-02-24 Thread Adam Vande More
On Fri, Feb 24, 2012 at 3:10 PM, Dieter BSD dieter...@engineer.com wrote: Depends on what sort of work the machine is doing. If the job is something that can be done again, you could simply try again, if you still get different answers try a third machine or wade in and start manually

Re: OS support for fault tolerance

2012-02-21 Thread Julian Elischer
On 2/20/12 6:32 AM, Da Rock wrote: On 02/15/12 03:25, Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the

Re: OS support for fault tolerance

2012-02-20 Thread Da Rock
On 02/15/12 03:25, Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is

Re: OS support for fault tolerance

2012-02-20 Thread Dieter BSD
Rayson writes: The question is, are we planning to handle 95% of the errors for 99% of the hardware we run on, or are we really planning to spend years trying to design something that would require special hardware support? I assume this started as: Oh look, most CPUs have multiple cores

Re: OS support for fault tolerance

2012-02-20 Thread perryh
Dieter BSD dieter...@engineer.com wrote: The problem then is how to feed both machines the same inputs, and compare the outputs. ??Do we need a third machine to supervise? Can we have each machine keep an eye on the other, avoiding the need for a third machine? A pair would work as long as

OS support for fault tolerance

2012-02-14 Thread Maninya M
For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state

Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer
On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main

Re: OS support for fault tolerance

2012-02-14 Thread Jason Hellenthal
On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to

Re: OS support for fault tolerance

2012-02-14 Thread mdf
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer jul...@freebsd.org wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to

Re: OS support for fault tolerance

2012-02-14 Thread Joshua Isom
On 2/14/2012 10:57 AM, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each

Re: OS support for fault tolerance

2012-02-14 Thread Brandon Falk
On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate

Re: OS support for fault tolerance (re-send)

2012-02-14 Thread Rayson Ho
(The email below did not show up on the online archive - resending...) -- Forwarded message -- From: Rayson Ho raysonlo...@gmail.com Date: Tue, Feb 14, 2012 at 12:27 PM Subject: Re: OS support for fault tolerance On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer jul

Re: OS support for fault tolerance

2012-02-14 Thread Rayson Ho
On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer jul...@freebsd.org wrote: but I'm interested in any answers people may have The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg. when a partical from the outerspace hits

Re: OS support for fault tolerance

2012-02-14 Thread Eitan Adler
On Tue, Feb 14, 2012 at 12:05 PM, Jason Hellenthal jh...@dataix.net wrote: How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't

Re: OS support for fault tolerance

2012-02-14 Thread Uffe Jakobsen
On 2012-02-14 18:13, Joshua Isom wrote: On 2/14/2012 10:57 AM, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The

Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer
On 2/14/12 9:27 AM, Rayson Ho wrote: On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischerjul...@freebsd.org wrote: but I'm interested in any answers people may have The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg.

Re: OS support for fault tolerance

2012-02-14 Thread Jan Mikkelsen
On 15/02/2012, at 3:57 AM, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of

RE: OS support for fault tolerance

2012-02-14 Thread Devin Teske
-Original Message- From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd- hack...@freebsd.org] On Behalf Of Julian Elischer Sent: Tuesday, February 14, 2012 3:02 PM To: Rayson Ho Cc: Maninya M; freebsd-hackers@freebsd.org Subject: Re: OS support for fault tolerance On 2

Re: OS support for fault tolerance

2012-02-14 Thread Rayson Ho
On Tue, Feb 14, 2012 at 6:01 PM, Julian Elischer jul...@freebsd.org wrote: True, but you can't guarantee that a cpu is going to fail in a way that you can detect like that. what if the clock just stops.. The question is, are we planning to handle 95% of the errors for 99% of the hardware we run

Re: OS support for fault tolerance

2012-02-14 Thread Jim Bryant
Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is

Re: OS support for fault tolerance

2012-02-14 Thread Jim Bryant
Mirrored SMP? Even NonStops require a supervisory CPU subsystem to manage what is working or not. SMP itself would have to be totally rethought. My suggestion is to study the examples of NonStop and Guardian-90. Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore

Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer
On 2/14/12 3:51 PM, Jan Mikkelsen wrote: Coming back to the multicore issue: The problem when a core fails is that it has affected more than its own state. It will be holding locks on shared resources and may have corrupted shared memory or asked a device to do the wrong thing. By the time