Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform

2017-07-10 Thread Kristoffer Grönlund
Digimer  writes:

> Hi all,
>
>   I suspect by now, many of you here have heard me talk about the Anvil!
> intelligent availability platform. Today, I am proud to announce that it
> is ready for general use!
>
> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0
>

Cool, congratulations!

Cheers,
Kristoffer

>
>   Now, time to start working full time on version 3!
>
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform

2017-07-06 Thread Christine Caulfield
On 05/07/17 14:55, Ken Gaillot wrote:
> Wow! I'm looking forward to the September summit talk.
> 



Me too! Congratulations on the release :)

Chrissie



> On 07/05/2017 01:52 AM, Digimer wrote:
>> Hi all,
>>
>>   I suspect by now, many of you here have heard me talk about the Anvil!
>> intelligent availability platform. Today, I am proud to announce that it
>> is ready for general use!
>>
>> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0
>>
>>   I started five years ago with an idea of building an "Availability
>> Appliance". A single machine where any part could be failed, removed and
>> replaced without needing a maintenance window. A system with no single
>> point of failure anywhere wrapped behind a very simple interface.
>>
>>   The underlying architecture that provides this redundancy was laid
>> down years ago as an early tutorial and has been field tested all over
>> North America and around the world in the years since. In that time, the
>> Anvil! platform has demonstrated over 99.% availability!
>>
>>   Starting back then, the goal was to write the web interface that made
>> it easy to use the Anvil! platform. Then, about two years ago, I decided
>> that an Anvil! could be much, much more than just an appliance.
>>
>>   It could think for itself.
>>
>>   Today, I would like to announce version 2.0.0. This releases
>> introduces the ScanCore "decision engine". ScanCore can be thought of as
>> a sort of "Layer 3" availability platform. Where Corosync provides
>> membership and communications, with Pacemaker (and rgmanager) sitting on
>> top monitoring applications and handling fault detection and recovery,
>> ScanCore sits on top of both, gathering disparate data, analyzing it and
>> making "big picture" decisions on how to best protect the hosted servers.
>>
>>   Examples;
>>
>> 1. All servers are on node 1, and node 1 suffers a cooling fan failure.
>> ScanCore compares against node 2's health, waits a period of time in
>> case it is a transient fault and the autonomously live-migrates the
>> servers to node 2. Later, node 2 suffers a drive failure, degrading the
>> underlying RAID array. ScanCore can then compare the relative risks of a
>> failed fan versus a degraded RAID array, determine that the failed fan
>> is less risky and automatically migrate the servers back to node 1. If a
>> hot-spare kicks in and the array returns to an Optimal state, ScanCore
>> will again migrate the servers back to node 2. When node 1's fan failure
>> is finally repaired, the servers stay on node 2 as there is no benefit
>> to migrating as now both nodes are equally healthy.
>>
>> 2. Input power is lost to one UPS, but not the second UPS. ScanCore
>> knows that good power is available and, so, doesn't react in any way. If
>> input power is lost to both UPSes, however, then ScanCore will decide
>> that the greatest risk the server availability is no longer unexpected
>> component failure, but instead depleting the batteries. Given this, it
>> will decide that the best option to protect the hosted servers is to
>> shed load and maximize run time. if the power stays out for too long,
>> then ScanCore will determine hard off is imminent, and decide to
>> gracefully shut down all hosted servers, withdraw and power off. Later,
>> when power returns, the Striker dashboards will monitor the charge rate
>> of the UPSes and as soon as it is safe to do so, restart the nodes and
>> restore full redundancy.
>>
>> 3. Similar to case 2, ScanCore can gather temperature data from multiple
>> sources and use this data to distinguish localized cooling failures from
>> environmental cooling failures, like the loss of an HVAC or AC system.
>> If the former case, ScanCore will migrate servers off and, if critical
>> temperatures are reached, shut down systems before hardware damage can
>> occur. In the later case, ScanCore will decide that minimizing thermal
>> output is the best way to protect hosted servers and, so, will shed load
>> to accomplish this. If necessary to avoid damage, ScanCore will perform
>> a full shut down. Once ScanCore (on the low-powered Striker dashboards)
>> determines thermal levels are safe again, it will restart the nodes and
>> restore full redundancy.
>>
>>   All of this intelligence is of little use, of course, if it is hard to
>> build and maintain an Anvil! system. Perhaps the greatest lesson learned
>> from our old tutorial was that the barrier to entry had to be reduced
>> dramatically.
>>
>> https://www.alteeve.com/w/Build_an_m2_Anvil!
>>
>>   So, this release also dramatically simplifies how easy it is to go
>> from bare iron to provisioned, protected servers. Even with no
>> experience in availability at all, a tech should be able to go from iron
>> in boxes to provision servers in one or two days. Almost all steps have
>> been automated, which serves the core goal of maximum reliability by
>> minimizing the chances for human error.
>>
>>   This version also introduces the 

Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform

2017-07-05 Thread Ken Gaillot
Wow! I'm looking forward to the September summit talk.

On 07/05/2017 01:52 AM, Digimer wrote:
> Hi all,
> 
>   I suspect by now, many of you here have heard me talk about the Anvil!
> intelligent availability platform. Today, I am proud to announce that it
> is ready for general use!
> 
> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0
> 
>   I started five years ago with an idea of building an "Availability
> Appliance". A single machine where any part could be failed, removed and
> replaced without needing a maintenance window. A system with no single
> point of failure anywhere wrapped behind a very simple interface.
> 
>   The underlying architecture that provides this redundancy was laid
> down years ago as an early tutorial and has been field tested all over
> North America and around the world in the years since. In that time, the
> Anvil! platform has demonstrated over 99.% availability!
> 
>   Starting back then, the goal was to write the web interface that made
> it easy to use the Anvil! platform. Then, about two years ago, I decided
> that an Anvil! could be much, much more than just an appliance.
> 
>   It could think for itself.
> 
>   Today, I would like to announce version 2.0.0. This releases
> introduces the ScanCore "decision engine". ScanCore can be thought of as
> a sort of "Layer 3" availability platform. Where Corosync provides
> membership and communications, with Pacemaker (and rgmanager) sitting on
> top monitoring applications and handling fault detection and recovery,
> ScanCore sits on top of both, gathering disparate data, analyzing it and
> making "big picture" decisions on how to best protect the hosted servers.
> 
>   Examples;
> 
> 1. All servers are on node 1, and node 1 suffers a cooling fan failure.
> ScanCore compares against node 2's health, waits a period of time in
> case it is a transient fault and the autonomously live-migrates the
> servers to node 2. Later, node 2 suffers a drive failure, degrading the
> underlying RAID array. ScanCore can then compare the relative risks of a
> failed fan versus a degraded RAID array, determine that the failed fan
> is less risky and automatically migrate the servers back to node 1. If a
> hot-spare kicks in and the array returns to an Optimal state, ScanCore
> will again migrate the servers back to node 2. When node 1's fan failure
> is finally repaired, the servers stay on node 2 as there is no benefit
> to migrating as now both nodes are equally healthy.
> 
> 2. Input power is lost to one UPS, but not the second UPS. ScanCore
> knows that good power is available and, so, doesn't react in any way. If
> input power is lost to both UPSes, however, then ScanCore will decide
> that the greatest risk the server availability is no longer unexpected
> component failure, but instead depleting the batteries. Given this, it
> will decide that the best option to protect the hosted servers is to
> shed load and maximize run time. if the power stays out for too long,
> then ScanCore will determine hard off is imminent, and decide to
> gracefully shut down all hosted servers, withdraw and power off. Later,
> when power returns, the Striker dashboards will monitor the charge rate
> of the UPSes and as soon as it is safe to do so, restart the nodes and
> restore full redundancy.
> 
> 3. Similar to case 2, ScanCore can gather temperature data from multiple
> sources and use this data to distinguish localized cooling failures from
> environmental cooling failures, like the loss of an HVAC or AC system.
> If the former case, ScanCore will migrate servers off and, if critical
> temperatures are reached, shut down systems before hardware damage can
> occur. In the later case, ScanCore will decide that minimizing thermal
> output is the best way to protect hosted servers and, so, will shed load
> to accomplish this. If necessary to avoid damage, ScanCore will perform
> a full shut down. Once ScanCore (on the low-powered Striker dashboards)
> determines thermal levels are safe again, it will restart the nodes and
> restore full redundancy.
> 
>   All of this intelligence is of little use, of course, if it is hard to
> build and maintain an Anvil! system. Perhaps the greatest lesson learned
> from our old tutorial was that the barrier to entry had to be reduced
> dramatically.
> 
> https://www.alteeve.com/w/Build_an_m2_Anvil!
> 
>   So, this release also dramatically simplifies how easy it is to go
> from bare iron to provisioned, protected servers. Even with no
> experience in availability at all, a tech should be able to go from iron
> in boxes to provision servers in one or two days. Almost all steps have
> been automated, which serves the core goal of maximum reliability by
> minimizing the chances for human error.
> 
>   This version also introduces the ability to run entirely offline. This
> version of the Anvil! is entirely self-contained with internal
> repositories making it possible to fully manage an Anvil! with no
> 

[ClusterLabs] Introducing the Anvil! Intelligent Availability platform

2017-07-05 Thread Digimer
Hi all,

  I suspect by now, many of you here have heard me talk about the Anvil!
intelligent availability platform. Today, I am proud to announce that it
is ready for general use!

https://github.com/ClusterLabs/striker/releases/tag/v2.0.0

  I started five years ago with an idea of building an "Availability
Appliance". A single machine where any part could be failed, removed and
replaced without needing a maintenance window. A system with no single
point of failure anywhere wrapped behind a very simple interface.

  The underlying architecture that provides this redundancy was laid
down years ago as an early tutorial and has been field tested all over
North America and around the world in the years since. In that time, the
Anvil! platform has demonstrated over 99.% availability!

  Starting back then, the goal was to write the web interface that made
it easy to use the Anvil! platform. Then, about two years ago, I decided
that an Anvil! could be much, much more than just an appliance.

  It could think for itself.

  Today, I would like to announce version 2.0.0. This releases
introduces the ScanCore "decision engine". ScanCore can be thought of as
a sort of "Layer 3" availability platform. Where Corosync provides
membership and communications, with Pacemaker (and rgmanager) sitting on
top monitoring applications and handling fault detection and recovery,
ScanCore sits on top of both, gathering disparate data, analyzing it and
making "big picture" decisions on how to best protect the hosted servers.

  Examples;

1. All servers are on node 1, and node 1 suffers a cooling fan failure.
ScanCore compares against node 2's health, waits a period of time in
case it is a transient fault and the autonomously live-migrates the
servers to node 2. Later, node 2 suffers a drive failure, degrading the
underlying RAID array. ScanCore can then compare the relative risks of a
failed fan versus a degraded RAID array, determine that the failed fan
is less risky and automatically migrate the servers back to node 1. If a
hot-spare kicks in and the array returns to an Optimal state, ScanCore
will again migrate the servers back to node 2. When node 1's fan failure
is finally repaired, the servers stay on node 2 as there is no benefit
to migrating as now both nodes are equally healthy.

2. Input power is lost to one UPS, but not the second UPS. ScanCore
knows that good power is available and, so, doesn't react in any way. If
input power is lost to both UPSes, however, then ScanCore will decide
that the greatest risk the server availability is no longer unexpected
component failure, but instead depleting the batteries. Given this, it
will decide that the best option to protect the hosted servers is to
shed load and maximize run time. if the power stays out for too long,
then ScanCore will determine hard off is imminent, and decide to
gracefully shut down all hosted servers, withdraw and power off. Later,
when power returns, the Striker dashboards will monitor the charge rate
of the UPSes and as soon as it is safe to do so, restart the nodes and
restore full redundancy.

3. Similar to case 2, ScanCore can gather temperature data from multiple
sources and use this data to distinguish localized cooling failures from
environmental cooling failures, like the loss of an HVAC or AC system.
If the former case, ScanCore will migrate servers off and, if critical
temperatures are reached, shut down systems before hardware damage can
occur. In the later case, ScanCore will decide that minimizing thermal
output is the best way to protect hosted servers and, so, will shed load
to accomplish this. If necessary to avoid damage, ScanCore will perform
a full shut down. Once ScanCore (on the low-powered Striker dashboards)
determines thermal levels are safe again, it will restart the nodes and
restore full redundancy.

  All of this intelligence is of little use, of course, if it is hard to
build and maintain an Anvil! system. Perhaps the greatest lesson learned
from our old tutorial was that the barrier to entry had to be reduced
dramatically.

https://www.alteeve.com/w/Build_an_m2_Anvil!

  So, this release also dramatically simplifies how easy it is to go
from bare iron to provisioned, protected servers. Even with no
experience in availability at all, a tech should be able to go from iron
in boxes to provision servers in one or two days. Almost all steps have
been automated, which serves the core goal of maximum reliability by
minimizing the chances for human error.

  This version also introduces the ability to run entirely offline. This
version of the Anvil! is entirely self-contained with internal
repositories making it possible to fully manage an Anvil! with no
external access to the outside world, including rebuilding Striker
dashboards or Anvil! nodes after a major fault and building new Anvil!
node pairs.

  There is so much more that the Anvil! platform can do, but this
announcement is already quite long, so I'll stop here.

  I'm