Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform
Digimerwrites: > Hi all, > > I suspect by now, many of you here have heard me talk about the Anvil! > intelligent availability platform. Today, I am proud to announce that it > is ready for general use! > > https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 > Cool, congratulations! Cheers, Kristoffer > > Now, time to start working full time on version 3! > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ > "I am, somehow, less interested in the weight and convolutions of > Einstein’s brain than in the near certainty that people of equal talent > have lived and died in cotton fields and sweatshops." - Stephen Jay Gould > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform
On 05/07/17 14:55, Ken Gaillot wrote: > Wow! I'm looking forward to the September summit talk. > Me too! Congratulations on the release :) Chrissie > On 07/05/2017 01:52 AM, Digimer wrote: >> Hi all, >> >> I suspect by now, many of you here have heard me talk about the Anvil! >> intelligent availability platform. Today, I am proud to announce that it >> is ready for general use! >> >> https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 >> >> I started five years ago with an idea of building an "Availability >> Appliance". A single machine where any part could be failed, removed and >> replaced without needing a maintenance window. A system with no single >> point of failure anywhere wrapped behind a very simple interface. >> >> The underlying architecture that provides this redundancy was laid >> down years ago as an early tutorial and has been field tested all over >> North America and around the world in the years since. In that time, the >> Anvil! platform has demonstrated over 99.% availability! >> >> Starting back then, the goal was to write the web interface that made >> it easy to use the Anvil! platform. Then, about two years ago, I decided >> that an Anvil! could be much, much more than just an appliance. >> >> It could think for itself. >> >> Today, I would like to announce version 2.0.0. This releases >> introduces the ScanCore "decision engine". ScanCore can be thought of as >> a sort of "Layer 3" availability platform. Where Corosync provides >> membership and communications, with Pacemaker (and rgmanager) sitting on >> top monitoring applications and handling fault detection and recovery, >> ScanCore sits on top of both, gathering disparate data, analyzing it and >> making "big picture" decisions on how to best protect the hosted servers. >> >> Examples; >> >> 1. All servers are on node 1, and node 1 suffers a cooling fan failure. >> ScanCore compares against node 2's health, waits a period of time in >> case it is a transient fault and the autonomously live-migrates the >> servers to node 2. Later, node 2 suffers a drive failure, degrading the >> underlying RAID array. ScanCore can then compare the relative risks of a >> failed fan versus a degraded RAID array, determine that the failed fan >> is less risky and automatically migrate the servers back to node 1. If a >> hot-spare kicks in and the array returns to an Optimal state, ScanCore >> will again migrate the servers back to node 2. When node 1's fan failure >> is finally repaired, the servers stay on node 2 as there is no benefit >> to migrating as now both nodes are equally healthy. >> >> 2. Input power is lost to one UPS, but not the second UPS. ScanCore >> knows that good power is available and, so, doesn't react in any way. If >> input power is lost to both UPSes, however, then ScanCore will decide >> that the greatest risk the server availability is no longer unexpected >> component failure, but instead depleting the batteries. Given this, it >> will decide that the best option to protect the hosted servers is to >> shed load and maximize run time. if the power stays out for too long, >> then ScanCore will determine hard off is imminent, and decide to >> gracefully shut down all hosted servers, withdraw and power off. Later, >> when power returns, the Striker dashboards will monitor the charge rate >> of the UPSes and as soon as it is safe to do so, restart the nodes and >> restore full redundancy. >> >> 3. Similar to case 2, ScanCore can gather temperature data from multiple >> sources and use this data to distinguish localized cooling failures from >> environmental cooling failures, like the loss of an HVAC or AC system. >> If the former case, ScanCore will migrate servers off and, if critical >> temperatures are reached, shut down systems before hardware damage can >> occur. In the later case, ScanCore will decide that minimizing thermal >> output is the best way to protect hosted servers and, so, will shed load >> to accomplish this. If necessary to avoid damage, ScanCore will perform >> a full shut down. Once ScanCore (on the low-powered Striker dashboards) >> determines thermal levels are safe again, it will restart the nodes and >> restore full redundancy. >> >> All of this intelligence is of little use, of course, if it is hard to >> build and maintain an Anvil! system. Perhaps the greatest lesson learned >> from our old tutorial was that the barrier to entry had to be reduced >> dramatically. >> >> https://www.alteeve.com/w/Build_an_m2_Anvil! >> >> So, this release also dramatically simplifies how easy it is to go >> from bare iron to provisioned, protected servers. Even with no >> experience in availability at all, a tech should be able to go from iron >> in boxes to provision servers in one or two days. Almost all steps have >> been automated, which serves the core goal of maximum reliability by >> minimizing the chances for human error. >> >> This version also introduces the
Re: [ClusterLabs] Introducing the Anvil! Intelligent Availability platform
Wow! I'm looking forward to the September summit talk. On 07/05/2017 01:52 AM, Digimer wrote: > Hi all, > > I suspect by now, many of you here have heard me talk about the Anvil! > intelligent availability platform. Today, I am proud to announce that it > is ready for general use! > > https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 > > I started five years ago with an idea of building an "Availability > Appliance". A single machine where any part could be failed, removed and > replaced without needing a maintenance window. A system with no single > point of failure anywhere wrapped behind a very simple interface. > > The underlying architecture that provides this redundancy was laid > down years ago as an early tutorial and has been field tested all over > North America and around the world in the years since. In that time, the > Anvil! platform has demonstrated over 99.% availability! > > Starting back then, the goal was to write the web interface that made > it easy to use the Anvil! platform. Then, about two years ago, I decided > that an Anvil! could be much, much more than just an appliance. > > It could think for itself. > > Today, I would like to announce version 2.0.0. This releases > introduces the ScanCore "decision engine". ScanCore can be thought of as > a sort of "Layer 3" availability platform. Where Corosync provides > membership and communications, with Pacemaker (and rgmanager) sitting on > top monitoring applications and handling fault detection and recovery, > ScanCore sits on top of both, gathering disparate data, analyzing it and > making "big picture" decisions on how to best protect the hosted servers. > > Examples; > > 1. All servers are on node 1, and node 1 suffers a cooling fan failure. > ScanCore compares against node 2's health, waits a period of time in > case it is a transient fault and the autonomously live-migrates the > servers to node 2. Later, node 2 suffers a drive failure, degrading the > underlying RAID array. ScanCore can then compare the relative risks of a > failed fan versus a degraded RAID array, determine that the failed fan > is less risky and automatically migrate the servers back to node 1. If a > hot-spare kicks in and the array returns to an Optimal state, ScanCore > will again migrate the servers back to node 2. When node 1's fan failure > is finally repaired, the servers stay on node 2 as there is no benefit > to migrating as now both nodes are equally healthy. > > 2. Input power is lost to one UPS, but not the second UPS. ScanCore > knows that good power is available and, so, doesn't react in any way. If > input power is lost to both UPSes, however, then ScanCore will decide > that the greatest risk the server availability is no longer unexpected > component failure, but instead depleting the batteries. Given this, it > will decide that the best option to protect the hosted servers is to > shed load and maximize run time. if the power stays out for too long, > then ScanCore will determine hard off is imminent, and decide to > gracefully shut down all hosted servers, withdraw and power off. Later, > when power returns, the Striker dashboards will monitor the charge rate > of the UPSes and as soon as it is safe to do so, restart the nodes and > restore full redundancy. > > 3. Similar to case 2, ScanCore can gather temperature data from multiple > sources and use this data to distinguish localized cooling failures from > environmental cooling failures, like the loss of an HVAC or AC system. > If the former case, ScanCore will migrate servers off and, if critical > temperatures are reached, shut down systems before hardware damage can > occur. In the later case, ScanCore will decide that minimizing thermal > output is the best way to protect hosted servers and, so, will shed load > to accomplish this. If necessary to avoid damage, ScanCore will perform > a full shut down. Once ScanCore (on the low-powered Striker dashboards) > determines thermal levels are safe again, it will restart the nodes and > restore full redundancy. > > All of this intelligence is of little use, of course, if it is hard to > build and maintain an Anvil! system. Perhaps the greatest lesson learned > from our old tutorial was that the barrier to entry had to be reduced > dramatically. > > https://www.alteeve.com/w/Build_an_m2_Anvil! > > So, this release also dramatically simplifies how easy it is to go > from bare iron to provisioned, protected servers. Even with no > experience in availability at all, a tech should be able to go from iron > in boxes to provision servers in one or two days. Almost all steps have > been automated, which serves the core goal of maximum reliability by > minimizing the chances for human error. > > This version also introduces the ability to run entirely offline. This > version of the Anvil! is entirely self-contained with internal > repositories making it possible to fully manage an Anvil! with no >
[ClusterLabs] Introducing the Anvil! Intelligent Availability platform
Hi all, I suspect by now, many of you here have heard me talk about the Anvil! intelligent availability platform. Today, I am proud to announce that it is ready for general use! https://github.com/ClusterLabs/striker/releases/tag/v2.0.0 I started five years ago with an idea of building an "Availability Appliance". A single machine where any part could be failed, removed and replaced without needing a maintenance window. A system with no single point of failure anywhere wrapped behind a very simple interface. The underlying architecture that provides this redundancy was laid down years ago as an early tutorial and has been field tested all over North America and around the world in the years since. In that time, the Anvil! platform has demonstrated over 99.% availability! Starting back then, the goal was to write the web interface that made it easy to use the Anvil! platform. Then, about two years ago, I decided that an Anvil! could be much, much more than just an appliance. It could think for itself. Today, I would like to announce version 2.0.0. This releases introduces the ScanCore "decision engine". ScanCore can be thought of as a sort of "Layer 3" availability platform. Where Corosync provides membership and communications, with Pacemaker (and rgmanager) sitting on top monitoring applications and handling fault detection and recovery, ScanCore sits on top of both, gathering disparate data, analyzing it and making "big picture" decisions on how to best protect the hosted servers. Examples; 1. All servers are on node 1, and node 1 suffers a cooling fan failure. ScanCore compares against node 2's health, waits a period of time in case it is a transient fault and the autonomously live-migrates the servers to node 2. Later, node 2 suffers a drive failure, degrading the underlying RAID array. ScanCore can then compare the relative risks of a failed fan versus a degraded RAID array, determine that the failed fan is less risky and automatically migrate the servers back to node 1. If a hot-spare kicks in and the array returns to an Optimal state, ScanCore will again migrate the servers back to node 2. When node 1's fan failure is finally repaired, the servers stay on node 2 as there is no benefit to migrating as now both nodes are equally healthy. 2. Input power is lost to one UPS, but not the second UPS. ScanCore knows that good power is available and, so, doesn't react in any way. If input power is lost to both UPSes, however, then ScanCore will decide that the greatest risk the server availability is no longer unexpected component failure, but instead depleting the batteries. Given this, it will decide that the best option to protect the hosted servers is to shed load and maximize run time. if the power stays out for too long, then ScanCore will determine hard off is imminent, and decide to gracefully shut down all hosted servers, withdraw and power off. Later, when power returns, the Striker dashboards will monitor the charge rate of the UPSes and as soon as it is safe to do so, restart the nodes and restore full redundancy. 3. Similar to case 2, ScanCore can gather temperature data from multiple sources and use this data to distinguish localized cooling failures from environmental cooling failures, like the loss of an HVAC or AC system. If the former case, ScanCore will migrate servers off and, if critical temperatures are reached, shut down systems before hardware damage can occur. In the later case, ScanCore will decide that minimizing thermal output is the best way to protect hosted servers and, so, will shed load to accomplish this. If necessary to avoid damage, ScanCore will perform a full shut down. Once ScanCore (on the low-powered Striker dashboards) determines thermal levels are safe again, it will restart the nodes and restore full redundancy. All of this intelligence is of little use, of course, if it is hard to build and maintain an Anvil! system. Perhaps the greatest lesson learned from our old tutorial was that the barrier to entry had to be reduced dramatically. https://www.alteeve.com/w/Build_an_m2_Anvil! So, this release also dramatically simplifies how easy it is to go from bare iron to provisioned, protected servers. Even with no experience in availability at all, a tech should be able to go from iron in boxes to provision servers in one or two days. Almost all steps have been automated, which serves the core goal of maximum reliability by minimizing the chances for human error. This version also introduces the ability to run entirely offline. This version of the Anvil! is entirely self-contained with internal repositories making it possible to fully manage an Anvil! with no external access to the outside world, including rebuilding Striker dashboards or Anvil! nodes after a major fault and building new Anvil! node pairs. There is so much more that the Anvil! platform can do, but this announcement is already quite long, so I'll stop here. I'm