For primary storage dose nexentastor provide you with HA?
On Fri, Jul 19, 2013 at 12:09 PM, David Ortiz <dpor...@outlook.com> wrote: > Dean, > We didn't really have a recovery plan in place at the time. > Fortunately for us, this was just before we went live for other users to > hit our system, so what ended up happening was I was able to compare the > mysql database entries for volumes with the list of files that were still > present on primary storage. From there I could figure out which VMs were > missing root disks and delete/rebuild them as needed, and then for data > volumes that were missing we were able to simply recreate them and go into > the instances to reformat and do any other configuration. Fortunately we > had created all the VMs that went down, and I had created base templates > for each basic system type we were using (e.g. hadoop node, web server, > etc.), so recovery was pretty straightforward. > We now have been taking snaphosts of our vms and vendor vms so we can > restore from those if things get corrupted. We also are using nexentastor > for our shared storage, which I believe lets you snapshot the entire shared > filesystem as well. > Thanks, Dave > > > Date: Mon, 15 Jul 2013 17:27:24 -0400 > > Subject: RE: outage feedback and questions > > From: dean.kam...@gmail.com > > To: users@cloudstack.apache.org > > > > Just wondering if you had a recovery plan? > > Would you please share with us your experience. > > > > Thank you > > On Jul 15, 2013 4:47 PM, "David Ortiz" <dpor...@outlook.com> wrote: > > > > > Laurent, > > > We too had some issues where we lost VMs after a switch went down. > We > > > are also using gfs2 over iScsi for our primary storage. Once I got the > > > cluster back up, fsck found a lot of corruption on the gfs2 fs, which > > > resulted in probably 6 VMs out of the 25 we had needing to have volumes > > > rebuilt, or having to be rebuilt completely. I would guess this is > what > > > happened in your case as well. > > > Thanks, David Ortiz > > > > > > > From: dean.kam...@gmail.com > > > > Date: Tue, 9 Jul 2013 19:35:52 -0400 > > > > Subject: Re: outage feedback and questions > > > > To: users@cloudstack.apache.org > > > > > > > > courtesy to geoff.higginbottom@shapeblue.comfor answering this > question > > > first > > > > > > > > > > > > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <dean.kam...@gmail.com> > > > wrote: > > > > > > > > > Well, I have asked in the mailing list sometime ago, about > > > > > cloudstack behaviour when I lose connectively to primary storage, > then > > > > > hypervisor start rebooting randomly. > > > > > > > > > > I believe this what is very similar to what happend in your case. > > > > > > > > > > This is actually 'by design'. The logic is that if the storage > goes > > > > > offline, then all VMs must have also failed, and a 'forced' reboot > of > > > the > > > > > Host 'might' automatically fix things. > > > > > > > > > > This is great if you only have one Primary Storage, but typically > you > > > > > have more than one, so whilst the reboot might fix the failed > storage, > > > it > > > > > will also kill off all the perfectly good VMs which were still > happily > > > > > running. > > > > > > > > > > The answer what I got was for xenserver not KVM, it included > removing > > > the > > > > > reboot -f option for a config file. > > > > > > > > > > > > > > > > > > > > The fix for XenServer Hosts is to: > > > > > > > > > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, > > > > > commenting out the two entries which have "reboot -f" > > > > > > > > > > 2. Identify the PID of the script - pidof -x xenheartbeat.sh > > > > > > > > > > 3. Restart the Script - kill <pid> > > > > > > > > > > 4. Force reconnect Host from the UI, the script will then > re-launch on > > > > > reconnect > > > > > > > > > > > > > > > > > > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff < > laurent.st...@inria.fr > > > >wrote: > > > > > > > > > >> Hi Dean, > > > > >> > > > > >> And thanks for your answer. > > > > >> > > > > >> Yes the network troubles lead to issue with the main storage > > > > >> on clusters (iscsi). > > > > >> > > > > >> So is that a fact if the main storage is lost on KVM, VMs are > stopped > > > > >> and domain destroyed ? > > > > >> > > > > >> It was an hypothesis as I found traces in > > > > >> > > > > >> > > > > >> > > > > apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java > > > > >> > > > > >> which "kills -9 qemu processes" if main storage is not found, but > I > > > was > > > > >> not sure when the function was called. > > > > >> > > > > >> It's on the function checkingMountPoint, which calls destroyVMs > if > > > mount > > > > >> point not found. > > > > >> > > > > >> Regards, > > > > >> > > > > >> ----- Mail original ----- > > > > >> > De: "Dean Kamali" <dean.kam...@gmail.com> > > > > >> > À: users@cloudstack.apache.org > > > > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04 > > > > >> > Objet: Re: outage feedback and questions > > > > >> > > > > > >> > Survivors VMs are on the same KVM/GFS2 Cluster. > > > > >> > SSVM is one of them. Messages on the console indicates she was > > > > >> > temporarily > > > > >> > in read-only mode > > > > >> > > > > > >> > Do you have an issue with storage? > > > > >> > > > > > >> > I wouldn't expect a failure in switch could cause all of this, > it > > > > >> > will > > > > >> > cause loss of network connectivity but it shouldn't cause your > vms > > > to > > > > >> > go > > > > >> > down. > > > > >> > > > > > >> > This behavior usually happens when you lose your primary > storage. > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff > > > > >> > <laurent.st...@inria.fr>wrote: > > > > >> > > > > > >> > > Hello, > > > > >> > > > > > > >> > > Cloudstack is used in our company as a core component of a > > > > >> > > "Continuous > > > > >> > > Integration" > > > > >> > > Service. > > > > >> > > > > > > >> > > We are mainly happy with it, for a lot of reasons too long to > > > > >> > > describe. :) > > > > >> > > > > > > >> > > We encountered recently a major service outage on Cloudstack > > > mainly > > > > >> > > linked > > > > >> > > to bad practices on our side, and the aim of this post is : > > > > >> > > > > > > >> > > - ask questions about things we didn't understand yet > > > > >> > > - gather some practical best practices we missed > > > > >> > > - if problems detected are still present on Cloudstack 4.x, > > > helping > > > > >> > > to robustify Cloudstack with our feedback > > > > >> > > > > > > >> > > we know that 3.x version is not supported and plan to move > ASAP in > > > > >> > > 4.x > > > > >> > > version. > > > > >> > > > > > > >> > > It's quite a long mail, and it may be badly directed (dev > mailing > > > > >> > > list ? > > > > >> > > multiple bugs ?) > > > > >> > > > > > > >> > > Any response is appreciated ;) > > > > >> > > > > > > >> > > Regards, > > > > >> > > > > > > >> > > > > > > >> > > --------------------long > > > > >> > > part---------------------------------------- > > > > >> > > > > > > >> > > Architecture : > > > > >> > > -------------- > > > > >> > > > > > > >> > > Old and non Apache CloudStack 3.0.2 release > > > > >> > > 1 Zone, 1 physical network, 1 pod > > > > >> > > 1 Virtual Router VM, 1 SSVM > > > > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi > storage > > > > >> > > Management Server on Vmware virtual machine > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > Incidents : > > > > >> > > ----------- > > > > >> > > > > > > >> > > Day 1 : Management Server DoSed by internal synchronization > > > scripts > > > > >> > > (ldap > > > > >> > > to Cloudstack) > > > > >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, > and > > > > >> > > rebooted > > > > >> > > (never rebooted in more than a year). Cloudstack > > > > >> > > is running again normally (vm creation/stop/start/console/...) > > > > >> > > Day 4 : (week-end) Network outage on core datacenter switch. > > > > >> > > Network > > > > >> > > unstable 2 days. > > > > >> > > > > > > >> > > Symptoms : > > > > >> > > ---------- > > > > >> > > > > > > >> > > Day 7 : The network is operationnal but most of VMs down (250 > of > > > > >> > > 300) > > > > >> > > since Day 4. > > > > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased). > > > > >> > > > > > > >> > > VirtualRouter VM fileystem was on of them. Filesystem > corruption > > > > >> > > prevented > > > > >> > > it to reboot normally. > > > > >> > > > > > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster. > > > > >> > > SSVM is one of them. Messages on the console indicates she was > > > > >> > > temporarily > > > > >> > > in read-only mode > > > > >> > > > > > > >> > > Hard way to revival (actions): > > > > >> > > ----------------------------- > > > > >> > > > > > > >> > > 1. VirtualRouter VM destructed by an administrator, to let > > > > >> > > CloudStack > > > > >> > > recreate it from template. > > > > >> > > > > > > >> > > BUT :) > > > > >> > > > > > > >> > > the SystemVM KVM Template is not available. Status in GUI is > > > > >> > > "CONNECTION > > > > >> > > REFUSED". > > > > >> > > The url from where it was downloaded during install is no more > > > > >> > > valid (old > > > > >> > > and unavailable > > > > >> > > internal mirror server instead of http://download.cloud.com) > > > > >> > > > > > > >> > > => we are unable to start again VMs stopped and create new > ones > > > > >> > > > > > > >> > > 2. Manual download on the Managment Server of the template, > like > > > in > > > > >> > > a > > > > >> > > fresh install > > > > >> > > > > > > >> > > --- > > > > >> > > > > > > >> > > > > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt > > > > >> > > -m /mnt/secondary/ -u > > > > >> > > > > > > >> > > > > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h > > > > >> > > kvm -F > > > > >> > > --- > > > > >> > > > > > > >> > > It's no sufficient. mysql table template_host_ref does not > change. > > > > >> > > Even > > > > >> > > when changing url in mysql tables. > > > > >> > > We still have "CONNECTION REFUSED" on template status in > mysql and > > > > >> > > on the > > > > >> > > GUI > > > > >> > > > > > > >> > > 3. after analysis, we needed to alter manualy mysql tables > > > > >> > > (template_id of > > > > >> > > systemVM KVM was x) : > > > > >> > > > > > > >> > > --- > > > > >> > > update template_host_ref set download_state='DOWNLOADED' where > > > > >> > > template_id=x; > > > > >> > > update template_host_ref set job_id='NULL' where > template_id=x; <= > > > > >> > > may be > > > > >> > > useless > > > > >> > > update template_host_ref set job_id='NULL' where > template_id=x; <= > > > > >> > > may be > > > > >> > > useless > > > > >> > > --- > > > > >> > > > > > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED > > > > >> > > > > > > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new > VirtualRouter > > > > >> > > VM and > > > > >> > > we can let users > > > > >> > > start manually their stopped VM > > > > >> > > > > > > >> > > > > > > >> > > Questions : > > > > >> > > ----------- > > > > >> > > > > > > >> > > 1. What did stop and destroyed the libvirt domains of our VMs > ? > > > > >> > > There's > > > > >> > > some part > > > > >> > > of code who could do this, but I'm not sure > > > > >> > > > > > > >> > > 2. Is it possible that Cloudstack triggered autonomously the > > > > >> > > re-download > > > > >> > > of the > > > > >> > > systemVM template ? Or has it to be an human interaction. > > > > >> > > > > > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template > with a > > > > >> > > bad > > > > >> > > status > > > > >> > > still present. Is there any warning more than a simple > "connexion > > > > >> > > refused" > > > > >> > > not > > > > >> > > really visible as an alert ? > > > > >> > > > > > > >> > > 4. Is Cloudstack retrying by default to restart VMs who > should be > > > > >> > > up, or do > > > > >> > > we need configuration for this ? > > > > >> > > > > > > >> > > > > > > >> > > --------------------end of long > > > > >> > > part---------------------------------------- > > > > >> > > > > > > >> > > > > > > >> > > -- > > > > >> > > Laurent Steff > > > > >> > > > > > > >> > > DSI/SESI > > > > >> > > http://www.inria.fr/ > > > > >> > > > > > > >> > > > > > >> > > > > >> -- > > > > >> Laurent Steff > > > > >> > > > > >> DSI/SESI > > > > >> INRIA > > > > >> Tél. : +33 1 39 63 50 81 > > > > >> Port. : +33 6 87 66 77 85 > > > > >> http://www.inria.fr/ > > > > >> > > > > > > > > > > > > > > >