RE: outage feedback and questions

David Ortiz Mon, 15 Jul 2013 13:47:38 -0700

Laurent,
    We too had some issues where we lost VMs after a switch went down.  We are 
also using gfs2 over iScsi for our primary storage.  Once I got the cluster 
back up, fsck found a lot of corruption on the gfs2 fs, which resulted in 
probably 6 VMs out of the 25 we had needing to have volumes rebuilt, or having 
to be rebuilt completely.  I would guess this is what happened in your case as 
well.
Thanks,     David Ortiz


> From: [email protected]
> Date: Tue, 9 Jul 2013 19:35:52 -0400
> Subject: Re: outage feedback and questions
> To: [email protected]
> 
> courtesy to [email protected] answering this question first
> 
> 
> On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <[email protected]> wrote:
> 
> > Well, I have asked in the mailing list sometime ago, about
> > cloudstack behaviour when I lose connectively to primary storage, then
> > hypervisor start rebooting randomly.
> >
> > I believe this what is very similar to what happend in your case.
> >
> > This is actually 'by design'.  The logic is that if the storage goes
> > offline, then all VMs must have also failed, and a 'forced' reboot of the
> > Host 'might' automatically fix things.
> >
> > This is great if you only have one Primary Storage, but typically you
> > have more than one, so whilst the reboot might fix the failed storage, it
> > will also kill off all the perfectly good VMs which were still happily
> > running.
> >
> > The answer what I got was for xenserver not KVM, it included removing the
> > reboot -f option for a config file.
> >
> >
> >
> > The fix for XenServer Hosts is to:
> >
> > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > commenting out the two entries which have "reboot -f"
> >
> > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> >
> > 3. Restart the Script  - kill <pid>
> >
> > 4. Force reconnect Host from the UI,  the script will then re-launch on
> > reconnect
> >
> >
> >
> > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <[email protected]>wrote:
> >
> >> Hi Dean,
> >>
> >> And thanks for your answer.
> >>
> >> Yes the network troubles lead to issue with the main storage
> >> on clusters (iscsi).
> >>
> >> So is that a fact if the main storage is lost on KVM, VMs are stopped
> >> and domain destroyed ?
> >>
> >> It was an hypothesis as I found traces in
> >>
> >>
> >> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> >>
> >> which "kills -9 qemu processes" if main storage is not found, but I was
> >> not sure when the function was called.
> >>
> >> It's on the function  checkingMountPoint, which calls destroyVMs if mount
> >> point not found.
> >>
> >> Regards,
> >>
> >> ----- Mail original -----
> >> > De: "Dean Kamali" <[email protected]>
> >> > À: [email protected]
> >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> >> > Objet: Re: outage feedback and questions
> >> >
> >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> >> > SSVM is one of them. Messages on the console indicates she was
> >> > temporarily
> >> > in read-only mode
> >> >
> >> > Do you have an issue with storage?
> >> >
> >> > I wouldn't expect a failure in switch could cause all of this, it
> >> > will
> >> > cause loss of network connectivity but it shouldn't cause your vms to
> >> > go
> >> > down.
> >> >
> >> > This behavior usually happens when you lose your primary storage.
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> >> > <[email protected]>wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > Cloudstack is used in our company as a core component of a
> >> > > "Continuous
> >> > > Integration"
> >> > > Service.
> >> > >
> >> > > We are mainly happy with it, for a lot of reasons too long to
> >> > > describe. :)
> >> > >
> >> > > We encountered recently a major service outage on Cloudstack mainly
> >> > > linked
> >> > > to bad practices on our side, and the aim of this post is :
> >> > >
> >> > > - ask questions about things we didn't understand yet
> >> > > - gather some practical best practices we missed
> >> > > - if problems detected are still present on Cloudstack 4.x, helping
> >> > > to robustify Cloudstack with our feedback
> >> > >
> >> > > we know that 3.x version is not supported and plan to move ASAP in
> >> > > 4.x
> >> > > version.
> >> > >
> >> > > It's quite a long mail, and it may be badly directed (dev mailing
> >> > > list ?
> >> > > multiple bugs ?)
> >> > >
> >> > > Any response is appreciated ;)
> >> > >
> >> > > Regards,
> >> > >
> >> > >
> >> > > --------------------long
> >> > > part----------------------------------------
> >> > >
> >> > > Architecture :
> >> > > --------------
> >> > >
> >> > > Old and non Apache CloudStack 3.0.2 release
> >> > > 1 Zone, 1 physical network, 1 pod
> >> > > 1 Virtual Router VM, 1 SSVM
> >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> >> > > Management Server on Vmware virtual machine
> >> > >
> >> > >
> >> > >
> >> > > Incidents :
> >> > > -----------
> >> > >
> >> > > Day 1 : Management Server DoSed by internal synchronization scripts
> >> > > (ldap
> >> > > to Cloudstack)
> >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> >> > > rebooted
> >> > > (never rebooted in more than a year). Cloudstack
> >> > > is running again normally (vm creation/stop/start/console/...)
> >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> >> > > Network
> >> > > unstable 2 days.
> >> > >
> >> > > Symptoms :
> >> > > ----------
> >> > >
> >> > > Day 7 : The network is operationnal but most of VMs down (250 of
> >> > > 300)
> >> > > since Day 4.
> >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> >> > >
> >> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
> >> > > prevented
> >> > > it to reboot normally.
> >> > >
> >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> >> > > SSVM is one of them. Messages on the console indicates she was
> >> > > temporarily
> >> > > in read-only mode
> >> > >
> >> > > Hard way to revival (actions):
> >> > > -----------------------------
> >> > >
> >> > > 1. VirtualRouter VM destructed by an administrator, to let
> >> > > CloudStack
> >> > > recreate it from template.
> >> > >
> >> > > BUT :)
> >> > >
> >> > > the SystemVM KVM Template is not available. Status in GUI is
> >> > > "CONNECTION
> >> > > REFUSED".
> >> > > The url from where it was downloaded during install is no more
> >> > > valid (old
> >> > > and unavailable
> >> > > internal mirror server  instead of http://download.cloud.com)
> >> > >
> >> > > => we are unable to start again VMs stopped and create new ones
> >> > >
> >> > > 2. Manual download on the Managment Server of the template, like in
> >> > > a
> >> > > fresh install
> >> > >
> >> > > ---
> >> > >
> >> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> >> > > -m /mnt/secondary/  -u
> >> > >
> >> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> >> > > kvm -F
> >> > > ---
> >> > >
> >> > > It's no sufficient. mysql table template_host_ref does not change.
> >> > > Even
> >> > > when changing url in mysql tables.
> >> > > We still have "CONNECTION REFUSED" on template status in mysql and
> >> > > on the
> >> > > GUI
> >> > >
> >> > > 3. after analysis, we needed to alter manualy mysql tables
> >> > > (template_id of
> >> > > systemVM KVM was x) :
> >> > >
> >> > > ---
> >> > > update template_host_ref set download_state='DOWNLOADED' where
> >> > > template_id=x;
> >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> >> > > may be
> >> > > useless
> >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> >> > > may be
> >> > > useless
> >> > > ---
> >> > >
> >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> >> > >
> >> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> >> > > VM and
> >> > > we can let users
> >> > > start manually their stopped VM
> >> > >
> >> > >
> >> > > Questions :
> >> > > -----------
> >> > >
> >> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
> >> > > There's
> >> > > some part
> >> > > of code who could do this, but I'm not sure
> >> > >
> >> > > 2. Is it possible that Cloudstack triggered autonomously the
> >> > > re-download
> >> > > of the
> >> > > systemVM template ? Or has it to be an human interaction.
> >> > >
> >> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> >> > > bad
> >> > > status
> >> > > still present. Is there any warning more than a simple "connexion
> >> > > refused"
> >> > > not
> >> > > really visible as an alert ?
> >> > >
> >> > > 4. Is Cloudstack retrying by default to restart VMs who should be
> >> > > up, or do
> >> > > we need configuration for this ?
> >> > >
> >> > >
> >> > > --------------------end of long
> >> > > part----------------------------------------
> >> > >
> >> > >
> >> > > --
> >> > > Laurent Steff
> >> > >
> >> > > DSI/SESI
> >> > > http://www.inria.fr/
> >> > >
> >> >
> >>
> >> --
> >> Laurent Steff
> >>
> >> DSI/SESI
> >> INRIA
> >> Tél.  : +33 1 39 63 50 81
> >> Port. : +33 6 87 66 77 85
> >> http://www.inria.fr/
> >>
> >
> >

RE: outage feedback and questions

Reply via email to