Re: outage feedback and questions

Dean Kamali Tue, 09 Jul 2013 16:37:18 -0700

courtesy to [email protected] answering this question first



On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <[email protected]> wrote:

> Well, I have asked in the mailing list sometime ago, about
> cloudstack behaviour when I lose connectively to primary storage, then
> hypervisor start rebooting randomly.
>
> I believe this what is very similar to what happend in your case.
>
> This is actually 'by design'.  The logic is that if the storage goes
> offline, then all VMs must have also failed, and a 'forced' reboot of the
> Host 'might' automatically fix things.
>
> This is great if you only have one Primary Storage, but typically you
> have more than one, so whilst the reboot might fix the failed storage, it
> will also kill off all the perfectly good VMs which were still happily
> running.
>
> The answer what I got was for xenserver not KVM, it included removing the
> reboot -f option for a config file.
>
>
>
> The fix for XenServer Hosts is to:
>
> 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> commenting out the two entries which have "reboot -f"
>
> 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
>
> 3. Restart the Script  - kill <pid>
>
> 4. Force reconnect Host from the UI,  the script will then re-launch on
> reconnect
>
>
>
> On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <[email protected]>wrote:
>
>> Hi Dean,
>>
>> And thanks for your answer.
>>
>> Yes the network troubles lead to issue with the main storage
>> on clusters (iscsi).
>>
>> So is that a fact if the main storage is lost on KVM, VMs are stopped
>> and domain destroyed ?
>>
>> It was an hypothesis as I found traces in
>>
>>
>> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
>>
>> which "kills -9 qemu processes" if main storage is not found, but I was
>> not sure when the function was called.
>>
>> It's on the function  checkingMountPoint, which calls destroyVMs if mount
>> point not found.
>>
>> Regards,
>>
>> ----- Mail original -----
>> > De: "Dean Kamali" <[email protected]>
>> > À: [email protected]
>> > Envoyé: Lundi 8 Juillet 2013 16:34:04
>> > Objet: Re: outage feedback and questions
>> >
>> > Survivors VMs are on the same KVM/GFS2 Cluster.
>> > SSVM is one of them. Messages on the console indicates she was
>> > temporarily
>> > in read-only mode
>> >
>> > Do you have an issue with storage?
>> >
>> > I wouldn't expect a failure in switch could cause all of this, it
>> > will
>> > cause loss of network connectivity but it shouldn't cause your vms to
>> > go
>> > down.
>> >
>> > This behavior usually happens when you lose your primary storage.
>> >
>> >
>> >
>> >
>> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
>> > <[email protected]>wrote:
>> >
>> > > Hello,
>> > >
>> > > Cloudstack is used in our company as a core component of a
>> > > "Continuous
>> > > Integration"
>> > > Service.
>> > >
>> > > We are mainly happy with it, for a lot of reasons too long to
>> > > describe. :)
>> > >
>> > > We encountered recently a major service outage on Cloudstack mainly
>> > > linked
>> > > to bad practices on our side, and the aim of this post is :
>> > >
>> > > - ask questions about things we didn't understand yet
>> > > - gather some practical best practices we missed
>> > > - if problems detected are still present on Cloudstack 4.x, helping
>> > > to robustify Cloudstack with our feedback
>> > >
>> > > we know that 3.x version is not supported and plan to move ASAP in
>> > > 4.x
>> > > version.
>> > >
>> > > It's quite a long mail, and it may be badly directed (dev mailing
>> > > list ?
>> > > multiple bugs ?)
>> > >
>> > > Any response is appreciated ;)
>> > >
>> > > Regards,
>> > >
>> > >
>> > > --------------------long
>> > > part----------------------------------------
>> > >
>> > > Architecture :
>> > > --------------
>> > >
>> > > Old and non Apache CloudStack 3.0.2 release
>> > > 1 Zone, 1 physical network, 1 pod
>> > > 1 Virtual Router VM, 1 SSVM
>> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
>> > > Management Server on Vmware virtual machine
>> > >
>> > >
>> > >
>> > > Incidents :
>> > > -----------
>> > >
>> > > Day 1 : Management Server DoSed by internal synchronization scripts
>> > > (ldap
>> > > to Cloudstack)
>> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
>> > > rebooted
>> > > (never rebooted in more than a year). Cloudstack
>> > > is running again normally (vm creation/stop/start/console/...)
>> > > Day 4 : (week-end) Network outage on core datacenter switch.
>> > > Network
>> > > unstable 2 days.
>> > >
>> > > Symptoms :
>> > > ----------
>> > >
>> > > Day 7 : The network is operationnal but most of VMs down (250 of
>> > > 300)
>> > > since Day 4.
>> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
>> > >
>> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
>> > > prevented
>> > > it to reboot normally.
>> > >
>> > > Survivors VMs are on the same KVM/GFS2 Cluster.
>> > > SSVM is one of them. Messages on the console indicates she was
>> > > temporarily
>> > > in read-only mode
>> > >
>> > > Hard way to revival (actions):
>> > > -----------------------------
>> > >
>> > > 1. VirtualRouter VM destructed by an administrator, to let
>> > > CloudStack
>> > > recreate it from template.
>> > >
>> > > BUT :)
>> > >
>> > > the SystemVM KVM Template is not available. Status in GUI is
>> > > "CONNECTION
>> > > REFUSED".
>> > > The url from where it was downloaded during install is no more
>> > > valid (old
>> > > and unavailable
>> > > internal mirror server  instead of http://download.cloud.com)
>> > >
>> > > => we are unable to start again VMs stopped and create new ones
>> > >
>> > > 2. Manual download on the Managment Server of the template, like in
>> > > a
>> > > fresh install
>> > >
>> > > ---
>> > >
>> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
>> > > -m /mnt/secondary/  -u
>> > >
>> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
>> > > kvm -F
>> > > ---
>> > >
>> > > It's no sufficient. mysql table template_host_ref does not change.
>> > > Even
>> > > when changing url in mysql tables.
>> > > We still have "CONNECTION REFUSED" on template status in mysql and
>> > > on the
>> > > GUI
>> > >
>> > > 3. after analysis, we needed to alter manualy mysql tables
>> > > (template_id of
>> > > systemVM KVM was x) :
>> > >
>> > > ---
>> > > update template_host_ref set download_state='DOWNLOADED' where
>> > > template_id=x;
>> > > update template_host_ref set job_id='NULL' where template_id=x; <=
>> > > may be
>> > > useless
>> > > update template_host_ref set job_id='NULL' where template_id=x; <=
>> > > may be
>> > > useless
>> > > ---
>> > >
>> > > 4. As in MySQL, status on GUI is DOWNLOADED
>> > >
>> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
>> > > VM and
>> > > we can let users
>> > > start manually their stopped VM
>> > >
>> > >
>> > > Questions :
>> > > -----------
>> > >
>> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
>> > > There's
>> > > some part
>> > > of code who could do this, but I'm not sure
>> > >
>> > > 2. Is it possible that Cloudstack triggered autonomously the
>> > > re-download
>> > > of the
>> > > systemVM template ? Or has it to be an human interaction.
>> > >
>> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
>> > > bad
>> > > status
>> > > still present. Is there any warning more than a simple "connexion
>> > > refused"
>> > > not
>> > > really visible as an alert ?
>> > >
>> > > 4. Is Cloudstack retrying by default to restart VMs who should be
>> > > up, or do
>> > > we need configuration for this ?
>> > >
>> > >
>> > > --------------------end of long
>> > > part----------------------------------------
>> > >
>> > >
>> > > --
>> > > Laurent Steff
>> > >
>> > > DSI/SESI
>> > > http://www.inria.fr/
>> > >
>> >
>>
>> --
>> Laurent Steff
>>
>> DSI/SESI
>> INRIA
>> Tél.  : +33 1 39 63 50 81
>> Port. : +33 6 87 66 77 85
>> http://www.inria.fr/
>>
>
>

Re: outage feedback and questions

Reply via email to