[Users] Wiki and Mailing Lists Outage -- 2012-11-14
We experienced an outage today in both the wiki and the mailing lists. * Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages. Cause: This was caused by an Out of Space error on the host running both of these services. A temporary workaround was put in place to get both services up and running again. Action Taken: Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there) Long term solution: Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay). Current Situation: Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage. Thanks for the patience and understanding Mike ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14
Thanks Mike! I suggest to have a cron alerting for no-space issues. - Original Message - From: Mike Burns mbu...@redhat.com To: board bo...@ovirt.org, infra in...@ovirt.org, users users@ovirt.org Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14 We experienced an outage today in both the wiki and the mailing lists. * Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages. Cause: This was caused by an Out of Space error on the host running both of these services. A temporary workaround was put in place to get both services up and running again. Action Taken: Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there) Long term solution: Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay). Current Situation: Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage. Thanks for the patience and understanding Mike ___ Infra mailing list in...@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14
On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote: Thanks Mike! I suggest to have a cron alerting for no-space issues. We run logwatch which is supposed to highlight these issues, but I suspect that no one is actually reading the logwatch report. A separate cron job or monitoring service is also a possibility. Mike - Original Message - From: Mike Burns mbu...@redhat.com To: board bo...@ovirt.org, infra in...@ovirt.org, users users@ovirt.org Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14 We experienced an outage today in both the wiki and the mailing lists. * Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages. Cause: This was caused by an Out of Space error on the host running both of these services. A temporary workaround was put in place to get both services up and running again. Action Taken: Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there) Long term solution: Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay). Current Situation: Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage. Thanks for the patience and understanding Mike ___ Infra mailing list in...@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra ___ Board mailing list bo...@ovirt.org http://lists.ovirt.org/mailman/listinfo/board ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14
we are running a disk space job on jenkins slave: http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenkins_slaves it runs a script [1], i guess we can clone this to check other infra servers as well.. [1] #!/bin/sh df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5 $1 }' | while read output; do echo $output usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1 ) partition=$(echo $output | awk '{ print $2 }' ) if [ $usep -ge 90 ]; then echo Running out of space \$partition ($usep%)\ on $(hostname) as on $(date) exit 1 fi done - Original Message - From: Mike Burns mbu...@redhat.com To: Doron Fediuck dfedi...@redhat.com Cc: infra in...@ovirt.org, users users@ovirt.org, board bo...@ovirt.org Sent: Wednesday, November 14, 2012 3:58:17 PM Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14 On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote: Thanks Mike! I suggest to have a cron alerting for no-space issues. We run logwatch which is supposed to highlight these issues, but I suspect that no one is actually reading the logwatch report. A separate cron job or monitoring service is also a possibility. Mike - Original Message - From: Mike Burns mbu...@redhat.com To: board bo...@ovirt.org, infra in...@ovirt.org, users users@ovirt.org Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14 We experienced an outage today in both the wiki and the mailing lists. * Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages. Cause: This was caused by an Out of Space error on the host running both of these services. A temporary workaround was put in place to get both services up and running again. Action Taken: Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there) Long term solution: Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay). Current Situation: Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage. Thanks for the patience and understanding Mike ___ Infra mailing list in...@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra ___ Board mailing list bo...@ovirt.org http://lists.ovirt.org/mailman/listinfo/board ___ Infra mailing list in...@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users