[Users] Wiki and Mailing Lists Outage -- 2012-11-14

2012-11-14 Thread Mike Burns
We experienced an outage today in both the wiki and the mailing lists.  

* Wiki content was available throughout the outage, but attempts to
login or edit received an error message about requiring cookies to be
enabled.  
* All mails to the mailing  list failed to show up on the lists, but
also did not return rejection messages.

Cause: 

This was caused by an Out of Space error on the host running both of
these services.  A temporary workaround was put in place to get both
services up and running again.  


Action Taken:

Remove the oldest gerrit backup (600MB)
Remove some older non-functional ovirt-node-iso images and rpms from the
releases (source remains there)

Long term solution:

Migrating these services away from a single host onto hosted solutions
(OpenShift, AlterWay).

Current Situation:

Wiki is back up and running, login works as expected
Lists are processing the backlog of emails since the outage began.
At this time, it does not appear that any mail was lost due to the
outage.


Thanks for the patience and understanding

Mike

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14

2012-11-14 Thread Doron Fediuck
Thanks Mike!
I suggest to have a cron alerting for no-space issues.

- Original Message -
 From: Mike Burns mbu...@redhat.com
 To: board bo...@ovirt.org, infra in...@ovirt.org, users 
 users@ovirt.org
 Sent: Wednesday, November 14, 2012 3:31:11 PM
 Subject: Wiki and Mailing Lists Outage -- 2012-11-14
 
 We experienced an outage today in both the wiki and the mailing
 lists.
 
 * Wiki content was available throughout the outage, but attempts to
 login or edit received an error message about requiring cookies to be
 enabled.
 * All mails to the mailing  list failed to show up on the lists, but
 also did not return rejection messages.
 
 Cause:
 
 This was caused by an Out of Space error on the host running both
 of
 these services.  A temporary workaround was put in place to get both
 services up and running again.
 
 
 Action Taken:
 
 Remove the oldest gerrit backup (600MB)
 Remove some older non-functional ovirt-node-iso images and rpms from
 the
 releases (source remains there)
 
 Long term solution:
 
 Migrating these services away from a single host onto hosted
 solutions
 (OpenShift, AlterWay).
 
 Current Situation:
 
 Wiki is back up and running, login works as expected
 Lists are processing the backlog of emails since the outage began.
 At this time, it does not appear that any mail was lost due to the
 outage.
 
 
 Thanks for the patience and understanding
 
 Mike
 
 ___
 Infra mailing list
 in...@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/infra
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14

2012-11-14 Thread Mike Burns
On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
 Thanks Mike!
 I suggest to have a cron alerting for no-space issues.

We run logwatch which is supposed to highlight these issues, but I
suspect that no one is actually reading the logwatch report.  A separate
cron job or monitoring service is also a possibility.

Mike
 
 - Original Message -
  From: Mike Burns mbu...@redhat.com
  To: board bo...@ovirt.org, infra in...@ovirt.org, users 
  users@ovirt.org
  Sent: Wednesday, November 14, 2012 3:31:11 PM
  Subject: Wiki and Mailing Lists Outage -- 2012-11-14
  
  We experienced an outage today in both the wiki and the mailing
  lists.
  
  * Wiki content was available throughout the outage, but attempts to
  login or edit received an error message about requiring cookies to be
  enabled.
  * All mails to the mailing  list failed to show up on the lists, but
  also did not return rejection messages.
  
  Cause:
  
  This was caused by an Out of Space error on the host running both
  of
  these services.  A temporary workaround was put in place to get both
  services up and running again.
  
  
  Action Taken:
  
  Remove the oldest gerrit backup (600MB)
  Remove some older non-functional ovirt-node-iso images and rpms from
  the
  releases (source remains there)
  
  Long term solution:
  
  Migrating these services away from a single host onto hosted
  solutions
  (OpenShift, AlterWay).
  
  Current Situation:
  
  Wiki is back up and running, login works as expected
  Lists are processing the backlog of emails since the outage began.
  At this time, it does not appear that any mail was lost due to the
  outage.
  
  
  Thanks for the patience and understanding
  
  Mike
  
  ___
  Infra mailing list
  in...@ovirt.org
  http://lists.ovirt.org/mailman/listinfo/infra
  
 ___
 Board mailing list
 bo...@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/board


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Wiki and Mailing Lists Outage -- 2012-11-14

2012-11-14 Thread Eyal Edri
we are running a disk space job on jenkins slave: 
http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenkins_slaves

it runs a script [1], i guess we can clone this to check other infra servers as 
well.. 


[1]
#!/bin/sh
df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5   
$1 }' | while read output;
do
  echo $output
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge 90 ]; then
echo Running out of space \$partition ($usep%)\ on $(hostname) as on 
$(date) 
exit 1
  fi
done

- Original Message -
 From: Mike Burns mbu...@redhat.com
 To: Doron Fediuck dfedi...@redhat.com
 Cc: infra in...@ovirt.org, users users@ovirt.org, board 
 bo...@ovirt.org
 Sent: Wednesday, November 14, 2012 3:58:17 PM
 Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14
 
 On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
  Thanks Mike!
  I suggest to have a cron alerting for no-space issues.
 
 We run logwatch which is supposed to highlight these issues, but I
 suspect that no one is actually reading the logwatch report.  A
 separate
 cron job or monitoring service is also a possibility.
 
 Mike
  
  - Original Message -
   From: Mike Burns mbu...@redhat.com
   To: board bo...@ovirt.org, infra in...@ovirt.org, users
   users@ovirt.org
   Sent: Wednesday, November 14, 2012 3:31:11 PM
   Subject: Wiki and Mailing Lists Outage -- 2012-11-14
   
   We experienced an outage today in both the wiki and the mailing
   lists.
   
   * Wiki content was available throughout the outage, but attempts
   to
   login or edit received an error message about requiring cookies
   to be
   enabled.
   * All mails to the mailing  list failed to show up on the lists,
   but
   also did not return rejection messages.
   
   Cause:
   
   This was caused by an Out of Space error on the host running
   both
   of
   these services.  A temporary workaround was put in place to get
   both
   services up and running again.
   
   
   Action Taken:
   
   Remove the oldest gerrit backup (600MB)
   Remove some older non-functional ovirt-node-iso images and rpms
   from
   the
   releases (source remains there)
   
   Long term solution:
   
   Migrating these services away from a single host onto hosted
   solutions
   (OpenShift, AlterWay).
   
   Current Situation:
   
   Wiki is back up and running, login works as expected
   Lists are processing the backlog of emails since the outage
   began.
   At this time, it does not appear that any mail was lost due to
   the
   outage.
   
   
   Thanks for the patience and understanding
   
   Mike
   
   ___
   Infra mailing list
   in...@ovirt.org
   http://lists.ovirt.org/mailman/listinfo/infra
   
  ___
  Board mailing list
  bo...@ovirt.org
  http://lists.ovirt.org/mailman/listinfo/board
 
 
 ___
 Infra mailing list
 in...@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/infra
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users