Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Josef Johansson
Hi, On 11/04/14 07:29, Wido den Hollander wrote: Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com: One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Wido den Hollander
Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se: Hi, On 11/04/14 07:29, Wido den Hollander wrote: Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com: One thing to note All of our kvm VMs have to be rebooted. This is something I

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Wido den Hollander
On 04/11/2014 09:23 AM, Josef Johansson wrote: On 11/04/14 09:07, Wido den Hollander wrote: Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se: Hi, On 11/04/14 07:29, Wido den Hollander wrote: Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com: One

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier
So... our storage problems persisted for about 45 minutes. I gave an entire hypervisor worth of VM's time to recover (approx. 30 vms), and none of them recovered on their own. In the end, we had to stop and start every VM (easily done, it was just alarming). Once rebooted, the VMs of course were

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier
So, setting pgp_num to 2048 to match pg_num had a more serious impact than I expected. The cluster is rebalancing quite substantially (8.5% of objects being rebalanced)... which makes sense... Disk utilization is evening out fairly well which is encouraging. We are a little stumped as to why a

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Wido den Hollander
On 04/11/2014 02:45 PM, Greg Poirier wrote: So... our storage problems persisted for about 45 minutes. I gave an entire hypervisor worth of VM's time to recover (approx. 30 vms), and none of them recovered on their own. In the end, we had to stop and start every VM (easily done, it was just

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Sage Weil
On Thu, 10 Apr 2014, Greg Poirier wrote: Hi, I have about 200 VMs with a common RBD volume as their root filesystem and a number of additional filesystems on Ceph. All of them have stopped responding. One of the OSDs in my cluster is marked full. I tried stopping that OSD to force things to

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier
Going to try increasing the full ratio. Disk utilization wasn't really growing at an unreasonable pace. I'm going to keep an eye on it for the next couple of hours and down/out the OSDs if necessary. We have four more machines that we're in the process of adding (which doubles the number of

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier
One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not happening. Rebooting them restores service immediately. :/ Not ideal. On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Wido den Hollander
Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com: One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not happening. Rebooting them restores service