Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
Thanks for the extra info Gregory. I did not also set nodown. I expect that I will be very rarely shutting everything down in the normal course of things, but it has come up a couple times when having to do some physical re-organizing of racks. Little irritants like this aren't a big deal if people know to expect them, but as it is I lost quite a lot of time troubleshooting a non-existant problem. What's the best way to get notes to that effect added to the docs? It seems something in http://ceph.com/docs/master/rados/operations/operating/ would save some people some headache. I'm happy to propose edits, but a quick look doesn't reveal a process for submitting that sort of thing. My understanding is that the right method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way? QH On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 2:05 PM, Gregory Farnum g...@gregs42.com wrote: Github pull requests. :) Ah, well that's easy: https://github.com/ceph/ceph/pull/4237 QH ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 3:05 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman My understanding is that the right method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way? That's probably the best way to do it. Like I said, there was also a bug here that I think is fixed for Hammer but that might not have been backported to Giant. Unfortunately I don't remember the right keywords as I wasn't involved in the fix. I'd hope that the complete shutdown scenario would get some more testing in the future... I know that Ceph is targeted more at enterprise situations where things like generators and properly sized battery backups aren't extravagant luxuries, but there are probably a lot of clusters out there that will get shut down completely, planned or unplanned. -- Jeff Ollie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: Thanks for the extra info Gregory. I did not also set nodown. I expect that I will be very rarely shutting everything down in the normal course of things, but it has come up a couple times when having to do some physical re-organizing of racks. Little irritants like this aren't a big deal if people know to expect them, but as it is I lost quite a lot of time troubleshooting a non-existant problem. What's the best way to get notes to that effect added to the docs? It seems something in http://ceph.com/docs/master/rados/operations/operating/ would save some people some headache. I'm happy to propose edits, but a quick look doesn't reveal a process for submitting that sort of thing. Github pull requests. :) My understanding is that the right method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way? That's probably the best way to do it. Like I said, there was also a bug here that I think is fixed for Hammer but that might not have been backported to Giant. Unfortunately I don't remember the right keywords as I wasn't involved in the fix. -Greg QH On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com