Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Gregory Farnum
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last
 friday I got everything deployed and all was working well, and I set noout
 and shut all the OSD nodes down over the weekend. Yesterday when I spun it
 back up, the OSDs were behaving very strangely, incorrectly marking each
 other because of missed heartbeats, even though they were up. It looked like
 some kind of low-level networking problem, but I couldn't find any.

 After much work, I narrowed the apparent source of the problem down to the
 OSDs running on the first host I started in the morning. They were the ones
 that were logged the most messages about not being able to ping other OSDs,
 and the other OSDs were mostly complaining about them. After running out of
 other ideas to try, I restarted them, and then everything started working.
 It's still working happily this morning. It seems as though when that set of
 OSDs started they got stale OSD map information from the MON boxes, which
 failed to be updated as the other OSDs came up. Does that make sense? I
 still don't consider myself an expert on ceph architecture and would
 appreciate and corrections or other possible interpretations of events (I'm
 happy to provide whatever additional information I can) so I can get a
 deeper understanding of things. If my interpretation of events is correct,
 it seems that might point at a bug.

I can't find the ticket now, but I think we did indeed have a bug
around heartbeat failures when restarting nodes. This has been fixed
in other branches but might have been missed for giant. (Did you by
any chance set the nodown flag as well as noout?)

In general Ceph isn't very happy with being shut down completely like
that and its behaviors aren't validated, so nothing will go seriously
wrong but you might find little irritants like this. It's particularly
likely when you're prohibiting state changes with the noout/nodown
flags.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Quentin Hartman
Thanks for the extra info Gregory. I did not also set nodown.

I expect that I will be very rarely shutting everything down in the normal
course of things, but it has come up a couple times when having to do some
physical re-organizing of racks. Little irritants like this aren't a big
deal if people know to expect them, but as it is I lost quite a lot of time
troubleshooting a non-existant problem. What's the best way to get notes to
that effect added to the docs? It seems something in
http://ceph.com/docs/master/rados/operations/operating/ would save some
people some headache. I'm happy to propose edits, but a quick look doesn't
reveal a process for submitting that sort of thing.

My understanding is that the right method to take an entire cluster
offline is to set noout and then shutting everything down. Is there a
better way?

QH

On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
  I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
 Last
  friday I got everything deployed and all was working well, and I set
 noout
  and shut all the OSD nodes down over the weekend. Yesterday when I spun
 it
  back up, the OSDs were behaving very strangely, incorrectly marking each
  other because of missed heartbeats, even though they were up. It looked
 like
  some kind of low-level networking problem, but I couldn't find any.
 
  After much work, I narrowed the apparent source of the problem down to
 the
  OSDs running on the first host I started in the morning. They were the
 ones
  that were logged the most messages about not being able to ping other
 OSDs,
  and the other OSDs were mostly complaining about them. After running out
 of
  other ideas to try, I restarted them, and then everything started
 working.
  It's still working happily this morning. It seems as though when that
 set of
  OSDs started they got stale OSD map information from the MON boxes, which
  failed to be updated as the other OSDs came up. Does that make sense? I
  still don't consider myself an expert on ceph architecture and would
  appreciate and corrections or other possible interpretations of events
 (I'm
  happy to provide whatever additional information I can) so I can get a
  deeper understanding of things. If my interpretation of events is
 correct,
  it seems that might point at a bug.

 I can't find the ticket now, but I think we did indeed have a bug
 around heartbeat failures when restarting nodes. This has been fixed
 in other branches but might have been missed for giant. (Did you by
 any chance set the nodown flag as well as noout?)

 In general Ceph isn't very happy with being shut down completely like
 that and its behaviors aren't validated, so nothing will go seriously
 wrong but you might find little irritants like this. It's particularly
 likely when you're prohibiting state changes with the noout/nodown
 flags.
 -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Quentin Hartman
On Tue, Mar 31, 2015 at 2:05 PM, Gregory Farnum g...@gregs42.com wrote:

 Github pull requests. :)


Ah, well that's easy:

https://github.com/ceph/ceph/pull/4237


QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Jeffrey Ollie
On Tue, Mar 31, 2015 at 3:05 PM, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman
 
  My understanding is that the right method to take an entire cluster
  offline is to set noout and then shutting everything down. Is there a
 better
  way?

 That's probably the best way to do it. Like I said, there was also a
 bug here that I think is fixed for Hammer but that might not have been
 backported to Giant. Unfortunately I don't remember the right keywords
 as I wasn't involved in the fix.


I'd hope that the complete shutdown scenario would get some more testing in
the future...  I know that Ceph is targeted more at enterprise situations
where things like generators and properly sized battery backups aren't
extravagant luxuries, but there are probably a lot of clusters out there
that will get shut down completely, planned or unplanned.

-- 
Jeff Ollie
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird cluster restart behavior

2015-03-31 Thread Gregory Farnum
On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 Thanks for the extra info Gregory. I did not also set nodown.

 I expect that I will be very rarely shutting everything down in the normal
 course of things, but it has come up a couple times when having to do some
 physical re-organizing of racks. Little irritants like this aren't a big
 deal if people know to expect them, but as it is I lost quite a lot of time
 troubleshooting a non-existant problem. What's the best way to get notes to
 that effect added to the docs? It seems something in
 http://ceph.com/docs/master/rados/operations/operating/ would save some
 people some headache. I'm happy to propose edits, but a quick look doesn't
 reveal a process for submitting that sort of thing.

Github pull requests. :)


 My understanding is that the right method to take an entire cluster
 offline is to set noout and then shutting everything down. Is there a better
 way?

That's probably the best way to do it. Like I said, there was also a
bug here that I think is fixed for Hammer but that might not have been
backported to Giant. Unfortunately I don't remember the right keywords
as I wasn't involved in the fix.
-Greg


 QH

 On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote:

 On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
  I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
  Last
  friday I got everything deployed and all was working well, and I set
  noout
  and shut all the OSD nodes down over the weekend. Yesterday when I spun
  it
  back up, the OSDs were behaving very strangely, incorrectly marking each
  other because of missed heartbeats, even though they were up. It looked
  like
  some kind of low-level networking problem, but I couldn't find any.
 
  After much work, I narrowed the apparent source of the problem down to
  the
  OSDs running on the first host I started in the morning. They were the
  ones
  that were logged the most messages about not being able to ping other
  OSDs,
  and the other OSDs were mostly complaining about them. After running out
  of
  other ideas to try, I restarted them, and then everything started
  working.
  It's still working happily this morning. It seems as though when that
  set of
  OSDs started they got stale OSD map information from the MON boxes,
  which
  failed to be updated as the other OSDs came up. Does that make sense? I
  still don't consider myself an expert on ceph architecture and would
  appreciate and corrections or other possible interpretations of events
  (I'm
  happy to provide whatever additional information I can) so I can get a
  deeper understanding of things. If my interpretation of events is
  correct,
  it seems that might point at a bug.

 I can't find the ticket now, but I think we did indeed have a bug
 around heartbeat failures when restarting nodes. This has been fixed
 in other branches but might have been missed for giant. (Did you by
 any chance set the nodown flag as well as noout?)

 In general Ceph isn't very happy with being shut down completely like
 that and its behaviors aren't validated, so nothing will go seriously
 wrong but you might find little irritants like this. It's particularly
 likely when you're prohibiting state changes with the noout/nodown
 flags.
 -Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com