Re: [ceph-users] Cascading Failure of OSDs

2015-04-09 Thread Carl-Johan Schenström
Francois Lafont wrote: Just in case it could be useful, I have noticed the -s option (on my Ubuntu) that offer an output probably easier to parse: # column -t is just to make it's nice for the human eyes. ifconfig -s | column -t Since ifconfig is deprecated, one should use iproute2

Re: [ceph-users] Cascading Failure of OSDs

2015-04-09 Thread HEWLETT, Paul (Paul)** CTR **
-users-boun...@lists.ceph.com] on behalf of Carl-Johan Schenström [carl-johan.schenst...@gu.se] Sent: 09 April 2015 07:34 To: Francois Lafont; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cascading Failure of OSDs Francois Lafont wrote: Just in case it could be useful, I have noticed the -s

Re: [ceph-users] Cascading Failure of OSDs

2015-04-08 Thread Francois Lafont
Hi, 01/04/2015 17:28, Quentin Hartman wrote: Right now we're just scraping the output of ifconfig: ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}' It clunky, but it works. I'm sure there's a cleaner way, but this was expedient. QH Ok, thx for the information

Re: [ceph-users] Cascading Failure of OSDs

2015-04-01 Thread Quentin Hartman
Right now we're just scraping the output of ifconfig: ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}' It clunky, but it works. I'm sure there's a cleaner way, but this was expedient. QH On Tue, Mar 31, 2015 at 5:05 PM, Francois Lafont flafdiv...@free.fr wrote: Hi,

Re: [ceph-users] Cascading Failure of OSDs

2015-03-31 Thread Francois Lafont
Hi, Quentin Hartman wrote: Since I have been in ceph-land today, it reminded me that I needed to close the loop on this. I was finally able to isolate this problem down to a faulty NIC on the ceph cluster network. It worked, but it was accumulating a huge number of Rx errors. My best guess

Re: [ceph-users] Cascading Failure of OSDs

2015-03-26 Thread Quentin Hartman
Since I have been in ceph-land today, it reminded me that I needed to close the loop on this. I was finally able to isolate this problem down to a faulty NIC on the ceph cluster network. It worked, but it was accumulating a huge number of Rx errors. My best guess is some receive buffer cache

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
Now that I have a better understanding of what's happening, I threw together a little one-liner to create a report of the errors that the OSDs are seeing. Lots of missing / corrupted pg shards: https://gist.github.com/qhartman/174cc567525060cb462e I've experimented with exporting / importing the

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
So I'm not sure what has changed, but in the last 30 minutes the errors which were all over the place, have finally settled down to this: http://pastebin.com/VuCKwLDp The only thing I can think of is that I also net the noscrub flag in addition to the nodeep-scrub when I first got here, and that

[ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
So I'm in the middle of trying to triage a problem with my ceph cluster running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has been running happily for about a year. This last weekend, something caused the box running the MDS to sieze hard, and when we came in on monday, several

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Sage Weil
It looks like you may be able to work around the issue for the moment with ceph osd set nodeep-scrub as it looks like it is scrub that is getting stuck? sage On Fri, 6 Mar 2015, Quentin Hartman wrote: Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with active+clean pgs

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Alright, tried a few suggestions for repairing this state, but I don't seem to have any PG replicas that have good copies of the missing / zero length shards. What do I do now? telling the pg's to repair doesn't seem to help anything? I can deal with data loss if I can figure out which images

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the response. Is this the post you are referring to? http://ceph.com/community/incomplete-pgs-oh-my/ For what it's worth, this cluster was running happily for the better part of a year until the event from this weekend that I described in my first post, so I doubt it's configuration

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Ceph health detail - http://pastebin.com/5URX9SsQ pg dump summary (with active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ an osd crash log (in github gist because it was too big for pastebin) - https://gist.github.com/qhartman/cb0e290df373d284cfb5 And now I've got four OSDs that are

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the suggestion, but that doesn't seem to have made a difference. I've shut the entire cluster down and brought it back up, and my config management system seems to have upgraded ceph to 0.80.8 during the reboot. Everything seems to have come back up, but I am still seeing the crash

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Finally found an error that seems to provide some direction: -1 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does not match object info size (4120576) ajusted for ondisk to (4120576) I'm diving into

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Gregory Farnum
This might be related to the backtrace assert, but that's the problem you need to focus on. In particular, both of these errors are caused by the scrub code, which Sage suggested temporarily disabling — if you're still getting these messages, you clearly haven't done so successfully. That said,

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Here's more information I have been able to glean: pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last acting [24] pg 3.690 is stuck inactive for 11991.281739, current state incomplete, last acting [24] pg 4.ca is stuck inactive for 15905.499058, current state incomplete,