Re: [ceph-users] Cascading Failure of OSDs

2015-04-09 Thread Carl-Johan Schenström
Francois Lafont wrote:

 Just in case it could be useful, I have noticed the -s option (on my
 Ubuntu) that offer an output probably easier to parse:
 
 # column -t is just to make it's nice for the human eyes.
 ifconfig -s | column -t

Since ifconfig is deprecated, one should use iproute2 instead.

ip -s link show p2p1 | awk '/(RX|TX):/{getline; print $3;}'

However, the sysfs interface is probably a better alternative. See 
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net-statistics
 and https://www.kernel.org/doc/Documentation/ABI/README.

-- 
Carl-Johan Schenström
Driftansvarig / System Administrator
Språkbanken  Svensk nationell datatjänst /
The Swedish Language Bank  Swedish National Data Service
Göteborgs universitet / University of Gothenburg
carl-johan.schenst...@gu.se / +46 709 116769
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-04-09 Thread HEWLETT, Paul (Paul)** CTR **

I use the folowing:

cat /sys/class/net/em1/statistics/rx_bytes

for the em1 interface

all other stats are available

Paul Hewlett
Senior Systems Engineer
Velocix, Cambridge
Alcatel-Lucent
t: +44 1223 435893 m: +44 7985327353




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Carl-Johan 
Schenström [carl-johan.schenst...@gu.se]
Sent: 09 April 2015 07:34
To: Francois Lafont; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cascading Failure of OSDs

Francois Lafont wrote:

 Just in case it could be useful, I have noticed the -s option (on my
 Ubuntu) that offer an output probably easier to parse:

 # column -t is just to make it's nice for the human eyes.
 ifconfig -s | column -t

Since ifconfig is deprecated, one should use iproute2 instead.

ip -s link show p2p1 | awk '/(RX|TX):/{getline; print $3;}'

However, the sysfs interface is probably a better alternative. See 
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net-statistics
 and https://www.kernel.org/doc/Documentation/ABI/README.

--
Carl-Johan Schenström
Driftansvarig / System Administrator
Språkbanken  Svensk nationell datatjänst /
The Swedish Language Bank  Swedish National Data Service
Göteborgs universitet / University of Gothenburg
carl-johan.schenst...@gu.se / +46 709 116769
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-04-08 Thread Francois Lafont
Hi,

01/04/2015 17:28, Quentin Hartman wrote:

 Right now we're just scraping the output of ifconfig:
 
 ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'
 
 It clunky, but it works. I'm sure there's a cleaner way, but this was
 expedient.
 
 QH

Ok, thx for the information Quentin.
Just in case it could be useful, I have noticed the -s option (on my
Ubuntu) that offer an output probably easier to parse:

# column -t is just to make it's nice for the human eyes.
ifconfig -s | column -t

Bye.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-04-01 Thread Quentin Hartman
Right now we're just scraping the output of ifconfig:

ifconfig p2p1 | grep -e 'RX\|TX' | grep packets | awk '{print $3}'

It clunky, but it works. I'm sure there's a cleaner way, but this was
expedient.

QH


On Tue, Mar 31, 2015 at 5:05 PM, Francois Lafont flafdiv...@free.fr wrote:

 Hi,

 Quentin Hartman wrote:

  Since I have been in ceph-land today, it reminded me that I needed to
 close
  the loop on this. I was finally able to isolate this problem down to a
  faulty NIC on the ceph cluster network. It worked, but it was
  accumulating a huge number of Rx errors. My best guess is some receive
  buffer cache failed? Anyway, having a NIC go weird like that is totally
  consistent with all the weird problems I was seeing, the corrupted PGs,
 and
  the inability for the cluster to settle down.
 
  As a result we've added NIC error rates to our monitoring suite on the
  cluster so we'll hopefully see this coming if it ever happens again.

 Good for you. ;)

 Could you post here the command that you use to get NIC error rates?

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-31 Thread Francois Lafont
Hi,

Quentin Hartman wrote:

 Since I have been in ceph-land today, it reminded me that I needed to close
 the loop on this. I was finally able to isolate this problem down to a
 faulty NIC on the ceph cluster network. It worked, but it was
 accumulating a huge number of Rx errors. My best guess is some receive
 buffer cache failed? Anyway, having a NIC go weird like that is totally
 consistent with all the weird problems I was seeing, the corrupted PGs, and
 the inability for the cluster to settle down.
 
 As a result we've added NIC error rates to our monitoring suite on the
 cluster so we'll hopefully see this coming if it ever happens again.

Good for you. ;)

Could you post here the command that you use to get NIC error rates?

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-26 Thread Quentin Hartman
Since I have been in ceph-land today, it reminded me that I needed to close
the loop on this. I was finally able to isolate this problem down to a
faulty NIC on the ceph cluster network. It worked, but it was
accumulating a huge number of Rx errors. My best guess is some receive
buffer cache failed? Anyway, having a NIC go weird like that is totally
consistent with all the weird problems I was seeing, the corrupted PGs, and
the inability for the cluster to settle down.

As a result we've added NIC error rates to our monitoring suite on the
cluster so we'll hopefully see this coming if it ever happens again.

QH

On Sat, Mar 7, 2015 at 11:36 AM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 So I'm not sure what has changed, but in the last 30 minutes the errors
 which were all over the place, have finally settled down to this:
 http://pastebin.com/VuCKwLDp

 The only thing I can think of is that I also net the noscrub flag in
 addition to the nodeep-scrub when I first got here, and that finally
 took. Anyway, they've been stable there for some time now, and I've been
 able to get a couple VMs to come up and behave reasonably well. At this
 point I'm prepared to wipe the entire cluster and start over if I have to
 to get it truly consistent again, since my efforts to zap pg 3.75b haven't
 borne fruit. However, if anyone has a less nuclear option they'd like to
 suggest, I'm all ears.

 I've tried to export/re-import the pg and do a force_create. The import
 failed, and the force_create just reverted back to being incomplete after
 creating for a few minutes.

 QH

 On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman 
 qhart...@direwolfdigital.com wrote:

 Now that I have a better understanding of what's happening, I threw
 together a little one-liner to create a report of the errors that the OSDs
 are seeing. Lots of missing  / corrupted pg shards:
 https://gist.github.com/qhartman/174cc567525060cb462e

 I've experimented with exporting / importing the broken pgs with
 ceph_objectstore_tool, and while they seem to export correctly, the tool
 crashes when trying to import:

 root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
 --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
 /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
 Importing pgid 3.75b
 Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
 Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
 Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
 Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
 Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
 Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
 Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
 Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
 Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
 Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
 Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
 Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
 Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
 Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
 Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
 Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
 Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
 Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
 Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
 osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const
 hobject_t, const std::setsnapid_t,
 MapCacher::Transactionstd::basic_stringchar, ceph::buffer::list*)'
 thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
 osd/SnapMapper.cc: 228: FAILED assert(r == -2)
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x8b) [0xb94fbb]
  2: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
 std::lesssnapid_t, std::allocatorsnapid_t  const,
 MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
 ObjectStore::Transaction*, ceph::buffer::list, OSDriver,
 SnapMapper)+0x67c) [0x661a1c]
  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5)
 [0x661f85]
  5: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
  6: (main()+0x2208) [0x63f178]
  7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
  8: ceph_objectstore_tool() [0x659577]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed
 to interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 *** Caught signal (Aborted) **
  in thread 7fba67ff3900
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: ceph_objectstore_tool() [0xab1cea]
  2: (()+0x10340) [0x7fba66a95340]
  3: (gsignal()+0x39) [0x7fba627c7cc9]
  4: (abort()+0x148) [0x7fba627cb0d8]
  5: 

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
Now that I have a better understanding of what's happening, I threw
together a little one-liner to create a report of the errors that the OSDs
are seeing. Lots of missing  / corrupted pg shards:
https://gist.github.com/qhartman/174cc567525060cb462e

I've experimented with exporting / importing the broken pgs with
ceph_objectstore_tool, and while they seem to export correctly, the tool
crashes when trying to import:

root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
--data-path /var/lib/ceph/osd/ceph-19/ --journal-path
/var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
Importing pgid 3.75b
Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t,
const std::setsnapid_t, MapCacher::Transactionstd::basic_stringchar,
ceph::buffer::list*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
osd/SnapMapper.cc: 228: FAILED assert(r == -2)
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xb94fbb]
 2: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
std::lesssnapid_t, std::allocatorsnapid_t  const,
MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
 3: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*,
ceph::buffer::list, OSDriver, SnapMapper)+0x67c) [0x661a1c]
 4: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5) [0x661f85]
 5: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
 6: (main()+0x2208) [0x63f178]
 7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 8: ceph_objectstore_tool() [0x659577]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed
to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7fba67ff3900
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: ceph_objectstore_tool() [0xab1cea]
 2: (()+0x10340) [0x7fba66a95340]
 3: (gsignal()+0x39) [0x7fba627c7cc9]
 4: (abort()+0x148) [0x7fba627cb0d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
 6: (()+0x5e836) [0x7fba630d0836]
 7: (()+0x5e863) [0x7fba630d0863]
 8: (()+0x5eaa2) [0x7fba630d0aa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0xb951a8]
 10: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
std::lesssnapid_t, std::allocatorsnapid_t  const,
MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
 11: (get_attrs(ObjectStore*, coll_t, ghobject_t,
ObjectStore::Transaction*, ceph::buffer::list, OSDriver,
SnapMapper)+0x67c) [0x661a1c]
 12: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5)
[0x661f85]
 13: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
 14: (main()+0x2208) [0x63f178]
 15: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 16: ceph_objectstore_tool() [0x659577]
Aborted (core dumped)


Which I suppose is expected if it's importing from bad pg data. At this
point I'm really most interested in what I can do to get this cluster
consistent as quickly as possible so I can start coping with the data loss
in the VMs and start restoring from backups where needed. Any guidance in
that direction would be appreciated. Something along the lines of give up
on that busted pg is what I'm thinking of, but I haven't noticed anything
that seems to approximate that yet.

Thanks

QH




On Fri, Mar 6, 2015 at 8:47 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 Here's more information I have been able to glean:

 pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
 acting [24]
 pg 3.690 is stuck inactive for 11991.281739, current state incomplete,
 last acting [24]
 pg 4.ca is stuck inactive for 15905.499058, current state incomplete,
 last acting [24]
 pg 

Re: [ceph-users] Cascading Failure of OSDs

2015-03-07 Thread Quentin Hartman
So I'm not sure what has changed, but in the last 30 minutes the errors
which were all over the place, have finally settled down to this:
http://pastebin.com/VuCKwLDp

The only thing I can think of is that I also net the noscrub flag in
addition to the nodeep-scrub when I first got here, and that finally
took. Anyway, they've been stable there for some time now, and I've been
able to get a couple VMs to come up and behave reasonably well. At this
point I'm prepared to wipe the entire cluster and start over if I have to
to get it truly consistent again, since my efforts to zap pg 3.75b haven't
borne fruit. However, if anyone has a less nuclear option they'd like to
suggest, I'm all ears.

I've tried to export/re-import the pg and do a force_create. The import
failed, and the force_create just reverted back to being incomplete after
creating for a few minutes.

QH

On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 Now that I have a better understanding of what's happening, I threw
 together a little one-liner to create a report of the errors that the OSDs
 are seeing. Lots of missing  / corrupted pg shards:
 https://gist.github.com/qhartman/174cc567525060cb462e

 I've experimented with exporting / importing the broken pgs with
 ceph_objectstore_tool, and while they seem to export correctly, the tool
 crashes when trying to import:

 root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
 --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
 /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
 Importing pgid 3.75b
 Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
 Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
 Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
 Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
 Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
 Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
 Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
 Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
 Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
 Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
 Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
 Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
 Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
 Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
 Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
 Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
 Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
 Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
 Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
 osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t,
 const std::setsnapid_t, MapCacher::Transactionstd::basic_stringchar,
 ceph::buffer::list*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
 osd/SnapMapper.cc: 228: FAILED assert(r == -2)
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x8b) [0xb94fbb]
  2: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
 std::lesssnapid_t, std::allocatorsnapid_t  const,
 MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
 ObjectStore::Transaction*, ceph::buffer::list, OSDriver,
 SnapMapper)+0x67c) [0x661a1c]
  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5)
 [0x661f85]
  5: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
  6: (main()+0x2208) [0x63f178]
  7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
  8: ceph_objectstore_tool() [0x659577]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed
 to interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 *** Caught signal (Aborted) **
  in thread 7fba67ff3900
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: ceph_objectstore_tool() [0xab1cea]
  2: (()+0x10340) [0x7fba66a95340]
  3: (gsignal()+0x39) [0x7fba627c7cc9]
  4: (abort()+0x148) [0x7fba627cb0d8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
  6: (()+0x5e836) [0x7fba630d0836]
  7: (()+0x5e863) [0x7fba630d0863]
  8: (()+0x5eaa2) [0x7fba630d0aa2]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x278) [0xb951a8]
  10: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
 std::lesssnapid_t, std::allocatorsnapid_t  const,
 MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
  11: (get_attrs(ObjectStore*, coll_t, ghobject_t,
 ObjectStore::Transaction*, ceph::buffer::list, OSDriver,
 SnapMapper)+0x67c) [0x661a1c]
  12: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5)
 [0x661f85]
  13: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
  14: 

[ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
So I'm in the middle of trying to triage a problem with my ceph cluster
running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
been running happily for about a year. This last weekend, something caused
the box running the MDS to sieze hard, and when we came in on monday,
several OSDs were down or unresponsive. I brought the MDS and the OSDs back
on online, and managed to get things running again with minimal data loss.
Had to mark a few objects as lost, but things were apparently running fine
at the end of the day on Monday.

This afternoon, I noticed that one of the OSDs was apparently stuck in a
crash/restart loop, and the cluster was unhappy. Performance was in the
tank and ceph status is reporting all manner of problems, as one would
expect if an OSD is misbehaving. I marked the offending OSD out, and the
cluster started rebalancing as expected. However, I noticed a short while
later, another OSD has started into a crash/restart loop. So, I repeat the
process. And it happens again. At this point I notice, that there are
actually two at a time which are in this state.

It's as if there's some toxic chunk of data that is getting passed around,
and when it lands on an OSD it kills it. Contrary to that, however, I tried
just stopping an OSD when it's in a bad state, and once the cluster starts
to try rebalancing with that OSD down and not previously marked out,
another OSD will start crash-looping.

I've investigated the disk of the first OSD I found with this problem, and
it has no apparent corruption on the file system.

I'll follow up to this shortly with links to pastes of log snippets. Any
input would be appreciated. This is turning into a real cascade failure,
and I haven't any idea how to stop it.

QH
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Sage Weil
It looks like you may be able to work around the issue for the moment with

 ceph osd set nodeep-scrub

as it looks like it is scrub that is getting stuck?

sage


On Fri, 6 Mar 2015, Quentin Hartman wrote:

 Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
 active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
 an osd crash log (in github gist because it was too big for pastebin) -
 https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
 And now I've got four OSDs that are looping.
 
 On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
   So I'm in the middle of trying to triage a problem with my ceph
   cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
   The cluster has been running happily for about a year. This last
   weekend, something caused the box running the MDS to sieze hard,
   and when we came in on monday, several OSDs were down or
   unresponsive. I brought the MDS and the OSDs back on online, and
   managed to get things running again with minimal data loss. Had
   to mark a few objects as lost, but things were apparently
   running fine at the end of the day on Monday.
 This afternoon, I noticed that one of the OSDs was apparently stuck in
 a crash/restart loop, and the cluster was unhappy. Performance was in
 the tank and ceph status is reporting all manner of problems, as one
 would expect if an OSD is misbehaving. I marked the offending OSD out,
 and the cluster started rebalancing as expected. However, I noticed a
 short while later, another OSD has started into a crash/restart loop.
 So, I repeat the process. And it happens again. At this point I
 notice, that there are actually two at a time which are in this state.
 
 It's as if there's some toxic chunk of data that is getting passed
 around, and when it lands on an OSD it kills it. Contrary to that,
 however, I tried just stopping an OSD when it's in a bad state, and
 once the cluster starts to try rebalancing with that OSD down and not
 previously marked out, another OSD will start crash-looping.
 
 I've investigated the disk of the first OSD I found with this problem,
 and it has no apparent corruption on the file system.
 
 I'll follow up to this shortly with links to pastes of log snippets.
 Any input would be appreciated. This is turning into a real cascade
 failure, and I haven't any idea how to stop it.
 
 QH
 
 
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Alright, tried a few suggestions for repairing this state, but I don't seem
to have any PG replicas that have good copies of the missing / zero length
shards. What do I do now? telling the pg's to repair doesn't seem to help
anything? I can deal with data loss if I can figure out which images might
be damaged, I just need to get the cluster consistent enough that the
things which aren't damaged can be usable.

Also, I'm seeing these similar, but not quite identical, error messages as
well. I assume they are referring to the same root problem:

-1 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:
soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
size 4194304



On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 Finally found an error that seems to provide some direction:

 -1 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
 e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
 not match object info size (4120576) ajusted for ondisk to (4120576)

 I'm diving into google now and hoping for something useful. If anyone has
 a suggestion, I'm all ears!

 QH

 On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman 
 qhart...@direwolfdigital.com wrote:

 Thanks for the suggestion, but that doesn't seem to have made a
 difference.

 I've shut the entire cluster down and brought it back up, and my config
 management system seems to have upgraded ceph to 0.80.8 during the reboot.
 Everything seems to have come back up, but I am still seeing the crash
 loops, so that seems to indicate that this is definitely something
 persistent, probably tied to the OSD data, rather than some weird transient
 state.


 On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:

 It looks like you may be able to work around the issue for the moment
 with

  ceph osd set nodeep-scrub

 as it looks like it is scrub that is getting stuck?

 sage


 On Fri, 6 Mar 2015, Quentin Hartman wrote:

  Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
  active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
  an osd crash log (in github gist because it was too big for pastebin) -
  https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
  And now I've got four OSDs that are looping.
 
  On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
So I'm in the middle of trying to triage a problem with my ceph
cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
The cluster has been running happily for about a year. This last
weekend, something caused the box running the MDS to sieze hard,
and when we came in on monday, several OSDs were down or
unresponsive. I brought the MDS and the OSDs back on online, and
managed to get things running again with minimal data loss. Had
to mark a few objects as lost, but things were apparently
running fine at the end of the day on Monday.
  This afternoon, I noticed that one of the OSDs was apparently stuck in
  a crash/restart loop, and the cluster was unhappy. Performance was in
  the tank and ceph status is reporting all manner of problems, as one
  would expect if an OSD is misbehaving. I marked the offending OSD out,
  and the cluster started rebalancing as expected. However, I noticed a
  short while later, another OSD has started into a crash/restart loop.
  So, I repeat the process. And it happens again. At this point I
  notice, that there are actually two at a time which are in this state.
 
  It's as if there's some toxic chunk of data that is getting passed
  around, and when it lands on an OSD it kills it. Contrary to that,
  however, I tried just stopping an OSD when it's in a bad state, and
  once the cluster starts to try rebalancing with that OSD down and not
  previously marked out, another OSD will start crash-looping.
 
  I've investigated the disk of the first OSD I found with this problem,
  and it has no apparent corruption on the file system.
 
  I'll follow up to this shortly with links to pastes of log snippets.
  Any input would be appreciated. This is turning into a real cascade
  failure, and I haven't any idea how to stop it.
 
  QH
 
 
 
 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the response. Is this the post you are referring to?
http://ceph.com/community/incomplete-pgs-oh-my/

For what it's worth, this cluster was running happily for the better part
of a year until the event from this weekend that I described in my first
post, so I doubt it's configuration issue. I suppose it could be some
edge-casey thing, that only came up just now, but that seems unlikely. Our
usage of this cluster has been much heavier in the past than it has been
recently.

And yes, I have what looks to be about 8 pg shards on several OSDs that
seem to be in this state, but it's hard to say for sure. It seems like each
time I look at this more problems are popping up.

On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum g...@gregs42.com wrote:

 This might be related to the backtrace assert, but that's the problem
 you need to focus on. In particular, both of these errors are caused
 by the scrub code, which Sage suggested temporarily disabling — if
 you're still getting these messages, you clearly haven't done so
 successfully.

 That said, it looks like the problem is that the object and/or object
 info specified here are just totally busted. You probably want to
 figure out what happened there since these errors are normally a
 misconfiguration somewhere (e.g., setting nobarrier on fs mount and
 then losing power). I'm not sure if there's a good way to repair the
 object, but if you can lose the data I'd grab the ceph-objectstore
 tool and remove the object from each OSD holding it that way. (There's
 a walkthrough of using it for a similar situation in a recent Ceph
 blog post.)

 On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
  Alright, tried a few suggestions for repairing this state, but I don't
 seem
  to have any PG replicas that have good copies of the missing / zero
 length
  shards. What do I do now? telling the pg's to repair doesn't seem to help
  anything? I can deal with data loss if I can figure out which images
 might
  be damaged, I just need to get the cluster consistent enough that the
 things
  which aren't damaged can be usable.
 
  Also, I'm seeing these similar, but not quite identical, error messages
 as
  well. I assume they are referring to the same root problem:
 
  -1 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard
 22:
  soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
  size 4194304

 Mmm, unfortunately that's a different object than the one referenced
 in the earlier crash. Maybe it's repairable, or it might be the same
 issue — looks like maybe you've got some widespread data loss.
 -Greg

 
 
 
  On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
 
  Finally found an error that seems to provide some direction:
 
  -1 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
  e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0)
 does
  not match object info size (4120576) ajusted for ondisk to (4120576)
 
  I'm diving into google now and hoping for something useful. If anyone
 has
  a suggestion, I'm all ears!
 
  QH
 
  On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
 
  Thanks for the suggestion, but that doesn't seem to have made a
  difference.
 
  I've shut the entire cluster down and brought it back up, and my config
  management system seems to have upgraded ceph to 0.80.8 during the
 reboot.
  Everything seems to have come back up, but I am still seeing the crash
  loops, so that seems to indicate that this is definitely something
  persistent, probably tied to the OSD data, rather than some weird
 transient
  state.
 
 
  On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:
 
  It looks like you may be able to work around the issue for the moment
  with
 
   ceph osd set nodeep-scrub
 
  as it looks like it is scrub that is getting stuck?
 
  sage
 
 
  On Fri, 6 Mar 2015, Quentin Hartman wrote:
 
   Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary
 (with
   active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
   an osd crash log (in github gist because it was too big for
 pastebin)
   -
   https://gist.github.com/qhartman/cb0e290df373d284cfb5
  
   And now I've got four OSDs that are looping.
  
   On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
   qhart...@direwolfdigital.com wrote:
 So I'm in the middle of trying to triage a problem with my
 ceph
 cluster running 0.80.5. I have 24 OSDs spread across 8
 machines.
 The cluster has been running happily for about a year. This
 last
 weekend, something caused the box running the MDS to sieze
 hard,
 and when we came in on monday, several OSDs were down or
 unresponsive. I brought the MDS and the OSDs back on online,
 and
 managed to get things running again with minimal data loss.
 Had
 to mark a few objects as lost, but things 

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Ceph health detail - http://pastebin.com/5URX9SsQ
pg dump summary (with active+clean pgs removed) -
http://pastebin.com/Y5ATvWDZ
an osd crash log (in github gist because it was too big for pastebin) -
https://gist.github.com/qhartman/cb0e290df373d284cfb5

And now I've got four OSDs that are looping.

On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 So I'm in the middle of trying to triage a problem with my ceph cluster
 running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
 been running happily for about a year. This last weekend, something caused
 the box running the MDS to sieze hard, and when we came in on monday,
 several OSDs were down or unresponsive. I brought the MDS and the OSDs back
 on online, and managed to get things running again with minimal data loss.
 Had to mark a few objects as lost, but things were apparently running fine
 at the end of the day on Monday.

 This afternoon, I noticed that one of the OSDs was apparently stuck in a
 crash/restart loop, and the cluster was unhappy. Performance was in the
 tank and ceph status is reporting all manner of problems, as one would
 expect if an OSD is misbehaving. I marked the offending OSD out, and the
 cluster started rebalancing as expected. However, I noticed a short while
 later, another OSD has started into a crash/restart loop. So, I repeat the
 process. And it happens again. At this point I notice, that there are
 actually two at a time which are in this state.

 It's as if there's some toxic chunk of data that is getting passed around,
 and when it lands on an OSD it kills it. Contrary to that, however, I tried
 just stopping an OSD when it's in a bad state, and once the cluster starts
 to try rebalancing with that OSD down and not previously marked out,
 another OSD will start crash-looping.

 I've investigated the disk of the first OSD I found with this problem, and
 it has no apparent corruption on the file system.

 I'll follow up to this shortly with links to pastes of log snippets. Any
 input would be appreciated. This is turning into a real cascade failure,
 and I haven't any idea how to stop it.

 QH

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Thanks for the suggestion, but that doesn't seem to have made a difference.

I've shut the entire cluster down and brought it back up, and my config
management system seems to have upgraded ceph to 0.80.8 during the reboot.
Everything seems to have come back up, but I am still seeing the crash
loops, so that seems to indicate that this is definitely something
persistent, probably tied to the OSD data, rather than some weird transient
state.


On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:

 It looks like you may be able to work around the issue for the moment with

  ceph osd set nodeep-scrub

 as it looks like it is scrub that is getting stuck?

 sage


 On Fri, 6 Mar 2015, Quentin Hartman wrote:

  Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
  active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
  an osd crash log (in github gist because it was too big for pastebin) -
  https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
  And now I've got four OSDs that are looping.
 
  On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
So I'm in the middle of trying to triage a problem with my ceph
cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
The cluster has been running happily for about a year. This last
weekend, something caused the box running the MDS to sieze hard,
and when we came in on monday, several OSDs were down or
unresponsive. I brought the MDS and the OSDs back on online, and
managed to get things running again with minimal data loss. Had
to mark a few objects as lost, but things were apparently
running fine at the end of the day on Monday.
  This afternoon, I noticed that one of the OSDs was apparently stuck in
  a crash/restart loop, and the cluster was unhappy. Performance was in
  the tank and ceph status is reporting all manner of problems, as one
  would expect if an OSD is misbehaving. I marked the offending OSD out,
  and the cluster started rebalancing as expected. However, I noticed a
  short while later, another OSD has started into a crash/restart loop.
  So, I repeat the process. And it happens again. At this point I
  notice, that there are actually two at a time which are in this state.
 
  It's as if there's some toxic chunk of data that is getting passed
  around, and when it lands on an OSD it kills it. Contrary to that,
  however, I tried just stopping an OSD when it's in a bad state, and
  once the cluster starts to try rebalancing with that OSD down and not
  previously marked out, another OSD will start crash-looping.
 
  I've investigated the disk of the first OSD I found with this problem,
  and it has no apparent corruption on the file system.
 
  I'll follow up to this shortly with links to pastes of log snippets.
  Any input would be appreciated. This is turning into a real cascade
  failure, and I haven't any idea how to stop it.
 
  QH
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Finally found an error that seems to provide some direction:

-1 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
not match object info size (4120576) ajusted for ondisk to (4120576)

I'm diving into google now and hoping for something useful. If anyone has a
suggestion, I'm all ears!

QH

On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 Thanks for the suggestion, but that doesn't seem to have made a difference.

 I've shut the entire cluster down and brought it back up, and my config
 management system seems to have upgraded ceph to 0.80.8 during the reboot.
 Everything seems to have come back up, but I am still seeing the crash
 loops, so that seems to indicate that this is definitely something
 persistent, probably tied to the OSD data, rather than some weird transient
 state.


 On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:

 It looks like you may be able to work around the issue for the moment with

  ceph osd set nodeep-scrub

 as it looks like it is scrub that is getting stuck?

 sage


 On Fri, 6 Mar 2015, Quentin Hartman wrote:

  Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
  active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
  an osd crash log (in github gist because it was too big for pastebin) -
  https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
  And now I've got four OSDs that are looping.
 
  On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
So I'm in the middle of trying to triage a problem with my ceph
cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
The cluster has been running happily for about a year. This last
weekend, something caused the box running the MDS to sieze hard,
and when we came in on monday, several OSDs were down or
unresponsive. I brought the MDS and the OSDs back on online, and
managed to get things running again with minimal data loss. Had
to mark a few objects as lost, but things were apparently
running fine at the end of the day on Monday.
  This afternoon, I noticed that one of the OSDs was apparently stuck in
  a crash/restart loop, and the cluster was unhappy. Performance was in
  the tank and ceph status is reporting all manner of problems, as one
  would expect if an OSD is misbehaving. I marked the offending OSD out,
  and the cluster started rebalancing as expected. However, I noticed a
  short while later, another OSD has started into a crash/restart loop.
  So, I repeat the process. And it happens again. At this point I
  notice, that there are actually two at a time which are in this state.
 
  It's as if there's some toxic chunk of data that is getting passed
  around, and when it lands on an OSD it kills it. Contrary to that,
  however, I tried just stopping an OSD when it's in a bad state, and
  once the cluster starts to try rebalancing with that OSD down and not
  previously marked out, another OSD will start crash-looping.
 
  I've investigated the disk of the first OSD I found with this problem,
  and it has no apparent corruption on the file system.
 
  I'll follow up to this shortly with links to pastes of log snippets.
  Any input would be appreciated. This is turning into a real cascade
  failure, and I haven't any idea how to stop it.
 
  QH
 
 
 
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Gregory Farnum
This might be related to the backtrace assert, but that's the problem
you need to focus on. In particular, both of these errors are caused
by the scrub code, which Sage suggested temporarily disabling — if
you're still getting these messages, you clearly haven't done so
successfully.

That said, it looks like the problem is that the object and/or object
info specified here are just totally busted. You probably want to
figure out what happened there since these errors are normally a
misconfiguration somewhere (e.g., setting nobarrier on fs mount and
then losing power). I'm not sure if there's a good way to repair the
object, but if you can lose the data I'd grab the ceph-objectstore
tool and remove the object from each OSD holding it that way. (There's
a walkthrough of using it for a similar situation in a recent Ceph
blog post.)

On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 Alright, tried a few suggestions for repairing this state, but I don't seem
 to have any PG replicas that have good copies of the missing / zero length
 shards. What do I do now? telling the pg's to repair doesn't seem to help
 anything? I can deal with data loss if I can figure out which images might
 be damaged, I just need to get the cluster consistent enough that the things
 which aren't damaged can be usable.

 Also, I'm seeing these similar, but not quite identical, error messages as
 well. I assume they are referring to the same root problem:

 -1 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:
 soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known
 size 4194304

Mmm, unfortunately that's a different object than the one referenced
in the earlier crash. Maybe it's repairable, or it might be the same
issue — looks like maybe you've got some widespread data loss.
-Greg




 On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:

 Finally found an error that seems to provide some direction:

 -1 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
 e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does
 not match object info size (4120576) ajusted for ondisk to (4120576)

 I'm diving into google now and hoping for something useful. If anyone has
 a suggestion, I'm all ears!

 QH

 On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:

 Thanks for the suggestion, but that doesn't seem to have made a
 difference.

 I've shut the entire cluster down and brought it back up, and my config
 management system seems to have upgraded ceph to 0.80.8 during the reboot.
 Everything seems to have come back up, but I am still seeing the crash
 loops, so that seems to indicate that this is definitely something
 persistent, probably tied to the OSD data, rather than some weird transient
 state.


 On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote:

 It looks like you may be able to work around the issue for the moment
 with

  ceph osd set nodeep-scrub

 as it looks like it is scrub that is getting stuck?

 sage


 On Fri, 6 Mar 2015, Quentin Hartman wrote:

  Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
  active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
  an osd crash log (in github gist because it was too big for pastebin)
  -
  https://gist.github.com/qhartman/cb0e290df373d284cfb5
 
  And now I've got four OSDs that are looping.
 
  On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
  qhart...@direwolfdigital.com wrote:
So I'm in the middle of trying to triage a problem with my ceph
cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
The cluster has been running happily for about a year. This last
weekend, something caused the box running the MDS to sieze hard,
and when we came in on monday, several OSDs were down or
unresponsive. I brought the MDS and the OSDs back on online, and
managed to get things running again with minimal data loss. Had
to mark a few objects as lost, but things were apparently
running fine at the end of the day on Monday.
  This afternoon, I noticed that one of the OSDs was apparently stuck in
  a crash/restart loop, and the cluster was unhappy. Performance was in
  the tank and ceph status is reporting all manner of problems, as one
  would expect if an OSD is misbehaving. I marked the offending OSD out,
  and the cluster started rebalancing as expected. However, I noticed a
  short while later, another OSD has started into a crash/restart loop.
  So, I repeat the process. And it happens again. At this point I
  notice, that there are actually two at a time which are in this state.
 
  It's as if there's some toxic chunk of data that is getting passed
  around, and when it lands on an OSD it kills it. Contrary to that,
  however, I tried just stopping an OSD when it's in a bad state, and
  once the cluster starts 

Re: [ceph-users] Cascading Failure of OSDs

2015-03-06 Thread Quentin Hartman
Here's more information I have been able to glean:

pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
acting [24]
pg 3.690 is stuck inactive for 11991.281739, current state incomplete, last
acting [24]
pg 4.ca is stuck inactive for 15905.499058, current state incomplete, last
acting [24]
pg 3.5d3 is stuck unclean for 917.471550, current state incomplete, last
acting [24]
pg 3.690 is stuck unclean for 11991.281843, current state incomplete, last
acting [24]
pg 4.ca is stuck unclean for 15905.499162, current state incomplete, last
acting [24]
pg 3.19c is incomplete, acting [24] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.ca is incomplete, acting [24] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 5.7a is incomplete, acting [24] (reducing pool backups min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 5.6b is incomplete, acting [24] (reducing pool backups min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.6bf is incomplete, acting [24] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.690 is incomplete, acting [24] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.5d3 is incomplete, acting [24] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')


However, that list of incomplete pgs keeps changing each time I run ceph
health detail | grep incomplete. For example, here is the output
regenerated moments after I created the above:

HEALTH_ERR 34 pgs incomplete; 2 pgs inconsistent; 37 pgs peering; 470 pgs
stale; 13 pgs stuck inactive; 13 pgs stuck unclean; 4 scrub errors; 1/24 in
osds are down; noout,nodeep-scrub flag(s) set
pg 3.da is stuck inactive for 7977.699449, current state incomplete, last
acting [19]
pg 3.1a4 is stuck inactive for 6364.787502, current state incomplete, last
acting [14]
pg 4.c4 is stuck inactive for 8759.642771, current state incomplete, last
acting [14]
pg 3.4fa is stuck inactive for 8173.078486, current state incomplete, last
acting [14]
pg 3.372 is stuck inactive for 6706.018758, current state incomplete, last
acting [14]
pg 3.4ca is stuck inactive for 7121.446109, current state incomplete, last
acting [14]
pg 0.6 is stuck inactive for 8759.591368, current state incomplete, last
acting [14]
pg 3.343 is stuck inactive for 7996.560271, current state incomplete, last
acting [14]
pg 3.453 is stuck inactive for 6420.686656, current state incomplete, last
acting [14]
pg 3.4c1 is stuck inactive for 7049.443221, current state incomplete, last
acting [14]
pg 3.80 is stuck inactive for 7587.105164, current state incomplete, last
acting [14]
pg 3.4a7 is stuck inactive for 5506.691333, current state incomplete, last
acting [14]
pg 3.5ce is stuck inactive for 7153.943506, current state incomplete, last
acting [14]
pg 3.da is stuck unclean for 11816.026865, current state incomplete, last
acting [19]
pg 3.1a4 is stuck unclean for 8759.633093, current state incomplete, last
acting [14]
pg 3.4fa is stuck unclean for 8759.658848, current state incomplete, last
acting [14]
pg 4.c4 is stuck unclean for 8759.642866, current state incomplete, last
acting [14]
pg 3.372 is stuck unclean for 8759.662338, current state incomplete, last
acting [14]
pg 3.4ca is stuck unclean for 8759.603350, current state incomplete, last
acting [14]
pg 0.6 is stuck unclean for 8759.591459, current state incomplete, last
acting [14]
pg 3.343 is stuck unclean for 8759.645236, current state incomplete, last
acting [14]
pg 3.453 is stuck unclean for 8759.643875, current state incomplete, last
acting [14]
pg 3.4c1 is stuck unclean for 8759.606092, current state incomplete, last
acting [14]
pg 3.80 is stuck unclean for 8759.644522, current state incomplete, last
acting [14]
pg 3.4a7 is stuck unclean for 12723.462164, current state incomplete, last
acting [14]
pg 3.5ce is stuck unclean for 10024.882545, current state incomplete, last
acting [14]
pg 3.1a4 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.1a1 is incomplete, acting [14] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.138 is incomplete, acting [14] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.da is incomplete, acting [19] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.c4 is incomplete, acting [14] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 3.80 is incomplete, acting [14] (reducing pool volumes min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.70 is incomplete, acting [19] (reducing pool images min_size from 2
may help; search ceph.com/docs for 'incomplete')
pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2
may