Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Guillaume Chenuet
Hi,

Thanks for your answers.
I recreated the OSD and I'll monitor the disk health (currently OK).

Thanks a lot,
Guillaume

On Fri, 24 May 2019 at 15:56, Igor Fedotov  wrote:

> Hi Guillaume,
>
> Could you please set debug-bluefs to 20, restart OSD and collect the whole
> log.
>
>
> Thanks,
>
> Igor
> On 5/24/2019 4:50 PM, Guillaume Chenuet wrote:
>
> Hi,
>
> We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD
> per server) and Ceph version
> 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).
>
> This cluster is used by an OpenStack private cloud and deployed with
> OpenStack Kolla. Every OSD ran into a Docker container on the server and
> MON, MGR, MDS, and RGW are running on 3 other servers.
>
> This week, one OSD crashed and failed to restart, with this stack trace:
>
>  Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
> --cluster-addr 10.106.142.30 -i 35'
> + exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
> 10.106.142.30 -i 35
> starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
> /var/lib/ceph/osd/ceph-35/journal
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
> 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
> 2019-05-24 05:40:47.799918
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000:
> FAILED assert(r == 0)
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x556f7833f8f0]
>  2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  7: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  8: (main()+0x2d07) [0x556f77cbe667]
>  9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  10: (()+0x4c1f73) [0x556f77d5ef73]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> *** Caught signal (Aborted) **
>  in thread 7efd088d6d80 thread_name:ceph-osd
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (()+0xa63931) [0x556f78300931]
>  2: (()+0xf5d0) [0x7efd05f995d0]
>  3: (gsignal()+0x37) [0x7efd04fba207]
>  4: (abort()+0x148) [0x7efd04fbb8f8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556f7833fa64]
>  6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  11: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  12: (main()+0x2d07) [0x556f77cbe667]
>  13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  14: (()+0x4c1f73) [0x556f77d5ef73]
>
> The cluster health is OK and Ceph sees this OSD as shutdown.
>
> I tried to find more information on the internet about this error without
> luck.
> Do you have any idea or input about this error, please?
>
> Thanks,
> Guillaume
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
Guillaume Chenuet
*DevOps Engineer Productivity*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Igor Fedotov

Hi Guillaume,

Could you please set debug-bluefs to 20, restart OSD and collect the 
whole log.



Thanks,

Igor

On 5/24/2019 4:50 PM, Guillaume Chenuet wrote:

Hi,

We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 
OSD per server) and Ceph version 
12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).


This cluster is used by an OpenStack private cloud and deployed with 
OpenStack Kolla. Every OSD ran into a Docker container on the server 
and MON, MGR, MDS, and RGW are running on 3 other servers.


This week, one OSD crashed and failed to restart, with this stack trace:

 Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30 
--cluster-addr 10.106.142.30 -i 35'
+ exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr 
10.106.142.30 -i 35
starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35 
/var/lib/ceph/osd/ceph-35/journal
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In 
function 'int BlueFS::_read(BlueFS::FileReader*, 
BlueFS::FileReaderBuffer*, uint64_t, size_t, ceph::bufferlist*, 
char*)' thread 7efd088d6d80 time 2019-05-24 05:40:47.799918
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000: 
FAILED assert(r == 0)
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x556f7833f8f0]
 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 7: (OSD::init()+0x3bd) [0x556f77dbbaed]
 8: (main()+0x2d07) [0x556f77cbe667]
 9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 10: (()+0x4c1f73) [0x556f77d5ef73]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

*** Caught signal (Aborted) **
 in thread 7efd088d6d80 thread_name:ceph-osd
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) 
luminous (stable)

 1: (()+0xa63931) [0x556f78300931]
 2: (()+0xf5d0) [0x7efd05f995d0]
 3: (gsignal()+0x37) [0x7efd04fba207]
 4: (abort()+0x148) [0x7efd04fbb8f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x556f7833fa64]
 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, 
unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4) 
[0x556f782b5574]

 7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 11: (OSD::init()+0x3bd) [0x556f77dbbaed]
 12: (main()+0x2d07) [0x556f77cbe667]
 13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 14: (()+0x4c1f73) [0x556f77d5ef73]

The cluster health is OK and Ceph sees this OSD as shutdown.

I tried to find more information on the internet about this error 
without luck.

Do you have any idea or input about this error, please?

Thanks,
Guillaume


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Paul Emmerich
Disk got corrupted, it might be dead. Check kernel log for errors and SMART
reallocated sector count or errors.

If the disk is still good: simply re-create the OSD.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 24, 2019 at 3:51 PM Guillaume Chenuet <
guillaume.chen...@schibsted.com> wrote:

> Hi,
>
> We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD
> per server) and Ceph version
> 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable).
>
> This cluster is used by an OpenStack private cloud and deployed with
> OpenStack Kolla. Every OSD ran into a Docker container on the server and
> MON, MGR, MDS, and RGW are running on 3 other servers.
>
> This week, one OSD crashed and failed to restart, with this stack trace:
>
>  Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
> --cluster-addr 10.106.142.30 -i 35'
> + exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
> 10.106.142.30 -i 35
> starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
> /var/lib/ceph/osd/ceph-35/journal
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
> 'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
> 2019-05-24 05:40:47.799918
> /builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000:
> FAILED assert(r == 0)
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x556f7833f8f0]
>  2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  7: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  8: (main()+0x2d07) [0x556f77cbe667]
>  9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  10: (()+0x4c1f73) [0x556f77d5ef73]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> *** Caught signal (Aborted) **
>  in thread 7efd088d6d80 thread_name:ceph-osd
>  ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
> (stable)
>  1: (()+0xa63931) [0x556f78300931]
>  2: (()+0xf5d0) [0x7efd05f995d0]
>  3: (gsignal()+0x37) [0x7efd04fba207]
>  4: (abort()+0x148) [0x7efd04fbb8f8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556f7833fa64]
>  6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
> unsigned long, unsigned long, ceph::buffer::list*, char*)+0xca4)
> [0x556f782b5574]
>  7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
>  8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
>  9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
>  10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
>  11: (OSD::init()+0x3bd) [0x556f77dbbaed]
>  12: (main()+0x2d07) [0x556f77cbe667]
>  13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
>  14: (()+0x4c1f73) [0x556f77d5ef73]
>
> The cluster health is OK and Ceph sees this OSD as shutdown.
>
> I tried to find more information on the internet about this error without
> luck.
> Do you have any idea or input about this error, please?
>
> Thanks,
> Guillaume
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost OSD - 1000: FAILED assert(r == 0)

2019-05-24 Thread Guillaume Chenuet
Hi,

We are running a Ceph cluster with 36 OSD splitted on 3 servers (12 OSD per
server) and Ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable).

This cluster is used by an OpenStack private cloud and deployed with
OpenStack Kolla. Every OSD ran into a Docker container on the server and
MON, MGR, MDS, and RGW are running on 3 other servers.

This week, one OSD crashed and failed to restart, with this stack trace:

 Running command: '/usr/bin/ceph-osd -f --public-addr 10.106.142.30
--cluster-addr 10.106.142.30 -i 35'
+ exec /usr/bin/ceph-osd -f --public-addr 10.106.142.30 --cluster-addr
10.106.142.30 -i 35
starting osd.35 at - osd_data /var/lib/ceph/osd/ceph-35
/var/lib/ceph/osd/ceph-35/journal
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: In function
'int BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*,
uint64_t, size_t, ceph::bufferlist*, char*)' thread 7efd088d6d80 time
2019-05-24 05:40:47.799918
/builddir/build/BUILD/ceph-12.2.11/src/os/bluestore/BlueFS.cc: 1000: FAILED
assert(r == 0)
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556f7833f8f0]
 2: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned
long, unsigned long, ceph::buffer::list*, char*)+0xca4) [0x556f782b5574]
 3: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 4: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 5: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 6: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 7: (OSD::init()+0x3bd) [0x556f77dbbaed]
 8: (main()+0x2d07) [0x556f77cbe667]
 9: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 10: (()+0x4c1f73) [0x556f77d5ef73]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
*** Caught signal (Aborted) **
 in thread 7efd088d6d80 thread_name:ceph-osd
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)
 1: (()+0xa63931) [0x556f78300931]
 2: (()+0xf5d0) [0x7efd05f995d0]
 3: (gsignal()+0x37) [0x7efd04fba207]
 4: (abort()+0x148) [0x7efd04fbb8f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x284) [0x556f7833fa64]
 6: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned
long, unsigned long, ceph::buffer::list*, char*)+0xca4) [0x556f782b5574]
 7: (BlueFS::_replay(bool)+0x2ef) [0x556f782c82af]
 8: (BlueFS::mount()+0x1d4) [0x556f782cc014]
 9: (BlueStore::_open_db(bool)+0x1847) [0x556f781e0ce7]
 10: (BlueStore::_mount(bool)+0x40e) [0x556f782126ae]
 11: (OSD::init()+0x3bd) [0x556f77dbbaed]
 12: (main()+0x2d07) [0x556f77cbe667]
 13: (__libc_start_main()+0xf5) [0x7efd04fa63d5]
 14: (()+0x4c1f73) [0x556f77d5ef73]

The cluster health is OK and Ceph sees this OSD as shutdown.

I tried to find more information on the internet about this error without
luck.
Do you have any idea or input about this error, please?

Thanks,
Guillaume
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com