Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault
Jewel is almost EOL. It looks similar to several related issues, one of which is http://tracker.ceph.com/issues/21826 On Mon, Aug 13, 2018 at 9:19 PM, Alexandru Cucu wrote: > Hi, > > Already tried zapping the disk. Unfortunaltely the same segfaults keep > me from adding the OSD back to the cluster. > > I wanted to open an issue on tracker.ceph.com but I can't find the > "new issue" button. > > --- > Alex Cucu > > On Mon, Aug 13, 2018 at 8:24 AM wrote: >> >> >> >> Am 3. August 2018 12:03:17 MESZ schrieb Alexandru Cucu : >> >Hello, >> > >> >> Hello Alex, >> >> >Another OSD started randomly crashing with segmentation fault. Haven't >> >managed to add the last 3 OSDs back to the cluster as the daemons keep >> >crashing. >> > >> >> An idea could be to remove the osds completely from the Cluster and add it >> again after zapping the Disks. >> >> Hth >> - Mehmet >> >> >--- >> > >> >-2> 2018-08-03 12:12:52.670076 7f12b6b15700 4 rocksdb: >> >EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event": >> >"table_file_deletion", "file_number": 4350} >> > -1> 2018-08-03 12:12:53.146753 7f12c38d0a80 0 osd.154 89917 load_pgs >> > 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal >> >(Segmentation fault) ** >> > in thread 7f12c38d0a80 thread_name:ceph-osd >> > ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) >> > 1: (()+0x9f1c2a) [0x7f12c42ddc2a] >> > 2: (()+0xf5e0) [0x7f12c1dc85e0] >> > 3: (()+0x34484) [0x7f12c34a6484] >> > 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions >> >const&, rocksdb::BlockIter*, >> >rocksdb::BlockBasedTable::CachableEntry*)+0x466) >> >[0x7f12c41e40d6] >> > 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, >> >rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297) >> >[0x7f12c41e4b27] >> > 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, >> >rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, >> >rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, >> >bool >> >, int)+0x2a4) [0x7f12c429ff94] >> > 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&, >> >rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, >> >rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*, >> >unsigned l >> >ong*)+0x810) [0x7f12c419bb80] >> > 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, >> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, >> >rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494] >> > 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, >> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, >> >rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19] >> > 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&, >> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, >> >std::string*)+0x95) [0x7f12c4252a45] >> > 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice >> >const&, std::string*)+0x4a) [0x7f12c4251eea] >> > 12: (RocksDBStore::get(std::string const&, std::string const&, >> >ceph::buffer::list*)+0xff) [0x7f12c415c31f] >> > 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock >> >const&, ghobject_t const&)+0x5e4) [0x7f12c4110814] >> > 14: (DBObjectMap::get_values(ghobject_t const&, std::set> >std::less, std::allocator > const&, >> >std::map, >> >std:: >> >allocator > >*)+0x5f) >> >[0x7f12c41f] >> > 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, >> >std::set, >> >std::allocator > const&, std::map> >ceph::buffer::list, std::less> >td::string>, std::allocator> >ceph::buffer::list> > >*)+0x197) [0x7f12c4031f77] >> >16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1] >> > 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5] >> > 18: (OSD::init()+0x2086) [0x7f12c3d07096] >> > 19: (main()+0x2c18) [0x7f12c3c1e088] >> > 20: (__libc_start_main()+0xf5) [0x7f12c0374c05] >> > 21: (()+0x3c8847) [0x7f12c3cb4847] >> > NOTE: a copy of the executable, or `objdump -rdS ` is >> >needed to interpret this. >> >--- >> > >> >Any help would be appreciated. >> > >> >Thanks, >> >Alex Cucu >> > >> >On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu wrote: >> >> >> >> Hello Ceph users, >> >> >> >> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after >> >> the update, 1 OSD crashed. >> >> When trying to add the OSD back to the cluster, other 2 OSDs started >> >> crashing with segmentation fault. Had to mark all 3 OSDs as down as >> >we >> >> had stuck PGs and blocked operations and the cluster status was >> >> HEALTH_ERR. >> >> >> >> We have tried various ways to re-add the OSDs back to the cluster but >> >> after a while they start crashing and won't start anymore. After a >> >> while they can be started again and marked as in but after some >> >> rebalancing they will start the crashing imediately after starting. >> >> >> >> Here are some logs: >> >> https://pastebin.com/nCRamgRU >> >> >> >> Do you know of any existing bug report that might be related? (I >> >> couldn't find anything). >>
Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault
Hi, Already tried zapping the disk. Unfortunaltely the same segfaults keep me from adding the OSD back to the cluster. I wanted to open an issue on tracker.ceph.com but I can't find the "new issue" button. --- Alex Cucu On Mon, Aug 13, 2018 at 8:24 AM wrote: > > > > Am 3. August 2018 12:03:17 MESZ schrieb Alexandru Cucu : > >Hello, > > > > Hello Alex, > > >Another OSD started randomly crashing with segmentation fault. Haven't > >managed to add the last 3 OSDs back to the cluster as the daemons keep > >crashing. > > > > An idea could be to remove the osds completely from the Cluster and add it > again after zapping the Disks. > > Hth > - Mehmet > > >--- > > > >-2> 2018-08-03 12:12:52.670076 7f12b6b15700 4 rocksdb: > >EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event": > >"table_file_deletion", "file_number": 4350} > > -1> 2018-08-03 12:12:53.146753 7f12c38d0a80 0 osd.154 89917 load_pgs > > 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal > >(Segmentation fault) ** > > in thread 7f12c38d0a80 thread_name:ceph-osd > > ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) > > 1: (()+0x9f1c2a) [0x7f12c42ddc2a] > > 2: (()+0xf5e0) [0x7f12c1dc85e0] > > 3: (()+0x34484) [0x7f12c34a6484] > > 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions > >const&, rocksdb::BlockIter*, > >rocksdb::BlockBasedTable::CachableEntry*)+0x466) > >[0x7f12c41e40d6] > > 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, > >rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297) > >[0x7f12c41e4b27] > > 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, > >rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, > >rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, > >bool > >, int)+0x2a4) [0x7f12c429ff94] > > 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&, > >rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, > >rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*, > >unsigned l > >ong*)+0x810) [0x7f12c419bb80] > > 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, > >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, > >rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494] > > 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, > >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, > >rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19] > > 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&, > >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, > >std::string*)+0x95) [0x7f12c4252a45] > > 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice > >const&, std::string*)+0x4a) [0x7f12c4251eea] > > 12: (RocksDBStore::get(std::string const&, std::string const&, > >ceph::buffer::list*)+0xff) [0x7f12c415c31f] > > 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock > >const&, ghobject_t const&)+0x5e4) [0x7f12c4110814] > > 14: (DBObjectMap::get_values(ghobject_t const&, std::set >std::less, std::allocator > const&, > >std::map, > >std:: > >allocator > >*)+0x5f) > >[0x7f12c41f] > > 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, > >std::set, > >std::allocator > const&, std::map >ceph::buffer::list, std::less >td::string>, std::allocator >ceph::buffer::list> > >*)+0x197) [0x7f12c4031f77] > >16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1] > > 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5] > > 18: (OSD::init()+0x2086) [0x7f12c3d07096] > > 19: (main()+0x2c18) [0x7f12c3c1e088] > > 20: (__libc_start_main()+0xf5) [0x7f12c0374c05] > > 21: (()+0x3c8847) [0x7f12c3cb4847] > > NOTE: a copy of the executable, or `objdump -rdS ` is > >needed to interpret this. > >--- > > > >Any help would be appreciated. > > > >Thanks, > >Alex Cucu > > > >On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu wrote: > >> > >> Hello Ceph users, > >> > >> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after > >> the update, 1 OSD crashed. > >> When trying to add the OSD back to the cluster, other 2 OSDs started > >> crashing with segmentation fault. Had to mark all 3 OSDs as down as > >we > >> had stuck PGs and blocked operations and the cluster status was > >> HEALTH_ERR. > >> > >> We have tried various ways to re-add the OSDs back to the cluster but > >> after a while they start crashing and won't start anymore. After a > >> while they can be started again and marked as in but after some > >> rebalancing they will start the crashing imediately after starting. > >> > >> Here are some logs: > >> https://pastebin.com/nCRamgRU > >> > >> Do you know of any existing bug report that might be related? (I > >> couldn't find anything). > >> > >> I will happily provide any information that would help solving this > >issue. > >> > >> Thank you, > >> Alex Cucu > >___ > >ceph-users mailing list > >ceph-users@lists.ceph.com > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault
Hello, Another OSD started randomly crashing with segmentation fault. Haven't managed to add the last 3 OSDs back to the cluster as the daemons keep crashing. --- -2> 2018-08-03 12:12:52.670076 7f12b6b15700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event": "table_file_deletion", "file_number": 4350} -1> 2018-08-03 12:12:53.146753 7f12c38d0a80 0 osd.154 89917 load_pgs 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal (Segmentation fault) ** in thread 7f12c38d0a80 thread_name:ceph-osd ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) 1: (()+0x9f1c2a) [0x7f12c42ddc2a] 2: (()+0xf5e0) [0x7f12c1dc85e0] 3: (()+0x34484) [0x7f12c34a6484] 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions const&, rocksdb::BlockIter*, rocksdb::BlockBasedTable::CachableEntry*)+0x466) [0x7f12c41e40d6] 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297) [0x7f12c41e4b27] 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool , int)+0x2a4) [0x7f12c429ff94] 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*, unsigned l ong*)+0x810) [0x7f12c419bb80] 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494] 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19] 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::string*)+0x95) [0x7f12c4252a45] 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, std::string*)+0x4a) [0x7f12c4251eea] 12: (RocksDBStore::get(std::string const&, std::string const&, ceph::buffer::list*)+0xff) [0x7f12c415c31f] 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock const&, ghobject_t const&)+0x5e4) [0x7f12c4110814] 14: (DBObjectMap::get_values(ghobject_t const&, std::set, std::allocator > const&, std::map, std:: allocator > >*)+0x5f) [0x7f12c41f] 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&, std::set, std::allocator > const&, std::map, std::allocator > >*)+0x197) [0x7f12c4031f77] 16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1] 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5] 18: (OSD::init()+0x2086) [0x7f12c3d07096] 19: (main()+0x2c18) [0x7f12c3c1e088] 20: (__libc_start_main()+0xf5) [0x7f12c0374c05] 21: (()+0x3c8847) [0x7f12c3cb4847] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- Any help would be appreciated. Thanks, Alex Cucu On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu wrote: > > Hello Ceph users, > > We have updated our cluster from 10.2.7 to 10.2.11. A few hours after > the update, 1 OSD crashed. > When trying to add the OSD back to the cluster, other 2 OSDs started > crashing with segmentation fault. Had to mark all 3 OSDs as down as we > had stuck PGs and blocked operations and the cluster status was > HEALTH_ERR. > > We have tried various ways to re-add the OSDs back to the cluster but > after a while they start crashing and won't start anymore. After a > while they can be started again and marked as in but after some > rebalancing they will start the crashing imediately after starting. > > Here are some logs: > https://pastebin.com/nCRamgRU > > Do you know of any existing bug report that might be related? (I > couldn't find anything). > > I will happily provide any information that would help solving this issue. > > Thank you, > Alex Cucu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Jewel 10.2.11] OSD Segmentation fault
Hello Ceph users, We have updated our cluster from 10.2.7 to 10.2.11. A few hours after the update, 1 OSD crashed. When trying to add the OSD back to the cluster, other 2 OSDs started crashing with segmentation fault. Had to mark all 3 OSDs as down as we had stuck PGs and blocked operations and the cluster status was HEALTH_ERR. We have tried various ways to re-add the OSDs back to the cluster but after a while they start crashing and won't start anymore. After a while they can be started again and marked as in but after some rebalancing they will start the crashing imediately after starting. Here are some logs: https://pastebin.com/nCRamgRU Do you know of any existing bug report that might be related? (I couldn't find anything). I will happily provide any information that would help solving this issue. Thank you, Alex Cucu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com