Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-08-13 Thread Brad Hubbard
Jewel is almost EOL.

It looks similar to several related issues, one of which is
http://tracker.ceph.com/issues/21826

On Mon, Aug 13, 2018 at 9:19 PM, Alexandru Cucu  wrote:
> Hi,
>
> Already tried zapping the disk. Unfortunaltely the same segfaults keep
> me from adding the OSD back to the cluster.
>
> I wanted to open an issue on tracker.ceph.com but I can't find the
> "new issue" button.
>
> ---
> Alex Cucu
>
> On Mon, Aug 13, 2018 at 8:24 AM  wrote:
>>
>>
>>
>> Am 3. August 2018 12:03:17 MESZ schrieb Alexandru Cucu :
>> >Hello,
>> >
>>
>> Hello Alex,
>>
>> >Another OSD started randomly crashing with segmentation fault. Haven't
>> >managed to add the last 3 OSDs back to the cluster as the daemons keep
>> >crashing.
>> >
>>
>> An idea could be to remove the osds completely from the Cluster and add it 
>> again after zapping the Disks.
>>
>> Hth
>> - Mehmet
>>
>> >---
>> >
>> >-2> 2018-08-03 12:12:52.670076 7f12b6b15700  4 rocksdb:
>> >EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event":
>> >"table_file_deletion", "file_number": 4350}
>> >  -1> 2018-08-03 12:12:53.146753 7f12c38d0a80  0 osd.154 89917 load_pgs
>> > 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal
>> >(Segmentation fault) **
>> > in thread 7f12c38d0a80 thread_name:ceph-osd
>> > ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
>> > 1: (()+0x9f1c2a) [0x7f12c42ddc2a]
>> > 2: (()+0xf5e0) [0x7f12c1dc85e0]
>> > 3: (()+0x34484) [0x7f12c34a6484]
>> > 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions
>> >const&, rocksdb::BlockIter*,
>> >rocksdb::BlockBasedTable::CachableEntry*)+0x466)
>> >[0x7f12c41e40d6]
>> > 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
>> >rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297)
>> >[0x7f12c41e4b27]
>> > 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
>> >rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&,
>> >rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*,
>> >bool
>> >, int)+0x2a4) [0x7f12c429ff94]
>> > 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
>> >rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
>> >rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*,
>> >unsigned l
>> >ong*)+0x810) [0x7f12c419bb80]
>> > 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
>> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>> >rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494]
>> > 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
>> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>> >rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19]
>> > 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&,
>> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
>> >std::string*)+0x95) [0x7f12c4252a45]
>> > 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice
>> >const&, std::string*)+0x4a) [0x7f12c4251eea]
>> > 12: (RocksDBStore::get(std::string const&, std::string const&,
>> >ceph::buffer::list*)+0xff) [0x7f12c415c31f]
>> > 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock
>> >const&, ghobject_t const&)+0x5e4) [0x7f12c4110814]
>> > 14: (DBObjectMap::get_values(ghobject_t const&, std::set> >std::less, std::allocator > const&,
>> >std::map,
>> >std::
>> >allocator > >*)+0x5f)
>> >[0x7f12c41f]
>> > 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&,
>> >std::set,
>> >std::allocator > const&, std::map> >ceph::buffer::list, std::less> >td::string>, std::allocator> >ceph::buffer::list> > >*)+0x197) [0x7f12c4031f77]
>> >16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1]
>> > 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5]
>> > 18: (OSD::init()+0x2086) [0x7f12c3d07096]
>> > 19: (main()+0x2c18) [0x7f12c3c1e088]
>> > 20: (__libc_start_main()+0xf5) [0x7f12c0374c05]
>> > 21: (()+0x3c8847) [0x7f12c3cb4847]
>> > NOTE: a copy of the executable, or `objdump -rdS ` is
>> >needed to interpret this.
>> >---
>> >
>> >Any help would be appreciated.
>> >
>> >Thanks,
>> >Alex Cucu
>> >
>> >On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu  wrote:
>> >>
>> >> Hello Ceph users,
>> >>
>> >> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
>> >> the update, 1 OSD crashed.
>> >> When trying to add the OSD back to the cluster, other 2 OSDs started
>> >> crashing with segmentation fault. Had to mark all 3 OSDs as down as
>> >we
>> >> had stuck PGs and blocked operations and the cluster status was
>> >> HEALTH_ERR.
>> >>
>> >> We have tried various ways to re-add the OSDs back to the cluster but
>> >> after a while they start crashing and won't start anymore. After a
>> >> while they can be started again and marked as in but after some
>> >> rebalancing they will start the crashing imediately after starting.
>> >>
>> >> Here are some logs:
>> >> https://pastebin.com/nCRamgRU
>> >>
>> >> Do you know of any existing bug report that might be related? (I
>> >> couldn't find anything).
>> 

Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-08-13 Thread Alexandru Cucu
Hi,

Already tried zapping the disk. Unfortunaltely the same segfaults keep
me from adding the OSD back to the cluster.

I wanted to open an issue on tracker.ceph.com but I can't find the
"new issue" button.

---
Alex Cucu

On Mon, Aug 13, 2018 at 8:24 AM  wrote:
>
>
>
> Am 3. August 2018 12:03:17 MESZ schrieb Alexandru Cucu :
> >Hello,
> >
>
> Hello Alex,
>
> >Another OSD started randomly crashing with segmentation fault. Haven't
> >managed to add the last 3 OSDs back to the cluster as the daemons keep
> >crashing.
> >
>
> An idea could be to remove the osds completely from the Cluster and add it 
> again after zapping the Disks.
>
> Hth
> - Mehmet
>
> >---
> >
> >-2> 2018-08-03 12:12:52.670076 7f12b6b15700  4 rocksdb:
> >EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event":
> >"table_file_deletion", "file_number": 4350}
> >  -1> 2018-08-03 12:12:53.146753 7f12c38d0a80  0 osd.154 89917 load_pgs
> > 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal
> >(Segmentation fault) **
> > in thread 7f12c38d0a80 thread_name:ceph-osd
> > ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
> > 1: (()+0x9f1c2a) [0x7f12c42ddc2a]
> > 2: (()+0xf5e0) [0x7f12c1dc85e0]
> > 3: (()+0x34484) [0x7f12c34a6484]
> > 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions
> >const&, rocksdb::BlockIter*,
> >rocksdb::BlockBasedTable::CachableEntry*)+0x466)
> >[0x7f12c41e40d6]
> > 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
> >rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297)
> >[0x7f12c41e4b27]
> > 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
> >rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&,
> >rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*,
> >bool
> >, int)+0x2a4) [0x7f12c429ff94]
> > 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
> >rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
> >rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*,
> >unsigned l
> >ong*)+0x810) [0x7f12c419bb80]
> > 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
> >rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494]
> > 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
> >rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19]
> > 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&,
> >rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
> >std::string*)+0x95) [0x7f12c4252a45]
> > 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice
> >const&, std::string*)+0x4a) [0x7f12c4251eea]
> > 12: (RocksDBStore::get(std::string const&, std::string const&,
> >ceph::buffer::list*)+0xff) [0x7f12c415c31f]
> > 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock
> >const&, ghobject_t const&)+0x5e4) [0x7f12c4110814]
> > 14: (DBObjectMap::get_values(ghobject_t const&, std::set >std::less, std::allocator > const&,
> >std::map,
> >std::
> >allocator > >*)+0x5f)
> >[0x7f12c41f]
> > 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&,
> >std::set,
> >std::allocator > const&, std::map >ceph::buffer::list, std::less >td::string>, std::allocator >ceph::buffer::list> > >*)+0x197) [0x7f12c4031f77]
> >16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1]
> > 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5]
> > 18: (OSD::init()+0x2086) [0x7f12c3d07096]
> > 19: (main()+0x2c18) [0x7f12c3c1e088]
> > 20: (__libc_start_main()+0xf5) [0x7f12c0374c05]
> > 21: (()+0x3c8847) [0x7f12c3cb4847]
> > NOTE: a copy of the executable, or `objdump -rdS ` is
> >needed to interpret this.
> >---
> >
> >Any help would be appreciated.
> >
> >Thanks,
> >Alex Cucu
> >
> >On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu  wrote:
> >>
> >> Hello Ceph users,
> >>
> >> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
> >> the update, 1 OSD crashed.
> >> When trying to add the OSD back to the cluster, other 2 OSDs started
> >> crashing with segmentation fault. Had to mark all 3 OSDs as down as
> >we
> >> had stuck PGs and blocked operations and the cluster status was
> >> HEALTH_ERR.
> >>
> >> We have tried various ways to re-add the OSDs back to the cluster but
> >> after a while they start crashing and won't start anymore. After a
> >> while they can be started again and marked as in but after some
> >> rebalancing they will start the crashing imediately after starting.
> >>
> >> Here are some logs:
> >> https://pastebin.com/nCRamgRU
> >>
> >> Do you know of any existing bug report that might be related? (I
> >> couldn't find anything).
> >>
> >> I will happily provide any information that would help solving this
> >issue.
> >>
> >> Thank you,
> >> Alex Cucu
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-08-03 Thread Alexandru Cucu
Hello,

Another OSD started randomly crashing with segmentation fault. Haven't
managed to add the last 3 OSDs back to the cluster as the daemons keep
crashing.

---

-2> 2018-08-03 12:12:52.670076 7f12b6b15700  4 rocksdb:
EVENT_LOG_v1 {"time_micros": 1533287572670073, "job": 3, "event":
"table_file_deletion", "file_number": 4350}
-1> 2018-08-03 12:12:53.146753 7f12c38d0a80  0 osd.154 89917 load_pgs
 0> 2018-08-03 12:12:57.526910 7f12c38d0a80 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f12c38d0a80 thread_name:ceph-osd
 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (()+0x9f1c2a) [0x7f12c42ddc2a]
 2: (()+0xf5e0) [0x7f12c1dc85e0]
 3: (()+0x34484) [0x7f12c34a6484]
 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions
const&, rocksdb::BlockIter*,
rocksdb::BlockBasedTable::CachableEntry*)+0x466)
[0x7f12c41e40d6]
 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x297)
[0x7f12c41e4b27]
 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*,
bool
, int)+0x2a4) [0x7f12c429ff94]
 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*,
unsigned l
ong*)+0x810) [0x7f12c419bb80]
 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*, bool*)+0x5a4) [0x7f12c424e494]
 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)+0x19) [0x7f12c424ea19]
 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
std::string*)+0x95) [0x7f12c4252a45]
 11: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice
const&, std::string*)+0x4a) [0x7f12c4251eea]
 12: (RocksDBStore::get(std::string const&, std::string const&,
ceph::buffer::list*)+0xff) [0x7f12c415c31f]
 13: (DBObjectMap::_lookup_map_header(DBObjectMap::MapHeaderLock
const&, ghobject_t const&)+0x5e4) [0x7f12c4110814]
 14: (DBObjectMap::get_values(ghobject_t const&, std::set, std::allocator > const&,
std::map,
std::
allocator > >*)+0x5f)
[0x7f12c41f]
 15: (FileStore::omap_get_values(coll_t const&, ghobject_t const&,
std::set,
std::allocator > const&, std::map, std::allocator > >*)+0x197) [0x7f12c4031f77]
 16: (PG::_has_removal_flag(ObjectStore*, spg_t)+0x151) [0x7f12c3d8f7c1]
 17: (OSD::load_pgs()+0x5d5) [0x7f12c3cf43e5]
 18: (OSD::init()+0x2086) [0x7f12c3d07096]
 19: (main()+0x2c18) [0x7f12c3c1e088]
 20: (__libc_start_main()+0xf5) [0x7f12c0374c05]
 21: (()+0x3c8847) [0x7f12c3cb4847]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
---

Any help would be appreciated.

Thanks,
Alex Cucu

On Mon, Jul 30, 2018 at 4:55 PM Alexandru Cucu  wrote:
>
> Hello Ceph users,
>
> We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
> the update, 1 OSD crashed.
> When trying to add the OSD back to the cluster, other 2 OSDs started
> crashing with segmentation fault. Had to mark all 3 OSDs as down as we
> had stuck PGs and blocked operations and the cluster status was
> HEALTH_ERR.
>
> We have tried various ways to re-add the OSDs back to the cluster but
> after a while they start crashing and won't start anymore. After a
> while they can be started again and marked as in but after some
> rebalancing they will start the crashing imediately after starting.
>
> Here are some logs:
> https://pastebin.com/nCRamgRU
>
> Do you know of any existing bug report that might be related? (I
> couldn't find anything).
>
> I will happily provide any information that would help solving this issue.
>
> Thank you,
> Alex Cucu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-07-30 Thread Alexandru Cucu
Hello Ceph users,

We have updated our cluster from 10.2.7 to 10.2.11. A few hours after
the update, 1 OSD crashed.
When trying to add the OSD back to the cluster, other 2 OSDs started
crashing with segmentation fault. Had to mark all 3 OSDs as down as we
had stuck PGs and blocked operations and the cluster status was
HEALTH_ERR.

We have tried various ways to re-add the OSDs back to the cluster but
after a while they start crashing and won't start anymore. After a
while they can be started again and marked as in but after some
rebalancing they will start the crashing imediately after starting.

Here are some logs:
https://pastebin.com/nCRamgRU

Do you know of any existing bug report that might be related? (I
couldn't find anything).

I will happily provide any information that would help solving this issue.

Thank you,
Alex Cucu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com