Re: [ceph-users] Deleting large pools

2017-11-18 Thread Gregory Farnum
On Wed, Nov 15, 2017 at 6:50 AM David Turner  wrote:

> 2 weeks later and things are still deleting, but getting really close to
> being done.  I tried to use ceph-objectstore-tool to remove one of the
> PGs.  I only tested on 1 PG on 1 OSD, but it's doing something really
> weird.  While it was running, my connection to the DC reset and the command
> died.  Now when I try to run the tool it segfaults and just running the OSD
> it doesn't try to delete the data.  The data in this PG does not matter and
> I figure the worst case scenario is that it just sits there taking up 200GB
> until I redeploy the OSD.
>
> However, I like to learn things about Ceph.  Is there anyone with any
> insight to what is happening with this PG?
>

Well, this isn't supposed to happen, but backtraces like that generally
mean the PG is trying to load an OSDMap that has already been trimmed.

If I were to guess, in this case enough of the PG metadata got cleaned up
that the OSD no longer knows it's there, and it removed the maps. But
trying to remove the PG is pulling them in.
Or, alternatively, there's an issue with removing PGs that have lost their
metadata and it's trying to pull in map epoch 0 or something...
I'd stick a bug in the tracker in case it comes up in the future or
somebody takes a fancy to it. :)
-Greg


>
> [root@osd1 ~] # ceph-objectstore-tool --data-path
> /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal
> --pgid 97.314s0 --op remove
> SG_IO: questionable sense data, results may be incorrect
> SG_IO: questionable sense data, results may be incorrect
>  marking collection for removal
> mark_pg_for_removal warning: peek_map_epoch reported error
> terminate called after throwing an instance of
> 'ceph::buffer::end_of_buffer'
>   what():  buffer::end_of_buffer
> *** Caught signal (Aborted) **
>  in thread 7f98ab2dc980 thread_name:ceph-objectstor
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (()+0x95209a) [0x7f98abc4b09a]
>  2: (()+0xf100) [0x7f98a91d7100]
>  3: (gsignal()+0x37) [0x7f98a7d825f7]
>  4: (abort()+0x148) [0x7f98a7d83ce8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f98a86879d5]
>  6: (()+0x5e946) [0x7f98a8685946]
>  7: (()+0x5e973) [0x7f98a8685973]
>  8: (()+0x5eb93) [0x7f98a8685b93]
>  9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
> char*)+0xa5) [0x7f98abd498a5]
>  10: (PG::read_info(ObjectStore*, spg_t, coll_t const&,
> ceph::buffer::list&, pg_info_t&, std::map std::less, std::allocator > >&, unsigned char&)+0x324) [0x7f98ab6d3094]
>  11: (mark_pg_for_removal(ObjectStore*, spg_t,
> ObjectStore::Transaction*)+0x87c) [0x7f98ab66615c]
>  12: (initiate_new_remove_pg(ObjectStore*, spg_t,
> ObjectStore::Sequencer&)+0x131) [0x7f98ab666a51]
>  13: (main()+0x39b7) [0x7f98ab610437]
>  14: (__libc_start_main()+0xf5) [0x7f98a7d6eb15]
>  15: (()+0x363a57) [0x7f98ab65ca57]
> Aborted
>
> On Thu, Nov 2, 2017 at 12:45 PM Gregory Farnum  wrote:
>
>> Deletion is throttled, though I don’t know the configs to change it you
>> could poke around if you want stuff to go faster.
>>
>> Don’t just remove the directory in the filesystem; you need to clean up
>> the leveldb metadata as well. ;)
>> Removing the pg via Ceph-objectstore-tool would work fine but I’ve seen
>> too many people kill the wrong thing to recommend it.
>> -Greg
>> On Thu, Nov 2, 2017 at 9:40 AM David Turner 
>> wrote:
>>
>>> Jewel 10.2.7; XFS formatted OSDs; no dmcrypt or LVM.  I have a pool that
>>> I deleted 16 hours ago that accounted for about 70% of the available space
>>> on each OSD (averaging 84% full), 370M objects in 8k PGs, ec 4+2 profile.
>>> Based on the rate that the OSDs are freeing up space after deleting the
>>> pool, it will take about a week to finish deleting the PGs from the OSDs.
>>>
>>> Is there anything I can do to speed this process up?  I feel like there
>>> may be a way for me to go through the OSDs and delete the PG folders either
>>> with the objectstore tool or while the OSD is offline.  I'm not sure what
>>> Ceph is doing to delete the pool, but I don't think that an `rm -Rf` of the
>>> PG folder would take nearly this long.
>>>
>>> Thank you all for your help.
>>>
>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
One thing I have just noticed.

The Abort is always on the next thread along for example the last PG 6.84s12 
was thread 7f78721ad700, however the assert was listed for 7f78721ae700

Does this mean the PG 6.84s12 crashed causing the next thread to exit, or was 
the thread 7f78721ae700 what caused the crash and just does not list a PG yet 
in the log’s?

  -1> 2017-11-18 16:31:23.653898 7f78719ad700 10 osd.33 pg_epoch: 205498 
pg[6.84s12( v 155617'153212 (150667'151572,155617'153212] lb MIN (bitwise) 
local-lis/les=154327/154330 n=0 ec=31534/31534 lis/c 205276/152474 les/c/f 
205277/152489/159786 205496/205496/189316) 
[102,2147483647,2147483647,0,70,72,49,40,53,52,15,18,7]/[102,2147483647,2147483647,84,2147483647,2147483647,49,40,53,52,15,18,28]
 r=-1 lpr=205496 pi=[152474,205496)/2 crt=155617'153212 lcod 0'0 unknown 
NOTIFY] check_recovery_sources no source osds () went down
 0> 2017-11-18 16:31:23.653865 7f78721ae700 -1 *** Caught signal (Aborted) 
**
in thread 7f78721ae700 thread_name:tp_peering

,Ashley
From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 23:19
To: Ashley Merrick 
Cc: Eric Nelson ; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


The osd shouldn't be about to peer while it's down. I think this is good 
information to update your ticket with as it is possible a different code path 
than anticipated.

Did your cluster see the osd as up?

On Sat, Nov 18, 2017, 9:32 AM Ashley Merrick 
> wrote:
Hello,

So seems noup does not help.

Still have the same error :

2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa0c554) [0x56547f500554]
2: (()+0x110c0) [0x7fb45cabe0c0]
3: (gsignal()+0xcf) [0x7fb45ba85fcf]
4: (abort()+0x16a) [0x7fb45ba873fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]
6: (PG::start_peering_interval(std::shared_ptr, std::vector const&, int, std::vector 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]
8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]
9: (boost::statechart::state_machine::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]
10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector&, int, 
std::vector&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, std::allocator 
>*)+0x2e7) [0x56547ef56e67]
12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
16: (()+0x7494) [0x7fb45cab4494]
17: (clone()+0x3f) [0x7fb45bb3baff]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 22:19
To: Ashley Merrick >
Cc: Eric Nelson >; 
ceph-us...@ceph.com

Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
Will add to ticket.

But no the cluster does not see the OSD go up, just the OSD fails on the same 
assert.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 23:19
To: Ashley Merrick 
Cc: Eric Nelson ; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


The osd shouldn't be about to peer while it's down. I think this is good 
information to update your ticket with as it is possible a different code path 
than anticipated.

Did your cluster see the osd as up?

On Sat, Nov 18, 2017, 9:32 AM Ashley Merrick 
> wrote:
Hello,

So seems noup does not help.

Still have the same error :

2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa0c554) [0x56547f500554]
2: (()+0x110c0) [0x7fb45cabe0c0]
3: (gsignal()+0xcf) [0x7fb45ba85fcf]
4: (abort()+0x16a) [0x7fb45ba873fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]
6: (PG::start_peering_interval(std::shared_ptr, std::vector const&, int, std::vector 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]
8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]
9: (boost::statechart::state_machine::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]
10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector&, int, 
std::vector&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, std::allocator 
>*)+0x2e7) [0x56547ef56e67]
12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
16: (()+0x7494) [0x7fb45cab4494]
17: (clone()+0x3f) [0x7fb45bb3baff]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 22:19
To: Ashley Merrick >
Cc: Eric Nelson >; 
ceph-us...@ceph.com

Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
Hello,

Any further suggestions or work around’s from anyone?

Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering”

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson >
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley


From: Eric Nelson >
Sent: 16 November 2017 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread David Turner
The osd shouldn't be about to peer while it's down. I think this is good
information to update your ticket with as it is possible a different code
path than anticipated.

Did your cluster see the osd as up?

On Sat, Nov 18, 2017, 9:32 AM Ashley Merrick  wrote:

> Hello,
>
>
>
> So seems noup does not help.
>
>
>
> Still have the same error :
>
>
>
> 2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted)
> **in thread 7fb4446cd700 thread_name:tp_peering
>
>
>
> ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>
> 1: (()+0xa0c554) [0x56547f500554]
>
> 2: (()+0x110c0) [0x7fb45cabe0c0]
>
> 3: (gsignal()+0xcf) [0x7fb45ba85fcf]
>
> 4: (abort()+0x16a) [0x7fb45ba873fa]
>
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x28e) [0x56547f547f0e]
>
> 6: (PG::start_peering_interval(std::shared_ptr,
> std::vector const&, int, std::vector std::allocator > const&, int, ObjectStore::Transaction*)+0x1569)
> [0x56547f029ad9]
>
> 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479)
> [0x56547f02a099]
>
> 8: (boost::statechart::simple_state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x188) [0x56547f06c6d8]
>
> 9: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x69) [0x56547f045549]
>
> 10: (PG::handle_advance_map(std::shared_ptr,
> std::shared_ptr, std::vector&,
> int, std::vector&, int, PG::RecoveryCtx*)+0x4a7)
> [0x56547f00e837]
>
> 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set std::less,
> std::allocator >*)+0x2e7) [0x56547ef56e67]
>
> 12: (OSD::process_peering_events(std::__cxx11::list std::allocator > const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
>
> 13: (ThreadPool::BatchWorkQueue::_void_process(void*,
> ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
>
> 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
>
> 15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
>
> 16: (()+0x7494) [0x7fb45cab4494]
>
> 17: (clone()+0x3f) [0x7fb45bb3baff]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
> I guess even with noup the OSD/PG still has the peer with the other PG’s
> which is the stage that causes the failure, most OSD’s seem to stay up for
> about 30 seconds, and every time it’s a different PG listed on the failure.
>
>
>
> ,Ashley
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
>
> *Sent:* 18 November 2017 22:19
> *To:* Ashley Merrick 
>
> *Cc:* Eric Nelson ; ceph-us...@ceph.com
>
>
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Does letting the cluster run with noup for a while until all down disks
> are idle, and then letting them come in help at all?  I don't know your
> specific issue and haven't touched bluestore yet, but that is generally
> sound advice when is won't start.
>
> Also is there any pattern to the osds that are down? Common PGs, common
> hosts, common ssds, etc?
>
>
>
> On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
>
> Hello,
>
>
>
> Any further suggestions or work around’s from anyone?
>
>
>
> Cluster is hard down now with around 2% PG’s offline, on the occasion able
> to get an OSD to start for a bit but then will seem to do some peering and
> again crash with “*** Caught signal (Aborted) **in thread 7f3471c55700
> thread_name:tp_peering”
>
>
>
> ,Ashley
>
>
>
> *From:* Ashley Merrick
>
> *Sent:* 16 November 2017 17:27
> *To:* Eric Nelson 
>
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Hello,
>
>
>
> Good to hear it's not just me, however have a cluster basically offline
> due to too many OSD's dropping for this issue.
>
>
>
> Anybody have any suggestions?
>
>
>
> ,Ashley
> --
>
> *From:* Eric Nelson 
> *Sent:* 16 November 2017 00:06:14
> *To:* Ashley Merrick
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> I've been seeing these as well on our SSD cachetier that's been ravaged by
> disk failures as of late Same tp_peering assert as above even running
> luminous branch from git.
>
>
>
> Let me know if you have 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
Hello,

Added an empty one, and was fine as I guess had no peering to do as had no PG’s.

I did also disable no backfill for a while and some PG’s have moved across 
fine, I did try and export a PG on a OSD that fails to boot and import a PG 
onto the OSD but then caused the OSD to do the same error till I removed the PG 
in question.

To me it seems there is some metadata that it is running through on the PG’s 
(past operations) or something that’s causing the crash,  however as it says 
happens on a random PG and have seen examples where the PG it crashes on last 
it has already gone through the peering stage with no issues.

The other user on ML that had the issue had theirs start when they had a disk 
failed which caused PG’s to peer during rebuilding, so seem’s to be related to 
a task in the peering phase which the OSD just loop’s on due to the crash.

,Ashley

From: Sean Redmond [mailto:sean.redmo...@gmail.com]
Sent: 18 November 2017 22:40
To: Ashley Merrick 
Cc: David Turner ; ceph-users 
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

Hi,

Is it possible to add new empty osds to your cluster? Or do these also crash 
out?

Thanks

On 18 Nov 2017 14:32, "Ashley Merrick" 
> wrote:
Hello,

So seems noup does not help.

Still have the same error :

2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa0c554) [0x56547f500554]
2: (()+0x110c0) [0x7fb45cabe0c0]
3: (gsignal()+0xcf) [0x7fb45ba85fcf]
4: (abort()+0x16a) [0x7fb45ba873fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]
6: (PG::start_peering_interval(std::shared_ptr, std::vector const&, int, std::vector 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]
8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]
9: (boost::statechart::state_machine::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]
10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector&, int, 
std::vector&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, std::allocator 
>*)+0x2e7) [0x56547ef56e67]
12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
16: (()+0x7494) [0x7fb45cab4494]
17: (clone()+0x3f) [0x7fb45bb3baff]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 22:19
To: Ashley Merrick >
Cc: Eric Nelson >; 
ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
Hello,

Any further suggestions or work around’s from anyone?

Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Sean Redmond
Hi,

Is it possible to add new empty osds to your cluster? Or do these also
crash out?

Thanks

On 18 Nov 2017 14:32, "Ashley Merrick"  wrote:

> Hello,
>
>
>
> So seems noup does not help.
>
>
>
> Still have the same error :
>
>
>
> 2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted)
> **in thread 7fb4446cd700 thread_name:tp_peering
>
>
>
> ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>
> 1: (()+0xa0c554) [0x56547f500554]
>
> 2: (()+0x110c0) [0x7fb45cabe0c0]
>
> 3: (gsignal()+0xcf) [0x7fb45ba85fcf]
>
> 4: (abort()+0x16a) [0x7fb45ba873fa]
>
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x28e) [0x56547f547f0e]
>
> 6: (PG::start_peering_interval(std::shared_ptr,
> std::vector const&, int, std::vector std::allocator > const&, int, ObjectStore::Transaction*)+0x1569)
> [0x56547f029ad9]
>
> 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479)
> [0x56547f02a099]
>
> 8: (boost::statechart::simple_state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_
> mode)0>::react_impl(boost::statechart::event_base const&, void
> const*)+0x188) [0x56547f06c6d8]
>
> 9: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator, boost::statechart::null_
> exception_translator>::process_event(boost::statechart::event_base
> const&)+0x69) [0x56547f045549]
>
> 10: (PG::handle_advance_map(std::shared_ptr,
> std::shared_ptr, std::vector&,
> int, std::vector&, int, PG::RecoveryCtx*)+0x4a7)
> [0x56547f00e837]
>
> 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set std::less, std::allocator > >*)+0x2e7) [0x56547ef56e67]
>
> 12: (OSD::process_peering_events(std::__cxx11::list std::allocator > const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
>
> 13: (ThreadPool::BatchWorkQueue::_void_process(void*,
> ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
>
> 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
>
> 15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
>
> 16: (()+0x7494) [0x7fb45cab4494]
>
> 17: (clone()+0x3f) [0x7fb45bb3baff]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
> I guess even with noup the OSD/PG still has the peer with the other PG’s
> which is the stage that causes the failure, most OSD’s seem to stay up for
> about 30 seconds, and every time it’s a different PG listed on the failure.
>
>
>
> ,Ashley
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* 18 November 2017 22:19
> *To:* Ashley Merrick 
> *Cc:* Eric Nelson ; ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Does letting the cluster run with noup for a while until all down disks
> are idle, and then letting them come in help at all?  I don't know your
> specific issue and haven't touched bluestore yet, but that is generally
> sound advice when is won't start.
>
> Also is there any pattern to the osds that are down? Common PGs, common
> hosts, common ssds, etc?
>
>
>
> On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
>
> Hello,
>
>
>
> Any further suggestions or work around’s from anyone?
>
>
>
> Cluster is hard down now with around 2% PG’s offline, on the occasion able
> to get an OSD to start for a bit but then will seem to do some peering and
> again crash with “*** Caught signal (Aborted) **in thread 7f3471c55700
> thread_name:tp_peering”
>
>
>
> ,Ashley
>
>
>
> *From:* Ashley Merrick
>
> *Sent:* 16 November 2017 17:27
> *To:* Eric Nelson 
>
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Hello,
>
>
>
> Good to hear it's not just me, however have a cluster basically offline
> due to too many OSD's dropping for this issue.
>
>
>
> Anybody have any suggestions?
>
>
>
> ,Ashley
> --
>
> *From:* Eric Nelson 
> *Sent:* 16 November 2017 00:06:14
> *To:* Ashley Merrick
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> I've been seeing these as well on our SSD cachetier that's been ravaged by
> disk failures as of late Same tp_peering assert as above even running
> luminous branch from git.
>
>
>
> Let me know if you have a bug filed I can +1 or have found a workaround.
>
>
>
> E
>
>
>
> On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
Hello,

So seems noup does not help.

Still have the same error :

2017-11-18 14:26:40.982827 7fb4446cd700 -1 *** Caught signal (Aborted) **in 
thread 7fb4446cd700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa0c554) [0x56547f500554]
2: (()+0x110c0) [0x7fb45cabe0c0]
3: (gsignal()+0xcf) [0x7fb45ba85fcf]
4: (abort()+0x16a) [0x7fb45ba873fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) 
[0x56547f547f0e]
6: (PG::start_peering_interval(std::shared_ptr, std::vector const&, int, std::vector 
const&, int, ObjectStore::Transaction*)+0x1569) [0x56547f029ad9]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56547f02a099]
8: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x188) [0x56547f06c6d8]
9: (boost::statechart::state_machine::process_event(boost::statechart::event_base
 const&)+0x69) [0x56547f045549]
10: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector&, int, 
std::vector&, int, PG::RecoveryCtx*)+0x4a7) 
[0x56547f00e837]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set, std::allocator 
>*)+0x2e7) [0x56547ef56e67]
12: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x1e4) [0x56547ef57cb4]
13: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2c) [0x56547efc2a0c]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x56547f54ef28]
15: (ThreadPool::WorkThread::entry()+0x10) [0x56547f5500c0]
16: (()+0x7494) [0x7fb45cab4494]
17: (clone()+0x3f) [0x7fb45bb3baff]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I guess even with noup the OSD/PG still has the peer with the other PG’s which 
is the stage that causes the failure, most OSD’s seem to stay up for about 30 
seconds, and every time it’s a different PG listed on the failure.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 22:19
To: Ashley Merrick 
Cc: Eric Nelson ; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
Hello,

Any further suggestions or work around’s from anyone?

Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering”

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson >
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley


From: Eric Nelson >
Sent: 16 November 2017 00:06:14
To: Ashley Merrick
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I've been seeing these as well on our SSD cachetier that's been ravaged by disk 
failures as of late Same tp_peering assert as above even running luminous 
branch from git.

Let me know if you have a bug filed I can +1 or have found a workaround.

E

On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 
> wrote:

Hello,



After replacing a single OSD disk due to a failed disk I am now seeing 2-3 
OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and then 
fail with the following (I tried setting OSD log’s to 5/5 but didn’t get any 
extra lines around the error just more 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
Hello,

Will try with the noup now and see if makes any difference.

Is effecting both BS & FS OSD’s and effecting different host’s and different 
PG’s seems to be no form of pattern.

,Ashley

From: David Turner [mailto:drakonst...@gmail.com]
Sent: 18 November 2017 22:19
To: Ashley Merrick 
Cc: Eric Nelson ; ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
> wrote:
Hello,

Any further suggestions or work around’s from anyone?

Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering”

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson >
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley


From: Eric Nelson >
Sent: 16 November 2017 00:06:14
To: Ashley Merrick
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I've been seeing these as well on our SSD cachetier that's been ravaged by disk 
failures as of late Same tp_peering assert as above even running luminous 
branch from git.

Let me know if you have a bug filed I can +1 or have found a workaround.

E

On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 
> wrote:

Hello,



After replacing a single OSD disk due to a failed disk I am now seeing 2-3 
OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and then 
fail with the following (I tried setting OSD log’s to 5/5 but didn’t get any 
extra lines around the error just more information pre boot.



Could this be a certain PG causing these OSD’s to crash (6.2f2s10 for example)?



-9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] state: transitioning to Stray

-8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] exit Start 0.19 0 0.00

-7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] enter Started/Stray

-6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.76

-5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started

-4> 2017-11-15 

Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread David Turner
Does letting the cluster run with noup for a while until all down disks are
idle, and then letting them come in help at all?  I don't know your
specific issue and haven't touched bluestore yet, but that is generally
sound advice when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common
hosts, common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick  wrote:

> Hello,
>
>
>
> Any further suggestions or work around’s from anyone?
>
>
>
> Cluster is hard down now with around 2% PG’s offline, on the occasion able
> to get an OSD to start for a bit but then will seem to do some peering and
> again crash with “*** Caught signal (Aborted) **in thread 7f3471c55700
> thread_name:tp_peering”
>
>
>
> ,Ashley
>
>
>
> *From:* Ashley Merrick
>
> *Sent:* 16 November 2017 17:27
> *To:* Eric Nelson 
>
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Hello,
>
>
>
> Good to hear it's not just me, however have a cluster basically offline
> due to too many OSD's dropping for this issue.
>
>
>
> Anybody have any suggestions?
>
>
>
> ,Ashley
> --
>
> *From:* Eric Nelson 
> *Sent:* 16 November 2017 00:06:14
> *To:* Ashley Merrick
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> I've been seeing these as well on our SSD cachetier that's been ravaged by
> disk failures as of late Same tp_peering assert as above even running
> luminous branch from git.
>
>
>
> Let me know if you have a bug filed I can +1 or have found a workaround.
>
>
>
> E
>
>
>
> On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 
> wrote:
>
> Hello,
>
>
>
> After replacing a single OSD disk due to a failed disk I am now seeing 2-3
> OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and
> then fail with the following (I tried setting OSD log’s to 5/5 but didn’t
> get any extra lines around the error just more information pre boot.
>
>
>
> Could this be a certain PG causing these OSD’s to crash (6.2f2s10 for
> example)?
>
>
>
> -9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] state: transitioning to Stray
>
> -8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] exit Start 0.19 0 0.00
>
> -7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] enter Started/Stray
>
> -6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit
> Reset 3.363755 2 0.76
>
> -5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter
> Started
>
> -4> 2017-11-15 17:37:14.696346 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter
> Start
>
> -3> 2017-11-15 

Re: [ceph-users] I/O stalls when doing fstrim on large RBD

2017-11-18 Thread Jason Dillaman
Can you capture a blktrace while perform fstrim to record the discard
operations? A 1TB trim extent would cause a huge impact since it would
translate to approximately 262K IO requests to the OSDs (assuming 4MB
backing files).

On Fri, Nov 17, 2017 at 6:19 PM, Brendan Moloney  wrote:
> Hi,
>
> I guess this isn't strictly about Ceph, but I feel like other folks here
> must have run into the same issues.
>
> I am trying to keep my thinly provisioned RBD volumes thin.  I use
> virtio-scsi to attach the RBD volumes to my VMs with the "discard=unmap"
> option. The RBD is formatted as XFS and some of them can be quite large
> (16TB+).  I have a cron job that runs "fstrim" commands twice a week in the
> evenings.
>
> The issue is that I see massive I/O stalls on the VM during the fstrim.  To
> the point where I am getting kernel panics from hung tasks and other
> timeouts.  I have tried a number of things to lessen the impact:
>
> - Switching from deadline to CFQ (initially I thought this helped, but
> now I am not convinced)
> - Running fstrim with "ionice -c idle" (this doesn't seem to make a
> difference)
> - Chunking the fstrim with the offset/length options (helps reduce worst
> case, but I can't trim less than 1TB at a time and that can still cause a
> pause for several minutes)
>
> Is there anything else I can do to avoid this issue?
>
> Thanks,
> Brendan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rebuild rgw bucket index

2017-11-18 Thread Milanov, Radoslav Nikiforov
Is there a way to rebuild the contents of .rgw.buckets.index pool removed by 
accident ?

Thanks in advance.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Random Failures - Latest Luminous

2017-11-18 Thread Ashley Merrick
Hello,

Any further suggestions or work around's from anyone?

Cluster is hard down now with around 2% PG's offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with "*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering"

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson 
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley


From: Eric Nelson >
Sent: 16 November 2017 00:06:14
To: Ashley Merrick
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I've been seeing these as well on our SSD cachetier that's been ravaged by disk 
failures as of late Same tp_peering assert as above even running luminous 
branch from git.

Let me know if you have a bug filed I can +1 or have found a workaround.

E

On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 
> wrote:

Hello,



After replacing a single OSD disk due to a failed disk I am now seeing 2-3 
OSD's randomly stop and fail to start, do a boot loop get to load_pgs and then 
fail with the following (I tried setting OSD log's to 5/5 but didn't get any 
extra lines around the error just more information pre boot.



Could this be a certain PG causing these OSD's to crash (6.2f2s10 for example)?



-9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] state: transitioning to Stray

-8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] exit Start 0.19 0 0.00

-7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] enter Started/Stray

-6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.76

-5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started

-4> 2017-11-15 17:37:14.696346 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Start

-3> 2017-11-15 17:37:14.696353 7fa4ec50f700  1 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] state: transitioning to 
Stray

-2> 2017-11-15 17:37:14.696364 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 

Re: [ceph-users] Bluestore performance 50% of filestore

2017-11-18 Thread Nick Fisk
Just a couple of points.

There is no way you can be writing over 7000 iops to 27x7200rpm disks at a 
replica level of 3. As Mark has suggested, with a 1GB test file, you are only 
touching a tiny area on each physical disk and so you are probably getting a 
combination of short stroking from the disks and Filestore/XFS buffering up 
your writes, coalescing them and actually writing a lot less out to the disks 
than what the benchmark is suggesting. 

I'm not 100% sure on how the allocations work in Bluestore, especially when it 
comes to overwriting with tiny 4kb objects, but I wondering if Bluestore is 
starting to spread the data out further across the disk so you lose some 
benefit of short stroking? There maybe other factors coming into play with the 
deferred writes which was implemented/fixed after the investigation Mark 
mentioned. The simple reproducer at the time was to coalesce a stream of small 
sequential writes, the scenario where a larger number of small random writes 
potentially covering the same small area was not tested.

I would suggest trying to use fio with the librbd engine directly and create a 
RBD of around a TB in size to rule out any disk locality issues first. If that 
brings the figures more in line, then that could potentially steer the 
investigation towards why Bluestore struggles to coalesce as well as the Linux 
FS system.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Milanov, Radoslav Nikiforov
> Sent: 17 November 2017 22:56
> To: Mark Nelson ; David Turner
> 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Bluestore performance 50% of filestore
> 
> Here's some more results, I'm reading 12.2.2 will have performance
> improvements for bluestore and should be released soon?
> 
> Iodepth=not specified
> Filestore
>   write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec
>   write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec
>   write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec
> 
>   read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec
>   read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec
>   read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec
> 
> Bluestore
>   write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec
>   write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec
>   write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec
> 
>   read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec
>   read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec
>   read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec
> 
> Iodepth=10
> Filestore
>   write: io=5045.1MB, bw=28706KB/s, iops=7176, runt=180001msec
>   write: io=4764.7MB, bw=27099KB/s, iops=6774, runt=180021msec
>   write: io=4626.2MB, bw=26318KB/s, iops=6579, runt=180031msec
> 
>   read : io=1745.3MB, bw=9928.6KB/s, iops=2482, runt=180001msec
>   read : io=1933.7MB, bw=11000KB/s, iops=2749, runt=180001msec
>   read : io=1952.7MB, bw=11108KB/s, iops=2777, runt=180001msec
> 
> Bluestore
>   write: io=1578.8MB, bw=8980.9KB/s, iops=2245, runt=180006msec
>   write: io=1583.9MB, bw=9010.2KB/s, iops=2252, runt=180002msec
>   write: io=1591.5MB, bw=9050.9KB/s, iops=2262, runt=180009msec
> 
>   read : io=412104KB, bw=2289.5KB/s, iops=572, runt=180002msec
>   read : io=718108KB, bw=3989.5KB/s, iops=997, runt=180003msec
>   read : io=968388KB, bw=5379.7KB/s, iops=1344, runt=180009msec
> 
> Iodpeth=20
> Filestore
>   write: io=4671.2MB, bw=26574KB/s, iops=6643, runt=180001msec
>   write: io=4583.4MB, bw=26066KB/s, iops=6516, runt=180054msec
>   write: io=4641.6MB, bw=26347KB/s, iops=6586, runt=180395msec
> 
>   read : io=2094.3MB, bw=11914KB/s, iops=2978, runt=180001msec
>   read : io=1997.6MB, bw=11364KB/s, iops=2840, runt=180001msec
>   read : io=2028.4MB, bw=11539KB/s, iops=2884, runt=180001msec
> 
> Bluestore
>   write: io=1595.8MB, bw=9078.2KB/s, iops=2269, runt=180001msec
>   write: io=1596.2MB, bw=9080.6KB/s, iops=2270, runt=180001msec
>   write: io=1588.3MB, bw=9035.4KB/s, iops=2258, runt=180002msec
> 
>   read : io=1126.9MB, bw=6410.5KB/s, iops=1602, runt=180004msec
>   read : io=1282.4MB, bw=7295.3KB/s, iops=1823, runt=180003msec
>   read : io=1380.9MB, bw=7854.1KB/s, iops=1963, runt=180007msec
> 
> 
> - Rado
> 
> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Thursday, November 16, 2017 2:04 PM
> To: Milanov, Radoslav Nikiforov ; David Turner
> 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Bluestore performance 50% of filestore
> 
> It depends on what you expect your typical workload to be like.  Ceph (and
> distributed storage in general) likes high io depths so writes can hit all of 
> the
> drives at the same time.  There are tricks (like journals, writahead logs,
> centralized caches, etc) that can help mitigate 

Re: [ceph-users] bucket cloning/writable snapshots

2017-11-18 Thread Haomai Wang
On Sat, Nov 18, 2017 at 4:49 PM, Fred Gansevles  wrote:

> Hi,
>
> Currently our company has +/- 50 apps where every app has its own
> data-area on NFS.
> We need to switch S3, using Ceph, as our new data layer with
> every app using its own s3-bucket, equivalent to the NFS data-area.
> The sizes of the data-areas, depending on the app, varies from 1.3 GB to
> 358 GB.
>
> In order to test multiple versions of the app, we currently make a
> writable snapshot of
> the data-area to avoid copying or 'polluting' the original (i.e.
> 'production' data).
> Since a snapshot is fast we can make multiple snapshots easy and discard
> them
> afterwards.
>
> With Ceph, we would like to do something alike, i.e. 'fast' copying and
> easily discardable.
> The path we are trying to take is: make a 'clone' of the bucket (i.e.
> writable
> snapshots) into a test-bucket.
>

why not S3 multiversion?

>
> Our design for this is the following:
> - We need to have every app bucket in its own unique pool-set.
> - (on the ceph-nodes) determine the pool-set that is used by the given
> bucket
> - (on the ceph-nodes) make snapshots of the pools and assign these
> snapshots to
>   an newly created (writable!) test-bucket.
> - after the test is finished, the test-bucket can be removed (either from
> the ceph-node
>   or the test system).
>
> The test-procedure that is testing the app is aware of its environment,
> i.e.: it
> 'knows' that it has to do specials things to get a test-bucket. The app
> itself isn not
> aware of this, and just uses whatever bucket is passed to it.
>
> This way the test procedure can use the test-bucket and does not interfere
> with the
> original data and the app can be run 'as-is' without changing the code.
>
> I have the following question:
> Is this scenario at all possible?
> - if yes: how can I accomplish this?
> - if no: is this a design-flaw (it can be done, just not this way)
>  or its simply not possible.
>
> --
>
> Best regards,
>
> Fred Gansevles
> *Devops Engineer*
> 
> Auke Vleerstraat 140 E *T* +31 (0) 53 48 00 680
> <+31%20%280%29%2053%2048%2000%20680>
> 7547 AN Enschede *E* fgansev...@betterbe.com
> CC no. 08097527
> 
> *M* +31 (0)6 30 262 174 <+31%20%280%296%2030%20262%20174> www.betterbe.com
> BetterBe accepts no liability for the content of this email, or for the
> consequences of any actions taken on the basis of the information provided,
> unless that information is subsequently confirmed in writing. If you are
> not the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> information is strictly prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bucket cloning/writable snapshots

2017-11-18 Thread Fred Gansevles
Hi,
 
Currently our company has +/- 50 apps where every app has its own
data-area on NFS.
We need to switch S3, using Ceph, as our new data layer with
every app using its own s3-bucket, equivalent to the NFS data-area.
The sizes of the data-areas, depending on the app, varies from 1.3 GB to
358 GB.

In order to test multiple versions of the app, we currently make a
writable snapshot of
the data-area to avoid copying or 'polluting' the original (i.e.
'production' data).
Since a snapshot is fast we can make multiple snapshots easy and discard
them
afterwards.

With Ceph, we would like to do something alike, i.e. 'fast' copying and
easily discardable.
The path we are trying to take is: make a 'clone' of the bucket (i.e.
writable
snapshots) into a test-bucket.

Our design for this is the following:
- We need to have every app bucket in its own unique pool-set.
- (on the ceph-nodes) determine the pool-set that is used by the given
bucket
- (on the ceph-nodes) make snapshots of the pools and assign these
snapshots to
  an newly created (writable!) test-bucket.
- after the test is finished, the test-bucket can be removed (either
from the ceph-node
  or the test system).

The test-procedure that is testing the app is aware of its environment,
i.e.: it
'knows' that it has to do specials things to get a test-bucket. The app
itself isn not
aware of this, and just uses whatever bucket is passed to it.

This way the test procedure can use the test-bucket and does not
interfere with the
original data and the app can be run 'as-is' without changing the code.

I have the following question:
    Is this scenario at all possible?
    - if yes: how can I accomplish this?
    - if no: is this a design-flaw (it can be done, just not this way)
 or its simply not possible.

-- 

Best regards,

Fred Gansevles
/Devops Engineer/


Auke Vleerstraat 140 E  *T* +31 (0) 53 48 00 680

7547 AN Enschede*E* fgansev...@betterbe.com

CC no. 08097527

*M* +31 (0)6 30 262 174 
www.betterbe.com 
BetterBe accepts no liability for the content of this email, or for the
consequences of any actions taken on the basis of the information
provided, unless that information is subsequently confirmed in writing.
If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents
of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com