Re: [ceph-users] Failed to encode map errors

2019-12-03 Thread Martin Verges
Hello,

what versions of Ceph are you running?

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Di., 3. Dez. 2019 um 19:05 Uhr schrieb John Hearns :

> And me again for the second time in one day.
>
> ceph -w is now showing messages like this:
>
> 2019-12-03 15:17:22.426988 osd.6 [WRN] failed to encode map e28961 with
> expected crc
>
> Any advice please?
>
> *Kheiron Medical Technologies*
>
> kheironmed.com | supporting radiologists with deep learning
>
> Kheiron Medical Technologies Ltd. is a registered company in England and
> Wales. This e-mail and its attachment(s) are intended for the above named
> only and are confidential. If they have come to you in error then you must
> take no action based upon them but contact us immediately. Any disclosure,
> copying, distribution or any action taken or omitted to be taken in
> reliance on it is prohibited and may be unlawful. Although this e-mail and
> its attachments are believed to be free of any virus, it is the
> responsibility of the recipient to ensure that they are virus free. If you
> contact us by e-mail then we will store your name and address to facilitate
> communications. Any statements contained herein are those of the individual
> and not the organisation.
>
> Registered number: 10184103. Registered office: 2nd Floor Stylus
> Building, 116 Old Street, London, England, EC1V 9BG
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shall host weight auto reduce on hdd failure?

2019-12-03 Thread Milan Kupcevic


On hdd failure the number of placement groups on the rest of osds on the
same host goes up. I would expect equal distribution of failed placement
groups across the cluster, not just on the troubled host. Shall the host
weight auto reduce whenever an osd gets out?

Exibit 1: Attached osd-df-tree file. Number of placement groups per osd
on healthy nodes across the cluster is around 160, see osd050 and
osd056. Number of placement groups per osd on nodes with hdd failures
goes noticeably up, more so as more hdd failures happen on the same
node, see osd051 and osd053.

This cluster can handle this case at this moment as it has got plenty of
free space. I wonder how is this going to play out when we get to 90% of
usage on the whole cluster. A single backplane failure in a node takes
four drives out at once; that is 30% of storage space on a node. The
whole cluster would have enough space to host the failed placement
groups but one node would not.

This cluster is running Nautilus 14.2.0 with default settings deployed
using ceph-ansible.


Milan


-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing



> ceph osd df tree name osd050
ID   CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAPMETAAVAIL   
%USE  VAR  PGS STATUS TYPE NAME   
-130   110.88315- 111 TiB 6.0 TiB  4.7 TiB 563 MiB  21 GiB 105 TiB  
5.39 1.00   -host osd050 
 517   hdd   9.20389  1.0 9.2 TiB 442 GiB  329 GiB  16 KiB 1.7 GiB 8.8 TiB  
4.69 0.87 157 up osd.517 
 532   hdd   9.20389  1.0 9.2 TiB 465 GiB  352 GiB  32 KiB 1.8 GiB 8.7 TiB  
4.94 0.92 170 up osd.532 
 544   hdd   9.20389  1.0 9.2 TiB 447 GiB  334 GiB  32 KiB 1.8 GiB 8.8 TiB  
4.74 0.88 153 up osd.544 
 562   hdd   9.20389  1.0 9.2 TiB 440 GiB  328 GiB  64 KiB 1.5 GiB 8.8 TiB  
4.67 0.87 159 up osd.562 
 575   hdd   9.20389  1.0 9.2 TiB 479 GiB  366 GiB  88 KiB 1.9 GiB 8.7 TiB  
5.08 0.94 175 up osd.575 
 592   hdd   9.20389  1.0 9.2 TiB 434 GiB  321 GiB  24 KiB 1.4 GiB 8.8 TiB  
4.60 0.85 153 up osd.592 
 605   hdd   9.20389  1.0 9.2 TiB 456 GiB  343 GiB 0 B 1.5 GiB 8.8 TiB  
4.84 0.90 170 up osd.605 
 618   hdd   9.20389  1.0 9.2 TiB 473 GiB  360 GiB  16 KiB 1.6 GiB 8.7 TiB  
5.01 0.93 172 up osd.618 
 631   hdd   9.20389  1.0 9.2 TiB 461 GiB  348 GiB  44 KiB 1.5 GiB 8.8 TiB  
4.89 0.91 165 up osd.631 
 644   hdd   9.20389  1.0 9.2 TiB 459 GiB  346 GiB  92 KiB 1.7 GiB 8.8 TiB  
4.87 0.90 163 up osd.644 
 656   hdd   9.20389  1.0 9.2 TiB 433 GiB  320 GiB  68 KiB 1.4 GiB 8.8 TiB  
4.59 0.85 156 up osd.656 
 669   hdd   9.20389  1.0 9.2 TiB 1.1 TiB 1019 GiB  36 KiB 2.6 GiB 8.1 TiB 
12.01 2.23 169 up osd.669 
 682   ssd   0.43649  1.0 447 GiB 3.1 GiB  2.1 GiB 562 MiB 462 MiB 444 GiB  
0.69 0.13 168 up osd.682 
TOTAL 111 TiB 6.0 TiB  4.7 TiB 563 MiB  21 GiB 105 TiB  
5.39 
MIN/MAX VAR: 0.13/2.23  STDDEV: 2.32

> ceph osd df tree name osd051
ID   CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETAAVAIL   
%USE VAR  PGS STATUS TYPE NAME   
-148   110.88315-  83 TiB 4.9 TiB 4.0 TiB 573 MiB  20 GiB  78 TiB 
5.94 1.00   -host osd051 
 408   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.408 
 538   hdd   9.20389  1.0 9.2 TiB 542 GiB 429 GiB  24 KiB 2.4 GiB 8.7 TiB 
5.75 0.97 212 up osd.538 
 552   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.552 
 565   hdd   9.203890 0 B 0 B 0 B 0 B 0 B 0 B   
 00   0   down osd.565 
 578   hdd   9.20389  1.0 9.2 TiB 557 GiB 444 GiB  56 KiB 2.0 GiB 8.7 TiB 
5.91 0.99 213 up osd.578 
 590   hdd   9.20389  1.0 9.2 TiB 533 GiB 420 GiB  34 KiB 2.4 GiB 8.7 TiB 
5.66 0.95 212 up osd.590 
 603   hdd   9.20389  1.0 9.2 TiB 562 GiB 449 GiB  76 KiB 2.2 GiB 8.7 TiB 
5.96 1.00 218 up osd.603 
 616   hdd   9.20389  1.0 9.2 TiB 553 GiB 440 GiB  16 KiB 2.2 GiB 8.7 TiB 
5.86 0.99 217 up osd.616 
 629   hdd   9.20389  1.0 9.2 TiB 579 GiB 466 GiB  40 KiB 2.0 GiB 8.6 TiB 
6.14 1.03 228 up osd.629 
 642   hdd   9.20389  1.0 9.2 TiB 588 GiB 475 GiB  40 KiB 2.6 GiB 8.6 TiB 
6.23 1.05 228 up osd.642 
 655   hdd   9.20389  1.0 9.2 TiB 583 GiB 470 GiB  32 KiB 2.3 GiB 8.6 TiB 
6.18 1.04 223 up osd.655 
 668   hdd   9.20389  1.0 9.2 TiB 570 GiB 457 GiB  32 KiB 1.9 GiB 8.6 TiB 
6.05 1.02 229 up osd.668 
 681   ssd   0.43649  1.0 447 GiB 3.1 GiB 2.1 GiB 573 MiB 451 MiB 444 GiB 
0.69 0.12 167 up osd.681 
TOTAL  83 TiB 4.9 TiB 4.0 TiB 573 MiB  

Re: [ceph-users] Revert a CephFS snapshot?

2019-12-03 Thread Luis Henriques
On Tue, Dec 03, 2019 at 02:09:30PM -0500, Jeff Layton wrote:
> On Tue, 2019-12-03 at 07:59 -0800, Robert LeBlanc wrote:
> > On Thu, Nov 14, 2019 at 11:48 AM Sage Weil  wrote:
> > > On Thu, 14 Nov 2019, Patrick Donnelly wrote:
> > > > On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee  
> > > > wrote:
> > > > >
> > > > > On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee  
> > > > > > wrote:
> > > > > > > Recently, I'm evaluating the snpahsot feature of CephFS from 
> > > > > > > kernel
> > > > > > > client and everthing works like a charm.  But, it seems that 
> > > > > > > reverting
> > > > > > > a snapshot is not available currently.  Is there some reason or
> > > > > > > technical limitation that the feature is not provided?  Any 
> > > > > > > insights
> > > > > > > or ideas are appreciated.
> > > > > >
> > > > > > Please provide more information about what you tried to do (commands
> > > > > > run) and how it surprised you.
> > > > >
> > > > > The thing I would like to do is to rollback a snapped directory to a
> > > > > previous version of snapshot.  It looks like the operation can be done
> > > > > by over-writting all the current version of files/directories from a
> > > > > previous snapshot via cp.  But cp may take lots of time when there are
> > > > > many files and directories in the target directory.  Is there any
> > > > > possibility to achieve the goal much faster from the CephFS internal
> > > > > via command like "ceph fs   snap rollback
> > > > > " (just a example)?  Thank you!
> > > > 
> > > > RADOS doesn't support rollback of snapshots so it needs to be done
> > > > manually. The best tool to do this would probably be rsync of the
> > > > .snap directory with appropriate options including deletion of files
> > > > that do not exist in the source (snapshot).
> > > 
> > > rsync is the best bet now, yeah.
> > > 
> > > RADOS does have a rollback operation that uses clone where it can, but 
> > > it's a per-object operation, so something still needs to walk the 
> > > hierarchy and roll back each file's content.  The MDS could do this more 
> > > efficiently than rsync give what it knows about the snapped inodes 
> > > (skipping untouched inodes or, eventually, entire subtrees) but it's a 
> > > non-trivial amount of work to implement.
> > > 
> > 
> > Would it make sense to extend CephFS to leverage reflinks for cases like 
> > this? That could be faster than rsync and more space efficient. It would 
> > require some development time though.
> > 
> 
> I think reflink would be hard. Ceph hardcodes the inode number into the
> object name of the backing objects, so sharing between different inode
> numbers is really difficult to do. It could be done, but it means a new
> in-object-store layout scheme.
> 
> That said...I wonder if we could get better performance by just
> converting rsync to use copy_file_range in this situation. That has the
> potential to offload a lot of the actual copying work to the OSDs. 

Just to add my 2 cents, I haven't done any serious performance
measurements with copy_file_range.  However, the very limited
observations I've done surprised me a bit, showing that performance
isn't great.  In fact, when file objects size is small, using
copy_file_range seems to be slower than a full read+write cycle.

It's still on my TODO list to do some more serious performance analysis
and figure out why.  It didn't seemed to be an issue on the client side,
but I don't really have any real evidences.  Once the COPY_FROM2
operation is stable, I can plan to spend some time on this.

Cheers,
--
Luís
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds way ahead of gateway version?

2019-12-03 Thread Gregory Farnum
Unfortunately RGW doesn't test against extended version differences
like this and I don't think it's compatible across more than one major
release. Basically it's careful to support upgrades between long-term
stable releases but nothing else is expected to work.

That said, getting off of Giant would be good; it's quite old! :)
-Greg

On Tue, Dec 3, 2019 at 3:27 PM Philip Brown  wrote:
>
>
> Im in a situation where it would be extremely strategically advantageous to 
> run some OSDs on luminous (so we can try out bluestore) while the gateways 
> stay on giant.
> Is this a terrible terrible thing, or can we reasonably get away with it?
>
> points of interest:
> 1. i plan to make a new pool for this and keep it all bluestore
> 2. we only use the cluster for RBDs.
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osds way ahead of gateway version?

2019-12-03 Thread Philip Brown


Im in a situation where it would be extremely strategically advantageous to run 
some OSDs on luminous (so we can try out bluestore) while the gateways stay on 
giant. 
Is this a terrible terrible thing, or can we reasonably get away with it?

points of interest:
1. i plan to make a new pool for this and keep it all bluestore
2. we only use the cluster for RBDs. 

--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbr...@medata.com| www.medata.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Paul Emmerich
On Tue, Dec 3, 2019 at 6:43 PM Robert LeBlanc  wrote:
>
> On Tue, Dec 3, 2019 at 9:11 AM Ed Fisher  wrote:
>>
>>
>>
>> On Dec 3, 2019, at 10:28 AM, Robert LeBlanc  wrote:
>>
>> Did you make progress on this? We have a ton of < 64K objects as well and 
>> are struggling to get good performance out of our RGW. Sometimes we have RGW 
>> instances that are just gobbling up CPU even when there are no requests to 
>> them, so it seems like things are getting hung up somewhere. There is 
>> nothing in the logs and I haven't had time to do more troubleshooting.
>>
>>
>> There's a bug in the current stable Nautilus release that causes a loop 
>> and/or crash in get_obj_data::flush (you should be able to see it gobbling 
>> up CPU in perf top). This is the related issue: 
>> https://tracker.ceph.com/issues/39660 -- it should be fixed as soon as 
>> 14.2.5 is released (any day now, supposedly).
>
>
> We will try out the new version when it's released and see if it improves 
> things for us.

I can confirm that what you are describing sounds like the issue
linked above; yeah, the issue talks mainly about crashes, but that's
the "good" version of this bug. The bad just hangs the thread in an
infinite loop, I've once debugged this in more detail... the added
locks in linked pull request fixed this.


Paul

>
> Thanks,
> Robert LeBlanc
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Failed to encode map errors

2019-12-03 Thread John Hearns
And me again for the second time in one day.

ceph -w is now showing messages like this:

2019-12-03 15:17:22.426988 osd.6 [WRN] failed to encode map e28961 with
expected crc

Any advice please?

-- 










*Kheiron Medical Technologies*

kheironmed.com 
 | supporting radiologists with deep learning


Kheiron Medical Technologies Ltd. is a registered company in England and 
Wales. This e-mail and its attachment(s) are intended for the above named 
only and are confidential. If they have come to you in error then you must 
take no action based upon them but contact us immediately. Any disclosure, 
copying, distribution or any action taken or omitted to be taken in 
reliance on it is prohibited and may be unlawful. Although this e-mail and 
its attachments are believed to be free of any virus, it is the 
responsibility of the recipient to ensure that they are virus free. If you 
contact us by e-mail then we will store your name and address to facilitate 
communications. Any statements contained herein are those of the individual 
and not the organisation.




Registered number: 10184103. Registered 
office: 2nd Floor Stylus Building, 116 Old Street, London, England, EC1V 9BG
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Robert LeBlanc
On Tue, Dec 3, 2019 at 9:11 AM Ed Fisher  wrote:

>
>
> On Dec 3, 2019, at 10:28 AM, Robert LeBlanc  wrote:
>
> Did you make progress on this? We have a ton of < 64K objects as well and
> are struggling to get good performance out of our RGW. Sometimes we have
> RGW instances that are just gobbling up CPU even when there are no requests
> to them, so it seems like things are getting hung up somewhere. There is
> nothing in the logs and I haven't had time to do more troubleshooting.
>
>
> There's a bug in the current stable Nautilus release that causes a loop
> and/or crash in get_obj_data::flush (you should be able to see it gobbling
> up CPU in perf top). This is the related issue:
> https://tracker.ceph.com/issues/39660 -- it should be fixed as soon as
> 14.2.5 is released (any day now, supposedly).
>

We will try out the new version when it's released and see if it improves
things for us.

Thanks,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Ed Fisher


> On Dec 3, 2019, at 10:28 AM, Robert LeBlanc  wrote:
> 
> Did you make progress on this? We have a ton of < 64K objects as well and are 
> struggling to get good performance out of our RGW. Sometimes we have RGW 
> instances that are just gobbling up CPU even when there are no requests to 
> them, so it seems like things are getting hung up somewhere. There is nothing 
> in the logs and I haven't had time to do more troubleshooting.
> 

There's a bug in the current stable Nautilus release that causes a loop and/or 
crash in get_obj_data::flush (you should be able to see it gobbling up CPU in 
perf top). This is the related issue: https://tracker.ceph.com/issues/39660 
 -- it should be fixed as soon as 14.2.5 
is released (any day now, supposedly).

Hope this helps,
Ed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Robert LeBlanc
On Tue, Nov 19, 2019 at 9:34 AM Christian  wrote:

> Hi,
>
> I used https://github.com/dvassallo/s3-benchmark to measure some
>> performance values for the rgws and got some unexpected results.
>> Everything above 64K has excellent performance but below it drops down to
>> a fraction of the speed and responsiveness resulting in even 256K objects
>> being faster than anything below 64K.
>> Does anyone observe similar effects while running this benchmark? Is it
>> the benchmarks fault or are there some options to tweak performance for low
>> object sizes?
>>
>
> I have an update on the performance  issues with reading rgw objects
> smaller than 64K. If I reduce the sample size to a lower number (number of
> my threads in this case) I get good values for the first objects.
> In this example for 32K objects: The more samples I use the worse my
> average and my percentiles get. I interpret this as: The first 32K objects
> are read fast but the consecutive reads are getting slower the more objects
> are read.
>
> Here are 3 examples with 8 threads, 32K and 64K with 8, 12, 16 and 128
> samples to demonstrate how the performance degrades with more samples.
> Notice how the performance is deminished drastically for 32K objects but
> not as badly for 63K objects.
>
> Is there a config option that can be tuned to get better performance for
> 32K objects? Like a cache or buffer setting I didn't find yet...
> Can anyone confirm the same issue?
>
> $ ./s3-benchmark -region us-east-1 -threads-min=8 -threads-max=8
> -payloads-min=6 -payloads-max=7 -samples=8 -endpoint
> https://mymungedrgwendpoint.de -bucket-name=bench -create-bucket=false
> [...section SETUP was deleted...]
> Download performance with 32 KB objects
>
>  
> +-+
>|Time to First Byte (ms)
>   |Time to Last Byte (ms)  |
>
> +-++++
> | Threads | Throughput |  avg   min   p25   p50   p75   p90   p99
> max |  avg   min   p25   p50   p75   p90   p99   max |
>
> +-++++
> |   8 |   6.1 MB/s |8 2 2 3 4 4 4
>  41 |8 2 2 3 4 4 441 |
>
> +-++++
>
> Download performance with 64 KB objects
>
>  
> +-+
>|Time to First Byte (ms)
>   |Time to Last Byte (ms)  |
>
> +-++++
> | Threads | Throughput |  avg   min   p25   p50   p75   p90   p99
> max |  avg   min   p25   p50   p75   p90   p99   max |
>
> +-++++
> |   8 | 122.5 MB/s |3 2 3 3 3 4 4
> 4 |3 3 3 3 3 4 4 4 |
>
> +-++++
> [...section CLEANUP was deleted...]
>
>
> $./s3-benchmark -region us-east-1 -threads-min=8 -threads-max=8
> -payloads-min=6 -payloads-max=7 -samples=12 -endpoint
> https://mymungedrgwendpoint.de -bucket-name=bench -create-bucket=false
> [...section SETUP was deleted...]
> Download performance with 32 KB objects
>
>  
> +-+
>|Time to First Byte (ms)
>   |Time to Last Byte (ms)  |
>
> +-++++
> | Threads | Throughput |  avg   min   p25   p50   p75   p90   p99
> max |  avg   min   p25   p50   p75   p90   p99   max |
>
> +-++++
> |   8 |   7.8 MB/s |   20 3 3 4414444
>  44 |   20 3 3 441444444 |
>
> +-++++
>
> Download performance with 64 KB objects
>
>  
> +-+
>|Time to First Byte (ms)
>   |Time to Last Byte (ms)  

Re: [ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-03 Thread Ernesto Puerta
Thanks, Benjeman! I created this pad
(https://pad.ceph.com/p/perf-counters-to-expose) so we can list them
there.

An alternative approach could also be to allow for whitelisting some
perf-counters, so they would become exported no matter their priority.
This would allow users to customize which counters they want to expose
without the need of any further code change.

Kind Regards,
Ernesto


On Tue, Dec 3, 2019 at 2:50 PM Benjeman Meekhof  wrote:
>
> I'd like to see a few of the cache tier counters exposed.  You get
> some info on cache activity in 'ceph -s' so it makes sense from my
> perspective to have similar availability in exposed counters.
>
> There's a tracker for this request (opened by me a while ago):
> https://tracker.ceph.com/issues/37156
>
> thanks,
> Ben
>
>
>
> On Tue, Dec 3, 2019 at 8:36 AM Ernesto Puerta  wrote:
> >
> > Hi Cephers,
> >
> > As a result of this tracker (https://tracker.ceph.com/issues/42961)
> > Neha and I were wondering if there would be other perf-counters deemed
> > by users/operators as worthy to be exposed via ceph-mgr modules for
> > monitoring purposes.
> >
> > The default behaviour is that only perf-counters with priority
> > PRIO_USEFUL (5) or higher are exposed (via `get_all_perf_counters` API
> > call) to ceph-mgr modules (including Dashboard, DiskPrediction or
> > Prometheus/InfluxDB/Telegraf exporters).
> >
> > While changing that is rather trivial, it could make sense to get
> > users' feedback and come up with a list of missing perf-counters to be
> > exposed.
> >
> > Kind regards,
> > Ernesto
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revert a CephFS snapshot?

2019-12-03 Thread Robert LeBlanc
On Thu, Nov 14, 2019 at 11:48 AM Sage Weil  wrote:

> On Thu, 14 Nov 2019, Patrick Donnelly wrote:
> > On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee 
> wrote:
> > >
> > > On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly 
> wrote:
> > > >
> > > > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee 
> wrote:
> > > > > Recently, I'm evaluating the snpahsot feature of CephFS from kernel
> > > > > client and everthing works like a charm.  But, it seems that
> reverting
> > > > > a snapshot is not available currently.  Is there some reason or
> > > > > technical limitation that the feature is not provided?  Any
> insights
> > > > > or ideas are appreciated.
> > > >
> > > > Please provide more information about what you tried to do (commands
> > > > run) and how it surprised you.
> > >
> > > The thing I would like to do is to rollback a snapped directory to a
> > > previous version of snapshot.  It looks like the operation can be done
> > > by over-writting all the current version of files/directories from a
> > > previous snapshot via cp.  But cp may take lots of time when there are
> > > many files and directories in the target directory.  Is there any
> > > possibility to achieve the goal much faster from the CephFS internal
> > > via command like "ceph fs   snap rollback
> > > " (just a example)?  Thank you!
> >
> > RADOS doesn't support rollback of snapshots so it needs to be done
> > manually. The best tool to do this would probably be rsync of the
> > .snap directory with appropriate options including deletion of files
> > that do not exist in the source (snapshot).
>
> rsync is the best bet now, yeah.
>
> RADOS does have a rollback operation that uses clone where it can, but
> it's a per-object operation, so something still needs to walk the
> hierarchy and roll back each file's content.  The MDS could do this more
> efficiently than rsync give what it knows about the snapped inodes
> (skipping untouched inodes or, eventually, entire subtrees) but it's a
> non-trivial amount of work to implement.
>
> Would it make sense to extend CephFS to leverage reflinks for cases like
this? That could be faster than rsync and more space efficient. It would
require some development time though.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster
I created https://tracker.ceph.com/issues/43106 and we're downgrading
our osds back to 13.2.6.

-- dan

On Tue, Dec 3, 2019 at 4:09 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We're midway through an update from 13.2.6 to 13.2.7 and started
> getting OSDs crashing regularly like this [1].
> Does anyone obviously know what the issue is? (Maybe
> https://github.com/ceph/ceph/pull/26448/files ?)
> Or is it some temporary problem while we still have v13.2.6 and
> v13.2.7 osds running concurrently?
>
> Thanks!
>
> Dan
>
> [1]
>
> 2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg unable to load latest map 2758889
> 2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
>  in thread 7ff3a453a700 thread_name:tp_osd_tp
>
>  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
>  1: (()+0xf5f0) [0x7ff3c620b5f0]
>  2: (gsignal()+0x37) [0x7ff3c522b337]
>  3: (abort()+0x148) [0x7ff3c522ca28]
>  4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
> OSDSuperblock&)+0x767) [0x555d60e8d797]
>  5: (OSDService::send_incremental_map(unsigned int, Connection*,
> std::shared_ptr&)+0x39e) [0x555d60e8dbee]
>  6: (OSDService::share_map_peer(int, Connection*,
> std::shared_ptr)+0x159) [0x555d60e8eda9]
>  7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
> int)+0x1a5) [0x555d60e8f085]
>  8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
> unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
> hobject_t, std::vector
> > const&, boost::optional&,
> ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
> [0x555d6116e522]
>  9: (ReplicatedBackend::submit_transaction(hobject_t const&,
> object_stat_sum_t const&, eversion_t const&,
> std::unique_ptr >&&,
> eversion_t const&, eversion_t const&, std::vector std::allocator > const&,
> boost::optional&, Context*, unsigned long,
> osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
>  10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
>  11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
> [0x555d61035902]
>  12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
> [0x555d610397a9]
>  13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
> [0x555d60e8e8a7]
>  15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
>  16: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
>  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
> [0x7ff3c929f5b3]
>  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
>  19: (()+0x7e65) [0x7ff3c6203e65]
>  20: (clone()+0x6d) [0x7ff3c52f388d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW bucket stats - strange behavior & slow performance requiring RGW restarts

2019-12-03 Thread David Monschein
Hi all,

I've been observing some strange behavior with my object storage cluster
running Nautilus 14.2.4. We currently have around 1800 buckets (A small
percentage of those buckets are actively used), with a total of 13.86M
objects. We have 20 RGWs right now, 10 for regular S3 access, and 10 for
static sites.

When calling $(radosgw-admin bucket stats), it normally comes back within a
few seconds, usually less than five. This returns stats for all buckets in
the cluster, which we use for accounting.

The strange behavior: Lately we've been observing a gradual increase in
runtime for bucket stats, which in extreme cases can take almost 10 minutes
to return. Things start out fine, and over the course of the week, the
runtime increases. From a few seconds to almost 10 minutes. Restarting all
of the S3 RGWs seems to fix this problem immediately. If we restart all the
radosgw processes, the runtime for bucket stats drops to 3 seconds.

This is odd behavior, and I've found nothing so far that would indicate why
this is happening. There is nothing suspicious in the RGWs, although a
message about aborted mutli-part uploads is in there:

2019-12-02 13:12:52.882 7faa7018f700 0 abort_bucket_multiparts WARNING :
aborted 8553000 incomplete multipart uploads

Otherwise, things look normal. Memory usage is low, CPU load is relatively
low and flat, and the cluster itself is not under heavy load.

Has anyone run into this before?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster
Hi all,

We're midway through an update from 13.2.6 to 13.2.7 and started
getting OSDs crashing regularly like this [1].
Does anyone obviously know what the issue is? (Maybe
https://github.com/ceph/ceph/pull/26448/files ?)
Or is it some temporary problem while we still have v13.2.6 and
v13.2.7 osds running concurrently?

Thanks!

Dan

[1]

2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
build_incremental_map_msg missing incremental map 2758889
2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
build_incremental_map_msg missing incremental map 2758889
2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
build_incremental_map_msg unable to load latest map 2758889
2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
 in thread 7ff3a453a700 thread_name:tp_osd_tp

 ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
 1: (()+0xf5f0) [0x7ff3c620b5f0]
 2: (gsignal()+0x37) [0x7ff3c522b337]
 3: (abort()+0x148) [0x7ff3c522ca28]
 4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
OSDSuperblock&)+0x767) [0x555d60e8d797]
 5: (OSDService::send_incremental_map(unsigned int, Connection*,
std::shared_ptr&)+0x39e) [0x555d60e8dbee]
 6: (OSDService::share_map_peer(int, Connection*,
std::shared_ptr)+0x159) [0x555d60e8eda9]
 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
int)+0x1a5) [0x555d60e8f085]
 8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
hobject_t, std::vector
> const&, boost::optional&,
ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
[0x555d6116e522]
 9: (ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr >&&,
eversion_t const&, eversion_t const&, std::vector > const&,
boost::optional&, Context*, unsigned long,
osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
 10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
[0x555d61035902]
 12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
[0x555d610397a9]
 13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
 14: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
[0x555d60e8e8a7]
 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
[0x7ff3c929f5b3]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
 19: (()+0x7e65) [0x7ff3c6203e65]
 20: (clone()+0x6d) [0x7ff3c52f388d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HA and data recovery of CEPH

2019-12-03 Thread Wido den Hollander




On 12/3/19 3:07 PM, Aleksey Gutikov wrote:



That is true. When an OSD goes down it will take a few seconds for it's
Placement Groups to re-peer with the other OSDs. During that period
writes to those PGs will stall for a couple of seconds.

I wouldn't say it's 40s, but it can take ~10s.


Hello,

According to my experience, in case of OSD crashes, killed -9 (any kind 
abnormat termination) OSD failure handling contains next steps:
1) Failed OSD's peers detect that it does not respond - it can take up 
to osd_heartbeat_grace + osd_heartbeat_interval seconds


If a 'Connection Refused' is detected the OSD will be marked as down 
right away.



2) Peers send reports to monitor
3) Monitor makes a decision according to (options from it's own config) 
mon_osd_adjust_heartbeat_grace, osd_heartbeat_grace, 
mon_osd_laggy_halflife, mon_osd_min_down_reporters, ... And finally mark 
OSD down in osdmap.


True.


4) Monitor send updated OSDmap to OSDs and clients
5) OSDs starting peering
5.1) Peering itself is complicated process, for example we had 
experienced PGs stuck in inactive state due to 
osd_max_pg_per_osd_hard_ratio.


I would say that 5.1 isn't relevant for most cases. Yes, it can happen, 
but it's rare.


6) Peering finished (PGs' data continue moving) - clients can normally 
access affected PGs. Clients also have their own timeouts that can 
affect time to recover. >

Again, according to my experience, 40s with default settings is possible.



40s is possible in certain scenarios. But I wouldn't say that's the 
default for all cases.


Wido




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HA and data recovery of CEPH

2019-12-03 Thread Aleksey Gutikov




That is true. When an OSD goes down it will take a few seconds for it's
Placement Groups to re-peer with the other OSDs. During that period
writes to those PGs will stall for a couple of seconds.

I wouldn't say it's 40s, but it can take ~10s.


Hello,

According to my experience, in case of OSD crashes, killed -9 (any kind 
abnormat termination) OSD failure handling contains next steps:
1) Failed OSD's peers detect that it does not respond - it can take up 
to osd_heartbeat_grace + osd_heartbeat_interval seconds

2) Peers send reports to monitor
3) Monitor makes a decision according to (options from it's own config) 
mon_osd_adjust_heartbeat_grace, osd_heartbeat_grace, 
mon_osd_laggy_halflife, mon_osd_min_down_reporters, ... And finally mark 
OSD down in osdmap.

4) Monitor send updated OSDmap to OSDs and clients
5) OSDs starting peering
5.1) Peering itself is complicated process, for example we had 
experienced PGs stuck in inactive state due to 
osd_max_pg_per_osd_hard_ratio.
6) Peering finished (PGs' data continue moving) - clients can normally 
access affected PGs. Clients also have their own timeouts that can 
affect time to recover.


Again, according to my experience, 40s with default settings is possible.


--

Best regards!
Aleksei Gutikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd auth del

2019-12-03 Thread John Hearns
Thankyou.   ceph auth add did work

I did try ceph auth get-or-create this does not read from an input file
- it will generate a new key.

On Tue, 3 Dec 2019 at 13:50, Willem Jan Withagen  wrote:

> On 3-12-2019 11:43, Wido den Hollander wrote:
> >
> >
> > On 12/3/19 11:40 AM, John Hearns wrote:
> >> I had a fat fingered moment yesterday
> >> I typed   ceph auth del osd.3
> >> Where osd.3 is an otherwise healthy little osd
> >> I have not set noout or down on  osd.3 yet
> >>
> >> This is a Nautilus cluster.
> >> ceph health reports everything is OK
> >>
> >
> > Fetch the key from the OSD's datastore on the machine itself. On the OSD
> > machine you'll find a file called keyring.
> >
> > Get that file and import it with the proper caps back into cephx. Then
> > all should be fixed!
>
> The magic incantation there would be:
>
> ceph auth add osd. osd 'allow *' mon 'allow rwx' keyring
>
> --WjW
>
>

-- 










*Kheiron Medical Technologies*

kheironmed.com 
 | supporting radiologists with deep learning


Kheiron Medical Technologies Ltd. is a registered company in England and 
Wales. This e-mail and its attachment(s) are intended for the above named 
only and are confidential. If they have come to you in error then you must 
take no action based upon them but contact us immediately. Any disclosure, 
copying, distribution or any action taken or omitted to be taken in 
reliance on it is prohibited and may be unlawful. Although this e-mail and 
its attachments are believed to be free of any virus, it is the 
responsibility of the recipient to ensure that they are virus free. If you 
contact us by e-mail then we will store your name and address to facilitate 
communications. Any statements contained herein are those of the individual 
and not the organisation.




Registered number: 10184103. Registered 
office: 2nd Floor Stylus Building, 116 Old Street, London, England, EC1V 9BG
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd auth del

2019-12-03 Thread Willem Jan Withagen

On 3-12-2019 11:43, Wido den Hollander wrote:



On 12/3/19 11:40 AM, John Hearns wrote:

I had a fat fingered moment yesterday
I typed                       ceph auth del osd.3
Where osd.3 is an otherwise healthy little osd
I have not set noout or down on  osd.3 yet

This is a Nautilus cluster.
ceph health reports everything is OK



Fetch the key from the OSD's datastore on the machine itself. On the OSD 
machine you'll find a file called keyring.


Get that file and import it with the proper caps back into cephx. Then 
all should be fixed!


The magic incantation there would be:

ceph auth add osd. osd 'allow *' mon 'allow rwx' keyring

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-03 Thread Benjeman Meekhof
I'd like to see a few of the cache tier counters exposed.  You get
some info on cache activity in 'ceph -s' so it makes sense from my
perspective to have similar availability in exposed counters.

There's a tracker for this request (opened by me a while ago):
https://tracker.ceph.com/issues/37156

thanks,
Ben



On Tue, Dec 3, 2019 at 8:36 AM Ernesto Puerta  wrote:
>
> Hi Cephers,
>
> As a result of this tracker (https://tracker.ceph.com/issues/42961)
> Neha and I were wondering if there would be other perf-counters deemed
> by users/operators as worthy to be exposed via ceph-mgr modules for
> monitoring purposes.
>
> The default behaviour is that only perf-counters with priority
> PRIO_USEFUL (5) or higher are exposed (via `get_all_perf_counters` API
> call) to ceph-mgr modules (including Dashboard, DiskPrediction or
> Prometheus/InfluxDB/Telegraf exporters).
>
> While changing that is rather trivial, it could make sense to get
> users' feedback and come up with a list of missing perf-counters to be
> exposed.
>
> Kind regards,
> Ernesto
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-03 Thread Ernesto Puerta
Hi Cephers,

As a result of this tracker (https://tracker.ceph.com/issues/42961)
Neha and I were wondering if there would be other perf-counters deemed
by users/operators as worthy to be exposed via ceph-mgr modules for
monitoring purposes.

The default behaviour is that only perf-counters with priority
PRIO_USEFUL (5) or higher are exposed (via `get_all_perf_counters` API
call) to ceph-mgr modules (including Dashboard, DiskPrediction or
Prometheus/InfluxDB/Telegraf exporters).

While changing that is rather trivial, it could make sense to get
users' feedback and come up with a list of missing perf-counters to be
exposed.

Kind regards,
Ernesto

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd auth del

2019-12-03 Thread Wido den Hollander



On 12/3/19 11:40 AM, John Hearns wrote:

I had a fat fingered moment yesterday
I typed                       ceph auth del osd.3
Where osd.3 is an otherwise healthy little osd
I have not set noout or down on  osd.3 yet

This is a Nautilus cluster.
ceph health reports everything is OK



Fetch the key from the OSD's datastore on the machine itself. On the OSD 
machine you'll find a file called keyring.


Get that file and import it with the proper caps back into cephx. Then 
all should be fixed!


Wido


However ceph tell osd.* version hangs when it gets to osd.3
Also the log ceph-osd.3.log is full of these lines:

2019-12-03 10:33:29.503 7f010adf1700  0 cephx: verify_authorizer could 
not get service secret for service osd secret_id=10281

2019-12-03 10:33:29.591 7f010adf1700  0 auth: could not find secret_id=10281
2019-12-03 10:33:29.591 7f010adf1700  0 cephx: verify_authorizer could 
not get service secret for service osd secret_id=10281

2019-12-03 10:33:29.819 7f010adf1700  0 auth: could not find secret_id=10281

OK, once you have all stopped laughing some advice would be appreciated.


*Kheiron Medical Technologies*

kheironmed.com | supporting radiologists with 
deep learning



Kheiron Medical Technologies Ltd. is a registered company in England and 
Wales. This e-mail and its attachment(s) are intended for the above 
named only and are confidential. If they have come to you in error then 
you must take no action based upon them but contact us immediately. Any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it is prohibited and may be unlawful. Although this 
e-mail and its attachments are believed to be free of any virus, it is 
the responsibility of the recipient to ensure that they are virus free. 
If you contact us by e-mail then we will store your name and address to 
facilitate communications. Any statements contained herein are those of 
the individual and not the organisation.


Registered number: 10184103. Registered office: 2nd Floor Stylus 
Building, 116 Old Street, London, England, EC1V 9BG



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Osd auth del

2019-12-03 Thread John Hearns
I had a fat fingered moment yesterday
I typed   ceph auth del osd.3
Where osd.3 is an otherwise healthy little osd
I have not set noout or down on  osd.3 yet

This is a Nautilus cluster.
ceph health reports everything is OK

However ceph tell osd.* version hangs when it gets to osd.3
Also the log ceph-osd.3.log is full of these lines:

2019-12-03 10:33:29.503 7f010adf1700  0 cephx: verify_authorizer could not
get service secret for service osd secret_id=10281
2019-12-03 10:33:29.591 7f010adf1700  0 auth: could not find secret_id=10281
2019-12-03 10:33:29.591 7f010adf1700  0 cephx: verify_authorizer could not
get service secret for service osd secret_id=10281
2019-12-03 10:33:29.819 7f010adf1700  0 auth: could not find secret_id=10281

OK, once you have all stopped laughing some advice would be appreciated.

-- 










*Kheiron Medical Technologies*

kheironmed.com 
 | supporting radiologists with deep learning


Kheiron Medical Technologies Ltd. is a registered company in England and 
Wales. This e-mail and its attachment(s) are intended for the above named 
only and are confidential. If they have come to you in error then you must 
take no action based upon them but contact us immediately. Any disclosure, 
copying, distribution or any action taken or omitted to be taken in 
reliance on it is prohibited and may be unlawful. Although this e-mail and 
its attachments are believed to be free of any virus, it is the 
responsibility of the recipient to ensure that they are virus free. If you 
contact us by e-mail then we will store your name and address to facilitate 
communications. Any statements contained herein are those of the individual 
and not the organisation.




Registered number: 10184103. Registered 
office: 2nd Floor Stylus Building, 116 Old Street, London, England, EC1V 9BG
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com