[ceph-users] Re: Increasing QD=1 performance (lowering latency)

2021-02-08 Thread Paul Emmerich
A few things that you can try on the network side to shave off microseconds:

1) 10G Base-T has quite some latency compared to fiber or DAC. I've
measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction,
so that's 8µs you can save for a round-trip if it's client -> switch -> osd
and back. Note that my measurement was for small packets, not sure how big
that penalty still is with large packets. Some of it comes from the large
block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing
time of that complex encoding.

2) Setting the switch to cut-through instead of store-and-forward can help,
especially on slower links. Serialization time is 0.8ns per byte on 10
gbit, so ~3.2µs for a 4kb packet.

3) Depending on which NIC you use: check if it has some kind of interrupt
throttling feature that you can adjust or disable. If your Base-T NIC is an
Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe
probably also X7xx, i40e), that can make a large difference. Try setting
itr=0 for the ixgbe kernel module. Note that you might want to compile your
kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise
CPU usage statistics will be wildly inaccurate if the driver takes a
significant amount of CPU time (should not be a problem for the setup
described here, but something to be aware of). This may get you up to 100µs
in the best case. No idea about other NICs

4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help
with latency, but I forgot the details

5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce
tail latency, but doesn't do anything for average and median latency and I
have no insights specific to Ceph, though.


This could get you a few microseconds, I think especially 3 and 4 are worth
trying. Please do report results if you test this, I'm always interested in
hearing stories about low-level performance optimizations :)

Paul



On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander  wrote:

> Hi,
>
> There are many talks and presentations out there about Ceph's
> performance. Ceph is great when it comes to parallel I/O, large queue
> depths and many applications sending I/O towards Ceph.
>
> One thing where Ceph isn't the fastest are 4k blocks written at Queue
> Depth 1.
>
> Some applications benefit very much from high performance/low latency
> I/O at qd=1, for example Single Threaded applications which are writing
> small files inside a VM running on RBD.
>
> With some tuning you can get to a ~700us latency for a 4k write with
> qd=1 (Replication, size=3)
>
> I benchmark this using fio:
>
> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
>
> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
>
> When comparing this to let's say a BSD machine running ZFS that's on the
> low side. With ZFS+NVMe you'll be able to reach about somewhere between
> 7.000 and 10.000 IOps, the latency is simply much lower.
>
> My benchmarking / test setup for this:
>
> - Ceph Nautilus/Octopus (doesn't make a big difference)
> - 3x SuperMicro 1U with:
> - AMD Epyc 7302P 16-core CPU
> - 128GB DDR4
> - 10x Samsung PM983 3,84TB
> - 10Gbit Base-T networking
>
> Things to configure/tune:
>
> - C-State pinning to 1
> - CPU governer to performance
> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
>
> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
> latency and going towards 25Gbit/100Gbit might help as well.
>
> These are however only very small increments and might help to reduce
> the latency by another 15% or so.
>
> It doesn't bring us anywhere near the 10k IOps other applications can do.
>
> And I totally understand that replication over a TCP/IP network takes
> time and thus increases latency.
>
> The Crimson project [0] is aiming to lower the latency with many things
> like DPDK and SPDK, but this is far from finished and production ready.
>
> In the meantime, am I overseeing some things here? Can we reduce the
> latency further of the current OSDs?
>
> Reaching a ~500us latency would already be great!
>
> Thanks,
>
> Wido
>
>
> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Paul Emmerich
Well, what I was saying was "does it hurt to unconditionally run hdparm -W
0 on all disks?"

Which disk would suffer from this? I haven't seen any disk where this would
be a bad idea


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 24, 2020 at 5:35 PM Frank Schilder  wrote:

> Yes, non-volatile write cache helps as described in the wiki. When you
> disable write cache with hdparm, it actually only disables volatile write
> cache. That's why SSDs with power loss protection are recommended for ceph.
>
> A SAS/SATA SSD without any write cache will perform poorly no matter what.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Paul Emmerich 
> Sent: 24 June 2020 17:30:51
> To: Frank R
> Cc: Benoît Knecht; s.pri...@profihost.ag; ceph-users@ceph.io
> Subject: [ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba
> MG07ACA14TE HDDs
>
> Has anyone ever encountered a drive with a write cache that actually
> *helped*?
> I haven't.
>
> As in: would it be a good idea for the OSD to just disable the write cache
> on startup? Worst case it doesn't do anything, best case it improves
> latency.
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Wed, Jun 24, 2020 at 3:49 PM Frank R  wrote:
>
> > fyi, there is an interesting note on disabling the write cache here:
> >
> >
> >
> https://yourcmc.ru/wiki/index.php?title=Ceph_performance=toggle_view_desktop#Drive_cache_is_slowing_you_down
> >
> > On Wed, Jun 24, 2020 at 9:45 AM Benoît Knecht 
> > wrote:
> > >
> > > Hi Igor,
> > >
> > > Igor Fedotov wrote:
> > > > for the sake of completeness one more experiment please if possible:
> > > >
> > > > turn off write cache for HGST drives and measure commit latency once
> > again.
> > >
> > > I just did the same experiment with HGST drives, and disabling the
> write
> > cache
> > > on those drives brought the latency down from about 7.5ms to about 4ms.
> > >
> > > So it seems disabling the write cache across the board would be
> > advisable in
> > > our case. Is it recommended in general, or specifically when the DB+WAL
> > is on
> > > the same hard drive?
> > >
> > > Stefan, Mark, are you disabling the write cache on your HDDs by
> default?
> > >
> > > Cheers,
> > >
> > > --
> > > Ben
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Feedback of the used configuration

2020-06-24 Thread Paul Emmerich
Have a look at cephfs subvolumes:
https://docs.ceph.com/docs/master/cephfs/fs-volumes/#fs-subvolumes

They are internally just directories with quota/pool placement
layout/namespace with some mgr magic to make it easier than doing that all
by hand

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 24, 2020 at 4:38 PM Simon Sutter  wrote:

> Hello,
>
> After two months of the "ceph try and error game", I finally managed to
> get an Octopuss cluster up and running.
> The unconventional thing about it is, it's just for hot backups, no
> virtual machines on there.
> All the  nodes are without any caching ssd's, just plain hdd's.
> At the moment there are eight of them with a total of 50TB. We are
> planning to go up to 25 and bigger disks so we end on 300TB-400TB
>
> I decided to go with cephfs, because I don't have any experience in things
> like S3 and I need to read the same file system from more than one client.
>
> I made one cephfs with a replicated pool.
> On there I added erasure-coded pools to save some Storage.
> To add those pools, I did it with the setfattr command like this:
> setfattr -n ceph.dir.layout.pool -v ec_data_server1 /cephfs/nfs/server1
>
> Some of our servers cannot use cephfs (old kernels, special OS's) so I
> have to use nfs.
> This is set up with the included ganesha-nfs.
> Exported is the /cephfs/nfs folder and clients can mount folders below
> this.
>
> There are two final questions:
>
> -  Was it right to go with the way of "mounting" pools with
> setfattr, or should I have used multiple cephfs?
>
> First I was thinking about using multiple cephfs but there are warnings
> everywhere. The deeper I got in, the more it seems I would have been fine
> with multiple cephfs.
>
> -  Is there a way I don't know, but it would be easier?
>
> I still don't know much about Rest, S3, RBD etc... so there may be a
> better way
>
> Other remarks are desired.
>
> Thanks in advance,
> Simon
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Paul Emmerich
Has anyone ever encountered a drive with a write cache that actually
*helped*?
I haven't.

As in: would it be a good idea for the OSD to just disable the write cache
on startup? Worst case it doesn't do anything, best case it improves
latency.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 24, 2020 at 3:49 PM Frank R  wrote:

> fyi, there is an interesting note on disabling the write cache here:
>
>
> https://yourcmc.ru/wiki/index.php?title=Ceph_performance=toggle_view_desktop#Drive_cache_is_slowing_you_down
>
> On Wed, Jun 24, 2020 at 9:45 AM Benoît Knecht 
> wrote:
> >
> > Hi Igor,
> >
> > Igor Fedotov wrote:
> > > for the sake of completeness one more experiment please if possible:
> > >
> > > turn off write cache for HGST drives and measure commit latency once
> again.
> >
> > I just did the same experiment with HGST drives, and disabling the write
> cache
> > on those drives brought the latency down from about 7.5ms to about 4ms.
> >
> > So it seems disabling the write cache across the board would be
> advisable in
> > our case. Is it recommended in general, or specifically when the DB+WAL
> is on
> > the same hard drive?
> >
> > Stefan, Mark, are you disabling the write cache on your HDDs by default?
> >
> > Cheers,
> >
> > --
> > Ben
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: Monitors not listening on msgrv1

2020-06-23 Thread Paul Emmerich
It's only listening on v2 because the mon map says so. How it got into the
mon map like this is hard to guess, but that's the place where you have to
fix it.
Simplest way to change the IP of a mon is destroy and re-create, but you
can also edit the monmap manually following these instructions:

https://docs.ceph.com/docs/octopus/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way

Specify both v1/v2 addresses explicitly using the same syntax as you've
used in the config file.

Speaking of your config file: your mon_host config line is unnecessarily
complex, trying both v1 and v2 is the default.
You can simply write it like this if you are running on the default ports:

mon_host = 10.144.0.2, 10.144.0.3, 10.144.0.4

This has the advantage of being backwards-compatible with old clients.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jun 23, 2020 at 10:52 PM Julian Fölsch 
wrote:

> Hi,
>
> Sorry about the second mail, I forgot the attachement!
> [0] https://paste.myftb.de/umefarejol.txt
>
> Am 23.06.20 um 22:44 schrieb Julian Fölsch:
> > Hi,
> >
> > I am currently facing the problem that our Ceph Cluster running Nautilus
> > is only listening on msgrv2 and we are not sure why.
> > This stops us from using block devices via rbd or mounting ceph via the
> > kernel module.
> > Attached[0] you can find the output of 'cat /etc/ceph/ceph.conf', 'ceph
> > mon dump' and 'ceph config dump'.
> > I already asked on IRC and was told that I probably have more success on
> > the mailing list so hopefully someone here also encountered that issue
> > and can help us out.
> >
> > Kind regards,
> > Julian Fölsch
> >
> --
> Julian Fölsch
>
>Arbeitsgemeinschaft Dresdner Studentennetz (AG DSN)
>Stellvertretender Schatzmeister
>
>Telefon: +49 351 271816 69
>Mobil: +49 152 22915871
>Fax: +49 351 46469685
>Email: julian.foel...@agdsn.de
>
>Studierendenrat der TU Dresden
>Helmholtzstr. 10
>01069 Dresden
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Radosgw huge traffic to index bucket compared to incoming requests

2020-06-19 Thread Paul Emmerich
Hi,

huge read amplification for index buckets is unfortunately normal,
complexity of a read request is O(n) where n is the number of objects in
that bucket.
I've worked on many clusters with huge buckets and having 10 gbit/s of
network traffic between the OSDs and radosgw is unfortunately not unusual
when running a lot of listing requests.

The problem is that it needs to read *all* shards for each list request and
the number of shards is the number of objects divided by 100k by default.
It's a little bit better in Octopus, but still not great for huge buckets.

My experiences with building rgw setups for huge (> 200 million) buckets
can be summed up as:

* use *good* NVMe disks for the index bucket (very good experiences with
Samsung 1725a, seen these things do > 50k iops during recovieres)
* it can be beneficial to have a larger number of OSDs handling the load as
huge rocksdb sizes can be a problem; that means it can be better to use the
NVMe disks as DB device for HDDs and put the index pool here than to run a
dedicated NVMe-only pool on very few OSDs
* go for larger shards on large buckets, shard sizes of 300k - 600k are
perfectly fine on fast NVMes (the trade-off here is recovery speed/locked
objects vs. read amplification)

I think the formula shards = bucket_size / 100k shouldn't apply for buckets
with >= 100 million objects; shards should become bigger as the bucket size
increases.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jun 18, 2020 at 9:25 AM Mariusz Gronczewski <
mariusz.gronczew...@efigence.com> wrote:

> Hi,
>
> we're using Ceph as S3-compatible storage to serve static files (mostly
> css/js/images + some videos) and I've noticed that there seem to be
> huge read amplification for index pool.
>
> Incoming traffic magniture is of around 15k req/sec (mostly sub 1MB
> request but index pool is getting hammered:
>
> pool pl-war1.rgw.buckets.index id 10
>   client io 632 MiB/s rd, 277 KiB/s wr, 129.92k op/s rd, 415 op/s wr
>
> pool pl-war1.rgw.buckets.data id 11
>   client io 4.5 MiB/s rd, 6.8 MiB/s wr, 640 op/s rd, 1.65k op/s wr
>
> and is getting order of magnitude more requests
>
> running 15.2.3, nothing special in terms of tunning aside from
> disabling some logging as to not overflow the logs.
>
> We've had similar test cluster on 12.x (and way slower hardware)
> getting similar traffic and haven't observed that magnitude of
> difference.
>
> when enabling debug on affected OSD I only get spam of
>
> 2020-06-17T12:35:05.700+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> = 0
> 2020-06-17T12:35:05.700+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.700+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.700+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.700+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b1a34d8:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.214:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) omap_get_header 10.51_head oid
> #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head#
> = 0
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
> bluestore(/var/lib/ceph/osd/ceph-20) get_omap_iterator 10.51_head
> #10:8b0d75b0:::.dir.88d4f221-0da5-444d-81a8-517771278350.454759.8.222:head#
> 2020-06-17T12:35:05.704+0200 7f80694c4700 10
>

[ceph-users] Re: help with failed osds after reboot

2020-06-15 Thread Paul Emmerich
On Mon, Jun 15, 2020 at 7:01 PM  wrote:

> Ceph version 10.2.7
>
> ceph.conf
> [global]
> fsid = 75d6dba9-2144-47b1-87ef-1fe21d3c58a8
>

(...)


> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> mount_activate: Failed to activate
> ceph-disk: Error: No cluster conf found in /etc/ceph with fsid
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
> ceph-disk: Error: One or more partitions failed to activate
>
> I am trying to gather as many details as possible, is there anything I am
> missing that I should take a look at?
> I still have not figured out why this started being a problem or how to
> resolve.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD upgrades

2020-06-02 Thread Paul Emmerich
Correct, "crush weight" and normal "reweight" are indeed very different.
The original post mentions "rebuilding" servers, in this case the correct
way is to use "destroy" and then explicitly re-use the OSD afterwards.

purge is really only for OSDs that you don't get back (or broken disks that
you don't replace quickly)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jun 2, 2020 at 12:32 PM Thomas Byrne - UKRI STFC <
tom.by...@stfc.ac.uk> wrote:

> As you have noted, 'ceph osd reweight 0' is the same as an 'ceph osd out',
> but it is not the same as removing the OSD from the crush map (or setting
> crush weight to 0). This explains your observation of the double rebalance
> when you mark an OSD out (or reweight an OSD to 0), and then remove it
> later.
>
> To avoid this, I use a crush reweight for the initial step to move PGs off
> an OSD when draining nodes. You can then purge the OSD with no further PG
> movement.
>
> Double movement:
> > ceph osd out $i
> # rebalancing
> > ceph osd purge $i
> # more rebalancing
>
> Single movement:
> > ceph osd crush reweight $i 0
> # rebalancing
> > ceph osd purge $i
> # no rebalancing
>
> The reason this occurs (as I understand it) is that the reweight value is
> taken into account later in the crush calc, so an OSD with a reweight of 0
> can still be picked for a PG set, and then the reweight kicks in and forces
> the calc to be retried, giving a different value for the PG set compared to
> if the OSD was not present, or had a crush weight of 0.
>
> Cheers,
> Tom
>
> > -Original Message-
> > From: Brent Kennedy 
> > Sent: 02 June 2020 04:44
> > To: 'ceph-users' 
> > Subject: [ceph-users] OSD upgrades
> >
> > We are rebuilding servers and before luminous our process was:
> >
> >
> >
> > 1.   Reweight the OSD to 0
> >
> > 2.   Wait for rebalance to complete
> >
> > 3.   Out the osd
> >
> > 4.   Crush remove osd
> >
> > 5.   Auth del osd
> >
> > 6.   Ceph osd rm #
> >
> >
> >
> > Seems the luminous documentation says that you should:
> >
> > 1.   Out the osd
> >
> > 2.   Wait for the cluster rebalance to finish
> >
> > 3.   Stop the osd
> >
> > 4.   Osd purge #
> >
> >
> >
> > Is reweighting to 0 no longer suggested?
> >
> >
> >
> > Side note:  I tried our existing process and even after reweight, the
> entire
> > cluster restarted the balance again after step 4 ( crush remove osd ) of
> the old
> > process.  I should also note, by reweighting to 0, when I tried to run
> "ceph osd
> > out #", it said it was already marked out.
> >
> >
> >
> > I assume the docs are correct, but just want to make sure since
> reweighting
> > had been previously recommended.
> >
> >
> >
> > Regards,
> >
> > -Brent
> >
> >
> >
> > Existing Clusters:
> >
> > Test: Nautilus 14.2.2 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
> > gateways ( all virtual on nvme )
> >
> > US Production(HDD): Nautilus 14.2.2 with 11 osd servers, 3 mons, 4
> gateways,
> > 2 iscsi gateways
> >
> > UK Production(HDD): Nautilus 14.2.2 with 12 osd servers, 3 mons, 4
> gateways
> >
> > US Production(SSD): Nautilus 14.2.2 with 6 osd servers, 3 mons, 3
> gateways,
> > 2 iscsi gateways
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to
> > ceph-users-le...@ceph.io
>
> This email and any attachments are intended solely for the use of the
> named recipients. If you are not the intended recipient you must not use,
> disclose, copy or distribute this email or any of its attachments and
> should notify the sender immediately and delete this email from your
> system. UK Research and Innovation (UKRI) has taken every reasonable
> precaution to minimise risk of this email or any attachments containing
> viruses or malware but the recipient should carry out its own virus and
> malware checks before opening the attachments. UKRI does not accept any
> liability for any losses or damages which the recipient may sustain due to
> presence of any viruses. Opinions, conclusions or other information in this
> message and attachments that are not related directly to UKRI business are
> solely those of the author and do not represent the views of UKRI.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD upgrades

2020-06-02 Thread Paul Emmerich
"reweight 0" and "out" are the exact same thing


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jun 2, 2020 at 9:30 AM Wido den Hollander  wrote:

>
>
> On 6/2/20 5:44 AM, Brent Kennedy wrote:
> > We are rebuilding servers and before luminous our process was:
> >
> >
> >
> > 1.   Reweight the OSD to 0
> >
> > 2.   Wait for rebalance to complete
> >
> > 3.   Out the osd
> >
> > 4.   Crush remove osd
> >
> > 5.   Auth del osd
> >
> > 6.   Ceph osd rm #
> >
> >
> >
> > Seems the luminous documentation says that you should:
> >
> > 1.   Out the osd
> >
> > 2.   Wait for the cluster rebalance to finish
> >
> > 3.   Stop the osd
> >
> > 4.   Osd purge #
> >
> >
> >
> > Is reweighting to 0 no longer suggested?
> >
> >
> >
> > Side note:  I tried our existing process and even after reweight, the
> entire
> > cluster restarted the balance again after step 4 ( crush remove osd ) of
> the
> > old process.  I should also note, by reweighting to 0, when I tried to
> run
> > "ceph osd out #", it said it was already marked out.
> >
> >
> >
> > I assume the docs are correct, but just want to make sure since
> reweighting
> > had been previously recommended.
>
> The new commands just make it more simple. There are many ways to
> accomplish the same goal, but what the docs describe should work in most
> scenarios.
>
> Wido
>
> >
> >
> >
> > Regards,
> >
> > -Brent
> >
> >
> >
> > Existing Clusters:
> >
> > Test: Nautilus 14.2.2 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
> > gateways ( all virtual on nvme )
> >
> > US Production(HDD): Nautilus 14.2.2 with 11 osd servers, 3 mons, 4
> gateways,
> > 2 iscsi gateways
> >
> > UK Production(HDD): Nautilus 14.2.2 with 12 osd servers, 3 mons, 4
> gateways
> >
> > US Production(SSD): Nautilus 14.2.2 with 6 osd servers, 3 mons, 3
> gateways,
> > 2 iscsi gateways
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ceph-users]: Ceph Nautius not working after setting MTU 9000

2020-05-29 Thread Paul Emmerich
Please do not apply any optimization without benchmarking *before* and
*after* in a somewhat realistic scenario.

No, iperf is likely not a realistic setup because it will usually be
limited by available network bandwidth which is (should) rarely be maxed
out on your actual Ceph setup.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 29, 2020 at 2:15 AM Dave Hall  wrote:

> Hello.
>
> A few days ago I offered to share the notes I've compiled on network
> tuning.  Right now it's a Google Doc:
>
>
> https://docs.google.com/document/d/1nB5fzIeSgQF0ti_WN-tXhXAlDh8_f8XF9GhU7J1l00g/edit?usp=sharing
>
> I've set it up to allow comments and I'd be glad for questions and
> feedback.  If Google Docs not an acceptable format I'll try to put it up
> somewhere as HTML or Wiki.  Disclosure: some sections were copied
> verbatim from other sources.
>
> Regarding the current discussion about iperf, the likely bottleneck is
> buffering.  There is a per-NIC output queue set with 'ip link' and a per
> CPU core input queue set with 'sysctl'.  Both should be set to some
> multiple of the frame size based on calculations related to link speed
> and latency.  Jumping from 1500 to 9000 could negatively impact
> performance because one buffer or the other might be 1500 bytes short of
> a low multiple of 9000.
>
> It would be interesting to see the iperf tests repeated with
> corresponding buffer sizing.  I will perform this experiment as soon as
> I complete some day-job tasks.
>
> -Dave
>
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
> On 5/27/2020 6:51 AM, EDH - Manuel Rios wrote:
> > Anyone can share their table with other MTU values?
> >
> > Also interested into Switch CPU load
> >
> > KR,
> > Manuel
> >
> > -Mensaje original-
> > De: Marc Roos 
> > Enviado el: miércoles, 27 de mayo de 2020 12:01
> > Para: chris.palmer ; paul.emmerich <
> paul.emmer...@croit.io>
> > CC: amudhan83 ; anthony.datri <
> anthony.da...@gmail.com>; ceph-users ; doustar <
> dous...@rayanexon.ir>; kdhall ; sstkadu <
> sstk...@gmail.com>
> > Asunto: [ceph-users] Re: [External Email] Re: Ceph Nautius not working
> after setting MTU 9000
> >
> >
> > Interesting table. I have this on a production cluster 10gbit at a
> > datacenter (obviously doing not that much).
> >
> >
> > [@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
> > Connecting to host 10.0.0.13, port 5201
> > [  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
> > [ ID] Interval   Transfer Bandwidth   Retr  Cwnd
> > [  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec0690 KBytes
> > [  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.08 MBytes
> > [  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
> > [  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
> > [  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
> > [  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.21 MBytes
> > [  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
> > [  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.21 MBytes
> > [  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
> > [  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval   Transfer Bandwidth   Retr
> > [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec0
> > sender
> > [  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec
> > receiver
> >
> >
> > -Original Message-
> > Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> > working after setting MTU 9000
> >
> > To elaborate on some aspects that have been mentioned already and add
> > some others::
> >
> >
> > * Test using iperf3.
> >
> > * Don't try to use jumbos on networks where you don't have complete
> > control over every host. This usually includes the main ceph network.
> > It's just too much grief. You can consider using it for limited-access
> > networks (e.g. ceph cluster network, hypervisor migration network, etc)
> > where you know every switch & host is tuned correctly. (This works even
> > when those nets share a vlan trunk with non-jumbo vlans - just set the
> > max value on the trunk itself, and individua

[ceph-users] Re: No scrubbing during upmap balancing

2020-05-29 Thread Paul Emmerich
Did you disable "osd scrub during recovery"?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 29, 2020 at 12:04 AM Vytenis A  wrote:

> Forgot to mention the CEPH version we're running: Nautilus 14.2.9
>
> On Fri, May 29, 2020 at 12:44 AM Vytenis A  wrote:
> >
> > Hi list,
> >
> > We have balancer plugin in upmap mode running for a while now:
> >
> > health: HEALTH_OK
> >
> > pgs:
> >   1973 active+clean
> >194  active+remapped+backfilling
> >73   active+remapped+backfill_wait
> >
> > recovery: 588 MiB/s, 343 objects/s
> >
> >
> > Our objects are stored on EC pool. We got an PG_NOT_DEEP_SCRUBBED
> > alert and have noticed that no scrubbing (literally zero) was done
> > since the balancing started. Has anyone some ideas why this is
> > happening?
> >
> > "pg deep-scrub " did not help.
> >
> > Thanks!
> >
> >
> > --
> > Vytenis
>
>
>
> --
> Vytenis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The sufficient OSD capabilities to enable write access on cephfs

2020-05-29 Thread Paul Emmerich
There are two bugs that may cause the tag to be missing from the pools, you
can somehow manually add these tags with "ceph osd pool application ..."; I
think I posted these commands some time ago on tracker.ceph.com

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 29, 2020 at 7:21 AM Derrick Lin  wrote:

> I did some brute-force experiments and found the following setting works
> for me:
>
> caps osd = "allow rw pool=cephfs_data"
>
>   I am not sure why ceph fs authorize command set in that way and for what
> purpose...
>
> Cheers
>
> On Fri, May 29, 2020 at 12:03 PM Derrick Lin  wrote:
>
> > Hi guys,
> >
> > I have a Ceph Cluster up and running and cephfs created (all done by
> > ceph-ansible).
> >
> > I following the guide to mount the volume on CentOS7 via FUSE.
> >
> > When I mount the volume as the default admin (client.admin), everything
> > works fine just like normal file system.
> >
> > Then I created a new client just for FUSE mount purpose, follow this
> > guide: https://docs.ceph.com/docs/master/cephfs/mount-prerequisites/
> >
> > The ceph fs authorize command created a new client with the following
> caps:
> >
> > [client.wp_test]
> > key = AQDAEc9ebLXjGhAAxEGqTuTvCOoN30g4UzF5jw==
> > caps mds = "allow rw"
> > caps mon = "allow r"
> > caps osd = "allow rw tag cephfs data=cephfs"
> >
> >
> > It can mount the volume, and I can touch a file. But when I tried write
> > data, such as editing a new text or cat a file, I got some read-only
> error
> > or
> >
> > [root@mon-6-26 ceph_root]# cat test.txt
> > cat: test.txt: Operation not permitted
> >
> > if I modified the OSD cap to "allow *", the it allows write again.
> >
> > Can anyone suggest what have been done incorrectly?
> >
> > We are using
> >
> > 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
> > nautilus (stable)
> >
> > Cheers,
> > Derrick
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High latency spikes under jewel

2020-05-27 Thread Paul Emmerich
Common problem for FileStore and really no point in debugging this: upgrade
everything to a recent version and migrate to BlueStore.
99% of random latency spikes are just fixed by doing that.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 27, 2020 at 3:26 PM Bence Szabo  wrote:

> Hi,
> We experienced random and relative high latency spikes (around 0.5-10 sec)
> in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s.
> One osd built with one spinning disk and two nvme device.
> We use a bcache device for osd back end (mixed with hdd and an nvme
> partition as caching device) and one nvme partition for journal.
> This synthetic command can be use for check io and latency:
> rados bench -p rbd 10 write -b 4000 -t 64
> With this parameters we often got about 1.5 sec or higher for maximum
> latency.
> We cannot decide if our cluster is misconfigured or just this is a natural
> ceph behavior.
> Any help, suggestion would be appreciated.
> Regards,
> Bence
>
> --
> --Szabo Bence
> --
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 15.2.2 bluestore issue

2020-05-27 Thread Paul Emmerich
Hi,

since this bug may lead to data loss when several OSDs crash at the same
time (e.g., after a power outage): can we pull the release from the mirrors
and docker hub?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 20, 2020 at 7:18 PM Josh Durgin  wrote:

> Hi folks, at this time we recommend pausing OSD upgrades to 15.2.2.
>
> There have been a couple reports of OSDs crashing due to rocksdb
> corruption after upgrading to 15.2.2 [1] [2]. It's safe to upgrade
> monitors and mgr, but OSDs and everything else should wait.
>
> We're investigating and will get a fix out as soon as we can. You
> can follow progress on this tracker:
>
>https://tracker.ceph.com/issues/45613
>
> Josh
>
> [1]
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CX5PRFGL6UBFMOJC6CLUMLPMT4B2CXVQ/
> [2]
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CWN7BNPGSRBKZHUF2D7MDXCOAE3U2ERU/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-26 Thread Paul Emmerich
Don't optimize stuff without benchmarking *before and after*, don't apply
random tuning tipps from the Internet without benchmarking them.

My experience with Jumbo frames: 3% performance. On a NVMe-only setup with
100 Gbit/s network.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, May 26, 2020 at 7:02 PM Marc Roos  wrote:

>
>
> Look what I have found!!! :)
> https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/
>
>
>
> -Original Message-
> From: Anthony D'Atri [mailto:anthony.da...@gmail.com]
> Sent: maandag 25 mei 2020 22:12
> To: Marc Roos
> Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; doustar
> Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> working after setting MTU 9000
>
> Quick and easy depends on your network infrastructure.  Sometimes it is
> difficult or impossible to retrofit a live cluster without disruption.
>
>
> > On May 25, 2020, at 1:03 AM, Marc Roos 
> wrote:
> >
> > 
> > I am interested. I am always setting mtu to 9000. To be honest I
> > cannot imagine there is no optimization since you have less interrupt
> > requests, and you are able x times as much data. Every time there
> > something written about optimizing the first thing mention is changing
>
> > to the mtu 9000. Because it is quick and easy win.
> >
> >
> >
> >
> > -Original Message-
> > From: Dave Hall [mailto:kdh...@binghamton.edu]
> > Sent: maandag 25 mei 2020 5:11
> > To: Martin Verges; Suresh Rama
> > Cc: Amudhan P; Khodayar Doustar; ceph-users
> > Subject: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> > working after setting MTU 9000
> >
> > All,
> >
> > Regarding Martin's observations about Jumbo Frames
> >
> > I have recently been gathering some notes from various internet
> > sources regarding Linux network performance, and Linux performance in
> > general, to be applied to a Ceph cluster I manage but also to the rest
>
> > of the Linux server farm I'm responsible for.
> >
> > In short, enabling Jumbo Frames without also tuning a number of other
> > kernel and NIC attributes will not provide the performance increases
> > we'd like to see.  I have not yet had a chance to go through the rest
> > of the testing I'd like to do, but  I can confirm (via iperf3) that
> > only enabling Jumbo Frames didn't make a significant difference.
> >
> > Some of the other attributes I'm referring to are incoming and
> > outgoing buffer sizes at the NIC, IP, and TCP levels, interrupt
> > coalescing, NIC offload functions that should or shouldn't be turned
> > on, packet queuing disciplines (tc), the best choice of TCP slow-start
>
> > algorithms, and other TCP features and attributes.
> >
> > The most off-beat item I saw was something about adding IPTABLES rules
>
> > to bypass CONNTRACK table lookups.
> >
> > In order to do anything meaningful to assess the effect of all of
> > these settings I'd like to figure out how to set them all via Ansible
> > - so more to learn before I can give opinions.
> >
> > -->  If anybody has added this type of configuration to Ceph Ansible,
> > I'd be glad for some pointers.
> >
> > I have started to compile a document containing my notes.  It's rough,
>
> > but I'd be glad to share if anybody is interested.
> >
> > -Dave
> >
> > Dave Hall
> > Binghamton University
> >
> >> On 5/24/2020 12:29 PM, Martin Verges wrote:
> >>
> >> Just save yourself the trouble. You won't have any real benefit from
> > MTU
> >> 9000. It has some smallish, but it is not worth the effort, problems,
> > and
> >> loss of reliability for most environments.
> >> Try it yourself and do some benchmarks, especially with your regular
> >> workload on the cluster (not the maximum peak performance), then drop
> > the
> >> MTU to default ;).
> >>
> >> Please if anyone has other real world benchmarks showing huge
> > differences
> >> in regular Ceph clusters, please feel free to post it here.
> >>
> >> --
> >> Martin Verges
> >> Managing director
> >>
> >> Mobile: +49 174 9335695
> >> E-Mail: martin.ver...@croit.io
> >> Chat: https://t.me/MartinVerges
> >>
> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
> >

[ceph-users] Re: diskprediction_local prediction granularity

2020-05-20 Thread Paul Emmerich
On Wed, May 20, 2020 at 5:36 PM Vytenis A  wrote:

> Is it possible to get any finer prediction date?
>

related question: did anyone actually observe any correlation between the
predicted failure time and the actual time until a failure occurs?


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
> --
> Vytenis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Paul Emmerich
On Tue, May 19, 2020 at 3:11 PM thoralf schulze 
wrote:

>
> On 5/19/20 2:13 PM, Paul Emmerich wrote:
> > 3) if necessary add more OSDs; common problem is having very
> > few dedicated OSDs for the index pool; running the index on
> > all OSDs (and having a fast DB device for every disk) is
> > better. But sounds like you already have that
>
> nope, unfortunately not. default.rgw.buckets.index is an replicated pool
> on hdds with only 4 pgs, i'll see if i can change that.
>
>
these PGs should be distributed across all OSDs; in general it's a good
idea to have at least as many PGs as you have OSDs of the target type for
that pool
(technically a third would be enough to target one PG per OSD, because of
x3 replication)


Paul



> back to igors questions:
>
> > Some questions about your cases:
> > - What kind of payload do you have - RGW or something else?
> mostly cephfs. the most active pools in terms of i/o are the openstack
> rgw ones, though.
>
> > - Have you done massive removals recently?
> yes, see above
>
> > - How large are main and DB disks for suffering OSDs? How much is their
> > current utilization?
> for osd.293, for which i've sent the log:
> main: 2tb hdd (5% used), db: 14gb partition on a 180gb nvme (~400mb used)
> … i'll attach a perf dump for this osd.
>
> > - Do you see multiple "slow operation observed" patterns in OSD logs?
> yes, although they do not necessarily correlate with osd down events.
>
> > Are they all about _collection_list function?
> no, there are also submit_transact and _txc_committed_kv, with about the
> same frequency as collection_list.
>
> thank you very much for your analysis & with kind regards,
> thoralf.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-19 Thread Paul Emmerich
On Tue, May 19, 2020 at 2:06 PM Igor Fedotov  wrote:

> Hi Thoralf,
>
> given the following indication from your logs:
>
> May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
> 7fb25cc80700  0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
> slow operation observed for _collection_list, latency = 96.337s, lat =
> 96s cid =2.0s2_head start GHMAX end GHMAX max 30
>
>
> I presume that your OSDs suffer from slow RocksDB access,
> collection_listing operation is a culprit in this case - 30 items
> listing takes 96seconds to complete.
>  From my experience such issues tend to happen after massive DB data
> removals (e.g. pool removal(s)) often backed by RGW usage which is "DB
> access greedy".
>

+1, also my experience: this happens when running large RGW setups.

Usually the solution is making sure that

1) all metadata really goes to SSD
2) there is no spillover
3) if necessary add more OSDs; common problem is having very few dedicated
OSDs for the index pool; running the index on all OSDs (and having a fast
DB device for every disk) is better. But sounds like you already have that


Paul



> DB data fragmentation is presumably the root cause for the resulting
> slowdown. BlueFS spillover to main HDD device if any to be eliminated too.
> To temporary workaround the issue you might want to do manual RocksDB
> compaction - it's known to be helpful in such cases. But the positive
> effect doesn't last forever - DB might go into degraded state again.
>
> Some questions about your cases:
> - What kind of payload do you have - RGW or something else?
> - Have you done massive removals recently?
> - How large are main and DB disks for suffering OSDs? How much is their
> current utilization?
> - Do you see multiple "slow operation observed" patterns in OSD logs?
> Are they all about _collection_list function?
>
> Thanks,
> Igor
> On 5/19/2020 12:07 PM, thoralf schulze wrote:
> > hi there,
> >
> > we are seeing osd occasionally getting kicked out of our cluster, after
> > having been marked down by other osds. most of the time, the affected
> > osd rejoins the cluster after about ~5 minutes, but sometimes this takes
> > much longer. during that time, the osd seems to run just fine.
> >
> > this happens more often that we'd like it to … is "OSD::osd_op_tp thread
> > … had timed out" a real error condition or just a warning about certain
> > operations on the osd taking a long time? i already set
> > osd_op_thread_timeout to 120 (was 60 before, default should be 15
> > according to the docs), but apparently that doesn't make any difference.
> >
> > are there any other settings that prevent this kind of behaviour?
> > mon_osd_report_timeout maybe, as in frank schilder's case?
> >
> > the cluster runs nautilus 14.2.7, osds are backed by spinning platters
> > with their rocksdb and wals on nvmes. in general, there seems to be the
> > following pattern:
> >
> > - it happens under moderate to heavy load, eg. while creating pools with
> > a lot of pgs
> > - the affected osd logs a lot of:
> > "heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed
> > out after 60"
> >   … and finally something along the lines of:
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
> > 7fb25cc80700  0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
> > slow operation observed for _collection_list, latency = 96.337s, lat =
> > 96s cid =2.0s2_head start GHMAX end GHMAX max 30
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219
> > 7fb25cc80700  1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread
> > 0x7fb25cc80700' had timed out after 60
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 :
> > Monitor daemon marked osd.293 down, but it is still running
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) log [WRN] : Monitor daemon marked
> > osd.293 down, but it is still running
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) do_log log to syslog
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) log [DBG] : map e646639 wrongly
> > marked me down at e646638
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
> > 7fb267c96700  0 log_channel(cluster) do_log log to syslog
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
> > 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
> > public interface 'br-bond0' numa node: (2) No such file or directory
> > May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
> > 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
> > public interface 'br-bond0' numa node: (2) No such file or directory
> >
> > - meanwhile on the mon:
> > 2020-05-18 21:12:16.440 7f08f7933700  0 mon.ceph-mon-01@0(leader) e4
> > 

[ceph-users] Re: Dealing with non existing crush-root= after reclassify on ec pools

2020-05-18 Thread Paul Emmerich
that part of erasure profiles are only used when a crush rule is created
when creating a pool without explicitly specifying a crush rule



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, May 18, 2020 at 9:09 PM Dan  wrote:

> I think I did a bad job explaining my issue:
>
> I have a fairly old cluster which had a crush map with two trees, one for
> hdds and one for ssd, like root hdd {..} and root ssd {...}  now with the
> newer class based rules I used crushtool —reclassify to merge those two
> trees into root default {...} So I already downloaded, edited and
> Reuploaded the crush map, which resulted in a very minor data movement,
> which crushtool —compare predicted.  One of my pools is an ec pool with an
> ec profile with crush-root=hdd. I can not, I think, change the ec-profile
> of an existing pool. But since the pool runs on that profile, with the  now
> non existing crush-root=hdd, I am wondering if I can expect to run into
> trouble down the line or does the cluster use some internal id, and the
> string displayed only matters on creation. Basically am I safe or am I
> hosed?
>
>
> On Mon 18. May 2020 at 19:05, Eric Smith  wrote:
>
> > You'll probably have to decompile, hand edit, recompile, and reset the
> > crush map pointing at the expected root. The EC profile is only used
> during
> > pool creation and will not change the crush map if you change the EC
> > profile. I think you can expect some data movement if you change the root
> > but either way I would test it in a lab if you have one available.
> >
> > -Original Message-
> > From: Dan  On Behalf Of Dan
> > Sent: Monday, May 18, 2020 9:14 AM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Dealing with non existing crush-root= after
> > reclassify on ec pools
> >
> > I have reclassified a CRUSH map, using the crushtool to a class based
> > ruleset.
> > I still have an ec pool with an older ec profile with a new non existing
> > crush-root=hdd
> >
> > I already switched the pool’s ruleset over to a newer rule with a newer
> > ec-profile with a correct crush-root But pool ls detail still shows:
> >
> >
> > pool 9 'data' erasure profile jerasure-3-1 size 4 min_size 3 …..
> >
> > Jerasure-3-1 being the old profile with non existing crush-root
> >
> > So what do I do now? Switching over the pool ruleset does not change the
> > ec-profile, can I switch the ec-profile over?
> > What can I expect having a pool with a ec-profile with a non existing
> > crush-root key?
> >
> > Please advise.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nfs migrate to rgw

2020-05-18 Thread Paul Emmerich
On Mon, May 18, 2020 at 1:52 PM Zhenshi Zhou  wrote:

>
> 50KB, and much video files around 30MB. The amount of the files is more
> than
> 1 million. Maybe I can find a way to seperate the files in more buckets so
> that
> there is no more than 1M objects in each bucket. But how about the small
> files
> around 50KB. Does rgw serve well on small files?
>

1 million files is usually the point where you first need to start thinking
about some optimizations, but that's mostly just making sure that the index
is on SSD and it'll happily work up to ~10 million files.
Then you might need to start thinking about the index being on *good* SSDs
(and/or on many SSDs/DB devices).

It starts the get interesting if you need to go beyond 100 million files,
that's the point where you need to start tuning shard sizes and the types
of index queries that you send...

I've found that a few hundred million objects per bucket are no problem if
you run with large shard sizes (500k - 1 million); however, there are some
index-queries that can be really expensive like filtering on prefixes in
some pathological cases...

Small files: sure, works well, but can be challenging for erasure coding on
HDDs, but that's unrelated to rgw/you'd have the same problem with CephFS

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
> Wido den Hollander  于2020年5月12日周二 下午2:41写道:
>
> >
> >
> > On 5/12/20 4:22 AM, Zhenshi Zhou wrote:
> > > Hi all,
> > >
> > > We have several nfs servers providing file storage. There is a nginx in
> > > front of
> > > nfs servers in order to serve the clients. The files are mostly small
> > files
> > > and
> > > nearly about 30TB in total.
> > >
> >
> > What is small? How many objects/files are you talking about?
> >
> > > I'm gonna use ceph rgw as the storage. I wanna know if it's appropriate
> > to
> > > do so.
> > > The data migrating from nfs to rgw is a huge job. Besides I'm not sure
> > > whether
> > > ceph rgw is suitable in this scenario or not.
> > >
> >
> > Yes, it is. But make sure you don't put millions of objects into a
> > single bucket. Make sure that you spread them out so that you have let's
> > say 1M of objects per bucket at max.
> >
> > Wido
> >
> > > Thanks
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Disproportionate Metadata Size

2020-05-13 Thread Paul Emmerich
osd df is misleading when using external DB devices, they are always
counted as 100% full there


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 13, 2020 at 11:40 AM Denis Krienbühl  wrote:

> Hi
>
> On one of our Ceph clusters, some OSDs have been marked as full. Since
> this is a staging cluster that does not have much data on it, this is
> strange.
>
> Looking at the full OSDs through “ceph osd df” I figured out that the
> space is mostly used by metadata:
>
> SIZE: 122 GiB
> USE: 118 GiB
> DATA: 2.4 GiB
> META: 116 GiB
>
> We run mimic, and for the affected OSDs we use a db device (nvme) in
> addition to the primary device (hdd).
>
> In the logs we see the following errors:
>
> 2020-05-12 17:10:26.089 7f183f604700  1 bluefs _allocate failed to
> allocate 0x40 on bdev 1, free 0x0; fallback to bdev 2
> 2020-05-12 17:10:27.113 7f183f604700  1
> bluestore(/var/lib/ceph/osd/ceph-8) _balance_bluefs_freespace gifting
> 0x180a00~40 to bluefs
> 2020-05-12 17:10:27.153 7f183f604700  1 bluefs add_block_extent bdev 2
> 0x180a00~40
>
> We assume it is an issue with Rocksdb, as the following call will quickly
> fix the problem:
>
> ceph daemon osd.8 compact
>
> The question is, why is this happening? I would think that “compact" is
> something that runs automatically from time to time, but I’m not sure.
>
> Is it on us to run this regularly?
>
> Any pointers are welcome. I’m quite new to Ceph :)
>
> Cheers,
>
> Denis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Paul Emmerich
First thing I'd try is to use objectstore-tool to scrape the
inactive/broken PGs from the dead OSDs using it's PG export feature.
Then import these PGs into any other OSD which will automatically recover
it.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
wrote:

> Yes
> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
>
> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
>
> > Can you share your osd tree and the current ceph status?
> >
> >
> > Zitat von Kári Bertilsson :
> >
> > > Hello
> > >
> > > I had an incidence where 3 OSD's crashed at once completely and won't
> > power
> > > up. And during recovery 3 OSD's in another host have somehow become
> > > corrupted. I am running erasure coding with 8+2 setup using crush map
> > which
> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's
> > > down. Unfortunately these PG's seem to overlap almost all data on the
> > pool,
> > > so i believe the entire pool is mostly lost after only these 2% of PG's
> > > down.
> > >
> > > I am running ceph 14.2.9.
> > >
> > > OSD 92 log https://pastebin.com/5aq8SyCW
> > > OSD 97 log https://pastebin.com/uJELZxwr
> > >
> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's
> > still
> > > fail with the log above.
> > >
> > > Log from trying ceph-bluestore-tool repair --deep which is still
> running,
> > > not sure if it will actually fix anything and log looks pretty bad.
> > > https://pastebin.com/gkqTZpY3
> > >
> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
> --op
> > > list" gave me input/output error. But everything in SMART looks OK,
> and i
> > > see no indication of hardware read error in any logs. Same for both
> OSD.
> > >
> > > The OSD's with corruption have absolutely no bad sectors and likely
> have
> > > only a minor corruption but at important locations.
> > >
> > > Any ideas on how to recover this kind of scenario ? Any tips would be
> > > highly appreciated.
> > >
> > > Best regards,
> > > Kári Bertilsson
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Zeroing out rbd image or volume

2020-05-12 Thread Paul Emmerich
And many hypervisors will turn writing zeroes into an unmap/trim (qemu
detect-zeroes=unmap), so running trim on the entire empty disk is often the
same as writing zeroes.
So +1 for encryption being the proper way here


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, May 12, 2020 at 1:52 PM Jason Dillaman  wrote:

> I would also like to add that the OSDs can (and will) use redirect on write
> techniques (not to mention the physical device hardware as well).
> Therefore, your zeroing of the device might just cause the OSDs to allocate
> new extents of zeros while the old extents remain intact (albeit
> unreferenced and available for future writes). The correct solution would
> be to layer LUKS/dm-crypt on top of the RBD device if you need a strong
> security guarantee about a specific image, or use encrypted OSDs if the
> concern is about the loss of the OSD physical device.
>
> On Tue, May 12, 2020 at 6:58 AM Marc Roos 
> wrote:
>
> >
> > dd if=/dev/zero of=rbd  :) but if you have encrypted osd's, what
> > would be the use of this?
> >
> >
> >
> > -Original Message-
> > From: huxia...@horebdata.cn [mailto:huxia...@horebdata.cn]
> > Sent: 12 May 2020 12:55
> > To: ceph-users
> > Subject: [ceph-users] Zeroing out rbd image or volume
> >
> > Hi, Ceph folks,
> >
> > Is there a rbd command, or any other way, to zero out rbd images or
> > volume? I would like to write all zero data to an rbd image/volume
> > before remove it.
> >
> > Any comments would be appreciated.
> >
> > best regards,
> >
> > samuel
> > Horebdata AG
> > Switzerland
> >
> >
> >
> >
> > huxia...@horebdata.cn
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> --
> Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph meltdown, need help

2020-05-05 Thread Paul Emmerich
Check network connectivity on all configured networks between alle hosts,
OSDs running but being marked as down is usually a network problem


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, May 5, 2020 at 4:45 PM Frank Schilder  wrote:

> Dear Dan,
>
> thank you for your fast response. Please find the log of the first OSD
> that went down and the ceph.log with these links:
>
> https://files.dtu.dk/u/tF1zv5zdc6mmXXO_/ceph.log?l
> https://files.dtu.dk/u/hPb5qax2-b6W9vmp/ceph-osd.2.log?l
>
> I can collect more osd logs if this helps.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 05 May 2020 16:25:31
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Ceph meltdown, need help
>
> Hi Frank,
>
> Could you share any ceph-osd logs and also the ceph.log from a mon to
> see why the cluster thinks all those osds are down?
>
> Simply marking them up isn't going to help, I'm afraid.
>
> Cheers, Dan
>
>
> On Tue, May 5, 2020 at 4:12 PM Frank Schilder  wrote:
> >
> > Hi all,
> >
> > a lot of OSDs crashed in our cluster. Mimic 13.2.8. Current status
> included below. All daemons are running, no OSD process crashed. Can I
> start marking OSDs in and up to get them back talking to each other?
> >
> > Please advice on next steps. Thanks!!
> >
> > [root@gnosis ~]# ceph status
> >   cluster:
> > id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
> > health: HEALTH_WARN
> > 2 MDSs report slow metadata IOs
> > 1 MDSs report slow requests
> > nodown,noout,norecover flag(s) set
> > 125 osds down
> > 3 hosts (48 osds) down
> > Reduced data availability: 2221 pgs inactive, 1943 pgs down,
> 190 pgs peering, 13 pgs stale
> > Degraded data redundancy: 5134396/500993581 objects degraded
> (1.025%), 296 pgs degraded, 299 pgs undersized
> > 9622 slow ops, oldest one blocked for 2913 sec, daemons
> [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]...
> have slow ops.
> >
> >   services:
> > mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> > mgr: ceph-02(active), standbys: ceph-03, ceph-01
> > mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
> > osd: 288 osds: 90 up, 215 in; 230 remapped pgs
> >  flags nodown,noout,norecover
> >
> >   data:
> > pools:   10 pools, 2545 pgs
> > objects: 62.61 M objects, 144 TiB
> > usage:   219 TiB used, 1.6 PiB / 1.8 PiB avail
> > pgs: 1.729% pgs unknown
> >  85.540% pgs not active
> >  5134396/500993581 objects degraded (1.025%)
> >  1796 down
> >  226  active+undersized+degraded
> >  147  down+remapped
> >  140  peering
> >  65   active+clean
> >  44   unknown
> >  38   undersized+degraded+peered
> >  38   remapped+peering
> >  17   active+undersized+degraded+remapped+backfill_wait
> >  12   stale+peering
> >  12   active+undersized+degraded+remapped+backfilling
> >  4active+undersized+remapped
> >  2remapped
> >  2undersized+degraded+remapped+peered
> >  1stale
> >  1undersized+degraded+remapped+backfilling+peered
> >
> >   io:
> > client:   26 KiB/s rd, 206 KiB/s wr, 21 op/s rd, 50 op/s wr
> >
> > [root@gnosis ~]# ceph health detail
> > HEALTH_WARN 2 MDSs report slow metadata IOs; 1 MDSs report slow
> requests; nodown,noout,norecover flag(s) set; 125 osds down; 3 hosts (48
> osds) down; Reduced data availability: 2219 pgs inactive, 1943 pgs down,
> 188 pgs peering, 13 pgs stale; Degraded data redundancy: 5214696/500993589
> objects degraded (1.041%), 298 pgs degraded, 299 pgs undersized; 9788 slow
> ops, oldest one blocked for 2953 sec, daemons
> [osd.0,osd.100,osd.101,osd.112,osd.118,osd.133,osd.136,osd.142,osd.144,osd.145]...
> have slow ops.
> > MDS_SLOW_METADATA_IO 2 MDSs report slow metadata IOs
> > mdsceph-08(mds.0): 100+ slow metadata IOs are blocked > 30 secs,
> oldest blocked for 2940 secs
> > mdsceph-12(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest
> blocked for 2942 secs
> &

[ceph-users] Re: mount issues with rbd running xfs - Structure needs cleaning

2020-05-04 Thread Paul Emmerich
Yeah, file systems rarely really do a read-only mount without providing
some very obscure options, no idea about xfs specifically.

Suggestion: use a keyring with profile rbd-read-only to ensure that it
definitely can't write when mapping the rbd. xfs might just do the right
thing automatically when encountering a read-only block device


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, May 4, 2020 at 7:05 PM Void Star Nill 
wrote:

> Thanks Janne. I actually meant that the RW mount is unmounted already -
> sorry about the confusion.
>
> - Shridhar
>
> On Mon, 4 May 2020 at 00:35, Janne Johansson  wrote:
>
> > Den mån 4 maj 2020 kl 05:14 skrev Void Star Nill <
> void.star.n...@gmail.com
> > >:
> >
> >> One of the use cases (e.g. machine learning workloads) for RBD volumes
> in
> >> our production environment is that, users could mount an RBD volume in
> RW
> >> mode in a container, write some data to it and later use the same volume
> >> in
> >> RO mode into a number of containers in parallel to consume the data.
> >>
> >> I am trying to test this scenario with different file systems (ext3/4
> and
> >> xfs). I have an automated test code that creates a volume, maps it to a
> >> node, mounts in RW mode and write some data into it. Later the same
> volume
> >> is mounted in RO mode in a number of other nodes and a process reads
> from
> >> the file.
> >>
> >
> > Is the RW unmounted or not? You write "stopped writing" but that doesn't
> > clearly
> > indicate if you make it impossible or just "I ask it to not make much
> IO".
> > Given that many filesystems are doing very lazy writes, caches a lot and
> > so on,
> > it would be very important to make sure 1) ALL writes are done, which is
> > easiest with
> > umount I think and 2) that mounting clients knows can't write to it at
> > all, or otherwise
> > as someone said, it might still be updating some metainfo like the
> > journals or
> > "last mounted on /X" or whatever magic fs's store even while not altering
> > the files
> > inside the fs.
> >
> > It's kind of hard to tell filesystems that are accustomed to being in
> > charge of all
> > mounted instances to sit in the back seat and not be allowed to control
> > stuff.
> >
> > --
> > May the most significant bit of your life be positive.
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
On Fri, May 1, 2020 at 9:27 PM Paul Emmerich  wrote:

> The OpenFileTable objects are safe to delete while the MDS is offline
> anyways, the RADOS object names are mds*_openfiles*
>

I should clarify this a little bit: you shouldn't touch the CephFS internal
state or data structures unless you know *exactly* what you are doing.

However, it is pretty safe to delete these files in general, running a
scrub afterwards is a good idea anyways.

But only do this after reading up on lots of details or consulting an
expert.
My assessment here is purely an educated guess based on the error and can
be wrong or counter-productive. All of my mailing list advice is just
things that I know off the top of my head with no further research.
Take with a grain of salt. Don't touch stuff that you don't understand if
you have important data in there.


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Fri, May 1, 2020 at 9:04 PM Marco Pizzolo 
> wrote:
>
>> Also seeing errors such as this:
>>
>>
>> [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:20,974][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.13 with osd_fsid
>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:20,989][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:20,989][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:20,998][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.5 with osd_fsid
>> 4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:21,014][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:21,014][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:21,019][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.9 with osd_fsid
>> 32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:21,035][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:21,035][systemd][WARNING] failed activating OSD, retries
>> left: 11
>> [2020-05-01 13:15:25,972][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:25,994][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:26,020][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:26,040][ceph_volume.process][INFO  ] Running command:
>> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:26,388][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.1 with osd_fsid
>> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
>> [2020-05-01 13:15:26,389][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.13 with osd_fsid
>> dd49cd80-418e-4a8c-8ebf-a33d339663ff
>> [2020-05-01 13:15:26,391][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.5 with osd_fsid
>> 4eaf2baa-60f2-4045-8964-6152608c742a
>> [2020-05-01 13:15:26,402][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,403][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,403][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,404][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,404][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,405][systemd][WARNING] failed activating OSD, retries
>> left: 10
>> [2020-05-01 13:15:26,411][ceph_volume.process][INFO  ] stderr -->
>>  RuntimeError: could not find osd.9 with osd_fsid
>> 32f4a716-f26e-4579-a074-5d6452c22e34
>> [2020-05-01 13:15:26,424][systemd][WARNING] command returned non-zero exit
>> status: 1
>> [2020-05-01 13:15:26,424][systemd][WARNING] failed activating OSD, retries
>> left: 10

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
The OpenFileTable objects are safe to delete while the MDS is offline
anyways, the RADOS object names are mds*_openfiles*



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, May 1, 2020 at 9:04 PM Marco Pizzolo  wrote:

> Also seeing errors such as this:
>
>
> [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD, retries
> left: 11
> [2020-05-01 13:15:20,974][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.13 with osd_fsid
> dd49cd80-418e-4a8c-8ebf-a33d339663ff
> [2020-05-01 13:15:20,989][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:20,989][systemd][WARNING] failed activating OSD, retries
> left: 11
> [2020-05-01 13:15:20,998][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.5 with osd_fsid
> 4eaf2baa-60f2-4045-8964-6152608c742a
> [2020-05-01 13:15:21,014][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:21,014][systemd][WARNING] failed activating OSD, retries
> left: 11
> [2020-05-01 13:15:21,019][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.9 with osd_fsid
> 32f4a716-f26e-4579-a074-5d6452c22e34
> [2020-05-01 13:15:21,035][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:21,035][systemd][WARNING] failed activating OSD, retries
> left: 11
> [2020-05-01 13:15:25,972][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
> [2020-05-01 13:15:25,994][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
> [2020-05-01 13:15:26,020][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
> [2020-05-01 13:15:26,040][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
> [2020-05-01 13:15:26,388][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.1 with osd_fsid
> 0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
> [2020-05-01 13:15:26,389][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.13 with osd_fsid
> dd49cd80-418e-4a8c-8ebf-a33d339663ff
> [2020-05-01 13:15:26,391][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.5 with osd_fsid
> 4eaf2baa-60f2-4045-8964-6152608c742a
> [2020-05-01 13:15:26,402][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:26,403][systemd][WARNING] failed activating OSD, retries
> left: 10
> [2020-05-01 13:15:26,403][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:26,404][systemd][WARNING] failed activating OSD, retries
> left: 10
> [2020-05-01 13:15:26,404][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:26,405][systemd][WARNING] failed activating OSD, retries
> left: 10
> [2020-05-01 13:15:26,411][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.9 with osd_fsid
> 32f4a716-f26e-4579-a074-5d6452c22e34
> [2020-05-01 13:15:26,424][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:26,424][systemd][WARNING] failed activating OSD, retries
> left: 10
> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 1-0f0e6dd7-9dd8-4b48-beaa-084f55f73b32
> [2020-05-01 13:15:31,408][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 5-4eaf2baa-60f2-4045-8964-6152608c742a
> [2020-05-01 13:15:31,409][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 13-dd49cd80-418e-4a8c-8ebf-a33d339663ff
> [2020-05-01 13:15:31,429][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 9-32f4a716-f26e-4579-a074-5d6452c22e34
> [2020-05-01 13:15:31,743][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.5 with osd_fsid
> 4eaf2baa-60f2-4045-8964-6152608c742a
> [2020-05-01 13:15:31,750][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not find osd.13 with osd_fsid
> dd49cd80-418e-4a8c-8ebf-a33d339663ff
> [2020-05-01 13:15:31,752][systemd][WARNING] command returned non-zero exit
> status: 1
> [2020-05-01 13:15:31,752][systemd][WARNING] failed activating OSD, retries
> left: 9
> [2020-05-01 13:15:31,754][ceph_volume.process][INFO  ] stderr -->
>  RuntimeError: could not fi

[ceph-users] Re: 4.14 kernel or greater recommendation for multiple active MDS

2020-05-01 Thread Paul Emmerich
I've seen issues with clients reconnects on older kernels, yeah. They
sometimes get stuck after a network failure

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Apr 30, 2020 at 10:19 PM Gregory Farnum  wrote:

> On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc 
> wrote:
> >
> > In the Nautilus manual it recommends >= 4.14 kernel for multiple active
> > MDSes. What are the potential issues for running the 4.4 kernel with
> > multiple MDSes? We are in the process of upgrading the clients, but at
> > times overrun the capacity of a single MDS server.
>
> I don't think this is documented specifically; you'd have to go
> through the git logs. Talked with the team and 4.14 was the upstream
> kernel when we marked multi-MDS as stable, with the general stream of
> ongoing fixes that always applies there.
>
> There aren't any known issues that will cause file consistency to
> break or anything; I'd be more worried about clients having issues
> reconnecting when their network blips or an MDS fails over.
> -Greg
>
> >
> > MULTIPLE ACTIVE METADATA SERVERS
> > <
> https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active-metadata-servers
> >
> >
> > The feature has been supported since the Luminous release. It is
> > recommended to use Linux kernel clients >= 4.14 when there are multiple
> > active MDS.
> > Thank you,
> > Robert LeBlanc
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS - busy?

2020-04-30 Thread Paul Emmerich
Things to check:

* metadata is on SSD?
* try multiple active MDS servers
* try a larger cache for the MDS
* try a recent version of Ceph

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Apr 30, 2020 at 10:02 AM  wrote:

> Hi.
>
> How do I find out if the MDS is "busy" - being the one limiting CephFS
> metadata throughput. (12.2.8).
>
> $ time find . | wc -l
> 1918069
>
> real8m43.008s
> user0m2.689s
> sys 0m7.818s
>
> or 3.667ms per file.
> In the light of "potentially batching" and a network latency of ~0.20ms to
> the MDS - I have a feeling that this could be significantly improved.
>
> Then I additionally tried to do the same through the NFS -ganesha gateway.
>
> For reference:
> Same - but on "local DAS - xfs".
> $ time find . | wc -l
> 1918061
>
> real0m4.848s
> user0m2.360s
> sys 0m2.816s
>
> Same but "above local DAS over NFS":
> $ time find . | wc -l
> 1918061
>
> real5m56.546s
> user0m2.903s
> sys 0m34.381s
>
>
> jk@ceph-mon1:~$ sudo ceph fs status
> [sudo] password for jk:
> cephfs - 84 clients
> ==
> +--++---+---+---+---+
> | Rank | State  |MDS|Activity   |  dns  |  inos |
> +--++---+---+---+---+
> |  0   | active | ceph-mds2 | Reqs: 1369 /s | 11.3M | 11.3M |
> | 0-s  | standby-replay | ceph-mds1 | Evts:0 /s |0  |0  |
> +--++---+---+---+---+
> +--+--+---+---+
> |   Pool   |   type   |  used | avail |
> +--+--+---+---+
> | cephfs_metadata  | metadata |  226M | 16.4T |
> |   cephfs_data|   data   |  164T |  132T |
> | cephfs_data_ec42 |   data   |  180T |  265T |
> +--+--+---+---+
>
> +-+
> | Standby MDS |
> +-+
> +-+
> MDS version: ceph version 12.2.5-45redhat1xenial
> (d4b9f17b56b3348566926849313084dd6efc2ca2) luminous (stable)
>
> How can we asses where the bottleneck is and what to do to speed it up?
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph crash hangs forever and recovery stop

2020-04-30 Thread Paul Emmerich
Best guess: the recovery process doesn't really stop, but it's just that
the mgr is dead and it no longer reports the progress

And yeah, I can confirm that having a huge number of crash reports is a
problem (had a case where a monitoring script crashed due to a
radosgw-admin bug... lots of crash reports)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Apr 30, 2020 at 4:09 PM Francois Legrand 
wrote:

> Hi everybody (again),
> We recently had a lot of osd crashs (more than 30 osd crashed). This is
> now fixed, but it triggered a huge rebalancing+recovery.
> More or less in the same time, we noticed that the ceph crash ls (or
> whatever other ceph crash command) hangs forever and never returns.
> And finally, the recovery process stops regularly (after ~1 hour) but it
> can be restarted by reseting the mgr daemon (systemctl restart
> ceph-mgr.target on the active manager).
> There is nothing in the logs (the manager still works, the service is
> up, the dashboard is accessible but simply the recovery stops).
> We also tryed to reboot the managers, but it doesn't solve the problem.
> I guess theses two problems should be linked, but not sure.
> Does anybody have a clue ?
> Thanks.
> F.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade Luminous to Nautilus on a Debian system

2020-04-29 Thread Paul Emmerich
We run the Luminous/Mimic -> Nautilus upgrade by upgrading Ceph and
Debian at the same time, i.e., your first scenario.

Didn't encounter any problems with that; the Nautilus upgrade has been
very smooth for us and we've migrated almost all of our deployments
using our fully automated upgrade assistant; it's just one button that
does all the right things in the right order.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Apr 29, 2020 at 8:58 PM Herve Ballans
 wrote:
>
> Hi Alex,
>
> Thanks a lot for your tips. I note that for my planned upgrade.
>
> I take the opportunity here to add a complementary question regarding
> the require-osd-release functionality (ceph osd require-osd-release
> nautilus )
>
> I remember that one time I did that (on another cluster, a proxmox one)
> and it took a very long time and had a strong impact on the ceph
> performances during this operation (several hours)
>
> Did you notice that too on your side ?
>
> Thanks again,
> Hervé
>
> On 29/04/2020 20:39, Alex Gorbachev wrote:
> >
> >
> >
> > On Wed, Apr 29, 2020 at 11:54 AM Herve Ballans
> > mailto:herve.ball...@ias.u-psud.fr>> wrote:
> >
> > Hi all,
> >
> > I'm planning to upgrade one on my Ceph Cluster currently on Luminous
> > 12.2.13 / Debian Stretch (updated).
> > On this cluster, Luminous is packaged from the official Ceph repo
> > (deb
> > https://download.ceph.com/debian-luminous/ stretch main)
> >
> > I would like to upgrade it with Debian Buster and Nautilus using the
> > croit.io <http://croit.io> repository (deb
> > https://mirror.croit.io/debian-nautilus/ buster
> > main)
> >
> > I already prepared the steps procedure but I just want to verify one
> > step regarding the upgrade of the ceph packages.
> >
> > Do I have to upgrade ceph in the same time than Debian or do i
> > have to
> > upgrade ceph after the Debian upgrade from Stretch to Buster ?
> >
> > 1) In the first case :
> >
> >   * Replace stretch by buster in /etc/apt/sources.list
> >   * Modify the ceph.list repo by croit.io <http://croit.io> one
> >   * Upgrade the entire nodes
> >
> > 2) In the second case (upgrade Debian then Ceph)
> >
> >   * Replace stretch by buster in /etc/apt/sources.list
> >   * keep the /etc/apt/sources.list.d/ceph.list as it is
> >   * Upgrade and reboot the nodes
> >   * replace the ceph.list file by croit.io <http://croit.io>
> >   * upgrade the ceph packages
> >   * restarting the Ceph services (in the right order MON -> MGR -> OSD
> > -> MDS)
> >
> > Thanks a lot for your advices
> >
> > Regards,
> > Hervé
> >
> >
> > Hi Hervé,
> >
> > The one thing I had trouble with (and it's primarily from not reading
> > the docs very carefully) is that you should NOT enable the messenger 2
> > protocol until all OSDs have been updated.  In other words, Ceph will
> > complain about not running msgr2, but you should leave it like that
> > until all OSDs are on Nautilus.  Then you run:
> > ceph mon enable-msgr2
> > ceph osd require-osd-release nautilus
> >
> > Ref: https://docs.ceph.com/docs/master/releases/nautilus/
> >
> > --
> > Alex Gorbachev
> > Intelligent Systems Services Inc.
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > <mailto:ceph-users-le...@ceph.io>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Paul Emmerich
On Tue, Apr 21, 2020 at 12:44 PM Brad Hubbard  wrote:
>
> On Tue, Apr 21, 2020 at 6:35 PM Paul Emmerich  wrote:
> >
> > On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
> > >
> > > Wait for recovery to finish so you know whether any data from the down
> > > OSDs is required. If not just reprovision them.
> >
> > Recovery will not finish from this state as several PGs are down and/or 
> > stale.
>
> What I meant was let recovery get as far as it can.

Which doesn't solve anything, you can already see that you need to get
at least some of these OSDs back in order to fix it.
No point in waiting for the recovery.

I agree that it looks like https://tracker.ceph.com/issues/36337
I happen to know Jonas who opened that issue and wrote the script;
I'll poke him maybe he has an idea or additional input


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> >
> >
> > Paul
> >
> > >
> > > If data is required from the down OSDs you will need to run a query on
> > > the pg(s) to find out what OSDs have the required copies of the
> > > pg/object required. you can then export the pg from the down osd using
> > > the ceph-objectstore-tool, back it up, then import it back into the
> > > cluster.
> > >
> > > On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > one of our customers had his Ceph cluster crashed due to a power or 
> > > > network outage (they still try to figure out what happened).
> > > >
> > > > The cluster is very unhealthy but recovering:
> > > >
> > > > # ceph -s
> > > >   cluster:
> > > > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > > > health: HEALTH_ERR
> > > > 1 filesystem is degraded
> > > > 1 mds daemon damaged
> > > > 1 osds down
> > > > 1 pools have many more objects per pg than average
> > > > 1/115117480 objects unfound (0.000%)
> > > > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 
> > > > pgs peering, 27 pgs stale
> > > > Possible data damage: 1 pg recovery_unfound
> > > > Degraded data redundancy: 7303464/230234960 objects 
> > > > degraded (3.172%), 693 pgs degraded, 945 pgs undersized
> > > > 14 daemons have recently crashed
> > > >
> > > >   services:
> > > > mon: 3 daemons, quorum 
> > > > maslxlabstore01,maslxlabstore02,maslxlabstore04 (age 64m)
> > > > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > > > maslxlabstore02, maslxlabstore04
> > > > mds: cephfs:2/3 
> > > > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 
> > > > up:standby, 1 damaged
> > > > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped 
> > > > pgs
> > > > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > > > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> > > >
> > > >   data:
> > > > pools:   6 pools, 8328 pgs
> > > > objects: 115.12M objects, 218 TiB
> > > > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > > > pgs: 0.853% pgs not active
> > > >  7303464/230234960 objects degraded (3.172%)
> > > >  13486/230234960 objects misplaced (0.006%)
> > > >  1/115117480 objects unfound (0.000%)
> > > >  7311 active+clean
> > > >  338  active+undersized+degraded+remapped+backfill_wait
> > > >  255  active+undersized+degraded+remapped+backfilling
> > > >  215  active+undersized+remapped+backfilling
> > > >  99   active+undersized+degraded
> > > >  44   down
> > > >  37   active+undersized+remapped+backfill_wait
> > > >  13   stale+peering
> > > >  9stale+down
> > > >  5stale+remapped+peering
> > > >  1active+recovery_unfound+undersized+degraded+remapped
> > > >  1active+clean+remapped
> > > >
> > > >   io:
> > > > client:   168 B/s 

[ceph-users] Re: Nautilus cluster damaged + crashing OSDs

2020-04-21 Thread Paul Emmerich
On Tue, Apr 21, 2020 at 3:20 AM Brad Hubbard  wrote:
>
> Wait for recovery to finish so you know whether any data from the down
> OSDs is required. If not just reprovision them.

Recovery will not finish from this state as several PGs are down and/or stale.


Paul

>
> If data is required from the down OSDs you will need to run a query on
> the pg(s) to find out what OSDs have the required copies of the
> pg/object required. you can then export the pg from the down osd using
> the ceph-objectstore-tool, back it up, then import it back into the
> cluster.
>
> On Tue, Apr 21, 2020 at 1:05 AM Robert Sander
>  wrote:
> >
> > Hi,
> >
> > one of our customers had his Ceph cluster crashed due to a power or network 
> > outage (they still try to figure out what happened).
> >
> > The cluster is very unhealthy but recovering:
> >
> > # ceph -s
> >   cluster:
> > id: 1c95ca5d-948b-4113-9246-14761cb9a82a
> > health: HEALTH_ERR
> > 1 filesystem is degraded
> > 1 mds daemon damaged
> > 1 osds down
> > 1 pools have many more objects per pg than average
> > 1/115117480 objects unfound (0.000%)
> > Reduced data availability: 71 pgs inactive, 53 pgs down, 18 pgs 
> > peering, 27 pgs stale
> > Possible data damage: 1 pg recovery_unfound
> > Degraded data redundancy: 7303464/230234960 objects degraded 
> > (3.172%), 693 pgs degraded, 945 pgs undersized
> > 14 daemons have recently crashed
> >
> >   services:
> > mon: 3 daemons, quorum maslxlabstore01,maslxlabstore02,maslxlabstore04 
> > (age 64m)
> > mgr: maslxlabstore01(active, since 69m), standbys: maslxlabstore03, 
> > maslxlabstore02, maslxlabstore04
> > mds: cephfs:2/3 
> > {0=maslxlabstore03=up:resolve,1=maslxlabstore01=up:resolve} 2 up:standby, 1 
> > damaged
> > osd: 140 osds: 130 up (since 4m), 131 in (since 4m); 847 remapped pgs
> > rgw: 4 daemons active (maslxlabstore01.rgw0, maslxlabstore02.rgw0, 
> > maslxlabstore03.rgw0, maslxlabstore04.rgw0)
> >
> >   data:
> > pools:   6 pools, 8328 pgs
> > objects: 115.12M objects, 218 TiB
> > usage:   425 TiB used, 290 TiB / 715 TiB avail
> > pgs: 0.853% pgs not active
> >  7303464/230234960 objects degraded (3.172%)
> >  13486/230234960 objects misplaced (0.006%)
> >  1/115117480 objects unfound (0.000%)
> >  7311 active+clean
> >  338  active+undersized+degraded+remapped+backfill_wait
> >  255  active+undersized+degraded+remapped+backfilling
> >  215  active+undersized+remapped+backfilling
> >  99   active+undersized+degraded
> >  44   down
> >  37   active+undersized+remapped+backfill_wait
> >  13   stale+peering
> >  9stale+down
> >  5stale+remapped+peering
> >  1active+recovery_unfound+undersized+degraded+remapped
> >  1active+clean+remapped
> >
> >   io:
> > client:   168 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
> > recovery: 1.9 GiB/s, 15 keys/s, 948 objects/s
> >
> >
> > The MDS cluster is unable to start because one of them is damaged.
> >
> > 10 of the OSDs do not start. They crash very early in the boot process:
> >
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 set uid:gid to 64045:64045 
> > (ceph:ceph)
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 ceph version 14.2.9 
> > (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable), process 
> > ceph-osd, pid 69463
> > 2020-04-20 16:26:14.935 7f818ec8cc00  0 pidfile_write: ignore empty 
> > --pid-file
> > 2020-04-20 16:26:15.503 7f818ec8cc00  0 starting osd.42 osd_data 
> > /var/lib/ceph/osd/ceph-42 /var/lib/ceph/osd/ceph-42/journal
> > 2020-04-20 16:26:15.523 7f818ec8cc00  0 load: jerasure load: lrc load: isa
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_readahead_size = 2MB
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_style = kCompactionStyleLevel
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > compaction_threads = 32
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option compression = 
> > kNoCompression
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option flusher_threads 
> > = 8
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_file_num_compaction_trigger = 8
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_slowdown_writes_trigger = 32
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > level0_stop_writes_trigger = 64
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_background_compactions = 31
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_bytes_for_level_base = 536870912
> > 2020-04-20 16:26:16.339 7f818ec8cc00  0  set rocksdb option 
> > max_bytes_for_level_multiplier = 8
> > 

[ceph-users] Re: Check if upmap is supported by client?

2020-04-14 Thread Paul Emmerich
Hi,

On Mon, Apr 13, 2020 at 3:08 PM Frank Schilder  wrote:
>
> Hi Paul,
>
> thanks for the fast reply. When you say "bit 21", do you mean "(feature_map & 
> 2^21) == true" (i.e., counting from 0 starting at the right-hand end)?

yes

> Assuming upmap is supported by all clients. If I understand correctly, to use 
> the upmap mode with balancer, I need to set
>
> ceph osd set-require-min-compat-client luminous
>
> Which I would guess will not allow the jewel clients to reconnect. I would be 
> grateful if you could clarify these points to me:

yes

> 1) Can I use up-map mode without setting this?

no

> 2) If so, what happens if a jewel client without this feature bit set tries 
> to connect?

it'll error out with a message about feature mismatch; it checks
actually relevant feature flags not the reverse mapping to a release
which is usually wrong for kernel clients

> 3) I guess that in case that as soon as an up-map table is created, only 
> clients with this bit set can connect. In case we run into problems, is there 
> a way to roll back?

yes, you can remove the upmap items manually and change the client
requirement; I don't know the command to do this off the top of my
head


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> Many thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Paul Emmerich 
> Sent: 13 April 2020 13:32:40
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Check if upmap is supported by client?
>
> bit 21 in the features bitmap is upmap support
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Mon, Apr 13, 2020 at 11:53 AM Frank Schilder  wrote:
> >
> > Dear all,
> >
> > I would like to enable the balancer on a mimic 13.2.8 cluster in upmap 
> > mode. Unfortunately, I have a lot of ceph fs kernel clients that report 
> > their version as jewel, but might already support upmap. The ceph client 
> > kernel module received already a lot of back-ports and supports features of 
> > later ceph versions, for example, quotas. I guess they report back jewel, 
> > because not all luminous/mimic features are back-ported yet. Is there a way 
> > to check if a client supports upmap?
> >
> > Here some info:
> >
> > [root@gnosis ~]# ceph features
> > {
> > [...]
> >  "client": [
> > {
> > "features": "0x27018fb86aa42ada",
> > "release": "jewel",
> > "num": 1676
> > },
> > {
> > "features": "0x2f018fb86aa42ada",
> > "release": "luminous",
> > "num": 1
> > },
> > {
> > "features": "0x3ffddff8ffacfffb",
> > "release": "luminous",
> > "num": 167
> > }
> > ],
> >
> > The fs clients are the top two entries, the third entry is rbd clients. 
> > Note that the feature key for the fs clients is almost identical. Here a 
> > snippet from mds session ls for one such jewel client:
> >
> > {
> > "id": 25641514,
> > "num_leases": 0,
> > "num_caps": 1,
> > "state": "open",
> > "request_load_avg": 0,
> > "uptime": 588563.550276,
> > "replay_requests": 0,
> > "completed_requests": 0,
> > "reconnecting": false,
> > "inst": "client.25641514 192.168.57.124:0/3398308464",
> > "client_metadata": {
> > "features": "00ff",
> > "entity_id": "con-fs2-hpc",
> > "hostname": "sn253.hpc.ait.dtu.dk",
> > "kernel_version": "3.10.0-957.12.2.el7.x86_64",
> > "root": "/hpc/groups"
> > }
> > }
> >
> > Since I would like to use upmap right from the beginning, my alternative is 
> > to re-weight a few of the really bad outliers manually to simplify changing 
> > back.
> >
> > What would you suggest?
> >
> > Thanks and best regards,
> >
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Check if upmap is supported by client?

2020-04-13 Thread Paul Emmerich
bit 21 in the features bitmap is upmap support

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Apr 13, 2020 at 11:53 AM Frank Schilder  wrote:
>
> Dear all,
>
> I would like to enable the balancer on a mimic 13.2.8 cluster in upmap mode. 
> Unfortunately, I have a lot of ceph fs kernel clients that report their 
> version as jewel, but might already support upmap. The ceph client kernel 
> module received already a lot of back-ports and supports features of later 
> ceph versions, for example, quotas. I guess they report back jewel, because 
> not all luminous/mimic features are back-ported yet. Is there a way to check 
> if a client supports upmap?
>
> Here some info:
>
> [root@gnosis ~]# ceph features
> {
> [...]
>  "client": [
> {
> "features": "0x27018fb86aa42ada",
> "release": "jewel",
> "num": 1676
> },
> {
> "features": "0x2f018fb86aa42ada",
> "release": "luminous",
> "num": 1
> },
> {
> "features": "0x3ffddff8ffacfffb",
> "release": "luminous",
> "num": 167
> }
> ],
>
> The fs clients are the top two entries, the third entry is rbd clients. Note 
> that the feature key for the fs clients is almost identical. Here a snippet 
> from mds session ls for one such jewel client:
>
> {
> "id": 25641514,
> "num_leases": 0,
> "num_caps": 1,
> "state": "open",
> "request_load_avg": 0,
> "uptime": 588563.550276,
> "replay_requests": 0,
> "completed_requests": 0,
> "reconnecting": false,
> "inst": "client.25641514 192.168.57.124:0/3398308464",
> "client_metadata": {
> "features": "00ff",
> "entity_id": "con-fs2-hpc",
> "hostname": "sn253.hpc.ait.dtu.dk",
> "kernel_version": "3.10.0-957.12.2.el7.x86_64",
> "root": "/hpc/groups"
> }
> }
>
> Since I would like to use upmap right from the beginning, my alternative is 
> to re-weight a few of the really bad outliers manually to simplify changing 
> back.
>
> What would you suggest?
>
> Thanks and best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-df free discrepancy

2020-04-10 Thread Paul Emmerich
On Sat, Apr 11, 2020 at 12:43 AM Reed Dier  wrote:
> That said, as a straw man argument, ~380GiB free, times 60 OSDs, should be 
> ~22.8TiB free, if all OSD's grew evenly, which they won't

Yes, that's the problem. They won't grow evenly. The fullest one will
grow faster than the others. Also, your full-ratio is probably 95% not
100%.
So it'll be full as soon as OSD 70 takes another ~360 GB of data. But
the others won't take 360 GB of data but less because of the bad
balancing. For example, OSD 28 will only get around 233 GB of data by
the time OSD 70 has 360 GB.



Paul

> , which is still far short of 37TiB raw free, as expected.
> However, what doesn't track is the 5.6TiB available at the pools level, even 
> for a 3x replicated pool (5.6*3=16.8TiB, which is about 34% less than my 
> napkin math, which would be 22.8/3=7.6TiB.
> But what tracks even less is the hybrid pools, which use 1/3 of what the 
> 3x-replicated data consumes.
> Meaning if my napkin math is right, should show ~22.8TiB free.
>
> Am I grossly mis-understanding how this is calculated?
> Maybe this is fixed in Octopus?
>
> Just trying to get a grasp on what I'm seeing not matching expectations.
>
> Thanks,
>
> Reed
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-10 Thread Paul Emmerich
My main problem with LVM cache was always the unpredictable
performance. It's *very* hard to benchmark properly even in a
synthetic setup, even harder to guess anything about a real-world
workload.
And testing out both configurations for a real-world setup is often
not feasible, especially as usage patterns change over the lifetime of
a cluster.

Does anyone have any real-world experience with LVM cache?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 10, 2020 at 11:19 PM Reed Dier  wrote:
>
> Going to resurrect this thread to provide another option:
>
> LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.
>
> I only mention this because I noticed it in the SUSE documentation for SES6 
> (based on Nautilus) here: 
> https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>
>  If you plan to use a fast drive as an LVM cache for multiple OSDs, be aware 
> that all OSD operations (including replication) will go through the caching 
> device. All reads will be queried from the caching device, and are only 
> served from the slow device in case of a cache miss. Writes are always 
> applied to the caching device first, and are flushed to the slow device at a 
> later time ('writeback' is the default caching mode).
> When deciding whether to utilize an LVM cache, verify whether the fast drive 
> can serve as a front for multiple OSDs while still providing an acceptable 
> amount of IOPS. You can test it by measuring the maximum amount of IOPS that 
> the fast device can serve, and then dividing the result by the number of OSDs 
> behind the fast device. If the result is lower or close to the maximum amount 
> of IOPS that the OSD can provide without the cache, LVM cache is probably not 
> suited for this setup.
>
> The interaction of the LVM cache device with OSDs is important. Writes are 
> periodically flushed from the caching device to the slow device. If the 
> incoming traffic is sustained and significant, the caching device will 
> struggle to keep up with incoming requests as well as the flushing process, 
> resulting in performance drop. Unless the fast device can provide much more 
> IOPS with better latency than the slow device, do not use LVM cache with a 
> sustained high volume workload. Traffic in a burst pattern is more suited for 
> LVM cache as it gives the cache time to flush its dirty data without 
> interfering with client traffic. For a sustained low traffic workload, it is 
> difficult to guess in advance whether using LVM cache will improve 
> performance. The best test is to benchmark and compare the LVM cache setup 
> against the WAL/DB setup. Moreover, as small writes are heavy on the WAL 
> partition, it is suggested to use the fast device for the DB and/or WAL 
> instead of an LVM cache.
>
>
> So it sounds like you could partition your NVMe for either LVM-cache, DB/WAL, 
> or both?
>
> Just figured this sounded a bit more akin to what you were looking for in 
> your original post and figured I would share.
>
> I don't use this, but figured I would share it.
>
> Reed
>
> On Apr 4, 2020, at 9:12 AM, jes...@krogh.cc wrote:
>
> Hi.
>
> We have a need for "bulk" storage - but with decent write latencies.
> Normally we would do this with a DAS with a Raid5 with 2GB Battery
> backed write cache in front - As cheap as possible but still getting the
> features of scalability of ceph.
>
> In our "first" ceph cluster we did the same - just stuffed in BBWC
> in the OSD nodes and we're fine - but now we're onto the next one and
> systems like:
> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
> Does not support a Raid controller like that - but is branded as for "Ceph
> Storage Solutions".
>
> It do however support 4 NVMe slots in the front - So - some level of
> "tiering" using the NVMe drives should be what is "suggested" - but what
> do people do? What is recommeneded. I see multiple options:
>
> Ceph tiering at the "pool - layer":
> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> And rumors that it is "deprectated:
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
>
> Pro: Abstract layer
> Con: Deprecated? - Lots of warnings?
>
> Offloading the block.db on NVMe / SSD:
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>
> Pro: Easy to deal with - seem heavily supported.
> Con: As far as I can tell - this will only benefit the metadata of the
> osd- not actual data. Thus a data-

[ceph-users] Re: remove S3 bucket with rados CLI

2020-04-10 Thread Paul Emmerich
Quick & dirty solution if only one OSD is full (likely as it looks
very unbalanced): take down the full OSD, delete data, take it back
online


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Apr 9, 2020 at 3:30 PM Dan van der Ster  wrote:
>
> On Thu, Apr 9, 2020 at 3:25 PM Robert Sander
>  wrote:
> >
> > Hi Dan,
> >
> > Am 09.04.20 um 15:08 schrieb Dan van der Ster:
> > >
> > > What do you have for full_ratio?
> >
> > The cluster is running Nautilus and the ratios should still be the
> > default values. Currently I have to direct access to report them.
> >
> > > Maybe you can unblock by setting the full_ratio to 0.96?
> >
> > We will try that on tuesday.
> >
> > Additionally here is the output of "ceph df":
> >
> > [root@fra1s80103 ~]# ceph df
> > RAW STORAGE:
> > CLASS SIZEAVAIL   USEDRAW USED %RAW USED
> > hdd   524 TiB 101 TiB 416 TiB  423 TiB 80.74
> > ssd11 TiB 7.8 TiB 688 MiB  3.2 TiB 28.92
> > TOTAL 535 TiB 109 TiB 416 TiB  426 TiB 79.68
> >
> > POOLS:
> > POOLID  STORED  OBJECTSUSED   %USEDMAX AVAIL
> > .rgw.root2  1.2 KiB  4  256 KiB   01.4 TiB
> > default.rgw.control  30 B80 B 01.4 TiB
> > default.rgw.meta 4  3.2 KiB 13  769 KiB   01.4 TiB
> > default.rgw.log  5   48 KiB210   48 KiB   01.4 TiB
> > default.rgw.buckets.index 6 487 GiB  21.10k 487 GiB   8.09 1.4 TiB
> > default.rgw.buckets.data 8  186 TiB 671.88M 416 TiB 100.00   0 B
> > default.rgw.buckets.non-ec 9  0 B00 B 0  0 B
> >
> > It's a four node cluster with the buckets.data pool erasure coded on hdd
> > with k=m=2 and size=4 and min_size=2, to have each part on a different node.
> >
> > New HDDs and even new nodes are currently being ordered to expand this
> > proof of concept setup for backup storage.
>
> This looks like an unbalanced cluster.
>
> # ceph osd df  | sort -n -k17
>
> should be illuminating.
>
> -- dan
>
>
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Support GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > http://www.heinlein-support.de
> >
> > Tel: 030 / 405051-43
> > Fax: 030 / 405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Octopus] OSD overloading

2020-04-08 Thread Paul Emmerich
What's the CPU busy with while spinning at 100%?

Check "perf top" for a quick overview


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Apr 8, 2020 at 3:09 PM Jack  wrote:
>
> I do:
> root@backup1:~# ceph config dump | grep snap_trim_sleep
> globaladvanced  osd_snap_trim_sleep
> 60.00
> globaladvanced  osd_snap_trim_sleep_hdd
> 60.00
>
> (cluster is fully rusty)
>
>
> On 4/8/20 2:53 PM, Dan van der Ster wrote:
> > Do you have a custom value for osd_snap_trim_sleep ?
> >
> > On Wed, Apr 8, 2020 at 2:03 PM Jack  wrote:
> >>
> >> I put the nosnaptrim during upgrade because I saw high CPU usage and
> >> though it was somehow related to the upgrade process
> >> However, all my daemon are now running Octopus, and the issue is still
> >> here, so I was wrong
> >>
> >>
> >> On 4/8/20 1:58 PM, Wido den Hollander wrote:
> >>>
> >>>
> >>> On 4/8/20 1:38 PM, Jack wrote:
> >>>> Hello,
> >>>>
> >>>> I've a issue, since my Nautilus -> Octopus upgrade
> >>>>
> >>>> My cluster has many rbd images (~3k or something)
> >>>> Each of them has ~30 snapshots
> >>>> Each day, I create and remove a least a snapshot per image
> >>>>
> >>>> Since Octopus, when I remove the "nosnaptrim" flags, each OSDs uses 100%
> >>>> of its CPU time
> >>>
> >>> Why do you have the 'nosnaptrim' flag set? I'm missing that piece of
> >>> information.
> >>>
> >>>> The whole cluster collapses: OSDs no longer see each others, most of
> >>>> them are seens as down ..
> >>>> I do not see any progress being made : it does not appear the problem
> >>>> will solve by itself
> >>>>
> >>>> What can I do ?
> >>>>
> >>>> Best regards,
> >>>> ___
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: question on rbd locks

2020-04-07 Thread Paul Emmerich
On Tue, Apr 7, 2020 at 6:49 PM Void Star Nill  wrote:
> So is there a way to tell ceph to release the lock if the client becomes
> unavailable?

That's the task of the new client trying to take the lock, it needs to
kick out the old client and blacklist the connection to ensure
consistency.

A common problem is using incorrect permissions on the Ceph keyring.
Check that you are using "profile rbd" for mon permissions


Paul

>
> Thanks,
> Shridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-06 Thread Paul Emmerich
The keyword to search for is "deferred writes", there are several
parameters that control the size and maximum number of ops that'll be
"cached". Increasing to 1 MB is probably a bad idea.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Apr 4, 2020 at 9:41 PM  wrote:
>
> > On Sat, Apr 4, 2020 at 4:13 PM  wrote:
> >> Offloading the block.db on NVMe / SSD:
> >> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
> >>
> >> Pro: Easy to deal with - seem heavily supported.
> >> Con: As far as I can tell - this will only benefit the metadata of the
> >> osd- not actual data. Thus a data-commit to the osd til still be
> >> dominated
> >> by the writelatency of the underlying - very slow HDD.
> >
> > small writes (<= 32kb, configurable) are written to db first and
> > written back to the slow disk asynchronous to the original request.
>
> Now, that sounds really interesting - I havent been able to find that in
> the documentation - can you provide a pointer? Whats the configuratoin
> parameter named?
>
> Meaning moving block.dk to a say 256GB NVMe will do "the right thing" for
> the system and deliver a fast write cache for smallish writes.
>
> Would setting the parameter til 1MB be "insane"?
>
> Jesper
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-04 Thread Paul Emmerich
On Sat, Apr 4, 2020 at 4:13 PM  wrote:
> Offloading the block.db on NVMe / SSD:
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>
> Pro: Easy to deal with - seem heavily supported.
> Con: As far as I can tell - this will only benefit the metadata of the
> osd- not actual data. Thus a data-commit to the osd til still be dominated
> by the writelatency of the underlying - very slow HDD.

small writes (<= 32kb, configurable) are written to db first and
written back to the slow disk asynchronous to the original request.


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> Bcache:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>
> Pro: Closest to the BBWC mentioned above - but with way-way larger cache
> sizes.
> Con: It is hard to see if I end up being the only one on the planet using
> this
> solution.
>
> Eat it - Writes will be as slow as hitting dead-rust - anything that
> cannot live
> with that need to be entirely on SSD/NVMe.
>
> Other?
>
> Thanks for your input.
>
> Jesper
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: different RGW Versions on same ceph cluster

2020-04-03 Thread Paul Emmerich
No, this is not supported. You must follow the upgrade order for
services. The reason is that many parts of RGW are implemented in the
OSD themselves, so you can't run a new RGW against an old OSD.



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 3, 2020 at 4:15 PM Scheurer François
 wrote:
>
> Dear All
>
>
>
> One ceph cluster is running with all daemons (mon, mgr, osd, rgw) having the 
> version 12.2.12.
>
>
> Let's say we configure an additional radosgw instance with version 14.2.8, 
> configured with the same ceph cluster name, realm, zonegroup and zone as the 
> existing instances.
>
> Is it dangerous to start a different version of rgw concurrently with the 
> older rgw?
>
> Can both versions server PUT & GET requests concurrently on the same buckets?
>
> It is probably ok, because during an upgrade it is as well assumed to have 
> temporarily different versions running concurrently...
> So the question is kind of the same as asking if one may cancel the radosgw 
> upgrade from Luminous to Nautilus in the middle, rolling back the radosgw 
> packages.
>
>
> Many thanks in advance.
>
>
>
> Cheers
>
> Francois Scheurer
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: LARGE_OMAP_OBJECTS 1 large omap objects

2020-04-02 Thread Paul Emmerich
Safe to ignore/increase the warning threshold. You are seeing this
because the warning level was reduced to 200k from 2M recently.

The file will be sharded in a newer version which will clean this up


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Apr 2, 2020 at 9:50 AM Dietmar Rieder
 wrote:
>
> Hi,
>
> I'm trying to understand the "LARGE_OMAP_OBJECTS 1 large omap objects"
> warning for out cephfs metadata pool.
>
> It seems that pg 5.26 has a large omap object with > 200k keys
>
> [WRN] : Large omap object found. Object:
> 5:654134d2:::mds0_openfiles.0:head PG: 5.4b2c82a6 (5.26) Key count:
> 286083 Size (bytes): 14043228
>
> I guess this object is related to the open files on the (cephfs
> mds0_openfiles.0). But what exactly does it tell me? Is the number of
> keys the number of currently open files?
> If yes, this is not matching the sum of open files over all clients
> obtained with lsof (which is less than 1000).
>
> So how can I get rid of this? (Reboot the clients?)
>
> Thanks for your help
>   Dietmar
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Institute of Bioinformatics
> Innrain 80, 6020 Innsbruck
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rados example: create namespace, user for this namespace, read and write objects with created namespace and user

2020-03-11 Thread Paul Emmerich
Easiest to think of namespaces just as a prefix for the object name
that is added and removed internally; there isn't really such a thing
as "creating a namespace".

(Yes, the rbd cli utility begs to differ, but that's really just
creating some metadata specific to rbd when you tell it to create a
namespace)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Mar 11, 2020 at 4:22 PM Rodrigo Severo - Fábrica
 wrote:
>
> Em ter., 10 de mar. de 2020 às 20:20, JC Lopez  escreveu:
> >
> > no need to create a namespace. You just specify the namespace you want to 
> > access.
>
> So it's created automatically when I use it through rados command line
> utility? Interesting...
>
> > See https://docs.ceph.com/docs/nautilus/man/8/rados/ the -N cli option
>
> I had already. From it it's clear to me that the -N option set the
> namespace to be used by the rados command line utility but I would
> never understand from this man page that the -N option of the rados
> command line utility would automatically create the namespace.
>
> > For access to a particular namespace have a look at the example here: 
> > https://docs.ceph.com/docs/nautilus/rados/operations/user-management/#modify-user-capabilities
>
> I had already read these generic instructions.
>
> What I am looking for is one simple working example where I can build
> my config from without having to do massive experimentation.
>
>
> Regards,
>
> Rodrigo Severo
>
>
> >
> > Regards
> > JC
> >
> >
> > > On Mar 10, 2020, at 13:10, Rodrigo Severo - Fábrica 
> > >  wrote:
> > >
> > > Hi,
> > >
> > >
> > > I'm trying to create a namespace in rados, create a user that has
> > > access to this created namespace and with rados command line utility
> > > read and write objects in this created namespace using the created
> > > user.
> > >
> > > I can't find an example on how to do it.
> > >
> > > Can someone point me to such example or show me how to do it?
> > >
> > >
> > > Regards,
> > >
> > > Rodrigo Severo
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-03-11 Thread Paul Emmerich
Encountered this one again today, I've updated the issue with new
information: https://tracker.ceph.com/issues/44184


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich
 wrote:
>
> Hi,
>
> I just wanted to report we've just hit very similar problem.. on mimic
> (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow
> ops caused by waiting for new map. It seems those are slowed by SATA
> OSDs which keep being 100% busy reading for long time until all ops are gone,
> blocking OPS on unrelated NVME pools - SATA pools are completely unused now.
>
> is this possible that those maps are being requested from slow SATA OSDs
> and it takes such a long time for some reason? why could it take so long?
> the cluster is very small with very light load..
>
> BR
>
> nik
>
>
>
> On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote:
> >
> >
> > On 2/19/20 9:34 AM, Paul Emmerich wrote:
> > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander  wrote:
> > >>
> > >>
> > >>
> > >> On 2/18/20 6:54 PM, Paul Emmerich wrote:
> > >>> I've also seen this problem on Nautilus with no obvious reason for the
> > >>> slowness once.
> > >>
> > >> Did this resolve itself? Or did you remove the pool?
> > >
> > > I've seen this twice on the same cluster, it fixed itself the first
> > > time (maybe with some OSD restarts?) and the other time I removed the
> > > pool after a few minutes because the OSDs were running into heartbeat
> > > timeouts. There unfortunately seems to be no way to reproduce this :(
> > >
> >
> > Yes, that's the problem. I've been trying to reproduce it, but I can't.
> > It works on all my Nautilus systems except for this one.
> >
> > As you saw it, Bryan saw it, I expect others to encounter this at some
> > point as well.
> >
> > I don't have any extensive logging as this cluster is in production and
> > I can't simply crank up the logging and try again.
> >
> > > In this case it wasn't a new pool that caused problems but a very old one.
> > >
> > >
> > > Paul
> > >
> > >>
> > >>> In my case it was a rather old cluster that was upgraded all the way
> > >>> from firefly
> > >>>
> > >>>
> > >>
> > >> This cluster has also been installed with Firefly. It was installed in
> > >> 2015, so a while ago.
> > >>
> > >> Wido
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Accidentally removed client.admin caps - fix via mon doesn't work

2020-03-11 Thread Paul Emmerich
This indicates that there's something wrong with the config on that
mon node. The command should work on any Ceph node that has the
keyring.

You should check ceph.conf on the monitor node, maybe there's some
kind of misconfiguration that might cause other problems in the
future.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Mar 11, 2020 at 10:46 AM Julian Wittler  wrote:
>
> I resolved this Issue by copying the mon Keyring to a Non-Monitor Node. From 
> there the command worked and i have my Admin Caps back.
> Thanks again for your Support, Paul
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs snap mkdir strange timestamp

2020-03-10 Thread Paul Emmerich
There's an xattr for this: ceph.snap.btime IIRC

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Mar 10, 2020 at 11:42 AM Marc Roos  wrote:
>
>
>
> If I make a directory in linux the directory has the date of now, why is
> this not with creating a snap dir? Is this not a bug? One expects this
> to be the same as in linux not
>
> [ @ test]$ mkdir temp
>
> [ @os0 test]$ ls -arltn
> total 28
> drwxrwxrwt. 27   0   0 20480 Mar 10 11:38 ..
> drwxrwxr-x   2 801 801  4096 Mar 10 11:38 temp
> drwxrwxr-x   3 801 801  4096 Mar 10 11:38 .
>
>
> [ @ .snap]# mkdir test
> [ @ .snap]# ls -lartn
> total 0
> drwxr-xr-x 861886554 0 0 8390344070786420358 Jan  1  1970 .
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 test
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-9
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-8
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-7
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-6
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-5
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 snap-10
> drwxr-xr-x 4 0 0   2 Mar  6 14:43 ..
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitors' election failed on VMs : e4 handle_auth_request failed to assign global_id

2020-03-10 Thread Paul Emmerich
On Tue, Mar 10, 2020 at 8:18 AM Yoann Moulin  wrote:
> I have added 3 new monitors on 3 VMs and I'd like to stop the 3 old monitors 
> daemon. But I soon as I stop the 3rd old monitor, the cluster stuck
> because the election of a new monitor fails.

By "stop" you mean "stop and then immediately remove before stopping
the next one"? Otherwise that's the problem.

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> The 3 old monitors are in 14.2.4-1xenial
> The 3 new monitors are in 14.2.7-1bionic
>
> > 2020-03-09 16:06:00.167 7fc4a3138700  1 mon.icvm0017@3(peon).paxos(paxos 
> > active c 20918592..20919120) lease_timeout -- calling new election
> > 2020-03-09 16:06:02.143 7fc49f931700  1 mon.icvm0017@3(probing) e4 
> > handle_auth_request failed to assign global_id
>
> Did I miss something?
>
> In attachment : some logs and ceph.conf
>
> Thanks for your help.
>
> Best,
>
> --
> Yoann Moulin
> EPFL IC-IT
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Accidentally removed client.admin caps - fix via mon doesn't work

2020-03-09 Thread Paul Emmerich
There's only one mon keyring that's shared by all mons, the mon user
therefore doesn't contain the mon name.

Try "-n mon."


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Mar 9, 2020 at 10:35 AM  wrote:
>
> Hello Guys,
>
> Unfortunately, I''ve deleted some caps from client.admin and tried the 
> following solution so set them back:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html
>
> I’ve tried the following:
>
> # ssh’d to a mon node and changed dir to the mon directory
> cd /var/lib/ceph/mon/
>
> # tried to authenticate with the monitor keyring and set the client.admin 
> caps to give back full permissions
> ceph -n mon. --keyring keyring  auth caps client.admin mds 'allow *' 
> osd 'allow *' mon 'allow *'
>
> When i try to modify the client.admin caps, the command just hangs at the 
> shell until i press ctrl-c, which gets aknowledged with "Cluster connection 
> aborted”
> Also i can’t see the the connection in the ceph.audit.log
>
> Do i make a mistake here or is this "workaround" not supported anymore in 
> nautilus (14.2.6)?
>
> Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph df hangs

2020-03-09 Thread Paul Emmerich
"ceph df" is handled by the mgr, check if your mgr is up and running
and if the user has the necessary permissions for the mgr.


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Mar 9, 2020 at 4:52 AM Rebecca CH  wrote:
>
> Hello
>
> When I run 'ceph df' on one of osd nodes, the command get hanged and no
> output to the screen.
>
> How to resolve it?
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus release announcement

2020-03-03 Thread Paul Emmerich
On Mon, Mar 2, 2020 at 7:19 PM Alex Chalkias
 wrote:
>
> Thanks for the update. Are you doing a beta-release prior to the official
> launch?

the first RC was tagged a few weeks ago:
https://github.com/ceph/ceph/tree/v15.1.0


Paul

>
>
> On Mon, Mar 2, 2020 at 7:12 PM Sage Weil  wrote:
>
> > It's getting close.  My guess is 1-2 weeks away.
> >
> > On Mon, 2 Mar 2020, Alex Chalkias wrote:
> >
> > > Hello,
> > >
> > > I was looking for an official announcement for Octopus release, as the
> > > latest update (back in Q3/2019) on the subject said it was scheduled for
> > > March 1st.
> > >
> > > Any updates on that?
> > >
> > > BR,
> > > --
> > > Alex Chalkias
> > > *Product Manager*
> > > alex.chalk...@canonical.com
> > > +33 766599367
> > > *Canonical | **Ubuntu*
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> >
>
>
> --
> Alex Chalkias
> *Product Manager*
> alex.chalk...@canonical.com
> +33 766599367
> *Canonical | **Ubuntu*
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cache tier OSDs crashing due to unfound hitset object 14.2.7

2020-02-27 Thread Paul Emmerich
Also: make a backup using the PG export feature of objectstore-tool
before doing anything else.

Sometimes it's enough to export and delete the PG from the broken OSD
and import it into a different OSD using objectstore-tool.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Feb 27, 2020 at 1:27 PM Paul Emmerich  wrote:
>
> Crash happens in PG::activate, so it's unrelated to IO etc.
>
> My first approach here would be to read the code and try to understand
> why it crashes/what the exact condition is that is violated here.
> It looks like something that can probably be fixed by fiddling around
> with ceph-objectstore-tool (but you should try to understand what
> exactly is happening before running random ceph-objectstore-tool
> commands)
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Feb 27, 2020 at 1:15 PM Lincoln Bryant  wrote:
> >
> > Thanks Paul.
> >
> > I was able to mark many of the unfound ones as lost, but I'm still stuck 
> > with one unfound and OSD assert at this point.
> >
> > I've tried setting many of the OSD options to pause all cluster I/O, 
> > backfilling, rebalancing, tiering agent, etc to try to avoid hitting the 
> > assert but alas this one OSD is still crashing. The OSD in question does 
> > manage to log quite a bit of things before crashing.
> >
> > Is there any way for me to delete this or create a dummy object in RADOS 
> > that will let this OSD come up, I wonder?
> >
> > --Lincoln
> >
> > OBJECT_UNFOUND 1/793053192 objects unfound (0.000%)
> > pg 36.1755 has 1 unfound objects
> > PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
> > pg 36.1153 is down+remapped, acting [299]
> > pg 36.2047 is down+remapped, acting [242]
> >
> > -2> 2020-02-27 06:13:12.265 7f0824f1c700  0 0x55ed866481e0 36.2047 
> > unexpected need for 
> > 36:e204:.ceph-internal::hit_set_36.2047_archive_2020-02-25 
> > 19%3a32%3a07.171593_2020-02-25 21%3a27%3a36.268116:head have 
> > 1363674'2866712 f
> > lags = none tried to add 1365222'2867906(1363674'2866712) flags = delete
> >
> >
> > 
> > From: Paul Emmerich 
> > Sent: Thursday, February 27, 2020 5:27 AM
> > To: Lincoln Bryant 
> > Cc: ceph-users@ceph.io 
> > Subject: Re: [ceph-users] Cache tier OSDs crashing due to unfound hitset 
> > object 14.2.7
> >
> > I've also encountered this issue, but luckily without the crashing
> > OSDs, so marking as lost resolved it for us.
> >
> > See https://tracker.ceph.com/issues/44286
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Thu, Feb 27, 2020 at 6:02 AM Lincoln Bryant  
> > wrote:
> > >
> > > Hello Ceph experts,
> > >
> > > In the last day or so, we had a few nodes randomly reboot and now unfound 
> > > objects are reported in Ceph health during cluster during recovery.
> > >
> > > It appears that the object in question is a hit set object, which I now 
> > > cannot mark lost because Ceph cannot probe the OSDs that keep crashing 
> > > due to missing the hit set object.
> > >
> > > Pasted below is the crash message[1] for osd.299, and some of the unfound 
> > > objects[2]. Lastly [3] shows a sample of the hit set objects that are 
> > > lost.
> > >
> > > I would greatly appreciate any insight you may have on how to move 
> > > forward. As of right now this cluster is inoperable due to 3 down PGs.
> > >
> > > Thanks,
> > > Lincoln Bryant
> > >
> > >
> > > [1]
> > >-4> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> > > unexpected need for 
> > > 36:d84c:.ceph-internal::hit_set_36.321b_archive_2020-02-24 
> > > 21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 
> > > 1352209'2834660 flags = none tried to add 1352209'2834660 flags = none
> > > -3> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> > > unexpected need 

[ceph-users] Re: Cache tier OSDs crashing due to unfound hitset object 14.2.7

2020-02-27 Thread Paul Emmerich
Crash happens in PG::activate, so it's unrelated to IO etc.

My first approach here would be to read the code and try to understand
why it crashes/what the exact condition is that is violated here.
It looks like something that can probably be fixed by fiddling around
with ceph-objectstore-tool (but you should try to understand what
exactly is happening before running random ceph-objectstore-tool
commands)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Feb 27, 2020 at 1:15 PM Lincoln Bryant  wrote:
>
> Thanks Paul.
>
> I was able to mark many of the unfound ones as lost, but I'm still stuck with 
> one unfound and OSD assert at this point.
>
> I've tried setting many of the OSD options to pause all cluster I/O, 
> backfilling, rebalancing, tiering agent, etc to try to avoid hitting the 
> assert but alas this one OSD is still crashing. The OSD in question does 
> manage to log quite a bit of things before crashing.
>
> Is there any way for me to delete this or create a dummy object in RADOS that 
> will let this OSD come up, I wonder?
>
> --Lincoln
>
> OBJECT_UNFOUND 1/793053192 objects unfound (0.000%)
> pg 36.1755 has 1 unfound objects
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
> pg 36.1153 is down+remapped, acting [299]
> pg 36.2047 is down+remapped, acting [242]
>
> -2> 2020-02-27 06:13:12.265 7f0824f1c700  0 0x55ed866481e0 36.2047 
> unexpected need for 
> 36:e204:.ceph-internal::hit_set_36.2047_archive_2020-02-25 
> 19%3a32%3a07.171593_2020-02-25 21%3a27%3a36.268116:head have 1363674'2866712 f
> lags = none tried to add 1365222'2867906(1363674'2866712) flags = delete
>
>
> 
> From: Paul Emmerich 
> Sent: Thursday, February 27, 2020 5:27 AM
> To: Lincoln Bryant 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Cache tier OSDs crashing due to unfound hitset 
> object 14.2.7
>
> I've also encountered this issue, but luckily without the crashing
> OSDs, so marking as lost resolved it for us.
>
> See https://tracker.ceph.com/issues/44286
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Feb 27, 2020 at 6:02 AM Lincoln Bryant  wrote:
> >
> > Hello Ceph experts,
> >
> > In the last day or so, we had a few nodes randomly reboot and now unfound 
> > objects are reported in Ceph health during cluster during recovery.
> >
> > It appears that the object in question is a hit set object, which I now 
> > cannot mark lost because Ceph cannot probe the OSDs that keep crashing due 
> > to missing the hit set object.
> >
> > Pasted below is the crash message[1] for osd.299, and some of the unfound 
> > objects[2]. Lastly [3] shows a sample of the hit set objects that are lost.
> >
> > I would greatly appreciate any insight you may have on how to move forward. 
> > As of right now this cluster is inoperable due to 3 down PGs.
> >
> > Thanks,
> > Lincoln Bryant
> >
> >
> > [1]
> >-4> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> > unexpected need for 
> > 36:d84c:.ceph-internal::hit_set_36.321b_archive_2020-02-24 
> > 21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 
> > 1352209'2834660 flags = none tried to add 1352209'2834660 flags = none
> > -3> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> > unexpected need for 
> > 36:d84c:.ceph-internal::hit_set_36.321b_archive_2020-02-24 
> > 21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 
> > 1352209'2834660 flags = none tried to add 1359781'2835659 flags = delete
> > -2> 2020-02-26 22:26:29.456 7ff53adc2700  3 osd.299 1367392 
> > handle_osd_map epochs [1367392,1367392], i have 1367392, src has 
> > [1349017,1367392]
> > -1> 2020-02-26 22:26:29.460 7ff52edaa700 -1 
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
> >  In function 'void PG::MissingLoc::add_active_missing(const pg_missing_t&)' 
> > thread 7ff52edaa700 time 2020-02-26 22:26:29.457170
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
> >

[ceph-users] Re: Cache tier OSDs crashing due to unfound hitset object 14.2.7

2020-02-27 Thread Paul Emmerich
I've also encountered this issue, but luckily without the crashing
OSDs, so marking as lost resolved it for us.

See https://tracker.ceph.com/issues/44286


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Feb 27, 2020 at 6:02 AM Lincoln Bryant  wrote:
>
> Hello Ceph experts,
>
> In the last day or so, we had a few nodes randomly reboot and now unfound 
> objects are reported in Ceph health during cluster during recovery.
>
> It appears that the object in question is a hit set object, which I now 
> cannot mark lost because Ceph cannot probe the OSDs that keep crashing due to 
> missing the hit set object.
>
> Pasted below is the crash message[1] for osd.299, and some of the unfound 
> objects[2]. Lastly [3] shows a sample of the hit set objects that are lost.
>
> I would greatly appreciate any insight you may have on how to move forward. 
> As of right now this cluster is inoperable due to 3 down PGs.
>
> Thanks,
> Lincoln Bryant
>
>
> [1]
>-4> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> unexpected need for 
> 36:d84c:.ceph-internal::hit_set_36.321b_archive_2020-02-24 
> 21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 1352209'2834660 
> flags = none tried to add 1352209'2834660 flags = none
> -3> 2020-02-26 22:26:29.455 7ff52edaa700  0 0x559587fa91e0 36.321b 
> unexpected need for 
> 36:d84c:.ceph-internal::hit_set_36.321b_archive_2020-02-24 
> 21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 1352209'2834660 
> flags = none tried to add 1359781'2835659 flags = delete
> -2> 2020-02-26 22:26:29.456 7ff53adc2700  3 osd.299 1367392 
> handle_osd_map epochs [1367392,1367392], i have 1367392, src has 
> [1349017,1367392]
> -1> 2020-02-26 22:26:29.460 7ff52edaa700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
>  In function 'void PG::MissingLoc::add_active_missing(const pg_missing_t&)' 
> thread 7ff52edaa700 time 2020-02-26 22:26:29.457170
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
>  838: FAILED ceph_assert(i->second.need == j->second.need)
>
>  ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x14a) [0x55955fdafc0f]
>  2: (()+0x47) [0x55955fdafdd7]
>  3: (PG::MissingLoc::add_active_missing(pg_missing_set const&)+0x1e0) 
> [0x55955ffa0cb0]
>  4: (PG::activate(ObjectStore::Transaction&, unsigned int, std::map std::map, std::allocator const, pg_query_t> > >, std::less, std::allocator std::map, std::allocator const, pg_query_t> > > > > >&, std::map std::vector, 
> std::allocator > >, std::less, 
> std::allocator PastIntervals>, std::allocator > > > > 
> >*, PG::RecoveryCtx*)+0x1916) [0x55955ff3f1e6]
>  5: 
> (PG::RecoveryState::Active::Active(boost::statechart::state  PG::RecoveryState::Primary, PG::RecoveryState::Activating, 
> (boost::statechart::history_mode)0>::my_context)+0x370) [0x55955ff62d20]
>  6: (boost::statechart::simple_state PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, 
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
> const&, void const*)+0xfb) [0x55955ffa8d5b]
>  7: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator, 
> boost::statechart::null_exception_translator>::process_queued_events()+0x97) 
> [0x55955ff88507]
>  8: (PG::handle_activate_map(PG::RecoveryCtx*)+0x1a8) [0x55955ff75848]
>  9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
> PG::RecoveryCtx*)+0x61d) [0x55955feb161d]
>  10: (OSD::dequeue_peering_evt(OSDShard*, PG*, 
> std::shared_ptr, ThreadPool::TPHandle&)+0xa6) [0x55955feb2d16]
>  11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
> ThreadPool::TPHandle&)+0x51) [0x55956011a481]
>  12: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x90f) [0x55955fea7bbf]
>  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) 
> [0x559560448976]
>  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55956044b490]
>  15: (()+0x7e25) [0x7ff5669bae25]
>  16: (clone()+0x6d) [0x7ff565a9a34d]
>
>  0> 2020-02-26 22:26:29.465 7ff52edaa700 -1 *** Caught signal (Aborted) **
>  in thread 7ff52eda

[ceph-users] Re: Migrating data to a more efficient EC pool

2020-02-25 Thread Paul Emmerich
Possible without downtime: Configure multi-site, create a new zone for
the new pool, let the cluster sync to itself, do a failover to the new
zone, delete old zone.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 24, 2020 at 6:14 PM Vladimir Brik
 wrote:
>
> Hello
>
> I have ~300TB of data in default.rgw.buckets.data k2m2 pool and I would
> like to move it to a new k5m2 pool.
>
> I found instructions using cache tiering[1], but they come with a vague
> scary warning, and it looks like EC-EC may not even be possible [2] (is
> it still the case?).
>
> Can anybody recommend a safe procedure to copy an EC pool's data to
> another pool with a more efficient erasure coding? Perhaps there is a
> tool out there that could do it?
>
> A few days of downtime would be tolerable, if it will simplify things.
> Also, I have enough free space to temporarily store the k2m2 data in a
> replicated pool (if EC-EC tiering is not possible, but EC-replicated and
> replicated-EC tiering is possible).
>
> Is there a tool or some efficient way to verify that the content of two
> pools is the same?
>
>
> Thanks,
>
> Vlad
>
> [1] https://ceph.io/geen-categorie/ceph-pool-migration/
> [2]
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016109.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph nvme 2x replication

2020-02-19 Thread Paul Emmerich
x2 replication is perfectly fine as long as you also keep min_size at 2 ;)

(But that means you're offline as soon as something is offline)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Feb 19, 2020 at 4:41 PM Wido den Hollander  wrote:
>
>
>
> On 2/19/20 3:17 PM, Frank R wrote:
> > Hi all,
> >
> > I have noticed that RedHat is willing to support 2x replication with
> > NVME drives. Additionally, I have seen CERN presentation where they
> > use a 2x replication with NVME for a hyperconverged/HPC/CephFS
> > solution.
> >
>
> Don't do this if you care about your data. NVMe isn't anything better or
> worse than SSDs. It's actually still an SSD, but we swapped the SATA/SAS
> controller for NVMe, but it's still flash.
>
> > I would like to hear some opinions on whether this is really a good
> > idea for production. Is this setup (NVME/2x replication) really only
> > meant to be used for data that is backed up and/or can be lost without
> > causing a catastrophe.
> >
>
> Yes.
>
> You can still loose data due to a single drive failure or OSD crash.
> Let's say you have an OSD/host down for maintenance or due to a network
> outage. The OSD's device isn't lost, but it's unavailable.
>
> While that happens you loose another OSD, but this time you actually
> loose the device due to a failure.
>
> Now you've lost data. Although you *think* you still have another OSD
> which is in a healthy state. If you boot the OSD you'll find out it's
> outdated because writes happened to the OSD you just lost.
>
> Result = data loss
>
> 2x replication is a bad thing in production if you care about your data.
>
> Wido
>
> > Thanks,
> > Frank
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

2020-02-19 Thread Paul Emmerich
On Wed, Feb 19, 2020 at 10:03 AM Wido den Hollander  wrote:
>
>
>
> On 2/19/20 8:49 AM, Sean Matheny wrote:
> > Thanks,
> >
> >> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
> >
> > How can I verify this? (i.e the epoch of the monitor vs the epoch of the
> > osd(s))
> >
>
> Check the status of the OSDs:
>
> $ ceph daemon osd.X status
>
> This should tell the newest map it has.
>
> Then check on the mons:
>
> $ ceph osd dump|head -n 10

mons are offline

> Or using ceph-monstore-tool to see what the latest map is the MON has.

ceph-monstore-tool  dump-keys

Also useful:

ceph-monstore-tool  get osdmap

Paul

>
> Wido
>
> > Cheers,
> > Sean
> >
> >
> >> On 19/02/2020, at 7:25 PM, Wido den Hollander  >> > wrote:
> >>
> >>
> >>
> >> On 2/19/20 5:45 AM, Sean Matheny wrote:
> >>> I wanted to add a specific question to the previous post, in the
> >>> hopes it’s easier to answer.
> >>>
> >>> We have a Luminous monitor restored from the OSDs using
> >>> ceph-object-tool, which seems like the best chance of any success. We
> >>> followed this rough process:
> >>>
> >>> https://tracker.ceph.com/issues/24419
> >>>
> >>> The monitor has come up (as a single monitor cluster), but it’s
> >>> reporting wildly inaccurate info, such as the number of osds that are
> >>> down (157 but all 223 are down), and hosts (1, but all 14 are down).
> >>>
> >>
> >> Have you verified that the MON's database has the same epoch of the
> >> OSDMap (or newer) as all the other OSDs?
> >>
> >> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
> >>
> >>> The OSD Daemons are still off, but I’m not sure if starting them back
> >>> up with this monitor will make things worse. The fact that this mon
> >>> daemon can’t even see how many OSDs are correctly down makes me think
> >>> that nothing good will come from turning the OSDs back on.
> >>>
> >>> Do I run risk of further corruption (i.e. on the Ceph side, not
> >>> client data as the cluster is paused) if I proceed and turn on the
> >>> osd daemons? Or is it worth a shot?
> >>>
> >>> Are these Ceph health metrics commonly inaccurate until it can talk
> >>> to the daemons?
> >>
> >> The PG stats will be inaccurate indeed and the number of OSDs can vary
> >> as long as they aren't able to peer with each other and the MONs.
> >>
> >>>
> >>> (Also other commands like `ceph osd tree` agree with the below `ceph
> >>> -s` so far)
> >>>
> >>> Many thanks for any wisdom… I just don’t want to make things
> >>> (unnecessarily) much worse.
> >>>
> >>> Cheers,
> >>> Sean
> >>>
> >>>
> >>> root@ntr-mon01:/var/log/ceph# ceph -s
> >>>  cluster:
> >>>id: ababdd7f-1040-431b-962c-c45bea5424aa
> >>>health: HEALTH_WARN
> >>>pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub
> >>> flag(s) set
> >>>157 osds down
> >>>1 host (15 osds) down
> >>>Reduced data availability: 12225 pgs inactive, 885 pgs
> >>> down, 673 pgs peering
> >>>Degraded data redundancy: 14829054/35961087 objects
> >>> degraded (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
> >>>mon: 1 daemons, quorum ntr-mon01
> >>>mgr: ntr-mon01(active)
> >>>osd: 223 osds: 66 up, 223 in
> >>> flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
> >>>pools:   14 pools, 15220 pgs
> >>>objects: 10.58M objects, 40.1TiB
> >>>usage:   43.0TiB used, 121TiB / 164TiB avail
> >>>pgs: 70.085% pgs unknown
> >>> 10.237% pgs not active
> >>> 14829054/35961087 objects degraded (41.236%)
> >>> 10667 unknown
> >>> 2869  active+undersized+degraded
> >>> 885   down
> >>> 673   peering
> >>> 126   active+undersized
> >>>
> >>>
> >>> On 19/02/2020, at 10:18 AM, Sean Matheny  >>> >
> >>> wrote:
> >>>
> >>> Hi folks,
> >>>
> >>> Our entire cluster is down at the moment.
> >>>
> >>> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The
> >>> first monitor we upgraded crashed. We reverted to luminous on this
> >>> one and tried another, and it was fine. We upgraded the rest, and
> >>> they all worked.
> >>>
> >>> Then we upgraded the first one again, and after it became the leader,
> >>> it died. Then the second one became the leader, and it died. Then the
> >>> third became the leader, and it died, leaving mon 4 and 5 unable to
> >>> form a quorum.
> >>>
> >>> We tried creating a single monitor cluster by editing the monmap of
> >>> mon05, and it died in the same way, just without the paxos
> >>> negotiation first.
> >>>
> >>> We have tried to revert to a luminous (12.2.12) monitor backup taken
> >>> a few hours before the crash. The mon daemon will start, but is
> >>> flooded with blocked requests and unknown pgs after a while. For
> >>> better or worse we removed the “noout” flag and 144 of 232 OSDs are
> >>> now 

[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-19 Thread Paul Emmerich
On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander  wrote:
>
>
>
> On 2/18/20 6:54 PM, Paul Emmerich wrote:
> > I've also seen this problem on Nautilus with no obvious reason for the
> > slowness once.
>
> Did this resolve itself? Or did you remove the pool?

I've seen this twice on the same cluster, it fixed itself the first
time (maybe with some OSD restarts?) and the other time I removed the
pool after a few minutes because the OSDs were running into heartbeat
timeouts. There unfortunately seems to be no way to reproduce this :(

In this case it wasn't a new pool that caused problems but a very old one.


Paul

>
> > In my case it was a rather old cluster that was upgraded all the way
> > from firefly
> >
> >
>
> This cluster has also been installed with Firefly. It was installed in
> 2015, so a while ago.
>
> Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-18 Thread Paul Emmerich
I've also seen this problem on Nautilus with no obvious reason for the
slowness once.
In my case it was a rather old cluster that was upgraded all the way
from firefly


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Feb 18, 2020 at 5:52 PM Wido den Hollander  wrote:
>
>
>
> On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> > We've run into a problem on our test cluster this afternoon which is 
> > running Nautilus (14.2.2).  It seems that any time PGs move on the cluster 
> > (from marking an OSD down, setting the primary-affinity to 0, or by using 
> > the balancer), a large number of the OSDs in the cluster peg the CPU cores 
> > they're running on for a while which causes slow requests.  From what I can 
> > tell it appears to be related to slow peering caused by osd_pg_create() 
> > taking a long time.
> >
> > This was seen on quite a few OSDs while waiting for peering to complete:
> >
> > # ceph daemon osd.3 ops
> > {
> > "ops": [
> > {
> > "description": "osd_pg_create(e179061 287.7a:177739 
> > 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 
> > 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> > "initiated_at": "2019-08-27 14:34:46.556413",
> > "age": 318.2523453801,
> > "duration": 318.2524189532,
> > "type_data": {
> > "flag_point": "started",
> > "events": [
> > {
> > "time": "2019-08-27 14:34:46.556413",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-08-27 14:34:46.556413",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-08-27 14:34:46.556299",
> > "event": "throttled"
> > },
> > {
> > "time": "2019-08-27 14:34:46.556456",
> > "event": "all_read"
> > },
> > {
> > "time": "2019-08-27 14:35:12.456901",
> > "event": "dispatched"
> > },
> > {
> > "time": "2019-08-27 14:35:12.456903",
> > "event": "wait for new map"
> > },
> > {
> > "time": "2019-08-27 14:40:01.292346",
> > "event": "started"
> > }
> > ]
> > }
> > },
> > ...snip...
> > {
> > "description": "osd_pg_create(e179066 287.7a:177739 
> > 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 
> > 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> > "initiated_at": "2019-08-27 14:35:09.908567",
> > "age": 294.900191001,
> > "duration": 294.9006841689,
> > "type_data": {
> > "flag_point": "delayed",
> > "events": [
> > {
> > "time": "2019-08-27 14:35:09.908567",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-08-27 14:35:09.908567",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-08-27 14:35:09.908520",
> > "event": "throttled"
> > },
> > {
> > "

[ceph-users] Re: Identify slow ops

2020-02-17 Thread Paul Emmerich
that's probably just https://tracker.ceph.com/issues/43893
(a harmless bug)

Restart the mons to get rid of the message

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 17, 2020 at 2:59 PM Thomas Schneider <74cmo...@gmail.com> wrote:
>
> Hi,
>
> the current output of ceph -s reports a warning:
> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops
> This time is increasing.
>
> root@ld3955:~# ceph -s
>   cluster:
> id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
> health: HEALTH_WARN
> 9 daemons have recently crashed
> 2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
> has slow ops
>
>   services:
> mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
> mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
> mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
> up:standby-replay 3 up:standby
> osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
>
>   data:
> pools:   7 pools, 19628 pgs
> objects: 65.78M objects, 251 TiB
> usage:   753 TiB used, 779 TiB / 1.5 PiB avail
> pgs: 19628 active+clean
>
>   io:
> client:   427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
>
> The details are as follows:
> root@ld3955:~# ceph health detail
> HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
> blocked for 347755 sec, mon.ld5505 has slow ops
> RECENT_CRASH 9 daemons have recently crashed
> mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
> mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
> mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
> mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
> mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
> mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
> mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
> mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
> mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
> SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
> slow ops
>
> There's no error on services (mgr, mon, osd).
>
> Can you please advise how to identify the root cause of this slow ops?
>
> THX
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recovery_unfound

2020-02-03 Thread Paul Emmerich
This might be related to recent problems with OSDs not being queried
for unfound objects properly in some cases (which I think was fixed in
master?)

Anyways: run ceph pg  query on the affected PGs, check for "might
have unfound" and try restarting the OSDs mentioned there. Probably
also sufficient to just run "ceph osd down" on the primaries on the
affected PGs to get them to re-check.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 3, 2020 at 4:27 PM Jake Grimmett  wrote:
>
> Dear All,
>
> Due to a mistake in my "rolling restart" script, one of our ceph
> clusters now has a number of unfound objects:
>
> There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
> all data is stored as cephfs.
>
> root@ceph7 ceph-archive]# ceph health
> HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
> 14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
> degraded (0.000%), 14 pgs degraded
>
> "ceph health detail" gives me a handle on which pgs are affected.
> e.g:
> pg 5.f2f has 2 unfound objects
> pg 5.5c9 has 2 unfound objects
> pg 5.4c1 has 1 unfound objects
> and so on...
>
> plus more entries of this type:
>   pg 5.6d is active+recovery_unfound+degraded, acting
> [295,104,57,442,240,338,219,33,150,382], 1 unfound
> pg 5.3fa is active+recovery_unfound+degraded, acting
> [343,147,21,131,315,63,214,365,264,437], 2 unfound
> pg 5.41d is active+recovery_unfound+degraded, acting
> [20,104,190,377,52,141,418,358,240,289], 1 unfound
>
> Digging deeper into one of the bad pg, we see the oid for the two
> unfound objects:
>
> root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
> {
> "num_missing": 4,
> "num_unfound": 2,
> "objects": [
> {
> "oid": {
> "oid": "1000ba25e49.0207",
> "key": "",
> "snapid": -2,
> "hash": 854007599,
> "max": 0,
> "pool": 5,
> "namespace": ""
> },
> "need": "22541'3088478",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "189(8)",
> "263(9)"
> ]
> },
> {
> "oid": {
> "oid": "1000bb25a5b.0091",
> "key": "",
> "snapid": -2,
> "hash": 3637976879,
> "max": 0,
> "pool": 5,
> "namespace": ""
> },
> "need": "22541'3088476",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "189(8)",
> "263(9)"
> ]
> }
> ],
> "more": false
> }
>
>
> While it would be nice to recover the data, this cluster is only used
> for storing backups.
>
> As all OSD are up and running, presumably the data blocks are
> permanently lost?
>
> If it's hard / impossible to recover the data, presumably we should now
> consider using "ceph pg 5.f2f  mark_unfound_lost delete" on each
> affected pg?
>
> Finally, can we use the oid to identify the affected files?
>
> best regards,
>
> Jake
>
> --
> Jake Grimmett
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephf_metadata: Large omap object found

2020-02-03 Thread Paul Emmerich
The warning threshold recently changed, I'd just increase it in this
particular case. It just means you have lots of open files.

I think there's some work going on to split the openfiles object into
multiple, so that problem will be fixed.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 3, 2020 at 5:39 PM Yoann Moulin  wrote:
>
> Hello,
>
> I have this message on my new ceph cluster in Nautilus. I have a cephfs with 
> a copy of ~100TB in progress.
>
> > /var/log/ceph/artemis.log:2020-02-03 16:22:49.970437 osd.66 (osd.66) 1137 : 
> > cluster [WRN] Large omap object found. Object: 
> > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> > Size (bytes): 6691941
>
> > /var/log/ceph/artemis-osd.66.log:2020-02-03 16:22:49.966 7fe77af62700  0 
> > log_channel(cluster) log [WRN] : Large omap object found. Object: 
> > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> > Size (bytes): 6691941
>
> I found this thread about a similar issue in the archives of the list
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JUFYDCQ2AHFA23NFJQY743ELJHG2N5DI/
>
> But I'm not sure what I can do in my situation, can I increase 
> osd_deep_scrub_large_omap_object_key_threshold or it's a bad idea?
>
> Thanks for your help.
>
> Here some useful (I guess) information:
>
> > Filesystem  Size  Used Avail Use% Mounted on
> > 10.90.37.4,10.90.37.6,10.90.37.8:/  329T   32T  297T  10% /artemis
>
> > artemis@icitsrv5:~$ ceph -s
> >   cluster:
> > id: 815ea021-7839-4a63-9dc1-14f8c5feecc6
> > health: HEALTH_WARN
> > 1 large omap objects
> >
> >   services:
> > mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 2w)
> > mgr: iccluster021(active, since 7h), standbys: iccluster009, 
> > iccluster023
> > mds: cephfs:5 5 up:active
> > osd: 120 osds: 120 up (since 5d), 120 in (since 5d)
> > rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, 
> > iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, 
> > iccluster021.rgw0, iccluster023.rgw0)
> >
> >   data:
> > pools:   10 pools, 2161 pgs
> > objects: 72.02M objects, 125 TiB
> > usage:   188 TiB used, 475 TiB / 662 TiB avail
> > pgs: 2157 active+clean
> >  4active+clean+scrubbing+deep
> >
> >   io:
> > client:   31 KiB/s rd, 803 KiB/s wr, 31 op/s rd, 184 op/s wr
>
> > artemis@icitsrv5:~$ ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool 'cephfs_metadata'
> > Search the cluster log for 'Large omap object found' for more details.
>
>
> > artemis@icitsrv5:~$ ceph fs status
> > cephfs - 3 clients
> > ==
> > +--++--+---+---+---+
> > | Rank | State  | MDS  |Activity   |  dns  |  inos |
> > +--++--+---+---+---+
> > |  0   | active | iccluster015 | Reqs:0 /s |  251k |  251k |
> > |  1   | active | iccluster001 | Reqs:3 /s | 20.2k | 19.1k |
> > |  2   | active | iccluster017 | Reqs:1 /s |  116k |  112k |
> > |  3   | active | iccluster019 | Reqs:0 /s |  263k |  263k |
> > |  4   | active | iccluster013 | Reqs:  123 /s | 16.3k | 16.3k |
> > +--++--+---+---+---+
> > +-+--+---+---+
> > |   Pool  |   type   |  used | avail |
> > +-+--+---+---+
> > | cephfs_metadata | metadata | 13.9G |  135T |
> > |   cephfs_data   |   data   | 51.3T |  296T |
> > +-+--+---+---+
> > +-+
> > | Standby MDS |
> > +-+
> > +-+
> > MDS version: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) 
> > nautilus (stable)
> > root@iccluster019:~# ceph --cluster artemis daemon osd.13 config show | 
> > grep large_omap
> > "osd_deep_scrub_large_omap_object_key_threshold": "20",
> > "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
>
> > artemis@icitsrv5:~$ rados -p cephfs_metadata listxattr mds3_openfiles.0
> > artemis@icitsrv5:~$ rados -p cephfs_metadata getomapheader mds3_openfiles.0
> > header (42 bytes) :
> &g

[ceph-users] Re: data loss on full file system?

2020-02-03 Thread Paul Emmerich
On Sun, Feb 2, 2020 at 9:35 PM Håkan T Johansson  wrote:
>

>
> Changing cp (or whatever standard tool is used) to call fsync() before
> each close() is not an option for a user.  Also, doing that would lead to
> terrible performance generally.  Just tested - a recursive copy of a 70k
> files linux source tree went from 15 s to 6 minutes on a local filesystem
> I have at hand.

Don't do it for every file:  cp foo bar; sync

>
> Best regards,
> Håkan
>
>
>
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson  
> > wrote:
> >>
> >>
> >> Hi,
> >>
> >> for test purposes, I have set up two 100 GB OSDs, one
> >> taking a data pool and the other metadata pool for cephfs.
> >>
> >> Am running 14.2.6-1-gffd69200ad-1 with packages from
> >> https://mirror.croit.io/debian-nautilus
> >>
> >> Am then running a program that creates a lot of 1 MiB files by calling
> >>fopen()
> >>fwrite()
> >>fclose()
> >> for each of them.  Error codes are checked.
> >>
> >> This works successfully for ~100 GB of data, and then strangely also 
> >> succeeds
> >> for many more 100 GB of data...  ??
> >>
> >> All written files have size 1 MiB with 'ls', and thus should contain the 
> >> data
> >> written.  However, on inspection, the files written after the first ~100 
> >> GiB,
> >> are full of just 0s.  (hexdump -C)
> >>
> >>
> >> To further test this, I use the standard tool 'cp' to copy a few 
> >> random-content
> >> files into the full cephfs filessystem.  cp reports no complaints, and 
> >> after
> >> the copy operations, content is seen with hexdump -C.  However, after 
> >> forcing
> >> the data out of cache on the client by reading other earlier created files,
> >> hexdump -C show all-0 content for the files copied with 'cp'.  Data that 
> >> was
> >> there is suddenly gone...?
> >>
> >>
> >> I am new to ceph.  Is there an option I have missed to avoid this 
> >> behaviour?
> >> (I could not find one in
> >> https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
> >>
> >> Is this behaviour related to
> >> https://docs.ceph.com/docs/mimic/cephfs/full/
> >> ?
> >>
> >> (That page states 'sometime after a write call has already returned 0'. 
> >> But if
> >> write returns 0, then no data has been written, so the user program would 
> >> not
> >> assume any kind of success.)
> >>
> >> Best regards,
> >>
> >> Håkan
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inactive pgs preventing osd from starting

2020-01-31 Thread Paul Emmerich
If you don't care about the data: set
osd_find_best_info_ignore_history_les = true on the affected OSDs
temporarily.

This means losing data.

For anyone else reading this: don't ever use this option. It's evil
and causes data loss (but gets your PG back and active, yay!)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Jan 31, 2020 at 3:14 PM Ragan, Tj (Dr.)
 wrote:
>
> Hi All,
>
> Long story-short, we’re doing disaster recovery on a cephfs cluster, and are 
> at a point where we have 8 pgs stuck incomplete.  Just before the disaster, I 
> increased the pg_count on two of the pools, and they had not completed 
> increasing the pgp_num yet.  I’ve since forced pgp_num to the current values.
>
> So far, I’ve tried mark_unfound_lost but they don’t report any unfound 
> objects, and I’ve tried force-create-pg but that has no effect, except on one 
> of the pgs, which went to creating+incomplete.  During the disaster recovery, 
> I had to re-create several OSDs (due to unreadable superblocks,) and now one 
> of the new osds, as well as one of the existing osds won’t start.  The log 
> from the startup of osd.29 is here: https://pastebin.com/PX9AAj8m, which 
> seems to indicate that it won’t start because it’s supposed to have copies of 
> the incomplete placement groups.
>
> ceph pg 5.38 query (one of the incomplete) gives: 
> https://pastebin.com/Jf4GnZTc
>
> I have hunted around in the osds listed for all the placement groups for any 
> sign of a pg that I could mark as complete with ceph-objectstore-tool, but 
> can’t find any.  I don’t care about the data in the pgs, but I can’t abandon 
> the filesystem.
>
> Any help would be greatly appreciated.
>
> -TJ Ragan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Micron SSD/Basic Config

2020-01-31 Thread Paul Emmerich
On Fri, Jan 31, 2020 at 2:06 PM EDH - Manuel Rios
 wrote:
>
> Hmm change 40Gbps to 100Gbps networking.
>
> 40Gbps technology its just a bond of 4x10 Links with some latency due link 
> aggregation.
> 100 Gbps and 25Gbps got less latency and Good performance. In ceph a 50% of 
> the latency comes from Network commits and the other 50% from disk commits.

40G ethernet is not the same as 4x 10G bond. A bond load balances on a
per-packet (or well, per flow usually) basis. A 40G link uses all four
links even for a single packet.
100G is "just" 4x 25G

I also wouldn't agree that network and disk latency is a 50/50 split
in Ceph unless you have some NVRAM disks or something.

Even for the network speed the processing and queuing in the network
stack dominates over the serialization delay from a 40G/100G
difference (4kb at 100G is 320ns, and 800ns at 40G for the
serialization; I don't have any figures for processing times on
40/100G ethernet, but 10G fiber is at 300ns, 10G base-t at 2300
nanoseconds)

Paul


>
> A fast graph : 
> https://blog.mellanox.com/wp-content/uploads/John-Kim-030416-Fig-3a-1024x747.jpg
> Article: 
> https://blog.mellanox.com/2016/03/25-is-the-new-10-50-is-the-new-40-100-is-the-new-amazing/
>
> Micron got their own Whitepaper for CEPH and looks like performs fine.
> https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en
>
>
> AS your Budget is high, please buy 3 x 1.5K $ nodes for your monitors and you 
> Will sleep better. They just need 4 cores / 16GB RAM and 2x128GB SSD or NVME 
> M2 .
>
> -Mensaje original-
> De: Adam Boyhan 
> Enviado el: viernes, 31 de enero de 2020 13:59
> Para: ceph-users 
> Asunto: [ceph-users] Micron SSD/Basic Config
>
> Looking to role out a all flash Ceph cluster. Wanted to see if anyone else 
> was using Micron drives along with some basic input on my design so far?
>
> Basic Config
> Ceph OSD Nodes
> 8x Supermicro A+ Server 2113S-WTRT
> - AMD EPYC 7601 32 Core 2.2Ghz
> - 256G Ram
> - AOC-S3008L-L8e HBA
> - 10GB SFP+ for client network
> - 40GB QSFP+ for ceph cluster network
>
> OSD
> 10x Micron 5300 PRO 7.68TB in each ceph node
> - 80 total drives across the 8 nodes
>
> WAL/DB
> 5x Micron 7300 MAX NVMe 800GB per Ceph Node
> - Plan on dedicating 1 for each 2 OSD's
>
> Still thinking out a external monitor node as I have a lot of options, but 
> this is a pretty good start. Open to suggestions as well!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: data loss on full file system?

2020-01-28 Thread Paul Emmerich
Yes, data that is not synced is not guaranteed to be written to disk,
this is consistent with POSIX semantics.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson  wrote:
>
>
> Hi,
>
> for test purposes, I have set up two 100 GB OSDs, one
> taking a data pool and the other metadata pool for cephfs.
>
> Am running 14.2.6-1-gffd69200ad-1 with packages from
> https://mirror.croit.io/debian-nautilus
>
> Am then running a program that creates a lot of 1 MiB files by calling
>fopen()
>fwrite()
>fclose()
> for each of them.  Error codes are checked.
>
> This works successfully for ~100 GB of data, and then strangely also succeeds
> for many more 100 GB of data...  ??
>
> All written files have size 1 MiB with 'ls', and thus should contain the data
> written.  However, on inspection, the files written after the first ~100 GiB,
> are full of just 0s.  (hexdump -C)
>
>
> To further test this, I use the standard tool 'cp' to copy a few 
> random-content
> files into the full cephfs filessystem.  cp reports no complaints, and after
> the copy operations, content is seen with hexdump -C.  However, after forcing
> the data out of cache on the client by reading other earlier created files,
> hexdump -C show all-0 content for the files copied with 'cp'.  Data that was
> there is suddenly gone...?
>
>
> I am new to ceph.  Is there an option I have missed to avoid this behaviour?
> (I could not find one in
> https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
>
> Is this behaviour related to
> https://docs.ceph.com/docs/mimic/cephfs/full/
> ?
>
> (That page states 'sometime after a write call has already returned 0'. But if
> write returns 0, then no data has been written, so the user program would not
> assume any kind of success.)
>
> Best regards,
>
> Håkan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC pool creation results in incorrect M value?

2020-01-27 Thread Paul Emmerich
min_size in the crush rule and min_size in the pool are completely
different things that happen to share the same name.

Ignore min_size in the crush rule, it has virtually no meaning in
almost all cases (like this one).


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 27, 2020 at 3:41 PM Smith, Eric  wrote:
>
> I have a Ceph Luminous (12.2.12) cluster with 6 nodes. I’m attempting to 
> create an EC3+2 pool with the following commands:
>
> Create the EC profile:
>
> ceph osd erasure-code-profile set es32 k=3 m=2 plugin=jerasure w=8 
> technique=reed_sol_van crush-failure-domain=host crush-root=sgshared
>
> Verify profile creation:
>
> [root@mon-1 ~]# ceph osd erasure-code-profile get es32
>
> crush-device-class=
>
> crush-failure-domain=host
>
> crush-root=sgshared
>
> jerasure-per-chunk-alignment=false
>
> k=3
>
> m=2
>
> plugin=jerasure
>
> technique=reed_sol_van
>
> w=8
>
> Create a pool using this profile:
>
> ceph osd pool create ec32pool 1024 1024 erasure es32
>
> List pool detail:
>
> pool 31 'es32' erasure size 5 min_size 4 crush_rule 11 object_hash rjenkins 
> pg_num 1024 pgp_num 1024 last_change 1568 flags hashpspool stripe_width 12288 
> application ES
>
> Here’s the crush rule that’s created:
> {
>
> "rule_id": 11,
>
> "rule_name": "es32",
>
> "ruleset": 11,
>
> "type": 3,
>
> "min_size": 3,
>
> "max_size": 5,
>
> "steps": [
>
> {
>
> "op": "set_chooseleaf_tries",
>
> "num": 5
>
> },
>
> {
>
> "op": "set_choose_tries",
>
> "num": 100
>
> },
>
> {
>
> "op": "take",
>
> "item": -2,
>
> "item_name": "sgshared"
>
> },
>
> {
>
> "op": "chooseleaf_indep",
>
> "num": 0,
>
> "type": "host"
>
> },
>
> {
>
> "op": "emit"
>
> }
>
> ]
>
> },
>
>
>
> From the output of “ceph osd pool ls detail” you can see min_size=4, the 
> crush rule says min_size=3 however the pool does NOT survive 2 hosts failing.
>
>
>
> Am I missing something?
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs : write error: Operation not permitted

2020-01-24 Thread Paul Emmerich
There are two bugs that can cause these application tags to be
missing, one of them is fixed (but old pools aren't fixed
automatically), the other is https://tracker.ceph.com/issues/43061
which happens if you create the cephfs pools manually.

You can fix the pools like this:

ceph osd pool application set   cephfs  

To work with "ceph fs authorize"

We automatically runs this on croit on startup on all cephfs pools to
make the permissions work properly for our users.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Jan 24, 2020 at 2:11 PM Yoann Moulin  wrote:
>
> Le 23.01.20 à 15:51, Ilya Dryomov a écrit :
> > On Wed, Jan 22, 2020 at 8:58 AM Yoann Moulin  wrote:
> >>
> >> Hello,
> >>
> >> On a fresh install (Nautilus 14.2.6) deploy with ceph-ansible playbook 
> >> stable-4.0, I have an issue with cephfs. I can create a folder, I can
> >> create empty files, but cannot write data on like I'm not allowed to write 
> >> to the cephfs_data pool.
> >>
> >>> $ ceph -s
> >>>cluster:
> >>>  id: fded5bb5-62c5-4a88-b62c-0986d7c7ac09
> >>>  health: HEALTH_OK
> >>>
> >>>services:
> >>>  mon: 3 daemons, quorum iccluster039,iccluster041,iccluster042 (age 
> >>> 23h)
> >>>  mgr: iccluster039(active, since 21h), standbys: iccluster041, 
> >>> iccluster042
> >>>  mds: cephfs:3 
> >>> {0=iccluster043=up:active,1=iccluster041=up:active,2=iccluster042=up:active}
> >>>  osd: 24 osds: 24 up (since 22h), 24 in (since 22h)
> >>>  rgw: 1 daemon active (iccluster043.rgw0)
> >>>
> >>>data:
> >>>  pools:   9 pools, 568 pgs
> >>>  objects: 800 objects, 225 KiB
> >>>  usage:   24 GiB used, 87 TiB / 87 TiB avail
> >>>  pgs: 568 active+clean
> >>
> >> The 2 cephfs pools:
> >>
> >>> $ ceph osd pool ls detail | grep cephfs
> >>> pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 
> >>> object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn 
> >>> last_change 83 lfor 0/0/81 flags hashpspool stripe_width 0 
> >>> expected_num_objects 1 application cephfs
> >>> pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
> >>> object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 
> >>> 48 flags hashpspool stripe_width 0 expected_num_objects 1 
> >>> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
> >>
> >> The status of the cephfs filesystem:
> >>
> >>> $ ceph fs status
> >>> cephfs - 1 clients
> >>> ==
> >>> +--++--+---+---+---+
> >>> | Rank | State  | MDS  |Activity   |  dns  |  inos |
> >>> +--++--+---+---+---+
> >>> |  0   | active | iccluster043 | Reqs:0 /s |   34  |   18  |
> >>> |  1   | active | iccluster041 | Reqs:0 /s |   12  |   16  |
> >>> |  2   | active | iccluster042 | Reqs:0 /s |   10  |   13  |
> >>> +--++--+---+---+---+
> >>> +-+--+---+---+
> >>> |   Pool  |   type   |  used | avail |
> >>> +-+--+---+---+
> >>> | cephfs_metadata | metadata | 4608k | 27.6T |
> >>> |   cephfs_data   |   data   |0  | 27.6T |
> >>> +-+--+---+---+
> >>> +-+
> >>> | Standby MDS |
> >>> +-+
> >>> +-+
> >>> MDS version: ceph version 14.2.6 
> >>> (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
> >>
> >>
> >>> # mkdir folder
> >>> # echo "foo" > bar
> >>> -bash: echo: write error: Operation not permitted
> >>> # ls -al
> >>> total 4
> >>> drwxrwxrwx  1 root root2 Jan 22 07:30 .
> >>> drwxr-xr-x 28 root root 4096 Jan 21 09:25 ..
> >>> -rw-r--r--  1 root root0 Jan 22 07:30 bar
> >>> drwxrwxrwx  1 root root1 Jan 21 16:49 folder
> >>
> >>> # df -hT .
> >>> Filesystem Type  Size  Used Avail 
> &

[ceph-users] Re: Benchmark results for Seagate Exos2X14 Dual Actuator HDDs

2020-01-16 Thread Paul Emmerich
Sorry, we no longer have these test drives :(


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 1:48 PM  wrote:

> Hi,
>
> The results look strange to me...
>
> To begin with, it's strange that read and write performance differs. But
> the thing is that a lot (if not most) large Seagate EXOS drives have
> internal SSD cache (~8 GB of it). I suspect that new EXOS also does and
> I'm not sure if Toshiba has it. It could explain the write performance
> difference in your test.
>
> Try to disable Seagates' write cache with sdparm --set WCE=0 /dev/sdX
> and see how the performance changes. If there is an SSD cache you'll
> probably see an increase in iops. Due to the nature of Bluestore and at
> least with an external block.db on SSD the difference is like ~230 iops
> vs ~1200 iops with iodepth=1. This is the result for ST8000NM0055.
>
> Also it's strange that read performance is almost the same. Can you
> benchmark the drive with fio alone, without Ceph?
>
> > Hi,
> >
> > we ran some benchmarks with a few samples of Seagate's new HDDs that
> > some of you might find interesting:
> >
> > Blog post:
> >
> > https://croit.io/2020/01/06/2020-01-06-benchmark-mach2
> >
> > GitHub repo with scripts and raw data:
> > https://github.com/croit/benchmarks/tree/master/mach2-disks
> >
> > Tl;dr: way faster for writes, somewhat faster for reads in some
> > scenarios
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at
> > https://croit.io
> >
> > Looking for Ceph training? We have some free spots available
> > https://croit.io/training/4-days-ceph-in-depth-training
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io [1]
> > Tel: +49 89 1896585 90
> >
> > Links:
> > --
> > [1] http://www.croit.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Benchmark results for Seagate Exos2X14 Dual Actuator HDDs

2020-01-15 Thread Paul Emmerich
Hi,

we ran some benchmarks with a few samples of Seagate's new HDDs that some
of you might find interesting:

Blog post:
https://croit.io/2020/01/06/2020-01-06-benchmark-mach2

GitHub repo with scripts and raw data:
https://github.com/croit/benchmarks/tree/master/mach2-disks

Tl;dr: way faster for writes, somewhat faster for reads in some scenarios


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io
Looking for Ceph training? We have some free spots available
https://croit.io/training/4-days-ceph-in-depth-training

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Experience with messenger v2 in Nautilus

2020-01-02 Thread Paul Emmerich
We run it on some of our test clusters, but not yet for customer
deployments by default (unless when importing a cluster with it already
enabled).

It's running mostly fine on the test systems nowadays (only issue I can
think of is a rare crash during shutdown of daemons/cli tools, tracker
https://tracker.ceph.com/issues/42583 ). We also had some problems during
upgrades in the earlier Nautilus releases, but that seems to be fixed.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 2, 2020 at 9:28 PM Stefan Kooman  wrote:

> Hi,
>
> I'm wondering how many of are using messenger v2 in Nautilus after
> upgrading from a previous release (Luminous / Mimic).
>
> Does it work for you? Or why did you not enable it (yet)?
>
> Thanks,
>
> Gr. Stefan
>
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw - Etags suffixed with #x0e

2020-01-02 Thread Paul Emmerich
On Wed, Dec 18, 2019 at 11:41 AM Ingo Reimann  wrote:

> Hi,
>
> We had a strange problem with some buckets. After a s3cmd sync, some
> objects got ETAGs with the suffix "#x0e". This rendered the XML output of
> "GET /" e.g. (s3cmd du) invalid. Unfortunately, this behaviour was not
> reproducable but could be fixed by "GET /{object}" + "PUT /{object}" (s3cmd
> get + s3cmd put).
>
> I am not sure, how this appeared and how to avoid that. Just now, we have
> nautilus mons and osds with jewel radosgws.


Please don't mix versions like that. Running nautilus and jewel at the same
time is unsupported. Upgrade everything and check if that solves your
problem.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> At the time of first appearence, also a nautilus gateway had been online,
> but the requests had been handled by both types.
>
> Any ideas?
>
> best regards,
> Ingo
>
> --
> Ingo Reimann
> Teamleiter Technik
> [ https://www.dunkel.de/ ]
> Dunkel GmbH
> Philipp-Reis-Straße 2
> 65795 Hattersheim
> Fon: +49 6190 889-100
> Fax: +49 6190 889-399
> eMail: supp...@dunkel.de
> http://www.Dunkel.de/   Amtsgericht Frankfurt/Main
> HRB: 37971
> Geschäftsführer: Axel Dunkel
> Ust-ID: DE 811622001
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.5

2019-12-19 Thread Paul Emmerich
We're also seeing unusually high mgr CPU usage on some setups, the only
thing they have in common seem to > 300 OSDs.

Threads using the CPU are "mgr-fin" and and "ms_dispatch"


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Dec 19, 2019 at 9:40 AM Serkan Çoban  wrote:

> +1
> 1500 OSDs, mgr is constant %100 after upgrading from 14.2.2 to 14.2.5.
>
> On Thu, Dec 19, 2019 at 11:06 AM Toby Darling 
> wrote:
> >
> > On 18/12/2019 22:40, Bryan Stillwell wrote:
> > > That's how we noticed it too.  Our graphs went silent after the upgrade
> > > completed.  Is your large cluster over 350 OSDs?
> >
> > A 'me too' on this - graphs have gone quiet, and mgr is using 100% CPU.
> > This happened when we grew our 14.2.5 cluster from 328 to 436 OSDs.
> >
> > Cheers
> > Toby
> > --
> > Toby Darling, Scientific Computing (2N249)
> > MRC Laboratory of Molecular Biology
> > Francis Crick Avenue
> > Cambridge Biomedical Campus
> > Cambridge CB2 0QH
> > Phone 01223 267070
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD state: transitioning to Stray

2019-12-09 Thread Paul Emmerich
An OSD that is down does not recover or backfill. Faster recovery or
backfill will not resolve down OSDs


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 1:42 PM Thomas Schneider <74cmo...@gmail.com> wrote:

> Hi,
>
> I think I can speed-up the recovery / backfill.
>
> What is the recommended setting for
> osd_max_backfills
> osd_recovery_max_active
> ?
>
> THX
>
> Am 09.12.2019 um 13:36 schrieb Paul Emmerich:
> > This message is expected.
> >
> > But your current situation is a great example of why having a separate
> > cluster network is a bad idea in most situations.
> > First thing I'd do in this scenario is to get rid of the cluster
> > network and see if that helps
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io <http://www.croit.io>
> > Tel: +49 89 1896585 90
> >
> >
> > On Mon, Dec 9, 2019 at 11:22 AM Thomas Schneider <74cmo...@gmail.com
> > <mailto:74cmo...@gmail.com>> wrote:
> >
> > Hi,
> > I had a failure on 2 of 7 OSD nodes.
> > This caused a server reboot and unfortunately the cluster network
> > failed
> > to come up.
> >
> > This resulted in many OSD down situation.
> >
> > I decided to stop all services (OSD, MGR, MON) and to start them
> > sequentially.
> >
> > Now I have multiple OSD marked as down although the service is
> > running.
> > None of these down OSDS is connected to the 2 nodes with failure.
> >
> > In the OSD logs I can see multiple entries like this:
> > 2019-12-09 11:13:10.378 7f9a372fb700  1 osd.374 pg_epoch: 493189
> > pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
> > local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c 466724/466724
> > les/c/f 466725/466725/176266 468956/493184/468423) [203,412] r=-1
> > lpr=493184 pi=[466724,493184)/1 crt=457986'92619 lcod 0'0 unknown
> > NOTIFY
> > mbc={}] state: transitioning to Stray
> >
> > I tried to restart the impacted OSD w/o success, means the
> > relevant OSD
> > is still marked as down.
> >
> > Is there a procedure to overcome this issue, means getting all OSD
> up?
> >
> > THX
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > <mailto:ceph-users-le...@ceph.io>
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD state: transitioning to Stray

2019-12-09 Thread Paul Emmerich
This message is expected.

But your current situation is a great example of why having a separate
cluster network is a bad idea in most situations.
First thing I'd do in this scenario is to get rid of the cluster network
and see if that helps


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 11:22 AM Thomas Schneider <74cmo...@gmail.com> wrote:

> Hi,
> I had a failure on 2 of 7 OSD nodes.
> This caused a server reboot and unfortunately the cluster network failed
> to come up.
>
> This resulted in many OSD down situation.
>
> I decided to stop all services (OSD, MGR, MON) and to start them
> sequentially.
>
> Now I have multiple OSD marked as down although the service is running.
> None of these down OSDS is connected to the 2 nodes with failure.
>
> In the OSD logs I can see multiple entries like this:
> 2019-12-09 11:13:10.378 7f9a372fb700  1 osd.374 pg_epoch: 493189
> pg[11.1992( v 457986'92619 (303558'88266,457986'92619]
> local-lis/les=466724/466725 n=4107 ec=8346/8346 lis/c 466724/466724
> les/c/f 466725/466725/176266 468956/493184/468423) [203,412] r=-1
> lpr=493184 pi=[466724,493184)/1 crt=457986'92619 lcod 0'0 unknown NOTIFY
> mbc={}] state: transitioning to Stray
>
> I tried to restart the impacted OSD w/o success, means the relevant OSD
> is still marked as down.
>
> Is there a procedure to overcome this issue, means getting all OSD up?
>
> THX
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Size and capacity calculations questions

2019-12-06 Thread Paul Emmerich
Home directories probably means lots of small objects. Default minimum
allocation size of BlueStore on HDD is 64 kiB, so there's a lot of overhead
for everything smaller;

Details: google bluestore min alloc size, can only be changed during OSD
creation

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, Dec 6, 2019 at 12:57 PM Jochen Schulz 
wrote:

> Hi!
>
> Thank you!
> The output of both commands are below.
> I still dont understand why there are 21T used data (because 5.5T*3 =
> 16.5T != 21T) and why there seems to be only 4.5 T MAX AVAIL, but the
> osd output tells we have 25T free space.
>
>
> $ sudo ceph df
> RAW STORAGE:
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED
> hdd45 TiB  24 TiB  21 TiB   21 TiB 46.33
> ssd   596 GiB 524 GiB 1.7 GiB   72 GiB 12.09
> TOTAL  46 TiB  25 TiB  21 TiB   21 TiB 45.89
>
> POOLS:
> POOLID STORED  OBJECTS USED%USED
> MAX AVAIL
> images   8 149 GiB  38.30k 354 GiB  2.52
>   4.5 TiB
> cephfs_data  9 5.5 TiB  26.61M  20 TiB 60.36
>   4.5 TiB
> cephfs_metadata 10  12 GiB   3.17M  13 GiB  2.57
>   164 GiB
>
>
> $ sudo ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETAAVAIL
>%USE  VAR  PGS STATUS
>  0   hdd 0.89000  1.0 931 GiB 456 GiB 453 GiB 136 MiB 3.5 GiB  475
> GiB 49.01 1.07 103 up
>  1   hdd 0.89000  1.0 931 GiB 495 GiB 491 GiB 100 MiB 3.9 GiB  436
> GiB 53.14 1.16  89 up
>  4   hdd 0.89000  1.0 931 GiB 345 GiB 342 GiB 108 MiB 2.9 GiB  586
> GiB 37.05 0.81  87 up
>  5   hdd 0.89000  1.0 931 GiB 521 GiB 517 GiB 108 MiB 4.1 GiB  410
> GiB 55.96 1.22  98 up
>  6   hdd 0.89000  1.0 931 GiB 367 GiB 364 GiB  95 MiB 3.2 GiB  564
> GiB 39.44 0.86  95 up
>  7   hdd 0.89000  1.0 931 GiB 540 GiB 536 GiB  77 MiB 3.7 GiB  392
> GiB 57.96 1.26 111 up
> 20   hdd 0.89000  1.0 931 GiB 382 GiB 378 GiB  60 MiB 3.3 GiB  550
> GiB 40.96 0.89  85 up
> 23   hdd 1.81929  1.0 1.8 TiB 706 GiB 701 GiB 113 MiB 4.9 GiB  1.1
> TiB 37.92 0.83 182 up
> 44   hdd 0.89000  1.0 931 GiB 468 GiB 465 GiB  34 MiB 3.3 GiB  463
> GiB 50.29 1.10  93 up
> 45   hdd 1.78999  1.0 1.8 TiB 882 GiB 875 GiB 138 MiB 6.3 GiB  981
> GiB 47.33 1.03 179 up
> 46   hdd 1.78999  1.0 1.8 TiB 910 GiB 903 GiB 127 MiB 6.4 GiB  953
> GiB 48.83 1.06 192 up
> 22   ssd 0.11639  1.0 119 GiB  15 GiB 357 MiB  12 GiB 2.8 GiB  104
> GiB 12.61 0.27 315 up
> 12   hdd 0.89000  1.0 931 GiB 499 GiB 494 GiB  64 MiB 4.5 GiB  432
> GiB 53.57 1.17 116 up
> 13   hdd 0.89000  1.0 931 GiB 536 GiB 532 GiB  48 MiB 4.4 GiB  395
> GiB 57.59 1.26 109 up
> 30   hdd 0.89000  1.0 931 GiB 510 GiB 506 GiB  33 MiB 3.9 GiB  421
> GiB 54.80 1.19 100 up
> 32   hdd 0.89000  1.0 931 GiB 495 GiB 491 GiB  56 MiB 4.1 GiB  436
> GiB 53.17 1.16 101 up
> 33   hdd 0.89000  1.0 931 GiB 333 GiB 330 GiB  56 MiB 3.1 GiB  598
> GiB 35.80 0.78  82 up
> 15   ssd 0.11639  1.0 119 GiB  14 GiB 336 MiB  11 GiB 2.9 GiB  105
> GiB 12.13 0.26 305 up
> 17   hdd 0.89000  1.0 931 GiB 577 GiB 573 GiB  77 MiB 4.4 GiB  354
> GiB 61.99 1.35  97 up
> 18   hdd 0.89000  1.0 931 GiB 413 GiB 409 GiB  70 MiB 4.0 GiB  518
> GiB 44.34 0.97  95 up
> 19   hdd 1.81879  1.0 1.8 TiB 895 GiB 889 GiB 144 MiB 5.6 GiB  967
> GiB 48.06 1.05 184 up
> 21   hdd 0.89000  1.0 931 GiB 360 GiB 357 GiB  60 MiB 3.4 GiB  570
> GiB 38.72 0.84 100 up
> 31   hdd 0.90909  1.0 931 GiB 508 GiB 505 GiB  80 MiB 3.5 GiB  423
> GiB 54.58 1.19 102 up
> 25   ssd 0.11639  1.0 119 GiB  14 GiB 339 MiB  11 GiB 2.7 GiB  105
> GiB 11.86 0.26 310 up
>  8   hdd 0.89000  1.0 931 GiB 359 GiB 356 GiB  72 MiB 3.1 GiB  572
> GiB 38.55 0.84  80 up
>  9   hdd 0.89000  1.0 931 GiB 376 GiB 373 GiB  42 MiB 3.0 GiB  555
> GiB 40.39 0.88  87 up
> 24   hdd 0.89000  1.0 931 GiB 342 GiB 339 GiB  70 MiB 2.8 GiB  590
> GiB 36.69 0.80  78 up
> 26   hdd 1.78999  1.0 1.8 TiB 921 GiB 915 GiB 129 MiB 6.1 GiB  942
> GiB 49.45 1.08 177 up
> 27   hdd 1.78999  1.0 1.8 TiB 891 GiB 885 GiB 125 MiB 5.7 GiB  972
> GiB 47.82 1.04 208 up
> 35   hdd 1.81929  1.0 1.8 TiB 819 GiB 814 GiB 110 MiB 5.3 GiB  1.0
> TiB 43.99 0.96 184 up
> 29   ssd 0.11638  1.0 119 GiB  15 GiB 339 MiB  11 GiB 2.9 GiB  105
> GiB 12.25 0.27 311 up
> 14   hdd 1

[ceph-users] Re: Upgrade from Jewel to Nautilus

2019-12-05 Thread Paul Emmerich
You should definitely migrate to BlueStore, that'll also take care of the
leveldb/rocksdb upgrade :)
For me mons: as it's super easy to delete and re-create a mon that's
usually the best/simplest way to go.

Also, note that you can't immediately continue from Luminous to Nautilus,
you have to scrub everything on Luminous first as the first scrub on
Luminous performs some data structure migrations that are no longer
supported on Nautilus.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, Dec 6, 2019 at 2:06 AM 徐蕴  wrote:

> Hello,
>
> We are planning to upgrade our cluster from Jewel to Nautilus. From my
> understanding, leveldb of monitor and filestore of OSDs will not be
> converted to rocketdb and bluestore automatically. So do you suggest to
> convert them manually after upgrading software? Is there any document or
> guidance available?
>
> Br,
> Xu Yun
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building a petabyte cluster from scratch

2019-12-03 Thread Paul Emmerich
It's pretty pointless to discuss erasure coding vs replicated without
knowing how it'll be used.

There are setups where erasure coding is faster than replicated. You
do need to write less data overall, so if that's your bottleneck then
erasure coding will be faster.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Dec 3, 2019 at 9:56 PM Nathan Fish  wrote:
>
> If k=8,m=3 is too slow on HDDs, so you need replica 3 and SSD DB/WAL,
> vs EC 8,3 on SSD, then that's (1/3) / (8/11) = 0.45 multiplier on the
> SSD space required vs HDDs.
> That brings it from 6x to 2.7x. Then you have the benefit of not
> needing separate SSDs for DB/WAL both in hardware cost and complexity.
> SSDs will still be more expensive; but perhaps justifiable given the
> performance, rebuild times, etc.
>
> If you only need cold-storage, then EC 8,3 on HDDs will be cheap. But
> is that fast enough?
>
> On Tue, Dec 3, 2019 at 3:47 PM  wrote:
> >
> > >> * Hardware raid with Battery Backed write-cache - will allow OSD to ack
> > >> writes before hitting spinning rust.
> > >
> > > Disagree.  See my litany from a few months ago.  Use a plain, IT-mode HBA.
> > >  Take the $$ you save and put it toward building your cluster out of SSDs
> > > instead of HDDs.  That way you don’t have to mess with the management
> > > hassles of maintaining and allocating external WAL+DB partitions too.
> >
> > These things are not really comparable - are they?  Cost of SSD vs. HDD is
> >  still in the 6:1 favor of HHD's. Yes SSD would be great but not
> > nessesarily affordable - or have I missed something that makes the math
> > work ?
> >
> > --
> > Jesper
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI Gateway reboots and permanent loss

2019-12-03 Thread Paul Emmerich
Gateway removal is indeed supported since ceph-iscsi 3.0 (or was it
2.7?) and it works while it is offline :)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Dec 3, 2019 at 8:21 PM Jason Dillaman  wrote:
>
> If I recall correctly, the recent ceph-iscsi release supports the
> removal of a gateway via the "gwcli". I think the Ceph dashboard can
> do that as well.
>
> On Tue, Dec 3, 2019 at 1:59 PM Wesley Dillingham  
> wrote:
> >
> > We utilize 4 iSCSI gateways in a cluster and have noticed the following 
> > during patching cycles when we sequentially reboot single iSCSI-gateways:
> >
> > "gwcli" often hangs on the still-up iSCSI GWs but sometimes still functions 
> > and gives the message:
> >
> > "1 gateway is inaccessible - updates will be disabled"
> >
> > This got me thinking about what the course of action would be should an 
> > iSCSI gateway fail permanently or semi-permanently, say a hardware issue. 
> > What would be the best course of action to instruct the remaining iSCSI 
> > gateways that one of them is no longer available and that they should allow 
> > updates again and take ownership of the now-defunct-node's LUNS?
> >
> > I'm guessing pulling down the RADOS config object and rewriting it and 
> > re-put'ing it followed by a rbd-target-api restart might do the trick but 
> > am hoping there is a more "in-band" and less potentially devastating way to 
> > do this.
> >
> > Thanks for any insights.
> >
> > Respectfully,
> >
> > Wes Dillingham
> > w...@wesdillingham.com
> > LinkedIn
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-12-02 Thread Paul Emmerich
On Mon, Dec 2, 2019 at 4:55 PM Simon Ironside  wrote:
>
> Any word on 14.2.5? Nervously waiting here . . .

real soon, the release is 99% done (check the corresponding thread on
the devel mailing list)



Paul

>
> Thanks,
> Simon.
>
> On 18/11/2019 11:29, Simon Ironside wrote:
>
> > I will sit tight and wait for 14.2.5.
> >
> > Thanks again,
> > Simon.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can min_read_recency_for_promote be -1

2019-12-02 Thread Paul Emmerich
I've recently configured something like this for a backup cluster with
these settings:

ceph osd pool set cache_test hit_set_type bloom
ceph osd pool set cache_test hit_set_count 1
ceph osd pool set cache_test hit_set_period 7200
ceph osd pool set cache_test target_max_bytes 1
ceph osd pool set cache_test min_read_recency_for_promote 1
ceph osd pool set cache_test min_write_recency_for_promote 0
ceph osd pool set cache_test cache_target_dirty_ratio 0.1
ceph osd pool set cache_test cache_target_dirty_high_ratio 0.33
ceph osd pool set cache_test cache_target_full_ratio 0.8


The goal here was just to handle bad IO patterns generated by bad
backup software (why do they love to run with a stupidly low queue
depth and small IOs?)
It's not ideal and doesn't really match your use case (since the data
in question isn't read back here)

But yeah, I also thought about building a specialized cache mode that
just acts as a write buffer, there are quite a few applications that
would benefit from that.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Dec 2, 2019 at 11:40 PM Robert LeBlanc  wrote:
>
> I'd like to configure a cache tier to act as a write buffer, so that if 
> writes come in, it promotes objects, but reads never promote an object. We 
> have a lot of cold data so we would like to tier down to an EC pool (CephFS) 
> after a period of about 30 days to save space. The storage tier and the 
> 'cache' tier would be on the same spindles, so the only performance 
> improvement would be from the faster writes with replication. So we don't 
> want to really move data between tiers.
>
> The idea would be to not promote on read since EC read performance is good 
> enough and have writes go to the cache tier where the data may be 'hot' for a 
> week or so, then get cold.
>
> It seems that we would only need one hit_set and if -1 can't be set for 
> min_read_recency_for_promote, I could probably use 2 which would never hit 
> because there is only one set, but that may error too. The follow up is how 
> big a set should be as it only really tells if an object "may" be in cache 
> and does not determine when things are flushed, so it really only matters how 
> out-of-date we are okay with the bloom filter being out of date, right? So we 
> could have it be a day long if we are okay with that stale rate? Is there any 
> advantage to having a longer period for a bloom filter? Now, I'm starting to 
> wonder if I even need a bloom filter for this use case, can I get tiering to 
> work without it and only use cache_min_flush_age/cach_min_evict_age since I 
> don't care about promoting when there are X hits in Y time?
>
> Thanks
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions about the EC pool

2019-11-29 Thread Paul Emmerich
It should take ~25 seconds by default to detect a network failure, the
config option that controls this is "osd heartbeat grace" (default 20
seconds, but it takes a little longer for it to really detect the
failure).
Check ceph -w while performing the test.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 29, 2019 at 8:14 AM majia xiao  wrote:
>
> Hello,
>
>
> We have a Ceph cluster (version 12.2.4) with 10 hosts, and there are 21 OSDs 
> on each host.
>
>
>  An EC pool is created with the following commands:
>
>
> ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
>
>   plugin=jerasure \
>
>   k=4 \
>
>   m=3 \
>
>   technique=reed_sol_van \
>
>   packetsize=2048 \
>
>   crush-device-class=hdd \
>
>   crush-failure-domain=host
>
>
> ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure 
> profile_jerasure_4_3_reed_sol_van
>
>
>
> Here are my questions:
>
> The EC pool is created using k=4, m=3, and crush-device-class=hdd, so we just 
> disable the network interfaces of some hosts (using "ifdown" command) to 
> verify the functionality of the EC pool while performing ‘rados bench’ 
> command.
> However, the IO rate drops immediately to 0 when a single host goes offline, 
> and it takes a long time (~100 seconds) for the IO rate becoming normal.
> As far as I know, the default value of min_size is k+1 or 5, which means that 
> the EC pool can be still working even if there are two hosts offline.
> Is there something wrong with my understanding?
> According to our observations, it seems that the IO rate becomes normal when 
> Ceph detects all OSDs corresponding to the failed host.
> Is there any way to reduce the time needed for Ceph to detect all failed OSDs?
>
>
>
> Thanks for any help.
>
>
> Best regards,
>
> Majia Xiao
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Changing failure domain

2019-11-28 Thread Paul Emmerich
Use a crush rule likes this for replica:

1) root default class XXX
2) choose 2 rooms
3) choose 2 disks

That'll get you 4 OSDs in two rooms and the first 3 of these get data,
the fourth will be ignored. That guarantees that losing a room will
lose you at most 2 out of 3 copies. This is for disaster recovery
only, it'll guarantee durability if you lose a room but not
availability.

3+2 erasure coding cannot be split across two rooms in this way
because, well, you need 3 out of 5 shards to survive, so you cannot
lose half of them.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 28, 2019 at 5:40 PM Francois Legrand  wrote:
>
> Hi,
> I have a cephfs in production based on 2 pools (data+metadata).
>
> Data is  in erasure coding with the profile :
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=3
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> Metadata is in replicated mode with k=3
>
> The crush rules are as follow :
> [
>  {
>  "rule_id": 0,
>  "rule_name": "replicated_rule",
>  "ruleset": 0,
>  "type": 1,
>  "min_size": 1,
>  "max_size": 10,
>  "steps": [
>  {
>  "op": "take",
>  "item": -1,
>  "item_name": "default"
>  },
>  {
>  "op": "chooseleaf_firstn",
>  "num": 0,
>  "type": "host"
>  },
>  {
>  "op": "emit"
>  }
>  ]
>  },
>  {
>  "rule_id": 1,
>  "rule_name": "ec_data",
>  "ruleset": 1,
>  "type": 3,
>  "min_size": 3,
>  "max_size": 5,
>  "steps": [
>  {
>  "op": "set_chooseleaf_tries",
>  "num": 5
>  },
>  {
>  "op": "set_choose_tries",
>  "num": 100
>  },
>  {
>  "op": "take",
>  "item": -1,
>  "item_name": "default"
>  },
>  {
>  "op": "chooseleaf_indep",
>  "num": 0,
>  "type": "host"
>  },
>  {
>  "op": "emit"
>  }
>  ]
>  }
> ]
>
> When we installed it, everything was in the same room, but know we
> splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated
> the crushmap by adding a room layer (with ceph osd crush add-bucket
> room1 room etc)  and move all our servers in the tree to the correct
> place (ceph osd crush move server1 room=room1 etc...).
>
> Now, we would like to change the rules to set a failure domain to room
> instead of host (to be sure that in case of disaster in one of the rooms
> we will still have a copy in the other).
>
> What is the best strategy to do this ?
>
> F.
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

2019-11-22 Thread Paul Emmerich
On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy
 wrote:

> The 2^31-1 in there seems to indicate an overflow somewhere - the way we
> were able to figure out where exactly
> is to query the PG and compare the "up" and "acting" sets - only _one_
> of them had the 2^31-1 number in place
> of the correct OSD number. We restarted that and the PG started doing
> its job and recovered.

no, this value is intentional (and shows up as 'None' on higher level
tools), it means no mapping could be found; check your crush map and
crush rule

Paul


>
> The issue seems to be going back to 2015:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
> however no solution...
>
> I'm more concerned about the cluster not being able to recover (it's a
> 4+2 EC pool across 12 hosts - plenty of room
> to heal) than about the weird print-out.
>
> The VMs who wanted to access data in any of the affected PGs of course
> died.
>
> Are we missing some settings to let the cluster self-heal even for EC
> pools? First EC pool in production :)
>
> Cheers,
> Zoltan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: msgr2 not used on OSDs in some Nautilus clusters

2019-11-19 Thread Paul Emmerich
There should be a warning that says something like "all OSDs are
running nautilus but require-osd-release nautilus is not set"

That warning did exist for older releases, pretty sure nautilus also has it?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell  wrote:
>
> Closing the loop here.  I figured out that I missed a step during the 
> Nautilus upgrade which was causing this issue:
>
> ceph osd require-osd-release nautilus
>
> If you don't do this your cluster will start having problems once you enable 
> msgr2:
>
> ceph mon enable-msgr2
>
> Based on how hard this was to track down, maybe a check should be added 
> before enabling msgr2 to make sure the require-osd-release is set to nautilus?
>
> Bryan
>
> > On Nov 18, 2019, at 5:41 PM, Bryan Stillwell  wrote:
> >
> > I cranked up debug_ms to 20 on two of these clusters today and I'm still 
> > not understanding why some of the clusters use v2 and some just use v1.
> >
> > Here's the boot/peering process for the cluster which uses v2:
> >
> > 2019-11-18 16:46:03.027 7fabb6281dc0  0 osd.0 39101 done with init, 
> > starting boot process
> > 2019-11-18 16:46:03.028 7fabb6281dc0  1 osd.0 39101 start_boot
> > 2019-11-18 16:46:03.030 7fabaebac700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.3:6800/1473285,v1:10.0.32.3:6801/1473285] conn(0x5596b30c3000 
> > 0x5596b4bf4000 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> > tx=0).handle_hello received hello: peer_type=16 
> > peer_addr_for_me=v2:10.0.32.67:51508/0
> > 2019-11-18 16:46:03.034 7faba8116700  1 -- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] --> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] -- osd_boot(osd.0 booted 0 
> > features 4611087854031667199 v39101) v7 -- 0x5596b4bd6000 con 0x5596b3b06400
> > 2019-11-18 16:46:03.034 7faba8116700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).send_message enqueueing message m=0x5596b4bd6000 type=71 
> > osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).prepare_send_message m=osd_boot(osd.0 booted 0 features 
> > 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700 20 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).prepare_send_message encoding features 4611087854031667199 
> > 0x5596b4bd6000 osd_boot(osd.0 booted 0 features 4611087854031667199 v39101) 
> > v7
> > 2019-11-18 16:46:03.034 7fabaf3ad700  5 --2- 
> > [v2:10.0.32.67:6800/258117,v1:10.0.32.67:6801/258117] >> 
> > [v2:10.0.32.65:3300/0,v1:10.0.32.65:6789/0] conn(0x5596b3b06400 
> > 0x5596b2bca580 crc :-1 s=READY pgs=11687624 cs=0 l=1 rx=0 
> > tx=0).write_message sending message m=0x5596b4bd6000 seq=8 osd_boot(osd.0 
> > booted 0 features 4611087854031667199 v39101) v7
> > 2019-11-18 16:46:03.352 7fab9d100700  1 osd.0 39104 state: booting -> active
> > 2019-11-18 16:46:03.354 7fabaebac700  5 --2- 
> > [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> > [v2:10.0.32.9:6802/3892454,v1:10.0.32.9:6803/3892454] conn(0x5596b4d68800 
> > 0x5596b4bf5080 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 
> > tx=0).handle_hello received hello: peer_type=4 
> > peer_addr_for_me=v2:10.0.32.67:45488/0
> > 2019-11-18 16:46:03.354 7fabafbae700  5 --2- 
> > [v2:10.0.32.67:6802/258117,v1:10.0.32.67:6803/258117] >> 
> > [v2:10.0.32.142:6810/2881684,v1:10.0.32.142:6811/2881684] 
> > conn(0x5596b4d68000 0x5596b4bf4580 unknown :-1 s=HELLO_CONNECTING pgs=0 
> > cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=4 
> > peer_addr_for_me=v2:10.0.32.67:39044/0
> > 2019-11-18 16:46:03.355 7fabaf3ad700  5 --2-  >> 
> > [v2:10.0.32.67:6814/100535,v1:10.0.32.67:6815/100535] conn(0x5596b4d68400 
> > 0x5596b4bf4b00 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=1 rx=0 
> > tx=0).handle_hello 

[ceph-users] Re: Balancing PGs across OSDs

2019-11-18 Thread Paul Emmerich
You have way too few PGs in one of the roots. Many OSDs have so few
PGs that you should see a lot of health warnings because of it.
The other root has a factor 5 difference in disk size which isn't ideal either.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Nov 18, 2019 at 3:03 PM Thomas Schneider <74cmo...@gmail.com> wrote:
>
> Hi,
>
> in this <https://ceph.io/community/the-first-telemetry-results-are-in/>
> blog post I find this statement:
> "So, in our ideal world so far (assuming equal size OSDs), every OSD now
> has the same number of PGs assigned."
>
> My issue is that accross all pools the number of PGs per OSD is not equal.
> And I conclude that this is causing very unbalanced data placement.
> As a matter of fact the data stored on my 1.6TB HDD in specific pool
> "hdb_backup" is in a range starting with
> osd.228 size: 1.6 usage: 52.61 reweight: 1.0
> and ending with
> osd.145 size: 1.6 usage: 81.11 reweight: 1.0
>
> This impacts the amount of data that can be stored in the cluster heavily.
>
> Ceph balancer is enabled, but this is not solving this issue.
> root@ld3955:~# ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "upmap"
> }
>
> Therefore I would ask you for suggestions how to work on this unbalanced
> data distribution.
>
> I have attached pastebin for
> - ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g>
> - ceph osd df tree <https://pastebin.com/SvhP2hp5>
>
> My cluster has multiple crush roots respresenting different disks.
> In addition I have defined multiple pools, one pool for each disk type:
> hdd, ssd, nvme.
>
> THX
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: add debian buster stable support for ceph-deploy

2019-11-18 Thread Paul Emmerich
We maintain an unofficial mirror for Buster packages:
https://croit.io/2019/07/07/2019-07-07-debian-mirror


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Nov 18, 2019 at 5:16 PM Jelle de Jong  wrote:
>
> Hello everybody,
>
> Can somebody add support for Debian buster and ceph-deploy:
> https://tracker.ceph.com/issues/42870
>
> Highly appreciated,
>
> Regards,
>
> Jelle de Jong
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow write speed on 3-node cluster with 6* SATA Harddisks (~ 3.5 MB/s)

2019-11-06 Thread Paul Emmerich
On Wed, Nov 6, 2019 at 5:57 PM Hermann Himmelbauer  wrote:
>
> Dear Vitaliy, dear Paul,
>
> Changing the block size for "dd" makes a huge difference.
>
> However, still some things are not fully clear to me:
>
> As recommended, I tried writing / reading directly to the rbd and this
> is blazingly fast:
>
> fio -ioengine=rbd -name=test -direct=1 -rw=read -bs=4M -iodepth=16
> -pool=SATA -rbdname=vm-100-disk-0
>
> write: IOPS=40, BW=160MiB/s (168MB/s)(4096MiB/25529msec)
> read: IOPS=135, BW=542MiB/s (568MB/s)(4096MiB/7556msec)
>
> When I do the same within the virtual machine, I get the following results:
>
> fio --filename=/dev/vdb -name=test -direct=1 -rw=write -bs=4M -iodepth=16

--ioengine=aio

Also, VirtIO block devices are usually slower than virtual SCSI disks
on VirtIO SCSI controllers.

Paul


>
>  cache = writeback --
> Blocksize: 4M:
> read : io=4096.0MB, bw=97640KB/s, iops=23, runt= 42957msec
> write: io=4096.0MB, bw=85250KB/s, iops=20, runt= 49200msec
>
> Blocksize: 4k:
> read : io=4096.0MB, bw=3988.6KB/s, iops=997, runt=1051599msec
> write: io=4096.0MB, bw=14529KB/s, iops=3632, runt=288686msec
> -
>
> The speeds are much slower whereas I don't really know why.
> Nevertheless, this seams also really reasonable, although it's strange
> that there's no difference between "cache = unsafe" and "cache =
> writeback". Moreover, I find it strange that reading with 4k-blocks is
> that slow, while writing is still o.k.
>
> My virtual machine is Debian 8, with a paravirtualized block device
> (/dev/vdb), the process (and qemu parameters) look like the following:
>
> root 1854681 18.3  8.0 5813032 1980804 ? Sl   16:00  20:55
> /usr/bin/kvm -id 100 -name bya-backend -chardev
> socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait -mon
> chardev=qmp,mode=control -chardev
> socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5 -mon
> chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/100.pid
> -daemonize -smbios type=1,uuid=e53d6e2d-708e-4511-bb26-b0f1aefd81c6 -smp
> 4,sockets=1,cores=4,maxcpus=4 -nodefaults -boot
> menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg
> -vnc unix:/var/run/qemu-server/100.vnc,password -cpu
> kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 4096 -device
> pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device
> pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device
> piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device
> usb-tablet,id=tablet,bus=uhci.0,port=1 -device
> VGA,id=vga,bus=pci.0,addr=0x2 -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -iscsi
> initiator-name=iqn.1993-08.org.debian:01:7c6dc4c7e9f -drive
> if=none,id=drive-ide2,media=cdrom,aio=threads -device
> ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -drive
> file=/mnt/pve/pontos-images/images/100/vm-100-disk-2.raw,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on
> -device
> virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100
> -drive
> file=rbd:SATA/vm-100-disk-0:conf=/etc/pve/ceph.conf:id=admin:keyring=/etc/pve/priv/ceph/SATA.keyring,if=none,id=drive-virtio1,cache=writeback,format=raw,aio=threads,detect-zeroes=on
> -device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb
> -netdev
> type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown
> -device
> e1000,mac=46:22:36:C3:37:7E,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
> -machine type=pc
>
> In case you have some further speedup-hints for me, I'm glad.
>
> Nevertheless thank you a lot for help!
>
> Best Regards,
> Hermann
>
> Am 05.11.19 um 12:49 schrieb Виталий Филиппов:
> > Yes, cache=unsafe has no effect with RBD. Hm, that's strange, you should
> > get ~40*6 MB/s linear write with 6 HDDs and Bluestore.
> >
> > Try to create a test image and test it with 'fio -ioengine=rbd
> > -name=test -direct=1 -rw=write -bs=4M -iodepth=16 -pool=
> > -rbdname=' from outside a VM.
> >
> > If you still get 4 MB/s, something's wrong with your ceph. If you get
> > adequate performance, something's wrong with your VM settings.
> >
> > 5 ноября 2019 г. 14:31:38 GMT+03:00, Hermann Himmelbauer
> >  пишет:
> >
> > Hi,
> > Thank you for your quick reply, Proxmox offers me "writeback"
> > (cache=writeback) and "writeback unsafe" (cache=unsafe), however, for my
> > "dd" test, this makes no difference at all.
> >
> > I still have write speeds of ~ 4,5 MB/s.
> >
> > Perhaps "dd" disables the write cache?
> >
> > Would it perhaps help to put the journal or something else on a SSD?
> >
> > Best Regards,
> > Hermann
> >
> > Am 05.11.19 um 11:49 schrieb vita...@yourcmc.ru:
> >
> > Use `cache=writeback` QEMU option for HDD clusters, that should
> > solve
> > your issue
> >
> > Hi,
> > I recently 

[ceph-users] Re: Slow write speed on 3-node cluster with 6* SATA Harddisks (~ 3.5 MB/s)

2019-11-05 Thread Paul Emmerich
On Mon, Nov 4, 2019 at 11:44 PM Hermann Himmelbauer  wrote:
>
> Hi,
> I recently upgraded my 3-node cluster to proxmox 6 / debian-10 and
> recreated my ceph cluster with a new release (14.2.4 bluestore) -
> basically hoping to gain some I/O speed.
>
> The installation went flawlessly, reading is faster than before (~ 80
> MB/s), however, the write speed is still really slow (~ 3,5 MB/s).
>
> I wonder if I can do anything to speed things up?
>
> My Hardware is as the following:
>
> 3 Nodes with Supermicro X8DTT-HIBQF Mainboard each,
> 2 OSD per node (2TB SATA harddisks, WDC WD2000F9YZ-0),
> interconnected via Infiniband 40
>
> The network should be reasonably fast, I measure ~ 16 GBit/s with iperf,
> so this seems fine.
>
> I use ceph for RBD only, so my measurement is simply doing a very simple
> "dd" read and write test within a virtual machine (Debian 8) like the
> following:
>
> read:
> dd if=/dev/vdb | pv | dd of=/dev/null
> -> 80 MB/s
>
>
> write:
> dd if=/dev/zero | pv | dd of=/dev/vdb
> -> 3.5 MB/s

you are mainly measuring latency, not bandwidth here. Use a larger
block size (bs=4M) to measure bandwidth.

Using /dev/zero as source can also be a bad benchmark if you are
running with detect-zeroes=unmap (which is the proxmox default IIRC)

Paul

>
> When I do the same on the virtual machine on a disk that is on a NFS
> storage, I get something about 30 MB/s for reading and writing.
>
> If I disable the write cache on all OSD disks via "hdparm -W 0
> /dev/sdX", I gain a little bit of performance, write speed is then 4.3 MB/s.
>
> Thanks to your help from the list I plan to install a second ceph
> cluster which is SSD based (Samsung PM1725b) which should be much
> faster, however, I still wonder if there is any way to speed up my
> harddisk based cluster?
>
> Thank you in advance for any help,
>
> Best Regards,
> Hermann
>
>
> --
> herm...@qwer.tk
> PGP/GPG: 299893C7 (on keyservers)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Health error right after starting balancer

2019-11-01 Thread Paul Emmerich
Looks like you didn't tell the whole story, please post the *full*
output of ceph -s and ceph osd df tree.

Wild guess: you need to increase "mon max pg per osd"

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 31, 2019 at 8:17 PM Thomas <74cmo...@gmail.com> wrote:
>
> This is the output of OSD.270 that remains with slow requests blocked
> even after restarting.
> What's the interpretation of it?
>
> root@ld5507:~# ceph daemon osd.270 dump_blocked_ops [330/1857]
> {
>  "ops": [
>  {
>  "description": "osd_pg_create(e293649 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:22:13.563017",
>  "age": 2785.269856041,
>  "duration": 2785.269905628,
>  "type_data": {
>  "flag_point": "started",
>  "events": [
>  {
>  "time": "2019-10-31 19:22:13.563017",
>  "event": "initiated"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563017",
>  "event": "header_read"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563011",
>  "event": "throttled"
>  },
>  {
>  "time": "2019-10-31 19:22:13.563024",
>  "event": "all_read"
>  },
>  {
>  "time": "2019-10-31 20:07:43.881441",
>  "event": "dispatched"
> }, [300/1857]
>  {
>  "time": "2019-10-31 20:07:43.881472",
>  "event": "wait for new map"
>  },
>  {
>  "time": "2019-10-31 20:07:44.665714",
>  "event": "started"
>  }
>  ]
>  }
>  },
>  {
>  "description": "osd_pg_create(e293650 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:23:16.150040",
>  "age": 2722.682833165,
>  "duration": 2722.683007228,
>  "type_data": {
>  "flag_point": "delayed",
>  "events": [
>  {
>  "time": "2019-10-31 19:23:16.150040",
>  "event": "initiated"
>  },
>  {
>  "time": "2019-10-31 19:23:16.150040",
>  "event": "header_read"
>  },
>  {
>  "time": "2019-10-31 19:23:16.150035",
>  "event": "throttled"
> }, [269/1857]
>  {
>  "time": "2019-10-31 19:23:16.150055",
>  "event": "all_read"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882197",
>  "event": "dispatched"
>  },
>  {
>  "time": "2019-10-31 20:07:43.882198",
>  "event": "wait for new map"
>  }
>  ]
>  }
>  },
>  {
>  "description": "osd_pg_create(e293651 59.b:267033
> 59.2c:267033)",
>  "initiated_at": "2019-10-31 19:23:17.779034",
>  "age": 2721.0538393319998,
>  "duration": 2721.0541152350002,
>  "type_data": {
>  "flag_point": "delayed&

[ceph-users] Re: Ceph Health error right after starting balancer

2019-10-31 Thread Paul Emmerich
Requests stuck for > 2 hours cannot be attributed to "IO load on the cluster".

Looks like some OSDs really are stuck, things to try:

* run "ceph daemon osd.X dump_blocked_ops" on one of the affected OSDs
to see what is stuck
* try restarting OSDs to see if it clears up automatically


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 31, 2019 at 2:27 PM Thomas Schneider <74cmo...@gmail.com> wrote:
>
> Hi,
>
> after enabling ceph balancer (with command ceph balancer on) the health
> status changed to error.
> This is the current output of ceph health detail:
> root@ld3955:~# ceph health detail
> HEALTH_ERR 1438 slow requests are blocked > 32 sec; 861 stuck requests
> are blocked > 4096 sec; mon ld5505 is low on available space
> REQUEST_SLOW 1438 slow requests are blocked > 32 sec
> 683 ops are blocked > 2097.15 sec
> 436 ops are blocked > 1048.58 sec
> 191 ops are blocked > 524.288 sec
> 78 ops are blocked > 262.144 sec
> 35 ops are blocked > 131.072 sec
> 11 ops are blocked > 65.536 sec
> 4 ops are blocked > 32.768 sec
> osd.62 has blocked requests > 65.536 sec
> osds 39,72 have blocked requests > 262.144 sec
> osds 6,19,67,173,174,187,188,269,434 have blocked requests > 524.288 sec
> osds
> 8,16,35,36,37,61,63,64,68,73,75,178,186,271,369,420,429,431,433,436 have
> blocked requests > 1048.58 sec
> osds 3,5,7,24,34,38,40,41,59,66,69,74,180,270,370,421,432,435 have
> blocked requests > 2097.15 sec
> REQUEST_STUCK 861 stuck requests are blocked > 4096 sec
> 25 ops are blocked > 8388.61 sec
> 836 ops are blocked > 4194.3 sec
> osds 2,28,29,32,60,65,181,185,268,368,423,424,426 have stuck
> requests > 4194.3 sec
> osds 0,30,70,71,184 have stuck requests > 8388.61 sec
>
> I understand that when balancer starts shifting PGs to other OSDs that
> this caused IO load on the cluster.
> However I don't understand why this is affecting OSD so heavily.
> And I don't understand why OSD of specific type (SSD, NVME) suffer
> although there's no balancing occuring on them.
>
> Regards
> Thomas
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iSCSI write performance

2019-10-31 Thread Paul Emmerich
On Fri, Oct 25, 2019 at 11:14 PM Maged Mokhtar  wrote:
> 3. vmotion between Ceph datastore and an external datastore..this will be 
> bad. This seems the case you are testing. It is bad because between 2 
> different storage systems (iqns are served on different targets), vaai xcopy 
> cannot be used and vmware does its own stuff. It moves data using 64k block 
> size, which gives low performance...to add some flavor, it does indeed use 32 
> threads, but unfortunately they use co-located addresses which does not work 
> well in Ceph as they are hitting the same rbd object, which gets serialized 
> due to pg locks, so you will not get any palatalization. Your speed will 
> mostly be determined by a serial 64k, so with 1 ms write latency for ssd 
> cluster, you will get around 64 MB/s..it will be slightly higher as the extra 
> threads have some low effect.

Yes, vmotion is the worst IO pattern ever for a sequential copy.

However, the situation you are describing can be fixed with RBD
striping v2, just make Ceph switch to another object every 64kb, see
https://docs.ceph.com/docs/master/dev/file-striping/

I'm not sure about the state of striping v2 support in the kernel
module, last time I checked it wasn't supported. But
ceph-iscsi/tcmu-runner got quite good over the past year, I don't see
any point in still using the kernel data path for iscsi nowadays.


Paul


>
> Note your esxtop does show 32 active ios under ACTV, the QUED of zero does is 
> not the queue depth, but rather the "queued" io the ESX would suspend in case 
> your active reaches the maximum by adapater ( 128 ).
>
> This is just to clarify, if case 3 is not your primary concern than i would 
> forget about it and benchmark 1 and 2 if they are relevant. Else, if 3 is 
> important, i am not sure you can do much as it is happening within 
> vmware..maybe there could be a way to map the external iqn to be served by 
> the same target serving the Ceph iqn then there could be a chance the xcopy 
> could be activated..Mike would probably know if this has any chance of 
> working :)
>
> /Maged
>
>
> On 25/10/2019 22:01, Ryan wrote:
>
> esxtop is showing a queue length of 0
>
> Storage motion to ceph
> DEVICEPATH/WORLD/PARTITION DQLEN WQLEN ACTV 
> QUED %USD  LOAD   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd 
> KAVG/cmd GAVG/cmd QAVG/cmd
> naa.6001405ec60d8b82342404d929fbbd03   - 128 -   32   
>  0   25  0.25  1442.32 0.18  1440.50 0.0089.7821.32 0.01  
>   21.34 0.01
>
> Storage motion from ceph
> DEVICEPATH/WORLD/PARTITION DQLEN WQLEN ACTV 
> QUED %USD  LOAD   CMDS/s  READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd 
> KAVG/cmd GAVG/cmd QAVG/cmd
> naa.6001405ec60d8b82342404d929fbbd03   - 128 -   32   
>  0   25  0.25  4065.38  4064.83 0.36   253.52 0.00 7.57 0.01  
>7.58 0.00
>
> I tried using fio like you mentioned but it was hanging with 
> [r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS] and the ETA kept climbing. I ended up using 
> rbd bench on the ceph iscsi gateway. With a 64K write workload I'm seeing 
> 400MB/s transfers.
>
> rbd create test --size 100G --image-feature layering
> rbd map test
> mkfs.ext4 /dev/rbd/rbd/test
> mount /dev/rbd/rbd/test test
>
> rbd create testec --size 100G --image-feature layering --data-pool rbd_ec
> rbd map testec
> mkfs.ext4 /dev/rbd/rbd/testec
> mount /dev/rbd/rbd/testec testec
>
> [root@ceph-iscsi1 mnt]# rbd bench --image test --io-size 64K --io-type write 
> --io-total 10G
> bench  type write io_size 65536 io_threads 16 bytes 10737418240 pattern 
> sequential
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1  6368   6377.59  417961796.64
> 2 12928   6462.27  423511630.71
> 3 19296   6420.18  420752986.78
> 4 26320   6585.61  431594792.67
> 5 33296   6662.37  436624891.04
> 6 40128   6754.67  442673957.25
> 7 46784   6765.75  443400452.26
> 8 53280   6809.02  446236110.93
> 9 60032   6739.67  441691068.73
>10 66784   6698.91  439019550.77
>11 73616   6690.88  438493253.66
>12 80016   6654.35  436099640.00
>13 85712   6485.07  425005611.11
>14 91088   6202.49  406486113.46
>15 96896   6021.17  394603137.62
>16102368   5741.19  376254347.24
>17107568   5501.57  360550910.38
>18113728   5603.17  367209502.58
>19120144   5820.48  381451245.32
>20126496   5917.60  387816078.53
>21132768   6089.71  399095466.00
>22139040   6306.98  413334431.09
>23145104   6276.42  411331743.63
>24151440   6256.67  410036891.68
>25157808   6261.12  410328554.98
>26163456   6140.03  402392725.36
> elapsed:26  ops:   163840  ops/sec:  6271.36  bytes/sec: 410999626.38
>
> [root@ceph-iscsi1 mnt]# rbd bench --image testec --io-size 64K --io-type 
> write --io-total 10G
> bench  

[ceph-users] Re: iSCSI write performance

2019-10-31 Thread Paul Emmerich
On Mon, Oct 28, 2019 at 8:07 PM Mike Christie  wrote:
>
> On 10/25/2019 03:25 PM, Ryan wrote:
> > Can you point me to the directions for the kernel mode iscsi backend. I
> > was following these directions
> > https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/
> >
>
> If you just wanted to use the krbd device /dev/rbd$N and export it with
> iscsi from a single iscsi target, then you can use targetcli like
> described here:
>
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Storage_Administration_Guide/ch24.html
>
> Instead of /dev/sdb in the example you would just use /dev/rbd$N in the
> section "BLOCK (Linux BLOCK devices)".

this did not work very well in the past (even without multi-pathing)


Paul


>
> Maged and SUSE also have a special module but I am not sure where to get
> their kernels and manuals. Check out suse and petasan's sites for that.
>
>
> > Thanks,
> > Ryan
> >
> >
> > On Fri, Oct 25, 2019 at 11:29 AM Mike Christie  > > wrote:
> >
> > On 10/25/2019 09:31 AM, Ryan wrote:
> > > I'm not seeing the emulate_3pc setting under disks/rbd/diskname when
> >
> > emulate_3pc is only for kernel based backends. tcmu-runner always has
> > xcopy on.
> >
> > > calling info. A google search shows that SUSE Enterprise Storage
> > has it
> > > available. I thought I had the latest packages, but maybe not. I'm
> > using
> > > tcmu-runner 1.5.2 and ceph-iscsi 3.3. Almost all of my VMs are
> > currently
> > > on Nimble iSCSI storage. I've actually tested from both and
> > performance
> > > is the same. Doing the math off the ceph status does show it using 64K
> > > blocks in both cases.
> > >
> > > Control Values
> > > - hw_max_sectors .. 1024
> > > - max_data_area_mb .. 256 (override)
> > > - osd_op_timeout .. 30
> > > - qfull_timeout .. 5
> > >
> > > On Fri, Oct 25, 2019 at 4:46 AM Maged Mokhtar
> > mailto:mmokh...@petasan.org>
> > > >> wrote:
> > >
> > > Actually this may not work if moving from a local datastore to
> > Ceph.
> > > For iSCSI xcopy, both the source and destination need to be
> > > accessible by the target such as in moving vms across Ceph
> > > datastores. So in your case, vmotion will be handled by VMWare
> > data
> > > mover which uses 64K block sizes.
> > >
> > > On 25/10/2019 10:28, Maged Mokhtar wrote:
> > >>
> > >> For vmotion speed, check "emulate_3pc" attribute on the LIO
> > >> target. If 0 (default), VMWare will issue io in 64KB blocks which
> > >> gives low speed. if set to 1  this will trigger VMWare to use
> > vaai
> > >> extended copy, which activates LIO's xcopy functionality which
> > >> uses 512KB block sizes by default. We also bumped the xcopy block
> > >> size to 4M (rbd object size) which gives around 400 MB/s vmotion
> > >> speed, the same speed can also be achieved via Veeam backups.
> > >>
> > >> /Maged
> > >>
> > >> On 25/10/2019 06:47, Ryan wrote:
> > >>> I'm using CentOS 7.7.1908 with kernel
> > 3.10.0-1062.1.2.el7.x86_64.
> > >>> The workload was a VMware Storage Motion from a local SSD backed
> > >>> datastore to the ceph backed datastore. Performance was measured
> > >>> using dstat on the iscsi gateway for network traffic and ceph
> > >>> status as this cluster is basically idle.  I changed
> > >>> max_data_area_mb to 256 and cmdsn_depth to 128. This appears to
> > >>> have given a slight improvement of maybe 10MB/s.
> > >>>
> > >>> Moving VM to the ceph backed datastore
> > >>> io:
> > >>> client:   124 KiB/s rd, 76 MiB/s wr, 95 op/s rd, 1.26k
> > op/s wr
> > >>>
> > >>> Moving VM off the ceph backed datastore
> > >>>   io:
> > >>> client:   344 MiB/s rd, 625 KiB/s wr, 5.54k op/s rd, 62
> > op/s wr
> > >>>
> > >>> I'm going to test bonnie++ with an rbd volume mounted
> > directly on
> > >>> the iscsi gateway. Also will test bonnie++ inside a VM on a ceph
> > >>> backed datastore.
> > >>>
> > >>> On Thu, Oct 24, 2019 at 7:15 PM Mike Christie
> > >>> mailto:mchri...@redhat.com>
> > >> wrote:
> > >>>
> > >>> On 10/24/2019 12:22 PM, Ryan wrote:
> > >>> > I'm in the process of testing the iscsi target feature of
> > >>> ceph. The
> > >>> > cluster is running ceph 14.2.4 and ceph-iscsi 3.3. It
> > >>> consists of 5
> > >>>
> > >>> What kernel are you using?
> > >>>
> > >>> > hosts with 12 SSD OSDs per host. Some basic testing moving
> > >>> VMs to a ceph
> > 

[ceph-users] Re: Correct Migration Workflow Replicated -> Erasure Code

2019-10-30 Thread Paul Emmerich
We've solved this off-list (because I already got access to the cluster)

For the list:

Copying on rados level is possible, but requires to shut down radosgw
to get a consistent copy. This wasn't feasible here due to the size
and performance.
We've instead added a second zone where the placement maps to an EC
pool to the zonegroup and it's currently copying over data. We'll then
make the second zone master and default and ultimately delete the
first one.
This allows for a migration without downtime.

Another possibility would be using a Transition lifecycle rule, but
that's not ideal because it doesn't actually change the bucket.

I don't think it would be too complicated to add a native bucket
migration mechanism that works similar to "bucket rewrite" (which is
intended for something similar but different).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Oct 30, 2019 at 6:46 AM Konstantin Shalygin  wrote:
>
> On 10/29/19 1:40 AM, Mac Wynkoop wrote:
> > So, I'm in the process of trying to migrate our rgw.buckets.data pool
> > from a replicated rule pool to an erasure coded pool. I've gotten the
> > EC pool set up, good EC profile and crush ruleset, pool created
> > successfully, but when I go to "rados cppool xxx.rgw.buckets.data
> > xxx.rgw.buckets.data.new", I get this error after it transfers 4GB of
> > data:
> >
> > error copying object: (2) No such file or directory
> > error copying pool xxx.rgw.buckets.data => xxx.rgw.buckets.data.new:
> > (2) No such file or directory
> >
> > Is "rados cppool" still the blessed way to do the migration, or has
> > something better/not deprecated been developed that I can use?
>
> rados cppool AFAIK lack of support of copy from replicated to EC. May be
> now is wrong.
>
> Also you can use rados import/export.
>
>
>
> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Compression on existing RGW buckets

2019-10-29 Thread Paul Emmerich
On Tue, Oct 29, 2019 at 7:26 PM Bryan Stillwell  wrote:
>
> Thanks Casey,
>
> If I'm understanding this correctly the only way to turn on RGW compression 
> is to do it essentially cluster wide in Luminous since all our existing 
> buckets use the same placement rule?  That's not going to work for what I 
> want to do since it's a shared cluster and other buckets need the performance.

Luminous supports placement rules (just not compression at this
level), but you can create two placement rules (or storage classes)
that go to different pools and you can compress on the pool level.

Regarding enabling compression after the fact: could the ancient
"bucket rewrite" be helpful here? I guess it detects that a rewrite
isn't necessary and do nothing, but I guess it should be rather simple
to add a --force-rewrite flag or something?


Paul

>
> We're in the process of upgrading to Nautilus and switching to BlueStore, 
> unfortunately this cluster hasn't been converted yet.  I do appreciate the 
> details of what Nautilus added though!
>
> Thanks again,
> Bryan
>
> On Oct 29, 2019, at 11:12 AM, Casey Bodley  wrote:
> > Luminous docs about pool placement and compression can be found at
> > https://docs.ceph.com/docs/luminous/radosgw/placement/. You're correct
> > that a bucket's placement target is set on creation and can't be
> > changed. But the placement target itself can be modified to enable
> > compression after the fact, and (once the gateways restart) the
> > compression would take effect on new objects uploaded to buckets with
> > that placement rule.
> >
> > In Nautilus, the compression setting is per storage class. See the
> > updated docs at https://docs.ceph.com/docs/nautilus/radosgw/placement/
> > for details. So you could either add a new storage class to your
> > existing placement target that enables compression, and use the S3 apis
> > like COPY Object or lifecycle transitions to compress existing object
> > data. Or you could modify the default STANDARD storage class to enable
> > compression, which would again apply only to new object uploads.
> >
> > For per-user compression, you can specify a default placement target
> > that applies when the user creates new buckets. And as of Nautilus you
> > can specify a default storage class to be used for new object uploads -
> > just note that some 'helpful' s3 clients will insert a
> > 'x-amz-storage-class: STANDARD' header to requests that don't specify
> > one, and the presence of this header will override the user's default
> > storage class.
> >
> > On 10/29/19 12:20 PM, Bryan Stillwell wrote:
> >> I'm wondering if it's possible to enable compression on existing RGW 
> >> buckets?  The cluster is running Luminous 12.2.12 with FileStore as the 
> >> backend (no BlueStore compression then).
> >>
> >> We have a cluster that recently started to rapidly fill up with 
> >> compressible content (qcow2 images) and I would like to enable compression 
> >> for new uploads to slow the growth.  The documentation seems to imply that 
> >> changing zone placement rules can only be done at creation time.  Is there 
> >> something I'm missing that would allow me to enable compression on a 
> >> per-bucket or even a per-user basis after a cluster has been used for 
> >> quite a while?
> >>
> >> Thanks,
> >> Bryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2019-10-25 Thread Paul Emmerich
Disabling write cache helps with the 970 Pro, but it still sucks. I've
worked on a setup with heavy metadata requirements (gigantic S3
buckets being listed) that unfortunately had all of that stored on 970
Pros and that never really worked out.

Just get a proper SSD like the 883, 983, or 1725. The (tiny) price
difference vs. the consumer disks just isn't worth the hassle and the
problems you are going to run into.

Paul

On Thu, Oct 24, 2019 at 9:08 PM Hermann Himmelbauer  wrote:
>
> Hi,
> I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
> 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
> interconnected via Infiniband 40.
>
> Problem is that the ceph performance is quite bad (approx. 30MiB/s
> reading, 3-4 MiB/s writing ), so I thought about plugging into each node
> a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
> have a faster ceph storage and also some storage extension.
>
> The question is now which SSDs I should use. If I understand it right,
> not every SSD is suitable for ceph, as is denoted at the links below:
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> or here:
> https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
>
> In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
> fast SSD for ceph. As the 950 is not available anymore, I ordered a
> Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
>
> Before equipping all nodes with these SSDs, I did some tests with "fio"
> as recommended, e.g. like this:
>
> fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> The results are as the following:
>
> ---
> 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
> write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
>
> Jobs: 4:
> read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
> write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
>
> Jobs: 10:
> read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
> write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
> ---
>
> So the read speed is impressive, but the write speed is really bad.
>
> Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
> chips (MLC instead of TLC). The results are, however even worse for writing:
>
> ---
> Samsung 970 PRO NVMe M.2 mit PCIe Adapter
> Jobs: 1:
> read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
> write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
>
> Jobs: 4:
> read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
> write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
>
> Jobs: 10:
> read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
> write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
> ---
>
> I did some research and found out, that the "--sync" flag sets the flag
> "O_DSYNC" which seems to disable the SSD cache which leads to these
> horrid write speeds.
>
> It seems that this relates to the fact that the write cache is only not
> disabled for SSDs which implement some kind of battery buffer that
> guarantees a data flush to the flash in case of a powerloss.
>
> However, It seems impossible to find out which SSDs do have this
> powerloss protection, moreover, these enterprise SSDs are crazy
> expensive compared to the SSDs above - moreover it's unclear if
> powerloss protection is even available in the NVMe form factor. So
> building a 1 or 2 TB cluster seems not really affordable/viable.
>
> So, can please anyone give me hints what to do? Is it possible to ensure
> that the write cache is not disabled in some way (my server is situated
> in a data center, so there will probably never be loss of power).
>
> Or is the link above already outdated as newer ceph releases somehow
> deal with this problem? Or maybe a later Debian release (10) will handle
> the O_DSYNC flag differently?
>
> Perhaps I should simply invest in faster (and bigger) harddisks and
> forget the SSD-cluster idea?
>
> Thank you in advance for any help,
>
> Best Regards,
> Hermann
>
>
> --
> herm...@qwer.tk
> PGP/GPG: 299893C7 (on keyservers)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG badly corrupted after merging PGs on mixed FileStore/BlueStore setup

2019-10-23 Thread Paul Emmerich
On Wed, Oct 23, 2019 at 11:27 PM Sage Weil  wrote:
>
> On Wed, 23 Oct 2019, Paul Emmerich wrote:
> > Hi,
> >
> > I'm working on a curious case that looks like a bug in PG merging
> > maybe related to FileStore.
> >
> > Setup is 14.2.1 that is half BlueStore half FileStore (being
> > migrated), and the number of PGs on an RGW index pool were reduced,
> > now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are
> > some (29) objects that are affected (~20% of the PG), the issue looks
> > like this for one of the affected objects which I'll call dir.A here
> >
> > # object seems to exist according to rados
> > rados -p default.rgw.buckets.index ls | grep .dir.A
> > .dir.A
> >
> > # or doesn't it?
> > rados -p default.rgw.buckets.index get .dir.A -
> > error getting default.rgw.buckets.index/.dir.A: (2) No such file or 
> > directory
> >
> > Running deep-scrub reports that everything is okay with the affected PG
>
> My guess is that the actual file is not in the right directory hash level.
> Did you look at the underlying file system to see if it is clearly out of
> place with the other objects?

PG is tiny with only ~150 files, so they aren't split into dirs, it's
right there next to all the working objects

> Also, I'm curious if all of the replicas are similarly affected?  What
> happens if you move the primary to one of the other replicas (e.g., via
> ceph osd primary-affinity) and try reading it then?

yes, I've tried all 3 replicas, same problem :(


Paul


>
> s
>
> >
> > This is what the OSD logs when trying to access it, nothing really
> > relevant with debug 20:
> >
> > 10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066
> > (1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135
> > ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0
> > 1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066
> > lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no
> > obc for soid 18:764060e4:::.dir.A:head and !can_create
> >
> > So going one level deeper with ceph-objectstore-tool:
> > # --op list
> > (29 messages like this)
> > error getting default.rgw.buckets.index/.dir.A: (2) No such file or 
> > directory
> > followed by a complete autoput of the json for the objects including
> > the broken ones
> >
> > # .dir.A dump
> > dump
> > Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such
> > file or directory
> > Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2)
> > No such file or directory
> > {
> > "id": {
> > "oid": ".dir.A",
> > "key": "",
> > "snapid": -2,
> > "hash": 3746994638,
> >     "max": 0,
> > "pool": 18,
> > "namespace": "",
> > "max": 0
> > }
> > }
> >
> > # --op export
> > stops after encountering a bad object with 'export_files error -2'
> >
> > This is the same for all 3 OSDs in that PG.
> >
> > Has anyone encountered something similar? I'll probably just nuke the
> > affected bucket indices tomorrow and re-create them.
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG badly corrupted after merging PGs on mixed FileStore/BlueStore setup

2019-10-23 Thread Paul Emmerich
Hi,

I'm working on a curious case that looks like a bug in PG merging
maybe related to FileStore.

Setup is 14.2.1 that is half BlueStore half FileStore (being
migrated), and the number of PGs on an RGW index pool were reduced,
now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are
some (29) objects that are affected (~20% of the PG), the issue looks
like this for one of the affected objects which I'll call dir.A here

# object seems to exist according to rados
rados -p default.rgw.buckets.index ls | grep .dir.A
.dir.A

# or doesn't it?
rados -p default.rgw.buckets.index get .dir.A -
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory

Running deep-scrub reports that everything is okay with the affected PG

This is what the OSD logs when trying to access it, nothing really
relevant with debug 20:

10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066
(1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135
ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0
1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066
lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no
obc for soid 18:764060e4:::.dir.A:head and !can_create

So going one level deeper with ceph-objectstore-tool:
# --op list
(29 messages like this)
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
followed by a complete autoput of the json for the objects including
the broken ones

# .dir.A dump
dump
Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such
file or directory
Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2)
No such file or directory
{
"id": {
"oid": ".dir.A",
"key": "",
"snapid": -2,
"hash": 3746994638,
"max": 0,
"pool": 18,
"namespace": "",
"max": 0
}
}

# --op export
stops after encountering a bad object with 'export_files error -2'

This is the same for all 3 OSDs in that PG.

Has anyone encountered something similar? I'll probably just nuke the
affected bucket indices tomorrow and re-create them.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW cant list objects when there are too many of them

2019-10-21 Thread Paul Emmerich
On Mon, Oct 21, 2019 at 11:20 AM Arash Shams  wrote:
> Yes listing v2 is not supported yet. I checked metadata osds and all of them 
> are 600gb 10k hdd I dont think this was the issue.
> I will test the --allow-unordered

5 million objects in a single bucket and metadata on HDD is a disaster
waiting to happen if this continues to grow.

rgw setups with large buckets need SSDs (or better NVMe) for metadata
if you value availability. Recovering after a node failure will be
horrible if you keep this on HDDs.

Paul


>
>
> Regards
> ____
> From: Paul Emmerich 
> Sent: Thursday, October 17, 2019 10:00 AM
> To: Arash Shams 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] RadosGW cant list objects when there are too many 
> of them
>
> Listing large buckets is slow due to S3 ordering requirements, it's
> approximately O(n^2).
> However, I wouldn't consider 5M to be a large bucket, it should go to
> only ~50 shards which should still perform reasonable. How fast are
> your metadata OSDs?
>
> Try --allow-unordered in radosgw-admin to get an unordered result
> which is only O(n) as you'd expect.
>
> For boto3: I'm not sure if v2 object listing is available yet (I think
> it has only been merged into master but has not yet made it into a
> release?). It doesn't support unordered listing but there has been
> some work to implement it there, not sure about the current state.
>
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Oct 17, 2019 at 9:19 AM Arash Shams  wrote:
> >
> > Dear All
> >
> > I have a bucket with 5 million Objects and I cant list objects with
> > radosgw-admin bucket list --bucket=bucket | jq .[].name
> > or listing files using boto3
> >
> > s3 = boto3.client('s3',
> >   endpoint_url=credentials['endpoint_url'],
> >   aws_access_key_id=credentials['access_key'],
> >   aws_secret_access_key=credentials['secret_key'])
> >
> > response = s3.list_objects_v2(Bucket=bucket_name)
> > for item in response['Contents']:
> > print(item['Key'])
> >
> > what is the solution ? how can I find list of my objects ?
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


  1   2   >