Re: [ceph-users] Unexpected "out" OSD behaviour
Dear Jonas, I tried just now on a 14.2.5 cluster, and sadly, the unexpected behaviour is still there, i.e. an OSD marked "out" and then restarted is not considered as data source anymore. I also tried with a 13.2.8 OSD (in a cluster running 13.2.6 on other OSDs, MONs and MGRs), same effect. However, the trick you described ("mark your OSD in and then out right away") helps in both cases, the data on the OSDs is considered as data source again and any degradation is gone. So while I think your patch should solve the issue, for some reason, it does not seem to be effective. Cheers, Oliver Am 22.12.19 um 23:50 schrieb Oliver Freyermuth: > Dear Jonas, > > Am 22.12.19 um 23:40 schrieb Jonas Jelten: >> hi! >> >> I've also noticed that behavior and have submitted a patch some time ago >> that should fix (2): >> https://github.com/ceph/ceph/pull/27288 > > thanks, this does indeed seem very much like the issue I saw! > I'm luckily not in a critical situation at the moment, but was just wondering > if this behaviour was normal (since it does not fit well > with the goal of ensuring maximum possible redundancy at all times). > > However, I observed this on 13.2.6, which - if I read the release notes > correctly - should already have your patch in. Strange. > >> But it may well be that there's more cases where PGs are not discovered on >> devices that do have them. Just recently a >> lot of my data was degraded and then recreated even though it would have >> been available on a node that had taken very >> long to reboot. > > We've set "mon_osd_down_out_subtree_limit" to "host" to make sure recovery of > data from full hosts does not start without one of us admins > telling Ceph to go ahead. Maybe this also helps in your case? > >> What you can do also is to mark your OSD in and then out right away, the >> data is discovered then. Although with my patch >> that shouldn't be necessary any more. Hope this helps you. > > I will keep this in mind the next time it happens (I may be able to provoke > it, we have to drain more nodes, and once the next node is almost-empty, > I can just restart one of the "out" OSDs and see what happens). > > Cheers and many thanks, > Oliver > >> >> Cheers >> -- Jonas >> >> >> On 22/12/2019 19.48, Oliver Freyermuth wrote: >>> Dear Cephers, >>> >>> I realized the following behaviour only recently: >>> >>> 1. Marking an OSD "out" sets the weight to zero and allows to migrate data >>> away (as long as it is up), >>>i.e. it is still considered as a "source" and nothing goes to degraded >>> state (so far, everything expected). >>> 2. Restarting an "out" OSD, however, means it will come back with "0 pgs", >>> and if data was not fully migrated away yet, >>>it means the PGs which were still kept on it before will enter degraded >>> state since they now lack a copy / shard. >>> >>> Is (2) expected? >>> >>> If so, my understanding that taking an OSD "out" to let the data be >>> migrated away without losing any redundancy is wrong, >>> since redundancy will be lost as soon as the "out" OSD is restarted (e.g. >>> due to a crash, node reboot,...) and the only safe options would be: >>> 1. Disable the automatic balancer. >>> 2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap >>> to drain them. >>> 3. Reenable the automatic balancer only after having fully drained those >>> OSDs and performing the necessary intervention >>>(in our case, recreating the OSDs with a faster blockdb). >>> >>> Is this correct? >>> >>> Cheers, >>> Oliver >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected "out" OSD behaviour
Dear Jonas, Am 22.12.19 um 23:40 schrieb Jonas Jelten: > hi! > > I've also noticed that behavior and have submitted a patch some time ago that > should fix (2): > https://github.com/ceph/ceph/pull/27288 thanks, this does indeed seem very much like the issue I saw! I'm luckily not in a critical situation at the moment, but was just wondering if this behaviour was normal (since it does not fit well with the goal of ensuring maximum possible redundancy at all times). However, I observed this on 13.2.6, which - if I read the release notes correctly - should already have your patch in. Strange. > But it may well be that there's more cases where PGs are not discovered on > devices that do have them. Just recently a > lot of my data was degraded and then recreated even though it would have been > available on a node that had taken very > long to reboot. We've set "mon_osd_down_out_subtree_limit" to "host" to make sure recovery of data from full hosts does not start without one of us admins telling Ceph to go ahead. Maybe this also helps in your case? > What you can do also is to mark your OSD in and then out right away, the data > is discovered then. Although with my patch > that shouldn't be necessary any more. Hope this helps you. I will keep this in mind the next time it happens (I may be able to provoke it, we have to drain more nodes, and once the next node is almost-empty, I can just restart one of the "out" OSDs and see what happens). Cheers and many thanks, Oliver > > Cheers > -- Jonas > > > On 22/12/2019 19.48, Oliver Freyermuth wrote: >> Dear Cephers, >> >> I realized the following behaviour only recently: >> >> 1. Marking an OSD "out" sets the weight to zero and allows to migrate data >> away (as long as it is up), >>i.e. it is still considered as a "source" and nothing goes to degraded >> state (so far, everything expected). >> 2. Restarting an "out" OSD, however, means it will come back with "0 pgs", >> and if data was not fully migrated away yet, >>it means the PGs which were still kept on it before will enter degraded >> state since they now lack a copy / shard. >> >> Is (2) expected? >> >> If so, my understanding that taking an OSD "out" to let the data be migrated >> away without losing any redundancy is wrong, >> since redundancy will be lost as soon as the "out" OSD is restarted (e.g. >> due to a crash, node reboot,...) and the only safe options would be: >> 1. Disable the automatic balancer. >> 2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap >> to drain them. >> 3. Reenable the automatic balancer only after having fully drained those >> OSDs and performing the necessary intervention >>(in our case, recreating the OSDs with a faster blockdb). >> >> Is this correct? >> >> Cheers, >> Oliver >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unexpected "out" OSD behaviour
Dear Cephers, I realized the following behaviour only recently: 1. Marking an OSD "out" sets the weight to zero and allows to migrate data away (as long as it is up), i.e. it is still considered as a "source" and nothing goes to degraded state (so far, everything expected). 2. Restarting an "out" OSD, however, means it will come back with "0 pgs", and if data was not fully migrated away yet, it means the PGs which were still kept on it before will enter degraded state since they now lack a copy / shard. Is (2) expected? If so, my understanding that taking an OSD "out" to let the data be migrated away without losing any redundancy is wrong, since redundancy will be lost as soon as the "out" OSD is restarted (e.g. due to a crash, node reboot,...) and the only safe options would be: 1. Disable the automatic balancer. 2. Either adjust the weights of the OSDs to drain to zero, or use pg upmap to drain them. 3. Reenable the automatic balancer only after having fully drained those OSDs and performing the necessary intervention (in our case, recreating the OSDs with a faster blockdb). Is this correct? Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dashboard hangs
Hi, On 2019-11-20 15:55, thoralf schulze wrote: hi, we were able to track this down to the auto balancer: disabling the auto balancer and cleaning out old (and probably not very meaningful) upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr daemons and an usable dashboard. I can confirm that, in our case I see this on a 14.2.4 cluster (which has started its life with an earlier Nautilus version, and developed this issue over the past weeks) and doing: ceph balancer off has been sufficient to make the mgrs stable again (i.e. I left the upmap-items in place). Interestingly, we did not see this with Luminous or Mimic on different clusters (which however have a more stable number of OSDs). @devs: If there's any more info needed to track this down, please let us know. Cheers, Oliver the not-so-sensible upmap-entries might or might not have been caused by us updating from mimic to nautilus - it's too late to debug this now. this seems to be consistent with bryan stillwell's findings ("mgr hangs with upmap balancer"). thank you very much & with kind regards, thoralf. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?
On 2019-10-24 09:46, Janne Johansson wrote: (Slightly abbreviated) Den tors 24 okt. 2019 kl 09:24 skrev Frank Schilder mailto:fr...@dtu.dk>>: What I learned are the following: 1) Avoid this work-around too few hosts for EC rule at all cost. 2) Do not use EC 2+1. It does not offer anything interesting for production. Use 4+2 (or 8+2, 8+3 if you have the hosts). 3) If you have no perspective of getting at least 7 servers in the long run (4+2=6 for EC profile, +1 for fail-over automatic rebuild), do not go for EC. 4) Before you start thinking about replicating to a second site, you should have a primary site running solid first. This is collected from my experience. I would do things different now and maybe it helps you with deciding how to proceed. Its basically about what resources can you expect in the foreseeable future and what compromises are you willing to make with regards to sleep and sanity. Amen to all of those points. We did similar-but-not-same mistakes on an EC cluster here. You are going to produce more tears than I/O if you make these mis-designs mentioned above. We could add: 5) Never buy SMR drives, pretend they don't even exist. If a similar technology appears tomorrow for cheap SSD/NVME, skip it. Amen from my side, too. Luckily, we only made a small fraction of these mistakes (running 4+2 on 6 servers and wondering about funny effects when taking one server offline, while we still were testing the setup, before we finally descided to ask for a 7th server), but this can in parts be extrapolated. Concerning SMR, I learnt that SMR-awareness is on Ceph's roadmap (for host-managed SMR drives). Once that is available, host-managed SMR drives should be a well-working and cheap solution especially for backup / WORM workloads. But as of for now, even disk vendors will tell you to avoid SMR for datacenter setups (unless you have a storage system aware of it and host-managed drives). Cheers, Oliver -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
Hi together, can somebody confirm whether should I put this in a ticket, or whether this is wanted (but very unexpected) behaviour? We have some pools which gain a factor of three by compression: POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR rbd2 1.2 TiB 472.44k 1.8 TiB 35.24 1.1 TiB N/A N/A 472.44k717 GiB 2.1 TiB so as of now, this always leads to a health warning via pg-autoscaler as soon as the cluster is 33 % filled, since it thinks the subtree is overcommitted: POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.buckets.data 61358M3.0 5952G 0.0302 0.0700 1.0 32 on rbd 1856G3.0 5952G 0.9359 0.9200 1.0 256 on Cheers, Oliver Am 12.09.19 um 23:34 schrieb Oliver Freyermuth: Dear Cephalopodians, I can confirm the same problem described by Joe Ryner in 14.2.2. I'm also getting (in a small test setup): - # ceph health detail HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_bytes Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] overcommit available storage by 1.068x due to target_size_bytes0 on pools [] POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_ratio Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] overcommit available storage by 1.068x due to target_size_ratio 0.000 on pools [] - However, there's not much actual data STORED: - # ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 4.0 TiB 2.6 TiB 1.4 TiB 1.4 TiB 35.94 TOTAL 4.0 TiB 2.6 TiB 1.4 TiB 1.4 TiB 35.94 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL rbd2 676 GiB 266.40k 707 GiB 23.42 771 GiB .rgw.root 9 1.2 KiB 4 768 KiB 0 771 GiB default.rgw.control 10 0 B 8 0 B 0 771 GiB default.rgw.meta 11 1.2 KiB 8 1.3 MiB 0 771 GiB default.rgw.log 12 0 B 175 0 B 0 771 GiB default.rgw.buckets.index 13 0 B 1 0 B 0 771 GiB default.rgw.buckets.data 14 249 GiB 99.62k 753 GiB 24.57 771 GiB - The main culprit here seems to be the default.rgw.buckets.data pool, but also the rbd pool contains thin images. As in the case of Joe, the autoscaler seems to look at the "USED" space, not at the "STORED" bytes: - POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.meta1344k3.0 4092G 0. 1.0 8 on default.rgw.buckets.index 0 3.0 4092G 0. 1.0 8 on default.rgw.control0 3.0 4092G 0. 1.0 8 on default.rgw.buckets.data 788.6G3.0 4092G 0.5782 1.0 128 on .rgw.root 768.0k3.0 4092G 0. 1.0 8 on rbd710.8G3.0 4092G 0.5212 1.0 64 on default.rgw.log0 3.0 4092G 0. 1.0 8 on - This does seem like a bug to me. The warning actually fires on a cluster with 35 % raw usage, and things are mostly balanced. Is there already a tracker entry on this? Cheers, Oliver On 2019-05-01 22:01, Joe Ryner wrote: I think I have figured out the issue. POOL
Re: [ceph-users] eu.ceph.com mirror out of sync?
Dear Wido, On 2019-09-24 08:53, Wido den Hollander wrote: On 9/17/19 11:01 PM, Oliver Freyermuth wrote: Dear Cephalopodians, I realized just now that: https://eu.ceph.com/rpm-nautilus/el7/x86_64/ still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 14.2.4, while the main repository at: https://download.ceph.com/rpm-nautilus/el7/x86_64/ looks as expected. Is this issue with the eu.ceph.com mirror already knwon? I missed this message and I see what's going on. Going to fix it right away. I manage this mirror. many thanks, it looks like it's already fixed now, at least the new packages are popping up :-). I'll also contact the other mirror owners whose mirrors appear to have issues or are out-of-sync, now that I have been pointed to the list of people managing them. Cheers and thanks, Oliver Wido Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] eu.ceph.com mirror out of sync?
Dear Matthew, On 2019-09-24 01:50, Matthew Taylor wrote: Hi David, RedHat staff had transitioned the Mirror mailing list to a new domain + self hosted instance of Mailman on this date: Subject:[Ceph-mirrors] FYI: Mailing list domain change Date: Mon, 17 Jun 2019 16:19:55 -0400 From: David Galloway To: ceph-mirr...@lists.ceph.com The new mirror list email is: ceph-mirr...@ceph.io *You can subscribe to the list via this URL: **https://lists.ceph.io/postorius/lists/* Please note that the actual mirror "project" is quite loose and vastly ignored as mirrors can easily be considered as 'set and forget' once set up. many thanks! This clarifies why I have never seen any mirror discussions and the fact that so many of the mirrors are either unreachable or out of sync at the same time (well, probably since a long time, but I checked only now). We used to have some strong advocates promoting improvement on the older mailing list (myself included), however the list it's self (old and new) has next to no traffic on it, inclusive of RedHat staff. The list has been active since 2015-11-10 (thank you, Wido). With that being said, and to be fair; the official docs at the time of writing this doesn't really give any direction about the mailing list or the project it's self: https://docs.ceph.com/docs/master/install/mirrors/ As a "mirror user", indeed all this was very unclear to me, since those mirrors are "just part of the install instructions". At this stage, I can really only suggest reaching out to the individual mirror maintainers should you have issues with them. Here is a list of current mirrors and their maintainer's contact info: https://github.com/ceph/ceph/blob/master/mirroring/MIRRORS Many thanks for this! This is really helpful. I see Wido is there for the eu-mirror. Since he is usually very active on this list, I guess he is on well-deserved holidays which would explain the silence ;-). In any case, I will walk through the list later and contact those mirror operators whose mirrors are either out of date or unreachable. An automated script checking https://MIRROR_URL/timestamp and alerting mirror owners automatically could technically also do this. Mmany thanks for the valuable information and your work in maintaining au.ceph.com! Cheers, Oliver Cheers, Matthew. (au.ceph.com maintainer) On 24/9/19 6:48 am, David Majchrzak, ODERLAND Webbhotell AB wrote: Hi, I'll have a look at the status of se.ceph.com tomorrow morning, it's maintained by us. Kind Regards, David On mån, 2019-09-23 at 22:41 +0200, Oliver Freyermuth wrote: Hi together, the EU mirror still seems to be out-of-sync - does somebody on this list happen to know whom to contact about this? Or is this mirror unmaintained and we should switch to something else? Going through the list of appropriate mirrors from https://docs.ceph.com/docs/master/install/mirrors/ (we are in Germany) I also find: http://de.ceph.com/ (the mirror in Germany) to be non-resolvable. Closest by then for us is possibly France: http://fr.ceph.com/rpm-nautilus/el7/x86_64/ but also here, there's only 14.2.2, so that's also out-of-sync. So in the EU, at least geographically, this only leaves Sweden and UK. Sweden at se.ceph.com does not load for me, but UK indeed seems fine. Should people in the EU use that mirror, or should we all just use download.ceph.com instead of something geographically close-by? Cheers, Oliver On 2019-09-17 23:01, Oliver Freyermuth wrote: Dear Cephalopodians, I realized just now that: https://eu.ceph.com/rpm-nautilus/el7/x86_64/ still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 14.2.4, while the main repository at: https://download.ceph.com/rpm-nautilus/el7/x86_64/ looks as expected. Is this issue with the eu.ceph.com mirror already knwon? Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] eu.ceph.com mirror out of sync?
Hi together, the EU mirror still seems to be out-of-sync - does somebody on this list happen to know whom to contact about this? Or is this mirror unmaintained and we should switch to something else? Going through the list of appropriate mirrors from https://docs.ceph.com/docs/master/install/mirrors/ (we are in Germany) I also find: http://de.ceph.com/ (the mirror in Germany) to be non-resolvable. Closest by then for us is possibly France: http://fr.ceph.com/rpm-nautilus/el7/x86_64/ but also here, there's only 14.2.2, so that's also out-of-sync. So in the EU, at least geographically, this only leaves Sweden and UK. Sweden at se.ceph.com does not load for me, but UK indeed seems fine. Should people in the EU use that mirror, or should we all just use download.ceph.com instead of something geographically close-by? Cheers, Oliver On 2019-09-17 23:01, Oliver Freyermuth wrote: Dear Cephalopodians, I realized just now that: https://eu.ceph.com/rpm-nautilus/el7/x86_64/ still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 14.2.4, while the main repository at: https://download.ceph.com/rpm-nautilus/el7/x86_64/ looks as expected. Is this issue with the eu.ceph.com mirror already knwon? Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD's keep crasching after clusterreboot
Hi together, for those reading along: We had to turn off all OSDs keeping our cephfs-data pool during the intervention, luckily everything came back fine. However, we managed to leave the MDS's and OSDs keeping the cephfs-metadata pool and the MONs online. We restarted those sequentially afterwards, though. So this probably means we are not affected by the upgrade bug - still, I would sleep better if somebody can confirm how to detected this bug and - if you are affected - how to edit the pool to fix it. Cheers, Oliver On 2019-09-17 21:23, Oliver Freyermuth wrote: Hi together, it seems the issue described by Ansgar was reported and closed here as being fixed for newly created pools in post-Luminous releases: https://tracker.ceph.com/issues/41336 However, it is unclear to me: - How to find out if an EC cephfs you have created in Luminous is actually affected, before actually testing the "shutdown all" procedure, and thus having dying OSDs. - If affected, how to fix it without purging the pool completely (which is not so easily done if there is 0.5 PB inside, which can't be restored without a long downtime). If this is an acknowledged issue, it should probably also be mentioned in the upgrade notes from pre-Mimic to Mimic and newer before more people lose data. In our case, we have such a a CephFS on an EC pool created with Luminous, and are right now running Mimic 13.2.6, but never tried a "full shutdown". We need to try that on Friday, though... (cooling system maintenance). "osd dump" contains: pool 1 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 40903 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application cephfs pool 2 'cephfs_data' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 40953 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 compression_algorithm snappy compression_mode aggressive application cephfs and the EC profile is: # ceph osd erasure-code-profile get cephfs_data crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 Neither contains the stripe_unit explicitly, so I wonder how to find out if it is (in)valid. Checking the xattr ceph.file.layout.stripe_unit of some "old" files on the FS reveals 4194304 in my case. Any help appreciated. Cheers and all the best, Oliver Am 09.08.19 um 08:54 schrieb Ansgar Jazdzewski: We got our OSD's back Since we removed the EC-Pool (cephfs.data) we had to figure out how to remove the PG from teh Offline OSD and hier is how we did it. remove cehfs, remove cache layer, remove pools: #ceph mds fail 0 #ceph fs rm cephfs --yes-i-really-mean-it #ceph osd tier remove-overlay cephfs.data there is now (or already was) no overlay for 'cephfs.data' #ceph osd tier remove cephfs.data cephfs.cache pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data' #ceph tell mon.\* injectargs '--mon-allow-pool-delete=true' #ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it pool 'cephfs.cache' removed #ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it pool 'cephfs.data' removed #ceph osd pool delete cephfs.metadata cephfs.metadata --yes-i-really-really-mean-it pool 'cephfs.metadata' removed remove placement groups of pool 23 (cephfs.data) from all offline OSDs: DATAPATH=/var/lib/ceph/osd/ceph-${OSD} a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."` for i in $a; do echo "INFO: removing ${i} from OSD ${OSD}" ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force done since we now had removed our cephfs we still not know if we could have solved it without data loss by upgrading to nautilus. Have a nice Weekend, Ansgar Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski : another update, we now took the more destructive route and removed the cephfs pools (lucky we had only test date in the filesystem) Our hope was that within the startup-process the osd will delete the no longer needed PG, But this is NOT the Case. So we are still have the same issue the only difference is that the PG does not belong to a pool anymore. -360> 2019-08-07 14:52:32.655 7fb14db8de00 5 osd.44 pg_epoch: 196586 pg[23.f8s0(unlocked)] enter Initial -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1 /build/ceph-13.2.6/src/osd/ECUtil.h: In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 7fb14db8de00 time 2019-08-07 14:52:32.660169 /build/ceph-13.
[ceph-users] eu.ceph.com mirror out of sync?
Dear Cephalopodians, I realized just now that: https://eu.ceph.com/rpm-nautilus/el7/x86_64/ still holds only released up to 14.2.2, and nothing is to be seen of 14.2.3 or 14.2.4, while the main repository at: https://download.ceph.com/rpm-nautilus/el7/x86_64/ looks as expected. Is this issue with the eu.ceph.com mirror already knwon? Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD's keep crasching after clusterreboot
Hi together, it seems the issue described by Ansgar was reported and closed here as being fixed for newly created pools in post-Luminous releases: https://tracker.ceph.com/issues/41336 However, it is unclear to me: - How to find out if an EC cephfs you have created in Luminous is actually affected, before actually testing the "shutdown all" procedure, and thus having dying OSDs. - If affected, how to fix it without purging the pool completely (which is not so easily done if there is 0.5 PB inside, which can't be restored without a long downtime). If this is an acknowledged issue, it should probably also be mentioned in the upgrade notes from pre-Mimic to Mimic and newer before more people lose data. In our case, we have such a a CephFS on an EC pool created with Luminous, and are right now running Mimic 13.2.6, but never tried a "full shutdown". We need to try that on Friday, though... (cooling system maintenance). "osd dump" contains: pool 1 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 40903 flags hashpspool stripe_width 0 compression_algorithm snappy compression_mode aggressive application cephfs pool 2 'cephfs_data' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 40953 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 compression_algorithm snappy compression_mode aggressive application cephfs and the EC profile is: # ceph osd erasure-code-profile get cephfs_data crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 Neither contains the stripe_unit explicitly, so I wonder how to find out if it is (in)valid. Checking the xattr ceph.file.layout.stripe_unit of some "old" files on the FS reveals 4194304 in my case. Any help appreciated. Cheers and all the best, Oliver Am 09.08.19 um 08:54 schrieb Ansgar Jazdzewski: We got our OSD's back Since we removed the EC-Pool (cephfs.data) we had to figure out how to remove the PG from teh Offline OSD and hier is how we did it. remove cehfs, remove cache layer, remove pools: #ceph mds fail 0 #ceph fs rm cephfs --yes-i-really-mean-it #ceph osd tier remove-overlay cephfs.data there is now (or already was) no overlay for 'cephfs.data' #ceph osd tier remove cephfs.data cephfs.cache pool 'cephfs.cache' is now (or already was) not a tier of 'cephfs.data' #ceph tell mon.\* injectargs '--mon-allow-pool-delete=true' #ceph osd pool delete cephfs.cache cephfs.cache --yes-i-really-really-mean-it pool 'cephfs.cache' removed #ceph osd pool delete cephfs.data cephfs.data --yes-i-really-really-mean-it pool 'cephfs.data' removed #ceph osd pool delete cephfs.metadata cephfs.metadata --yes-i-really-really-mean-it pool 'cephfs.metadata' removed remove placement groups of pool 23 (cephfs.data) from all offline OSDs: DATAPATH=/var/lib/ceph/osd/ceph-${OSD} a=`ceph-objectstore-tool --data-path ${DATAPATH} --op list-pgs | grep "^23\."` for i in $a; do echo "INFO: removing ${i} from OSD ${OSD}" ceph-objectstore-tool --data-path ${DATAPATH} --pgid ${i} --op remove --force done since we now had removed our cephfs we still not know if we could have solved it without data loss by upgrading to nautilus. Have a nice Weekend, Ansgar Am Mi., 7. Aug. 2019 um 17:03 Uhr schrieb Ansgar Jazdzewski : another update, we now took the more destructive route and removed the cephfs pools (lucky we had only test date in the filesystem) Our hope was that within the startup-process the osd will delete the no longer needed PG, But this is NOT the Case. So we are still have the same issue the only difference is that the PG does not belong to a pool anymore. -360> 2019-08-07 14:52:32.655 7fb14db8de00 5 osd.44 pg_epoch: 196586 pg[23.f8s0(unlocked)] enter Initial -360> 2019-08-07 14:52:32.659 7fb14db8de00 -1 /build/ceph-13.2.6/src/osd/ECUtil.h: In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 7fb14db8de00 time 2019-08-07 14:52:32.660169 /build/ceph-13.2.6/src/osd/ECUtil.h: 34: FAILED assert(stripe_width % stripe_size == 0) we now can take one rout and try to delete the pg by hand in the OSD (bluestore) how this can be done? OR we try to upgrade to Nautilus and hope for the beset. any help hints are welcome, have a nice one Ansgar Am Mi., 7. Aug. 2019 um 11:32 Uhr schrieb Ansgar Jazdzewski : Hi, as a follow-up: * a full log of one OSD failing to start https://pastebin.com/T8UQ2rZ6 * our ec-pool cration in the fist place https://pastebin.com/20cC06Jn * ceph osd dump and ceph osd erasure-code-profile get cephfs https://pastebin.com/TRLPaWcH as we try to dig more into it, it looks like a bug
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, Am 15.09.19 um 00:03 schrieb Jason Dillaman: > I was able to repeat this issue locally by restarting the primary OSD > for the "rbd_mirroring" object. It seems that a regression was > introduced w/ the introduction of Ceph msgr2 in that upon reconnect, > the connection type for the client switches from ANY to V2 -- but only > for the watcher session and not the status updates. I've opened a > tracker ticker for this issue [1]. > > Thanks. many thanks to you for the detailed investigation and reproduction! While I did not restart the first 5 OSDs of the test cluster, I added an OSD and rebalanced - so I guess this can also be triggered if the primary OSD for the object changes, which should of course also lead to a reconnection. I can also add to my observations that now while not touching the cluster anymore things stay in "up+replaying". Thanks and all the best, Oliver > > On Fri, Sep 13, 2019 at 12:44 PM Oliver Freyermuth > wrote: >> >> Am 13.09.19 um 18:38 schrieb Jason Dillaman: >>> On Fri, Sep 13, 2019 at 11:30 AM Oliver Freyermuth >>> wrote: >>>> >>>> Am 13.09.19 um 17:18 schrieb Jason Dillaman: >>>>> On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth >>>>> wrote: >>>>>> >>>>>> Am 13.09.19 um 16:30 schrieb Jason Dillaman: >>>>>>> On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman >>>>>>> wrote: >>>>>>>> >>>>>>>> On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Dear Jason, >>>>>>>>> >>>>>>>>> thanks for the very detailed explanation! This was very instructive. >>>>>>>>> Sadly, the watchers look correct - see details inline. >>>>>>>>> >>>>>>>>> Am 13.09.19 um 15:02 schrieb Jason Dillaman: >>>>>>>>>> On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Dear Jason, >>>>>>>>>>> >>>>>>>>>>> thanks for taking care and developing a patch so quickly! >>>>>>>>>>> >>>>>>>>>>> I have another strange observation to share. In our test setup, >>>>>>>>>>> only a single RBD mirroring daemon is running for 51 images. >>>>>>>>>>> It works fine with a constant stream of 1-2 MB/s, but at some point >>>>>>>>>>> after roughly 20 hours, _all_ images go to this interesting state: >>>>>>>>>>> - >>>>>>>>>>> # rbd mirror image status test-vm.X-disk2 >>>>>>>>>>> test-vm.X-disk2: >>>>>>>>>>> global_id: XXX >>>>>>>>>>> state: down+replaying >>>>>>>>>>> description: replaying, master_position=[object_number=14, >>>>>>>>>>> tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, >>>>>>>>>>> tag_tid=6, entry_tid=6338], entries_behind_master=0 >>>>>>>>>>> last_update: 2019-09-13 03:45:43 >>>>>>>>>>> - >>>>>>>>>>> Running this command several times, I see entry_tid increasing at >>>>>>>>>>> both ends, so mirroring seems to be working just fine. >>>>>>>>>>> >>>>>>>>>>> However: >>>>>>>>>>> - >>>>>>>>>>> # rbd mirror pool status >>>>>>>>>>> health: WARNING >>>>>>>>>>> images: 51 total >>>>>>>>>>> 51 unknown >>>>>>>>>>> - >>>>>>>>>>> The health warning is not visible in the dashboard (also not in the >>>>>>>>>>> mirroring menu), the daemon still seems to be running, dropped >>>>>>>>>>> nothing in the logs, >>>>&g
Re: [ceph-users] Ceph RBD Mirroring
Am 13.09.19 um 18:38 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 11:30 AM Oliver Freyermuth wrote: Am 13.09.19 um 17:18 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth wrote: Am 13.09.19 um 16:30 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman wrote: On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth wrote: Dear Jason, thanks for the very detailed explanation! This was very instructive. Sadly, the watchers look correct - see details inline. Am 13.09.19 um 15:02 schrieb Jason Dillaman: On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth wrote: Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. The dashboard relies on the rbd-mirror daemon to provide it errors and warnings. You can see the status reported by rbd-mirror by running "ceph service status": $ ceph service status { "rbd-mirror": { "4152": { "status_stamp": "2019-09-13T08:58:41.937491-0400", "last_beacon": "2019-09-13T08:58:41.937491-0400", "status": { "json": "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}" } } } } In your case, most likely it seems like rbd-mirror thinks all is good with the world so it's not reporting any errors. This is indeed the case: # ceph service status { "rbd-mirror": { "84243": { "status_stamp": "2019-09-13 15:40:01.149815", "last_beacon": "2019-09-13 15:40:26.151381", "status": { "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}" } } }, "rgw": { ... } } The "down" state indicates that the rbd-mirror daemon isn't correctly watching the "rbd_mirroring" object in the pool. You can see who it watching that object by running the "rados" "listwatchers" command: $ rados -p listwatchers rbd_mirroring watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992 watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424 In my case, the "4154" from "client.4154" is the unique global id for my connection to the cluster, which relates back to the "ceph service status" dump which also shows status by daemon using the unique global id. Sadly(?), this looks as expected: # rados -p rb
Re: [ceph-users] Ceph RBD Mirroring
Am 13.09.19 um 17:18 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth wrote: Am 13.09.19 um 16:30 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman wrote: On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth wrote: Dear Jason, thanks for the very detailed explanation! This was very instructive. Sadly, the watchers look correct - see details inline. Am 13.09.19 um 15:02 schrieb Jason Dillaman: On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth wrote: Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. The dashboard relies on the rbd-mirror daemon to provide it errors and warnings. You can see the status reported by rbd-mirror by running "ceph service status": $ ceph service status { "rbd-mirror": { "4152": { "status_stamp": "2019-09-13T08:58:41.937491-0400", "last_beacon": "2019-09-13T08:58:41.937491-0400", "status": { "json": "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}" } } } } In your case, most likely it seems like rbd-mirror thinks all is good with the world so it's not reporting any errors. This is indeed the case: # ceph service status { "rbd-mirror": { "84243": { "status_stamp": "2019-09-13 15:40:01.149815", "last_beacon": "2019-09-13 15:40:26.151381", "status": { "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}" } } }, "rgw": { ... } } The "down" state indicates that the rbd-mirror daemon isn't correctly watching the "rbd_mirroring" object in the pool. You can see who it watching that object by running the "rados" "listwatchers" command: $ rados -p listwatchers rbd_mirroring watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992 watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424 In my case, the "4154" from "client.4154" is the unique global id for my connection to the cluster, which relates back to the "ceph service status" dump which also shows status by daemon using the unique global id. Sadly(?), this looks as expected: # rados -p rbd listwatchers rbd_mirroring watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672 watcher=10.160.19.240:0
Re: [ceph-users] Ceph RBD Mirroring
Am 13.09.19 um 16:30 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman wrote: On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth wrote: Dear Jason, thanks for the very detailed explanation! This was very instructive. Sadly, the watchers look correct - see details inline. Am 13.09.19 um 15:02 schrieb Jason Dillaman: On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth wrote: Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. The dashboard relies on the rbd-mirror daemon to provide it errors and warnings. You can see the status reported by rbd-mirror by running "ceph service status": $ ceph service status { "rbd-mirror": { "4152": { "status_stamp": "2019-09-13T08:58:41.937491-0400", "last_beacon": "2019-09-13T08:58:41.937491-0400", "status": { "json": "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}" } } } } In your case, most likely it seems like rbd-mirror thinks all is good with the world so it's not reporting any errors. This is indeed the case: # ceph service status { "rbd-mirror": { "84243": { "status_stamp": "2019-09-13 15:40:01.149815", "last_beacon": "2019-09-13 15:40:26.151381", "status": { "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}" } } }, "rgw": { ... } } The "down" state indicates that the rbd-mirror daemon isn't correctly watching the "rbd_mirroring" object in the pool. You can see who it watching that object by running the "rados" "listwatchers" command: $ rados -p listwatchers rbd_mirroring watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992 watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424 In my case, the "4154" from "client.4154" is the unique global id for my connection to the cluster, which relates back to the "ceph service status" dump which also shows status by daemon using the unique global id. Sadly(?), this looks as expected: # rados -p rbd listwatchers rbd_mirroring watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672 watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560 Hmm, the unique id is different (84243 vs 84247). I wouldn't have expected the global id to
Re: [ceph-users] Ceph RBD Mirroring
Am 13.09.19 um 16:17 schrieb Jason Dillaman: On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth wrote: Dear Jason, thanks for the very detailed explanation! This was very instructive. Sadly, the watchers look correct - see details inline. Am 13.09.19 um 15:02 schrieb Jason Dillaman: On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth wrote: Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. The dashboard relies on the rbd-mirror daemon to provide it errors and warnings. You can see the status reported by rbd-mirror by running "ceph service status": $ ceph service status { "rbd-mirror": { "4152": { "status_stamp": "2019-09-13T08:58:41.937491-0400", "last_beacon": "2019-09-13T08:58:41.937491-0400", "status": { "json": "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}" } } } } In your case, most likely it seems like rbd-mirror thinks all is good with the world so it's not reporting any errors. This is indeed the case: # ceph service status { "rbd-mirror": { "84243": { "status_stamp": "2019-09-13 15:40:01.149815", "last_beacon": "2019-09-13 15:40:26.151381", "status": { "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}" } } }, "rgw": { ... } } The "down" state indicates that the rbd-mirror daemon isn't correctly watching the "rbd_mirroring" object in the pool. You can see who it watching that object by running the "rados" "listwatchers" command: $ rados -p listwatchers rbd_mirroring watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992 watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424 In my case, the "4154" from "client.4154" is the unique global id for my connection to the cluster, which relates back to the "ceph service status" dump which also shows status by daemon using the unique global id. Sadly(?), this looks as expected: # rados -p rbd listwatchers rbd_mirroring watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672 watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560 Hmm, the unique id is different (84243 vs 84247). I wouldn't have expected the global id to have changed. Did you restart the Ceph cluster or MON
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, thanks for the very detailed explanation! This was very instructive. Sadly, the watchers look correct - see details inline. Am 13.09.19 um 15:02 schrieb Jason Dillaman: On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth wrote: Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. The dashboard relies on the rbd-mirror daemon to provide it errors and warnings. You can see the status reported by rbd-mirror by running "ceph service status": $ ceph service status { "rbd-mirror": { "4152": { "status_stamp": "2019-09-13T08:58:41.937491-0400", "last_beacon": "2019-09-13T08:58:41.937491-0400", "status": { "json": "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}" } } } } In your case, most likely it seems like rbd-mirror thinks all is good with the world so it's not reporting any errors. This is indeed the case: # ceph service status { "rbd-mirror": { "84243": { "status_stamp": "2019-09-13 15:40:01.149815", "last_beacon": "2019-09-13 15:40:26.151381", "status": { "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}" } } }, "rgw": { ... } } The "down" state indicates that the rbd-mirror daemon isn't correctly watching the "rbd_mirroring" object in the pool. You can see who it watching that object by running the "rados" "listwatchers" command: $ rados -p listwatchers rbd_mirroring watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992 watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424 In my case, the "4154" from "client.4154" is the unique global id for my connection to the cluster, which relates back to the "ceph service status" dump which also shows status by daemon using the unique global id. Sadly(?), this looks as expected: # rados -p rbd listwatchers rbd_mirroring watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672 watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560 However, the dashboard still shows those images in "unknown", and this also shows up via command line: # rbd mirror pool status health: WARNING images: 51 total 51 unknown # rbd mirror image status test-vm.physik.uni-bonn.de-disk1 test-vm.physik.uni-bonn.de-disk2: global_
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, thanks for taking care and developing a patch so quickly! I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images. It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state: - # rbd mirror image status test-vm.X-disk2 test-vm.X-disk2: global_id: XXX state: down+replaying description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0 last_update: 2019-09-13 03:45:43 - Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine. However: - # rbd mirror pool status health: WARNING images: 51 total 51 unknown - The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs, and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine. Any idea on how to debug this? When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days. Any idea on this (or how I can extract more information)? I fear keeping high-level debug logs active for ~24h is not feasible. Cheers, Oliver On 2019-09-11 19:14, Jason Dillaman wrote: > On Wed, Sep 11, 2019 at 12:57 PM Oliver Freyermuth > wrote: >> >> Dear Jason, >> >> I played a bit more with rbd mirroring and learned that deleting an image at >> the source (or disabling journaling on it) immediately moves the image to >> trash at the target - >> but setting rbd_mirroring_delete_delay helps to have some more grace time to >> catch human mistakes. >> >> However, I have issues restoring such an image which has been moved to trash >> by the RBD-mirror daemon as user: >> --- >> [root@mon001 ~]# rbd trash ls -la >> ID NAME SOURCEDELETED_AT >> STATUS PARENT >> d4fbe8f63905 test-vm-XX-disk2 MIRRORING Wed Sep 11 18:43:14 >> 2019 protected until Thu Sep 12 18:43:14 2019 >> [root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905 >> rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 >> librbd::api::Trash: restore: Current trash source: mirroring does not match >> expected: user >> (22) Invalid argument >> --- >> This is issued on the mon, which has the client.admin key, so it should not >> be a permission issue. >> It also fails when I try that in the Dashboard. >> >> Sadly, the error message is not clear enough for me to figure out what could >> be the problem - do you see what I did wrong? > > Good catch, it looks like we accidentally broke this in Nautilus when > image live-migration support was added. I've opened a new tracker > ticket to fix this [1]. > >> Cheers and thanks again, >> Oliver >> >> On 2019-09-10 23:17, Oliver Freyermuth wrote: >>> Dear Jason, >>> >>> On 2019-09-10 23:04, Jason Dillaman wrote: >>>> On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth >>>> wrote: >>>>> >>>>> Dear Jason, >>>>> >>>>> On 2019-09-10 18:50, Jason Dillaman wrote: >>>>>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth >>>>>> wrote: >>>>>>> >>>>>>> Dear Cephalopodians, >>>>>>> >>>>>>> I have two questions about RBD mirroring. >>>>>>> >>>>>>> 1) I can not get it to work - my setup is: >>>>>>> >>>>>>> - One cluster holding the live RBD volumes and snapshots, in pool >>>>>>> "rbd", cluster name "ceph", >>>>>>>running latest Mimic. >>>>>>>I ran "rbd mirror pool enable rbd pool" on that cluster and >>>>>>> created a cephx user "rbd_mirror" with (is there a better way?): >>>>>>>ceph auth get-or-create client.rbd_mirror mon 'allow r' osd >>>>>>> 'allow class-read object_prefix rbd_children, allow pool rbd r' -o
Re: [ceph-users] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
Dear Cephalopodians, I can confirm the same problem described by Joe Ryner in 14.2.2. I'm also getting (in a small test setup): - # ceph health detail HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_bytes Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] overcommit available storage by 1.068x due to target_size_bytes0 on pools [] POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_ratio Pools ['rbd', '.rgw.root', 'default.rgw.control', 'default.rgw.meta', 'default.rgw.log', 'default.rgw.buckets.index', 'default.rgw.buckets.data'] overcommit available storage by 1.068x due to target_size_ratio 0.000 on pools [] - However, there's not much actual data STORED: - # ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 4.0 TiB 2.6 TiB 1.4 TiB 1.4 TiB 35.94 TOTAL 4.0 TiB 2.6 TiB 1.4 TiB 1.4 TiB 35.94 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL rbd2 676 GiB 266.40k 707 GiB 23.42 771 GiB .rgw.root 9 1.2 KiB 4 768 KiB 0 771 GiB default.rgw.control 10 0 B 8 0 B 0 771 GiB default.rgw.meta 11 1.2 KiB 8 1.3 MiB 0 771 GiB default.rgw.log 12 0 B 175 0 B 0 771 GiB default.rgw.buckets.index 13 0 B 1 0 B 0 771 GiB default.rgw.buckets.data 14 249 GiB 99.62k 753 GiB 24.57 771 GiB - The main culprit here seems to be the default.rgw.buckets.data pool, but also the rbd pool contains thin images. As in the case of Joe, the autoscaler seems to look at the "USED" space, not at the "STORED" bytes: - POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.meta1344k3.0 4092G 0. 1.0 8 on default.rgw.buckets.index 0 3.0 4092G 0. 1.0 8 on default.rgw.control0 3.0 4092G 0. 1.0 8 on default.rgw.buckets.data 788.6G3.0 4092G 0.5782 1.0 128 on .rgw.root 768.0k3.0 4092G 0. 1.0 8 on rbd710.8G3.0 4092G 0.5212 1.0 64 on default.rgw.log0 3.0 4092G 0. 1.0 8 on - This does seem like a bug to me. The warning actually fires on a cluster with 35 % raw usage, and things are mostly balanced. Is there already a tracker entry on this? Cheers, Oliver On 2019-05-01 22:01, Joe Ryner wrote: > I think I have figured out the issue. > > POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO > PG_NUM NEW PG_NUM AUTOSCALE > images 28523G 3.0 68779G 1.2441 > 1000 warn > > My images are 28523G with a replication level 3 and have a total of 68779G in > Raw Capacity. > > According to the documentation > http://docs.ceph.com/docs/master/rados/operations/placement-groups/ > > "*SIZE* is the amount of data stored in the pool. *TARGET SIZE*, if present, > is the amount of data the administrator has specified that they expect to > eventually be stored in this pool. The system uses the larger of the two > values for its calculation. > > *RATE* is the multiplier for the pool that determines how much raw storage > capacity is consumed. For example, a 3 replica pool will have a ratio of 3.0, > while a k=4,m=2 erasure coded pool will have a ratio of 1.5. > > *RAW CAPACITY* is the total amount of raw storage capacity on the OSDs that > are responsible for storing this pool’s (and perhaps other pools’) data. > *RATIO* is the ratio of that total capacity that this pool is
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, I played a bit more with rbd mirroring and learned that deleting an image at the source (or disabling journaling on it) immediately moves the image to trash at the target - but setting rbd_mirroring_delete_delay helps to have some more grace time to catch human mistakes. However, I have issues restoring such an image which has been moved to trash by the RBD-mirror daemon as user: --- [root@mon001 ~]# rbd trash ls -la ID NAME SOURCEDELETED_AT STATUS PARENT d4fbe8f63905 test-vm-XX-disk2 MIRRORING Wed Sep 11 18:43:14 2019 protected until Thu Sep 12 18:43:14 2019 [root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905 rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 librbd::api::Trash: restore: Current trash source: mirroring does not match expected: user (22) Invalid argument --- This is issued on the mon, which has the client.admin key, so it should not be a permission issue. It also fails when I try that in the Dashboard. Sadly, the error message is not clear enough for me to figure out what could be the problem - do you see what I did wrong? Cheers and thanks again, Oliver On 2019-09-10 23:17, Oliver Freyermuth wrote: Dear Jason, On 2019-09-10 23:04, Jason Dillaman wrote: On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth wrote: Dear Jason, On 2019-09-10 18:50, Jason Dillaman wrote: On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth wrote: Dear Cephalopodians, I have two questions about RBD mirroring. 1) I can not get it to work - my setup is: - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster name "ceph", running latest Mimic. I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user "rbd_mirror" with (is there a better way?): ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd r' -o ceph.client.rbd_mirror.keyring --cluster ceph In that pool, two images have the journaling feature activated, all others have it disabled still (so I would expect these two to be mirrored). You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps -- but you definitely need more than read-only permissions to the remote cluster since it needs to be able to create snapshots of remote images and update/trim the image journals. these profiles really make life a lot easier. I should have thought of them rather than "guessing" a potentially good configuration... - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool "rbd". I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key created above. I've then run: ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o client.rbd_mirror_backup.keyring --cluster ceph and deployed that key on the rbd-mirror machine, and started the service with: Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1]. That did the trick (in combination with the above)! Again a case of PEBKAC: I should have read the documentation until the end, clearly my fault. It works well now, even though it seems to run a bit slow (~35 MB/s for the initial sync when everything is 1 GBit/s), but that may also be caused by combination of some very limited hardware on the receiving end (which will be scaled up in the future). A single host with 6 disks, replica 3 and a RAID controller which can only do RAID0 and not JBOD is certainly not ideal, so commit latency may cause this slow bandwidth. You could try increasing "rbd_concurrent_management_ops" from the default of 10 ops to something higher to attempt to account for the latency. However, I wouldn't expect near-line speed w/ RBD mirroring. Thanks - I will play with this option once we have more storage available in the target pool ;-). systemctl start ceph-rbd-mirror@rbd_mirror_backup.service After this, everything looks fine: # rbd mirror pool info Mode: pool Peers: UUID NAME CLIENT XXX ceph-virt client.rbd_mirror The service also seems to start fine, but logs show (debug rbd_mirror=20): rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter rbd::mirror::Mirror
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, On 2019-09-10 23:04, Jason Dillaman wrote: > On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth > wrote: >> >> Dear Jason, >> >> On 2019-09-10 18:50, Jason Dillaman wrote: >>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth >>> wrote: >>>> >>>> Dear Cephalopodians, >>>> >>>> I have two questions about RBD mirroring. >>>> >>>> 1) I can not get it to work - my setup is: >>>> >>>> - One cluster holding the live RBD volumes and snapshots, in pool >>>> "rbd", cluster name "ceph", >>>> running latest Mimic. >>>> I ran "rbd mirror pool enable rbd pool" on that cluster and created >>>> a cephx user "rbd_mirror" with (is there a better way?): >>>> ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow >>>> class-read object_prefix rbd_children, allow pool rbd r' -o >>>> ceph.client.rbd_mirror.keyring --cluster ceph >>>> In that pool, two images have the journaling feature activated, all >>>> others have it disabled still (so I would expect these two to be mirrored). >>> >>> You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps -- >>> but you definitely need more than read-only permissions to the remote >>> cluster since it needs to be able to create snapshots of remote images >>> and update/trim the image journals. >> >> these profiles really make life a lot easier. I should have thought of them >> rather than "guessing" a potentially good configuration... >> >>> >>>> - Another (empty) cluster running latest Nautilus, cluster name >>>> "ceph", pool "rbd". >>>> I've used the dashboard to activate mirroring for the RBD pool, and >>>> then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", >>>> filled in the mons and key created above. >>>> I've then run: >>>> ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd >>>> 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o >>>> client.rbd_mirror_backup.keyring --cluster ceph >>>> and deployed that key on the rbd-mirror machine, and started the >>>> service with: >>> >>> Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1]. >> >> That did the trick (in combination with the above)! >> Again a case of PEBKAC: I should have read the documentation until the end, >> clearly my fault. >> >> It works well now, even though it seems to run a bit slow (~35 MB/s for the >> initial sync when everything is 1 GBit/s), >> but that may also be caused by combination of some very limited hardware on >> the receiving end (which will be scaled up in the future). >> A single host with 6 disks, replica 3 and a RAID controller which can only >> do RAID0 and not JBOD is certainly not ideal, so commit latency may cause >> this slow bandwidth. > > You could try increasing "rbd_concurrent_management_ops" from the > default of 10 ops to something higher to attempt to account for the > latency. However, I wouldn't expect near-line speed w/ RBD mirroring. Thanks - I will play with this option once we have more storage available in the target pool ;-). > >>> >>>> systemctl start ceph-rbd-mirror@rbd_mirror_backup.service >>>> >>>>After this, everything looks fine: >>>> # rbd mirror pool info >>>> Mode: pool >>>> Peers: >>>>UUID NAME CLIENT >>>>XXX ceph-virt client.rbd_mirror >>>> >>>>The service also seems to start fine, but logs show (debug >>>> rbd_mirror=20): >>>> >>>>rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: >>>> retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX >>>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter >>>>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting >>>> failed pool replayer for uuid: XXX cluster: ceph-virt client: >>>> client.rbd_mirror >>>>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: >>>> XXX cluster: ceph-virt client: c
Re: [ceph-users] Ceph RBD Mirroring
Dear Jason, On 2019-09-10 18:50, Jason Dillaman wrote: > On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth > wrote: >> >> Dear Cephalopodians, >> >> I have two questions about RBD mirroring. >> >> 1) I can not get it to work - my setup is: >> >> - One cluster holding the live RBD volumes and snapshots, in pool "rbd", >> cluster name "ceph", >> running latest Mimic. >> I ran "rbd mirror pool enable rbd pool" on that cluster and created a >> cephx user "rbd_mirror" with (is there a better way?): >> ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow >> class-read object_prefix rbd_children, allow pool rbd r' -o >> ceph.client.rbd_mirror.keyring --cluster ceph >> In that pool, two images have the journaling feature activated, all >> others have it disabled still (so I would expect these two to be mirrored). > > You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps -- > but you definitely need more than read-only permissions to the remote > cluster since it needs to be able to create snapshots of remote images > and update/trim the image journals. these profiles really make life a lot easier. I should have thought of them rather than "guessing" a potentially good configuration... > >> - Another (empty) cluster running latest Nautilus, cluster name "ceph", >> pool "rbd". >> I've used the dashboard to activate mirroring for the RBD pool, and >> then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", >> filled in the mons and key created above. >> I've then run: >> ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd >> 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o >> client.rbd_mirror_backup.keyring --cluster ceph >> and deployed that key on the rbd-mirror machine, and started the >> service with: > > Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1]. That did the trick (in combination with the above)! Again a case of PEBKAC: I should have read the documentation until the end, clearly my fault. It works well now, even though it seems to run a bit slow (~35 MB/s for the initial sync when everything is 1 GBit/s), but that may also be caused by combination of some very limited hardware on the receiving end (which will be scaled up in the future). A single host with 6 disks, replica 3 and a RAID controller which can only do RAID0 and not JBOD is certainly not ideal, so commit latency may cause this slow bandwidth. > >> systemctl start ceph-rbd-mirror@rbd_mirror_backup.service >> >>After this, everything looks fine: >> # rbd mirror pool info >> Mode: pool >> Peers: >>UUID NAME CLIENT >>XXX ceph-virt client.rbd_mirror >> >>The service also seems to start fine, but logs show (debug rbd_mirror=20): >> >>rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: >> retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX >>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter >>rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting >> failed pool replayer for uuid: XXX cluster: ceph-virt client: >> client.rbd_mirror >>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: >> XXX cluster: ceph-virt client: client.rbd_mirror >>rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to >> remote peer uuid: XXX cluster: ceph-virt client: client.rbd_mirror: >> (95) Operation not supported >>rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: >> pool_id=2, callout_id=2, callout_level=error, text=unable to connect to >> remote cluster > > If it's still broken after fixing your caps above, perhaps increase > debugging for "rados", "monc", "auth", and "ms" to see if you can > determine the source of the op not supported error. > >> I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the >> cluster with the live images) on the rbd-mirror machine explicitly (i.e. not >> only in mon config storage), >> and after doing that: >> rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls >> works fine. So it's not a connectivity issue. Maybe a permission issue? Or >> did I miss something? >> >
[ceph-users] Ceph RBD Mirroring
Dear Cephalopodians, I have two questions about RBD mirroring. 1) I can not get it to work - my setup is: - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster name "ceph", running latest Mimic. I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user "rbd_mirror" with (is there a better way?): ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd r' -o ceph.client.rbd_mirror.keyring --cluster ceph In that pool, two images have the journaling feature activated, all others have it disabled still (so I would expect these two to be mirrored). - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool "rbd". I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key created above. I've then run: ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o client.rbd_mirror_backup.keyring --cluster ceph and deployed that key on the rbd-mirror machine, and started the service with: systemctl start ceph-rbd-mirror@rbd_mirror_backup.service After this, everything looks fine: # rbd mirror pool info Mode: pool Peers: UUID NAME CLIENT XXX ceph-virt client.rbd_mirror The service also seems to start fine, but logs show (debug rbd_mirror=20): rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXX rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting failed pool replayer for uuid: XXX cluster: ceph-virt client: client.rbd_mirror rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: XXX cluster: ceph-virt client: client.rbd_mirror rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to remote peer uuid: XXX cluster: ceph-virt client: client.rbd_mirror: (95) Operation not supported rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: pool_id=2, callout_id=2, callout_level=error, text=unable to connect to remote cluster I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the cluster with the live images) on the rbd-mirror machine explicitly (i.e. not only in mon config storage), and after doing that: rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls works fine. So it's not a connectivity issue. Maybe a permission issue? Or did I miss something? Any idea what "operation not supported" means? It's unclear to me whether things should work well using Mimic with Nautilus, and enabling pool mirroring but only having journaling on for two images is a supported case. 2) Since there is a performance drawback (about 2x) for journaling, is it also possible to only mirror snapshots, and leave the live volumes alone? This would cover the common backup usecase before deferred mirroring is implemented (or is it there already?). Cheers and thanks in advance, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Urgent Help Needed (regarding rbd cache)
Hi together, Am 01.08.19 um 08:45 schrieb Janne Johansson: Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid mailto:junaid.fsd...@gmail.com>>: Your email has cleared many things to me. Let me repeat my understanding. Every Critical data (Like Oracle/Any Other DB) writes will be done with sync, fsync flags, meaning they will be only confirmed to DB/APP after it is actually written to Hard drives/OSD's. Any other application can do it also. All other writes, like OS logs etc will be confirmed immediately to app/user but later on written passing through kernel, RBD Cache, Physical drive Cache (If any) and then to disks. These are susceptible to power-failure-loss but overall things are recoverable/non-critical. That last part is probably simplified a bit, I suspect between a program in a guest sending its data to the virtualised device, running in a KVM on top of an OS that has remote storage over network, to a storage server with its own OS and drive controller chip and lastly physical drive(s) to store the write, there will be something like ~10 layers of write caching possible, out of which the RBD you were asking about, is just one. It is just located very conveniently before the I/O has to leave the KVM host and go back and forth over the network, so it is the last place where you can see huge gains in the guests I/O response time, but at the same time possible to share between lots of guests on the KVM host which should have tons of RAM available compared to any single guest so it is a nice way to get a large cache for outgoing writes. Also, to answer your first part, yes all critical software that depend heavily on write ordering and integrity is hopefully already doing write operations that way, asking for sync(), fsync() or fdatasync() and similar calls, but I can't produce a list of all programs that do. Since there already are many layers of delayed cached writes even without virtualisation and/or ceph, applications that are important have mostly learned their lessons by now, so chances are very high that all your important databases and similar program are doing the right thing. Just to add on this: One such software, for which people cared a lot, is of course a file system itself. BTRFS is notably a candidate very sensitive to broken flush / FUA ( https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) implementations at any layer of the I/O path due to the rather complicated metadata structure. While for in-kernel and other open source software (such as librbd), there are usually a lot of people checking the code for a correct implementation and testing things, there is also broken hardware (or rather, firmware) in the wild. But there are even software issues around, if you think more general and strive for data correctness (since also corruption can happen at any layer): I was hit by an in-kernel issue in the past (network driver writing network statistics via DMA to the wrong memory location - "sometimes") corrupting two BTRFS partitions of mine, and causing random crashes in browsers and mail client apps. BTRFS has been hardened only in kernel 5.2 to check the metadata tree before flushing it to disk. If you are curious about known hardware issues, check out this lengthy, but very insightful mail on the linux-btrfs list: https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/ As you can learn there, there are many drive and firmware combinations out there which do not implement flush / FUA correctly and your BTRFS may be corrupted after a power failure. The very same thing can happen to Ceph, but with replication across several OSDs and lower probability to have broken disks in all hosts makes this issue less likely. For what it is worth, we also use writeback caching for our virtualization cluster and are very happy with it - we also tried pulling power plugs on hypervisors, MONs and OSDs at random times during writes and ext4 could always recover easily with an fsck making use of the journal. Cheers and HTH, Oliver But if the guest is instead running a mail filter that does antivirus checks, spam checks and so on, operating on files that live on the machine for something like one second, and then either get dropped or sent to the destination mailbox somewhere else, then having aggressive write caches would be very useful, since the effects of a crash would still mostly mean "the emails that were in the queue were lost, not acked by the final mailserver and will probably be resent by the previous smtp server". For such a guest VM, forcing sync writes would only be a net loss, it would gain much by having large ram write caches. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME
Re: [ceph-users] Fix scrub error in bluestore.
Hi Alfredo, you may want to check the SMART data for the disk. I also had such a case recently (see http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035117.html for the thread), and the disk had one unreadable sector which was pending reallocation. Triggering "ceph pg repair" for the problematic placement group made the OSD rewrite the problematic sector and allowed the disk to reallocate this unreadable sector. Cheers, Oliver Am 06.06.19 um 18:45 schrieb Tarek Zegar: Look here _http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent_ Read error typically is a disk issue. The doc is not clear on how to resolve that Inactive hide details for Alfredo Rezinovsky ---06/06/2019 10:58:50 AM---https://urldefense.proofpoint.com/v2/url?u=https-3A__cAlfredo Rezinovsky ---06/06/2019 10:58:50 AM---https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_geen-2Dcategorie_ceph-2Dmanually-2Drep From: Alfredo Rezinovsky To: Ceph Users Date: 06/06/2019 10:58 AM Subject: [EXTERNAL] [ceph-users] Fix scrub error in bluestore. Sent by: "ceph-users" -- _https://ceph.com/geen-categorie/ceph-manually-repair-object/_ is a little outdated. After stopping the OSD, flushing the journal I don't have any clue on how to move the object (easy in filestore). I have thins in my osd log. 2019-06-05 10:46:41.418 7f47d0502700 -1 log_channel(cluster) log [ERR] : 10.c5 shard 2 soid 10:a39e2c78:::183f81f.0001:head : candidate had a read error How can I fix it? -- Alfrenovsky___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Object read error - enough copies available
Hi, Am 31.05.19 um 12:07 schrieb Burkhard Linke: > Hi, > > > see my post in the recent 'CephFS object mapping.' thread. It describes the > necessary commands to lookup a file based on its rados object name. many thanks! I somehow missed the important part in that thread earlier and only got the functional, but not really scaling "find . -xdev -inum xxx"-approach before I stopped reading, but now I have followed it in full - very enlightening indeed, so one needs to look at the xattrs of the RADOS objects! Very logical once you know it. Thanks again! Oliver > > > Regards, > > Burkhard > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Object read error - enough copies available
Am 30.05.19 um 17:00 schrieb Oliver Freyermuth: > Dear Cephalopodians, > > I found the messages: > 2019-05-30 16:08:51.656363 [ERR] Error -5 reading object > 2:0979ae43:::10002954ea6.007c:head > 2019-05-30 16:08:51.760660 [WRN] Error(s) ignored for > 2:0979ae43:::10002954ea6.007c:head enough copies available > just now in our logs (Mimic 13.2.5). However, everything stayed HEALTH_OK and > seems fine. Pool 2 is an EC pool containing CephFS. > > Up to now I've never had to delve into the depths of RADOS, so I have some > questions. If there are docs and I missed them, just redirect me :-). > > - How do I find the OSDs / PG for that object (is the PG contained in the > name?) > I'd love to check SMART in more detail and deep-scrub that PG to see if > this was just a hiccup, or a permanent error. I've progressed - and put it on the list in the hope it can also help others: # ceph osd map cephfs_data 10002954ea6.007c osdmap e40907 pool 'cephfs_data' (2) object '10002954ea6.007c' -> pg 2.c2759e90 (2.e90) -> up ([196,101,14,156,47,177], p196) acting ([196,101,14,156,47,177], p196) # ceph pg deep-scrub 2.e90 instructing pg 2.e90s0 on osd.196 to deep-scrub Checking the OSD logs (osd 196), I find: - 2019-05-30 16:08:51.759 7f46b36ac700 0 log_channel(cluster) log [WRN] : Error(s) ignored for 2:0979ae43:::10002954ea6.007c:head enough copies available 2019-05-30 17:13:39.817 7f46b36ac700 0 log_channel(cluster) log [DBG] : 2.e90 deep-scrub starts 2019-05-30 17:19:51.013 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 2.e90 shard 14(2) soid 2:0979ae43:::10002954ea6.007c:head : candidate had a read error 2019-05-30 17:23:52.360 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 2.e90s0 deep-scrub 0 missing, 1 inconsistent objects 2019-05-30 17:23:52.360 7f46b36ac700 -1 log_channel(cluster) log [ERR] : 2.e90 deep-scrub 1 errors - And now, the cluster is in HEALTH_ERR as expected. So that would probably have happened automatically after a while - wouldn't it be better to alert the operator immediately, e.g. by scheduling an immediate deep-scrub after a read-error? I presume "shard 14(2)" means: "Shard on OSD 14, third (index 2) in the acting set". Correct? Checking that OSDs logs, I do indeed find: - 2019-05-30 16:08:51.566 7f2e7dc15700 -1 bdev(0x55ae2eade000 /var/lib/ceph/osd/ceph-14/block) _aio_thread got r=-5 ((5) Input/output error) 2019-05-30 16:08:51.566 7f2e7dc15700 -1 bdev(0x55ae2eade000 /var/lib/ceph/osd/ceph-14/block) _aio_thread translating the error to EIO for upper layer 2019-05-30 16:08:51.655 7f2e683ea700 -1 log_channel(cluster) log [ERR] : Error -5 reading object 2:0979ae43:::10002954ea6.007c:head - The underlying disk has one problematic sector in SMART. Issuing: # ceph pg repair 2.e90 has triggered rewriting that sector and allowed the disk to reallocate that sector, and Ceph is HEALTH_OK again. So my issue is solved, but two questions remain: - Is it wanted that the error is "ignored" until the next deep-scrub happens? - Is there also a way to map the object name to a CephFS file object and vice-versa? In one direction (file / inode to object), it seems this approach should work: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html Cheers and thanks, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Object read error - enough copies available
Dear Cephalopodians, I found the messages: 2019-05-30 16:08:51.656363 [ERR] Error -5 reading object 2:0979ae43:::10002954ea6.007c:head 2019-05-30 16:08:51.760660 [WRN] Error(s) ignored for 2:0979ae43:::10002954ea6.007c:head enough copies available just now in our logs (Mimic 13.2.5). However, everything stayed HEALTH_OK and seems fine. Pool 2 is an EC pool containing CephFS. Up to now I've never had to delve into the depths of RADOS, so I have some questions. If there are docs and I missed them, just redirect me :-). - How do I find the OSDs / PG for that object (is the PG contained in the name?) I'd love to check SMART in more detail and deep-scrub that PG to see if this was just a hiccup, or a permanent error. - Is there also a way to map the object name to a CephFS file object and vice-versa? In one direction (file / inode to object), it seems this approach should work: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html - Should Ceph stay healthy in that case? Does it maybe even deep-scrub automatically, and only decide afterwards whether to stay healthy / whether repair is needed? Cheers and thanks, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Balancer: uneven OSDs
y] Loaded module_config entry > mgr/balancer/mode:upmap > 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/active > 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/begin_time > 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/end_time > 2019-05-29 17:06:54.299 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/sleep_interval > *2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] Optimize plan > auto_2019-05-29_17:06:54* > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/mode > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/max_misplaced > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] Mode upmap, max > misplaced 0.50 > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] do_upmap > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/upmap_max_iterations > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr get_config get_config key: > mgr/balancer/upmap_max_deviation > 2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] pools ['rbd'] > *2019-05-29 17:06:54.327 7f40cd3e8700 4 mgr[balancer] prepared 0/10 changes* > > > Inactive hide details for Oliver Freyermuth ---05/29/2019 11:59:39 AM---Hi > Tarek, Am 29.05.19 um 18:49 schrieb Tarek Zegar:Oliver Freyermuth > ---05/29/2019 11:59:39 AM---Hi Tarek, Am 29.05.19 um 18:49 schrieb Tarek > Zegar: > > From: Oliver Freyermuth > To: Tarek Zegar > Cc: ceph-users@lists.ceph.com > Date: 05/29/2019 11:59 AM > Subject: [EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs > > -- > > > > Hi Tarek, > > Am 29.05.19 um 18:49 schrieb Tarek Zegar: >> Hi Oliver, >> >> Thank you for the response, I did ensure that min-client-compact-level is >> indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph >> versions reports mimic. Also below is the output of ceph balancer status. >> One thing to note, I did enable the balancer after I already filled the >> cluster, not from the onset. I had hoped that it wouldn't matter, though >> your comment "if the compat-level is too old for upmap, you'll only find a >> small warning about that in the logfiles" leaves me to believe that it will >> *not* work in doing it this way, please confirm and let me know what message >> to look for in /var/log/ceph. > > it should also work well on existing clusters - we have also used it on a > Luminous cluster after it was already half-filled, and it worked well - > that's what it was made for ;-). > The only issue we encountered was that the client-compat-level needed to be > set to Luminous before enabling the balancer plugin, but since you can always > disable and re-enable a plugin, > this is not a "blocker". > > Do you see anything in the logs of the active mgr when disabling and > re-enabling the balancer plugin? > That's how we initially found the message that we needed to raise the > client-compat-level. > > Cheers, > Oliver > >> >> Thank you! >> >> root@hostadmin:~# ceph balancer status >> { >> "active": true, >> "plans": [], >> "mode": "upmap" >> } >> >> >> >> root@hostadmin:~# ceph features >> { >> "mon": [ >> { >> "features&quo
Re: [ceph-users] Balancer: uneven OSDs
Hi Tarek, Am 29.05.19 um 18:49 schrieb Tarek Zegar: > Hi Oliver, > > Thank you for the response, I did ensure that min-client-compact-level is > indeed Luminous (see below). I have no kernel mapped rbd clients. Ceph > versions reports mimic. Also below is the output of ceph balancer status. One > thing to note, I did enable the balancer after I already filled the cluster, > not from the onset. I had hoped that it wouldn't matter, though your comment > "if the compat-level is too old for upmap, you'll only find a small warning > about that in the logfiles" leaves me to believe that it will *not* work in > doing it this way, please confirm and let me know what message to look for in > /var/log/ceph. it should also work well on existing clusters - we have also used it on a Luminous cluster after it was already half-filled, and it worked well - that's what it was made for ;-). The only issue we encountered was that the client-compat-level needed to be set to Luminous before enabling the balancer plugin, but since you can always disable and re-enable a plugin, this is not a "blocker". Do you see anything in the logs of the active mgr when disabling and re-enabling the balancer plugin? That's how we initially found the message that we needed to raise the client-compat-level. Cheers, Oliver > > Thank you! > > root@hostadmin:~# ceph balancer status > { > "active": true, > "plans": [], > "mode": "upmap" > } > > > > root@hostadmin:~# ceph features > { > "mon": [ > { > "features": "0x3ffddff8ffacfffb", > "release": "luminous", > "num": 3 > } > ], > "osd": [ > { > "features": "0x3ffddff8ffacfffb", > "release": "luminous", > "num": 7 > } > ], > "client": [ > { > "features": "0x3ffddff8ffacfffb", > "release": "luminous", > "num": 1 > } > ], > "mgr": [ > { > "features": "0x3ffddff8ffacfffb", > "release": "luminous", > "num": 3 > } > ] > } > > > > > Inactive hide details for Oliver Freyermuth ---05/29/2019 11:13:51 AM---Hi > Tarek, what's the output of "ceph balancer status"?Oliver Freyermuth > ---05/29/2019 11:13:51 AM---Hi Tarek, what's the output of "ceph balancer > status"? > > From: Oliver Freyermuth > To: ceph-users@lists.ceph.com > Date: 05/29/2019 11:13 AM > Subject: [EXTERNAL] Re: [ceph-users] Balancer: uneven OSDs > Sent by: "ceph-users" > > -- > > > > Hi Tarek, > > what's the output of "ceph balancer status"? > In case you are using "upmap" mode, you must make sure to have a > min-client-compat-level of at least Luminous: > http://docs.ceph.com/docs/mimic/rados/operations/upmap/ > Of course, please be aware that your clients must be recent enough > (especially for kernel clients). > > Sadly, if the compat-level is too old for upmap, you'll only find a small > warning about that in the logfiles, > but no error on terminal when activating the balancer or any other kind of > erroneous / health condition. > > Cheers, > Oliver > > Am 29.05.19 um 17:52 schrieb Tarek Zegar: >> Can anyone help with this? Why can't I optimize this cluster, the pg counts >> and data distribution is way off. >> __ >> >> I enabled the balancer plugin and even tried to manually invoke it but it >> won't allow any changes. Looking at ceph osd df, it's not even at all. >> Thoughts? >> >> root@hostadmin:~# ceph osd df >>
Re: [ceph-users] Balancer: uneven OSDs
Hi Tarek, what's the output of "ceph balancer status"? In case you are using "upmap" mode, you must make sure to have a min-client-compat-level of at least Luminous: http://docs.ceph.com/docs/mimic/rados/operations/upmap/ Of course, please be aware that your clients must be recent enough (especially for kernel clients). Sadly, if the compat-level is too old for upmap, you'll only find a small warning about that in the logfiles, but no error on terminal when activating the balancer or any other kind of erroneous / health condition. Cheers, Oliver Am 29.05.19 um 17:52 schrieb Tarek Zegar: Can anyone help with this? Why can't I optimize this cluster, the pg counts and data distribution is way off. __ I enabled the balancer plugin and even tried to manually invoke it but it won't allow any changes. Looking at ceph osd df, it's not even at all. Thoughts? root@hostadmin:~# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0 3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156 6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0 5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159 7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141 2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90 4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99 8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111 TOTAL 90 GiB 53 GiB 37 GiB 72.93 MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67 root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd osdmaptool: osdmap file 'om' writing upmap command output to: out.txt checking for upmap cleanups upmap, max-count 100, max*deviation 0.01 <---really? It's not even close to 1% across the drives* limiting to pools rbd (1) *no upmaps proposed* ceph balancer optimize myplan Error EALREADY: Unable to find further optimization,or distribution is already perfect ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster
Am 28.05.19 um 03:24 schrieb Yan, Zheng: On Mon, May 27, 2019 at 6:54 PM Oliver Freyermuth wrote: Am 27.05.19 um 12:48 schrieb Oliver Freyermuth: Am 27.05.19 um 11:57 schrieb Dan van der Ster: On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth wrote: Dear Dan, thanks for the quick reply! Am 27.05.19 um 11:44 schrieb Dan van der Ster: Hi Oliver, We saw the same issue after upgrading to mimic. IIRC we could make the max_bytes xattr visible by touching an empty file in the dir (thereby updating the dir inode). e.g. touch /cephfs/user/freyermu/.quota; rm /cephfs/user/freyermu/.quota sadly, no, not even with sync's in between: - $ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute - Also restarting the FUSE client after that does not change it. Maybe this requires the rest of the cluster to be upgraded to work? I'm just guessing here, but maybe the MDS needs the file creation / update of the directory inode to "update" the way the quota attributes are exported. If something changed here with Mimic, this would explain why the "touch" is needed. And this would also explain why this might only help if the MDS is upgraded to Mimic, too. I think the relevant change which is causing this is the new_snaps in mimic. Did you already enable them? `ceph fs set cephfs allow_new_snaps 1` Good point! We wanted to enable these anyways with Mimic. I've enabled it just now (since servers are still Luminous, that required "--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there (also not after remounting on the client / using the file creation and deletion trick). That's interesting - it suddenly started to work for one directory after creating a snapshot for one directory subtree on which we have quotas enabled, and removing that snapshot again. I can reproduce that for other directories. So it seems enabling snapshots and snapshotting once fixes it for that directory tree. If that's the case, maybe this could be added to the upgrade notes? quota handling code changed in mimic. mimic client + luminous mds have compat issue. there should be no issue if both mds and client are both upgraded to mimic, Thanks for the confirmation! We have by now upgraded all our MDSs, and indeed now the trick which Dan outlined initially works: touch /directory/with/quotas/.somefile rm /directory/with/quotas/.somefile to get the attribute to show up again. No creation of snaps is needed anymore, but it's also not showing up by itself (an update of the directory inode seems needed to trigger the "migration"). Since a change inside the subtree is also sufficient, this means things will "heal" automatically for us. Still, this surprised me - maybe this compat issue could / should be mentioned in the upgrade notes? Naïvely, I believed that (fuse) clients should be relatively safe to upgrade even if the rest of the cluster is not there yet. Cheers and thanks, Oliver Regards Yan, Zheng Cheers, Oliver Cheers, Oliver -- dan We have scheduled the remaining parts of the upgrade for Wednesday, and worst case could survive until then without quota enforcement, but it's a really strange and unexpected incompatibility. Cheers, Oliver Does that work? -- dan On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth wrote: Dear Cephalopodians, in the process of migrating a cluster from Luminous (12.2.12) to Mimic (13.2.5), we have upgraded the FUSE clients first (we took the chance during a time of low activity), thinking that this should not cause any issues. All MDS+MON+OSDs are still on Luminous, 12.2.12. However, it seems quotas have stopped working - with a (FUSE) Mimic client (13.2.5), I see: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute A Luminous client (12.2.12) on the same cluster sees: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ 5 It does not seem as if the attribute has been renamed (e.g. https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still references it, same for the docs), and I have to assume the clients also do not enforce quota if they do not see it. Is this a known incompatibility between Mimic clients and a Luminous cluster? The release notes of Mimic only mention that quota support was added to the kernel client, but nothing else quota related catches my eye. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Oliver Freyermuth Universität Bonn Physikal
Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster
Am 27.05.19 um 12:48 schrieb Oliver Freyermuth: Am 27.05.19 um 11:57 schrieb Dan van der Ster: On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth wrote: Dear Dan, thanks for the quick reply! Am 27.05.19 um 11:44 schrieb Dan van der Ster: Hi Oliver, We saw the same issue after upgrading to mimic. IIRC we could make the max_bytes xattr visible by touching an empty file in the dir (thereby updating the dir inode). e.g. touch /cephfs/user/freyermu/.quota; rm /cephfs/user/freyermu/.quota sadly, no, not even with sync's in between: - $ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute - Also restarting the FUSE client after that does not change it. Maybe this requires the rest of the cluster to be upgraded to work? I'm just guessing here, but maybe the MDS needs the file creation / update of the directory inode to "update" the way the quota attributes are exported. If something changed here with Mimic, this would explain why the "touch" is needed. And this would also explain why this might only help if the MDS is upgraded to Mimic, too. I think the relevant change which is causing this is the new_snaps in mimic. Did you already enable them? `ceph fs set cephfs allow_new_snaps 1` Good point! We wanted to enable these anyways with Mimic. I've enabled it just now (since servers are still Luminous, that required "--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there (also not after remounting on the client / using the file creation and deletion trick). That's interesting - it suddenly started to work for one directory after creating a snapshot for one directory subtree on which we have quotas enabled, and removing that snapshot again. I can reproduce that for other directories. So it seems enabling snapshots and snapshotting once fixes it for that directory tree. If that's the case, maybe this could be added to the upgrade notes? Cheers, Oliver Cheers, Oliver -- dan We have scheduled the remaining parts of the upgrade for Wednesday, and worst case could survive until then without quota enforcement, but it's a really strange and unexpected incompatibility. Cheers, Oliver Does that work? -- dan On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth wrote: Dear Cephalopodians, in the process of migrating a cluster from Luminous (12.2.12) to Mimic (13.2.5), we have upgraded the FUSE clients first (we took the chance during a time of low activity), thinking that this should not cause any issues. All MDS+MON+OSDs are still on Luminous, 12.2.12. However, it seems quotas have stopped working - with a (FUSE) Mimic client (13.2.5), I see: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute A Luminous client (12.2.12) on the same cluster sees: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ 5 It does not seem as if the attribute has been renamed (e.g. https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still references it, same for the docs), and I have to assume the clients also do not enforce quota if they do not see it. Is this a known incompatibility between Mimic clients and a Luminous cluster? The release notes of Mimic only mention that quota support was added to the kernel client, but nothing else quota related catches my eye. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Oliver Freyermuth Universität Bonn Physikalisches Institut, Raum 1.047 Nußallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Oliver Freyermuth Universität Bonn Physikalisches Institut, Raum 1.047 Nußallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 -- smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster
Am 27.05.19 um 11:57 schrieb Dan van der Ster: On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth wrote: Dear Dan, thanks for the quick reply! Am 27.05.19 um 11:44 schrieb Dan van der Ster: Hi Oliver, We saw the same issue after upgrading to mimic. IIRC we could make the max_bytes xattr visible by touching an empty file in the dir (thereby updating the dir inode). e.g. touch /cephfs/user/freyermu/.quota; rm /cephfs/user/freyermu/.quota sadly, no, not even with sync's in between: - $ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute - Also restarting the FUSE client after that does not change it. Maybe this requires the rest of the cluster to be upgraded to work? I'm just guessing here, but maybe the MDS needs the file creation / update of the directory inode to "update" the way the quota attributes are exported. If something changed here with Mimic, this would explain why the "touch" is needed. And this would also explain why this might only help if the MDS is upgraded to Mimic, too. I think the relevant change which is causing this is the new_snaps in mimic. Did you already enable them? `ceph fs set cephfs allow_new_snaps 1` Good point! We wanted to enable these anyways with Mimic. I've enabled it just now (since servers are still Luminous, that required "--yes-i-really-mean-it") but sadly, the max_bytes attribute is still not there (also not after remounting on the client / using the file creation and deletion trick). Cheers, Oliver -- dan We have scheduled the remaining parts of the upgrade for Wednesday, and worst case could survive until then without quota enforcement, but it's a really strange and unexpected incompatibility. Cheers, Oliver Does that work? -- dan On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth wrote: Dear Cephalopodians, in the process of migrating a cluster from Luminous (12.2.12) to Mimic (13.2.5), we have upgraded the FUSE clients first (we took the chance during a time of low activity), thinking that this should not cause any issues. All MDS+MON+OSDs are still on Luminous, 12.2.12. However, it seems quotas have stopped working - with a (FUSE) Mimic client (13.2.5), I see: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute A Luminous client (12.2.12) on the same cluster sees: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ 5 It does not seem as if the attribute has been renamed (e.g. https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still references it, same for the docs), and I have to assume the clients also do not enforce quota if they do not see it. Is this a known incompatibility between Mimic clients and a Luminous cluster? The release notes of Mimic only mention that quota support was added to the kernel client, but nothing else quota related catches my eye. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Oliver Freyermuth Universität Bonn Physikalisches Institut, Raum 1.047 Nußallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 -- -- Oliver Freyermuth Universität Bonn Physikalisches Institut, Raum 1.047 Nußallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 -- smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster
Dear Dan, thanks for the quick reply! Am 27.05.19 um 11:44 schrieb Dan van der Ster: Hi Oliver, We saw the same issue after upgrading to mimic. IIRC we could make the max_bytes xattr visible by touching an empty file in the dir (thereby updating the dir inode). e.g. touch /cephfs/user/freyermu/.quota; rm /cephfs/user/freyermu/.quota sadly, no, not even with sync's in between: - $ touch /cephfs/user/freyermu/.quota; sync; rm -f /cephfs/user/freyermu/.quota; sync; getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute - Also restarting the FUSE client after that does not change it. Maybe this requires the rest of the cluster to be upgraded to work? I'm just guessing here, but maybe the MDS needs the file creation / update of the directory inode to "update" the way the quota attributes are exported. If something changed here with Mimic, this would explain why the "touch" is needed. And this would also explain why this might only help if the MDS is upgraded to Mimic, too. We have scheduled the remaining parts of the upgrade for Wednesday, and worst case could survive until then without quota enforcement, but it's a really strange and unexpected incompatibility. Cheers, Oliver Does that work? -- dan On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth wrote: Dear Cephalopodians, in the process of migrating a cluster from Luminous (12.2.12) to Mimic (13.2.5), we have upgraded the FUSE clients first (we took the chance during a time of low activity), thinking that this should not cause any issues. All MDS+MON+OSDs are still on Luminous, 12.2.12. However, it seems quotas have stopped working - with a (FUSE) Mimic client (13.2.5), I see: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute A Luminous client (12.2.12) on the same cluster sees: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ 5 It does not seem as if the attribute has been renamed (e.g. https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still references it, same for the docs), and I have to assume the clients also do not enforce quota if they do not see it. Is this a known incompatibility between Mimic clients and a Luminous cluster? The release notes of Mimic only mention that quota support was added to the kernel client, but nothing else quota related catches my eye. Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Oliver Freyermuth Universität Bonn Physikalisches Institut, Raum 1.047 Nußallee 12 53115 Bonn -- Tel.: +49 228 73 2367 Fax: +49 228 73 7869 -- smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster
Dear Cephalopodians, in the process of migrating a cluster from Luminous (12.2.12) to Mimic (13.2.5), we have upgraded the FUSE clients first (we took the chance during a time of low activity), thinking that this should not cause any issues. All MDS+MON+OSDs are still on Luminous, 12.2.12. However, it seems quotas have stopped working - with a (FUSE) Mimic client (13.2.5), I see: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute A Luminous client (12.2.12) on the same cluster sees: $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes /cephfs/user/freyermu/ 5 It does not seem as if the attribute has been renamed (e.g. https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still references it, same for the docs), and I have to assume the clients also do not enforce quota if they do not see it. Is this a known incompatibility between Mimic clients and a Luminous cluster? The release notes of Mimic only mention that quota support was added to the kernel client, but nothing else quota related catches my eye. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inodes on /cephfs
Dear Yury, Am 01.05.19 um 08:07 schrieb Yury Shevchuk: > cephfs is not alone at this, there are other inode-less filesystems > around. They all go with zeroes: > > # df -i /nfs-dir > Filesystem Inodes IUsed IFree IUse% Mounted on > xxx.xxx.xx.x:/xxx/xxx/x 0 0 0 - /xxx > > # df -i /reiserfs-dir > FilesystemInodes IUsed IFree IUse% Mounted on > /xxx//x0 0 0- /xxx/xxx//x > > # df -i /btrfs-dir > Filesystem Inodes IUsed IFree IUse% Mounted on > /xxx/xx/ 0 0 0 - / you are right, thanks for pointing me to these examples! > > Would YUM refuse to install on them all, including mainstream btrfs? > I doubt that. Prehaps YUM is confused by Inodes count that > cephfs (alone!) reports as non-zero. Look at YUM sources? Indeed, Yum works on all these file systems. Here's the place in the sources: https://github.com/rpm-software-management/rpm/blob/6913360d66510e60d7b6399cd338425d663a051b/lib/transaction.c#L172 That's actually in RPM, since Yum calls RPM and the complaint comes from RPM. Reading the sources, they just interpret the results from the statfs call. If a file system reports: sfb.f_ffree == 0 && sfb.f_files == 0 i.e. no used and no free inodes, then it's assumed the file system has no notion of inodes, and the check is disabled. However, since CephFS reports something non-zero for the total count (f_files), RPM assumes it has a notion of Inodes, and a check should be performed. So indeed, another solution would be to change f_files to also report 0, as all other file systems without actual inodes seem to do. That would (in my opinion) also be more correct than what is currently done, since reporting something non-zero as f_files but zero as f_free from a logical point of view seems "full". Even df shows a more useful output with both being zero - it just shows a "dash", highlighting that this is not information to be monitored. What do you think? Cheers, Oliver > > > -- Yury > > On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote: >> Am 01.05.19 um 00:51 schrieb Patrick Donnelly: >>> On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth >>> wrote: >>>> >>>> Dear Cephalopodians, >>>> >>>> we have a classic libvirtd / KVM based virtualization cluster using >>>> Ceph-RBD (librbd) as backend and sharing the libvirtd configuration >>>> between the nodes via CephFS >>>> (all on Mimic). >>>> >>>> To share the libvirtd configuration between the nodes, we have symlinked >>>> some folders from /etc/libvirt to their counterparts on /cephfs, >>>> so all nodes see the same configuration. >>>> In general, this works very well (of course, there's a "gotcha": Libvirtd >>>> needs reloading / restart for some changes to the XMLs, we have automated >>>> that), >>>> but there is one issue caused by Yum's cleverness (that's on CentOS 7). >>>> Whenever there's a libvirtd update, unattended upgrades fail, and we see: >>>> >>>>Transaction check error: >>>> installing package >>>> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on >>>> the /cephfs filesystem >>>> installing package >>>> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on >>>> the /cephfs filesystem >>>> >>>> So it seems yum follows the symlinks and checks the available inodes on >>>> /cephfs. Sadly, that reveals: >>>>[root@kvm001 libvirt]# LANG=C df -i /cephfs/ >>>>Filesystem Inodes IUsed IFree IUse% Mounted on >>>>ceph-fuse 6868 0 100% /cephfs >>>> >>>> I think that's just because there is no real "limit" on the maximum inodes >>>> on CephFS. However, returning 0 breaks some existing tools (notably, Yum). >>>> >>>> What do you think? Should CephFS return something different than 0 here to >>>> not break existing tools? >>>> Or should the tools behave differently? But one might also argue that if >>>> the total number of Inodes matches the used number of Inodes, the FS is >>>> indeed "full". >>>> It's just unclear to me who to file a bug against ;-). >>>> >>>> Right now, I am just using: >>>> yum -y --setopt=diskspacecheck=0 update >>>> as a manual workaround, but this is natura
Re: [ceph-users] Inodes on /cephfs
Am 01.05.19 um 00:51 schrieb Patrick Donnelly: > On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth > wrote: >> >> Dear Cephalopodians, >> >> we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD >> (librbd) as backend and sharing the libvirtd configuration between the nodes >> via CephFS >> (all on Mimic). >> >> To share the libvirtd configuration between the nodes, we have symlinked >> some folders from /etc/libvirt to their counterparts on /cephfs, >> so all nodes see the same configuration. >> In general, this works very well (of course, there's a "gotcha": Libvirtd >> needs reloading / restart for some changes to the XMLs, we have automated >> that), >> but there is one issue caused by Yum's cleverness (that's on CentOS 7). >> Whenever there's a libvirtd update, unattended upgrades fail, and we see: >> >>Transaction check error: >> installing package >> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on the >> /cephfs filesystem >> installing package >> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on >> the /cephfs filesystem >> >> So it seems yum follows the symlinks and checks the available inodes on >> /cephfs. Sadly, that reveals: >>[root@kvm001 libvirt]# LANG=C df -i /cephfs/ >>Filesystem Inodes IUsed IFree IUse% Mounted on >>ceph-fuse 6868 0 100% /cephfs >> >> I think that's just because there is no real "limit" on the maximum inodes >> on CephFS. However, returning 0 breaks some existing tools (notably, Yum). >> >> What do you think? Should CephFS return something different than 0 here to >> not break existing tools? >> Or should the tools behave differently? But one might also argue that if the >> total number of Inodes matches the used number of Inodes, the FS is indeed >> "full". >> It's just unclear to me who to file a bug against ;-). >> >> Right now, I am just using: >> yum -y --setopt=diskspacecheck=0 update >> as a manual workaround, but this is naturally rather cumbersome. > > This is fallout from [1]. See discussion on setting f_free to 0 here > [2]. In summary, userland tools are trying to be too clever by looking > at f_free. [I could be convinced to go back to f_free = ULONG_MAX if > there are other instances of this.] > > [1] https://github.com/ceph/ceph/pull/23323 > [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911 Thanks for the references! That certainly enlightens me on why this decision was taken, and of course I congratulate upon trying to prevent false monitoring. Still, even though I don't have other instances at hand (yet), I am not yet convinced "0" is a better choice than "ULONG_MAX". It certainly alerts users / monitoring software about doing something wrong, but it prevents a check which any file system (or rather, any file system I encountered so far) allows. Yum (or other package managers doing things in a safe manner) need to ensure they can fully install a package in an "atomic" way before doing so, since rolling back may be complex or even impossible (for most file systems). So they need a way to check if a file system can store the additional files in terms of space and inodes, before placing the data there, or risk installing something only partially, and potentially being unable to roll back. In most cases, the free number of inodes allows for that check. Of course, that has no (direct) meaning for CephFS, so one might argue the tools should add an exception for CephFS - but as the discussion correctly stated, there's no defined way to find out where the file system has a notion of "free inodes", and - if we go for an exceptional treatment for a list of file systems - not even a "clean" way to find out if the file system is CephFS (the tools will only see it is FUSE for ceph-fuse) [1]. So my question is: How are tools which need to ensure that a file system can accept a given number of bytes and inodes before actually placing the data there check that in case of CephFS? And if they should not, how do they find out that this check which is valid on e.g. ext4 is not useful on CephFS? (or, in other words: if I would file a bug report against Yum, I could not think of any implementation they could make to solve this issue) Of course, if it's just us, we can live with the workaround. We monitor space consumption on all file systems, and may start monitoring free inodes on our ext4 file systems, such that we can safely disable the Yum check on the affected nodes. But I wonder whether this is the best way
[ceph-users] Inodes on /cephfs
Dear Cephalopodians, we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD (librbd) as backend and sharing the libvirtd configuration between the nodes via CephFS (all on Mimic). To share the libvirtd configuration between the nodes, we have symlinked some folders from /etc/libvirt to their counterparts on /cephfs, so all nodes see the same configuration. In general, this works very well (of course, there's a "gotcha": Libvirtd needs reloading / restart for some changes to the XMLs, we have automated that), but there is one issue caused by Yum's cleverness (that's on CentOS 7). Whenever there's a libvirtd update, unattended upgrades fail, and we see: Transaction check error: installing package libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on the /cephfs filesystem installing package libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on the /cephfs filesystem So it seems yum follows the symlinks and checks the available inodes on /cephfs. Sadly, that reveals: [root@kvm001 libvirt]# LANG=C df -i /cephfs/ Filesystem Inodes IUsed IFree IUse% Mounted on ceph-fuse 6868 0 100% /cephfs I think that's just because there is no real "limit" on the maximum inodes on CephFS. However, returning 0 breaks some existing tools (notably, Yum). What do you think? Should CephFS return something different than 0 here to not break existing tools? Or should the tools behave differently? But one might also argue that if the total number of Inodes matches the used number of Inodes, the FS is indeed "full". It's just unclear to me who to file a bug against ;-). Right now, I am just using: yum -y --setopt=diskspacecheck=0 update as a manual workaround, but this is naturally rather cumbersome. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Some ceph config parameters default values
Dear Cephalopodians, in some recent threads on this list, I have read about the "knobs": pglog_hardlimit (false by default, available at least with 12.2.11 and 13.2.5) bdev_enable_discard (false by default, advanced option, no description) bdev_async_discard (false by default, advanced option, no description) I am wondering about the defaults for these settings, and why these settings seem mostly undocumented. It seems to me that on SSD / NVMe devices, you would always want to enable discard for significantly increased lifetime, or run fstrim regularly (which you can't with bluestore since it's a filesystem of its own). From personal experience, I have already lost two eMMC devices in Android phones early due to trimming not working fine. Of course, on first generation SSD devices, "discard" may lead to data loss (which for most devices has been fixed with firmware updates, though). I would presume that async-discard is also advantageous, since it seems to queue the discards and work on these in bulk later instead of issuing them immediately (that's what I grasp from the code). Additionally, it's unclear to me whether the bdev-discard settings also affect WAL/DB devices, which are very commonly SSD/NVMe devices in the Bluestore age. Concerning the pglog_hardlimit, I read on that list that it's safe and limits maximum memory consumption especially for backfills / during recovery. So it "sounds" like this is also something that could be on by default. But maybe that is not the case yet to allow downgrades after failed upgrades? So in the end, my question is: Is there a reason why these values are not on by default, and are also not really mentioned in the documentation? Are they just "not ready yet" / unsafe to be on by default, or are the defaults just like that because they have always been at this value, and defaults will change with the next major release (nautilus)? Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads
Hi, first of: I'm probably not the expert you are waiting for, but we are using CephFS for HPC / HTC (storing datafiles), and make use of containers for all jobs (up to ~2000 running in parallel). We also use RBD, but for our virtualization infrastructure. While I'm always one of the first to recommend CephFS / RBD, I personally think that another (open source) file system - CVMFS - may suit your container-usecase significantly better. We use that to store our container images (and software in several versions). The containers are rebuilt daily. CVMFS is read-only for the clients by design. An administrator commits changes on the "Stratum 0" server, and the clients see the new changes shortly after the commit has happened. Things are revisioned, and you can roll back in case something goes wrong. Why did we choose CVMFS here? - No need to have an explicit write-lock when changing things. - Deduplication built-in. We build several new containers daily, and keep them for 30 days (for long-running jobs). Deduplication spares us from the need to have many factors more of storage. I still hope Ceph learns deduplication some day ;-). - Extreme caching. The file system works via HTTP, i.e. you can use standard caching proxies (squids), and all clients have their own local disk cache. The deduplication also applies to that, so only unique chunks need to be fetched. High availability is rather easy to get (not as easy as with Ceoh, but you can have it by running one "Stratum 0" machine which does the writing, at least two "Stratum 1" machines syncing everything, and if you want more performance also at least two squid servers in front). It's a FUSE filesystem, but unexpectedly well performing especially for small files as you have them for software and containers. The caching and deduplication heavily reduce traffic when you run many containers, especially when they all start concurrently. That's just my 2 cents, and your mileage may vary (for example, this does not work well if the machines running the containers do not have any local storage to cache things). And maybe you do not run thousands of containers in parallel, and you do not gain as much as we do from the deduplication. If it does not fit your case, I think RBD is a good way to go, but sadly I can not share experience how well / stable it works with many clients mounting the volume read-only in parallel. In our virtualization, there's always only one exclusive lock on a volume. Cheers, Oliver Am 17.01.19 um 19:27 schrieb Void Star Nill: > Hi, > > We am trying to use Ceph in our products to address some of the use cases. We > think Ceph block device for us. One of the use cases is that we have a number > of jobs running in containers that need to have Read-Only access to shared > data. The data is written once and is consumed multiple times. I have read > through some of the similar discussions and the recommendations on using > CephFS for these situations, but in our case Block device makes more sense as > it fits well with other use cases and restrictions we have around this use > case. > > The following scenario seems to work as expected when we tried on a test > cluster, but we wanted to get an expert opinion to see if there would be any > issues in production. The usage scenario is as follows: > > - A block device is created with "--image-shared" options: > > rbd create mypool/foo --size 4G --image-shared > > > - The image is mapped to a host, formatted in ext4 format (or other file > formats), mounted to a directory in read/write mode and data is written to > it. Please note that the image will be mapped in exclusive write mode -- no > other read/write mounts are allowed a this time. > > - The volume is unmapped from the host and then mapped on to N number of > other hosts where it will be mounted in read-only mode and the data is read > simultaneously from N readers > > As mentioned above, this seems to work as expected, but we wanted to confirm > that we won't run into any unexpected issues. > > Appreciate any inputs on this. > > Thanks, > Shridhar > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic
Am 10.01.19 um 16:53 schrieb Jason Dillaman: > On Thu, Jan 10, 2019 at 10:50 AM Oliver Freyermuth > wrote: >> >> Dear Jason and list, >> >> Am 10.01.19 um 16:28 schrieb Jason Dillaman: >>> On Thu, Jan 10, 2019 at 4:01 AM Oliver Freyermuth >>> wrote: >>>> >>>> Dear Cephalopodians, >>>> >>>> I performed several consistency checks now: >>>> - Exporting an RBD snapshot before and after the object map rebuilding. >>>> - Exporting a backup as raw image, all backups (re)created before and >>>> after the object map rebuilding. >>>> - md5summing all of that for a snapshot for which the rebuilding was >>>> actually needed. >>>> >>>> The good news: I found that all checksums are the same. So the backups are >>>> (at least for those I checked) not broken. >>>> >>>> I also checked the source and found: >>>> https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h >>>> So to my understanding, the object map entries are OBJECT_EXISTS, but >>>> should be OBJECT_EXISTS_CLEAN. >>>> Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object >>>> being unchanged ("clean") as compared to another snapshot / the main >>>> volume? >>>> >>>> If so, this would explain why the backups, exports etc. are all okay, >>>> since the backup tools only got "too many" objects in the fast-diff and >>>> hence extracted too many objects from Ceph-RBD even though that was not >>>> needed. Since both Benji and Backy2 deduplicate again in their backends, >>>> this causes only a minor network traffic inefficiency. >>>> >>>> Is my understanding correct? >>>> Then the underlying issue would still be a bug, but (as it seems) a >>>> harmless one. >>> >>> Yes, your understanding is correct in that it's harmless from a >>> data-integrity point-of-view. >>> >>> During the creation of the snapshot, the current object map (for the >>> HEAD revision) is copied to a new object map for that snapshot and >>> then all the objects in the HEAD revision snapshot are marked as >>> EXISTS_CLEAN (if they EXIST). Somehow an IO operation is causing the >>> object map to think there is an update, but apparently no object >>> update is actually occurring (or at least the OSD doesn't think a >>> change occurred). >> >> thanks a lot for the clarification! Good to know my understanding is correct. >> >> I re-checked all object maps just now. Again, the most recent snapshots show >> this issue, but only those. >> The only "special" thing which probably not everybody is doing would likely >> be us running fstrim in the machines >> running from the RBD regularly, to conserve space. >> >> I am not sure how exactly the DISCARD operation is handled in rbd. But since >> this was my guess, I just did an fstrim inside one of the VMs, >> and checked the object-maps again. I get: >> 2019-01-10 16:44:25.320 7f06f67fc700 -1 librbd::ObjectMapIterateRequest: >> object map error: object rbd_data.4f587327b23c6.0040 marked as >> 1, but should be 3 >> In this case, I got it for the volume itself and not a snapshot. >> >> So it seems to me that sometimes, DISCARD causes objects to think they have >> been updated, albeit they have not. >> Sadly due to in-depth code knowledge and lack of a real debug setup I can >> not track it down further :-(. >> >> Cheers and hope that helps a code expert in tracking it down (at least it's >> not affecting data integrity), > > Thanks, that definitely provides a good investigation starting point. Should we also put it into a ticket, so it can be tracked? I could do it if you like. On the other hand, maybe you could summarize the issue more concisely than I can. Cheers and all the best, Oliver > >> Oliver >> >>> >>>> I'll let you know if it happens again to some of our snapshots, and if so, >>>> if it only happens to newly created ones... >>>> >>>> Cheers, >>>> Oliver >>>> >>>> Am 10.01.19 um 01:18 schrieb Oliver Freyermuth: >>>>> Dear Cephalopodians, >>>>> >>>>> inspired by >>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/
Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic
Dear Jason and list, Am 10.01.19 um 16:28 schrieb Jason Dillaman: On Thu, Jan 10, 2019 at 4:01 AM Oliver Freyermuth wrote: Dear Cephalopodians, I performed several consistency checks now: - Exporting an RBD snapshot before and after the object map rebuilding. - Exporting a backup as raw image, all backups (re)created before and after the object map rebuilding. - md5summing all of that for a snapshot for which the rebuilding was actually needed. The good news: I found that all checksums are the same. So the backups are (at least for those I checked) not broken. I also checked the source and found: https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h So to my understanding, the object map entries are OBJECT_EXISTS, but should be OBJECT_EXISTS_CLEAN. Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object being unchanged ("clean") as compared to another snapshot / the main volume? If so, this would explain why the backups, exports etc. are all okay, since the backup tools only got "too many" objects in the fast-diff and hence extracted too many objects from Ceph-RBD even though that was not needed. Since both Benji and Backy2 deduplicate again in their backends, this causes only a minor network traffic inefficiency. Is my understanding correct? Then the underlying issue would still be a bug, but (as it seems) a harmless one. Yes, your understanding is correct in that it's harmless from a data-integrity point-of-view. During the creation of the snapshot, the current object map (for the HEAD revision) is copied to a new object map for that snapshot and then all the objects in the HEAD revision snapshot are marked as EXISTS_CLEAN (if they EXIST). Somehow an IO operation is causing the object map to think there is an update, but apparently no object update is actually occurring (or at least the OSD doesn't think a change occurred). thanks a lot for the clarification! Good to know my understanding is correct. I re-checked all object maps just now. Again, the most recent snapshots show this issue, but only those. The only "special" thing which probably not everybody is doing would likely be us running fstrim in the machines running from the RBD regularly, to conserve space. I am not sure how exactly the DISCARD operation is handled in rbd. But since this was my guess, I just did an fstrim inside one of the VMs, and checked the object-maps again. I get: 2019-01-10 16:44:25.320 7f06f67fc700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.4f587327b23c6.0040 marked as 1, but should be 3 In this case, I got it for the volume itself and not a snapshot. So it seems to me that sometimes, DISCARD causes objects to think they have been updated, albeit they have not. Sadly due to in-depth code knowledge and lack of a real debug setup I can not track it down further :-(. Cheers and hope that helps a code expert in tracking it down (at least it's not affecting data integrity), Oliver I'll let you know if it happens again to some of our snapshots, and if so, if it only happens to newly created ones... Cheers, Oliver Am 10.01.19 um 01:18 schrieb Oliver Freyermuth: Dear Cephalopodians, inspired by http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I did a check of the object-maps of our RBD volumes and snapshots. We are running 13.2.1 on the cluster I am talking about, all hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5. Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not the volumes themselves), I got something like: -- 2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0260 marked as 1, but should be 3 2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- 2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0480 marked as 1, but should be 3 2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- It often appears to affect 1-3 entries in the map of a snapshot. The Object Map was *not* marked invalid before I ran the check. After rebuilding it, the check is fine again. The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we plan to upgrade to 13.2.4 soonish). There have been no major causes of worries
Re: [ceph-users] Invalid RBD object maps of snapshots on Mimic
Dear Cephalopodians, I performed several consistency checks now: - Exporting an RBD snapshot before and after the object map rebuilding. - Exporting a backup as raw image, all backups (re)created before and after the object map rebuilding. - md5summing all of that for a snapshot for which the rebuilding was actually needed. The good news: I found that all checksums are the same. So the backups are (at least for those I checked) not broken. I also checked the source and found: https://github.com/ceph/ceph/blob/master/src/include/rbd/object_map_types.h So to my understanding, the object map entries are OBJECT_EXISTS, but should be OBJECT_EXISTS_CLEAN. Do I understand correctly that OBJECT_EXISTS_CLEAN relates to the object being unchanged ("clean") as compared to another snapshot / the main volume? If so, this would explain why the backups, exports etc. are all okay, since the backup tools only got "too many" objects in the fast-diff and hence extracted too many objects from Ceph-RBD even though that was not needed. Since both Benji and Backy2 deduplicate again in their backends, this causes only a minor network traffic inefficiency. Is my understanding correct? Then the underlying issue would still be a bug, but (as it seems) a harmless one. I'll let you know if it happens again to some of our snapshots, and if so, if it only happens to newly created ones... Cheers, Oliver Am 10.01.19 um 01:18 schrieb Oliver Freyermuth: Dear Cephalopodians, inspired by http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I did a check of the object-maps of our RBD volumes and snapshots. We are running 13.2.1 on the cluster I am talking about, all hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5. Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not the volumes themselves), I got something like: -- 2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0260 marked as 1, but should be 3 2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- 2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0480 marked as 1, but should be 3 2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- It often appears to affect 1-3 entries in the map of a snapshot. The Object Map was *not* marked invalid before I ran the check. After rebuilding it, the check is fine again. The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we plan to upgrade to 13.2.4 soonish). There have been no major causes of worries so far. We purged a single OSD disk, balanced PGs with upmap, modified the CRUSH topology slightly etc. The cluster never was in a prolonged unhealthy period nor did we have to repair any PG. Is this a known error? Is it harmful, or is this just something like reference counting being off, and objects being in the map which did not really change in the snapshot? Our usecase, in case that helps to understand or reproduce: - RBDs are used as disks for qemu/kvm virtual machines. - Every night: - We run an fstrim in the VM (which propagates to RBD and purges empty blocks), fsfreeze it, take a snapshot, thaw it again. - After that, we run two backups with Benji backup ( https://benji-backup.me/ ) and Backy2 backup ( http://backy2.com/docs/ ) which seems to work rather well so far. - We purge some old snapshots. We use the following RBD feature flags: layering, exclusive-lock, object-map, fast-diff, deep-flatten Since Benji and Backy2 are optimized for differential RBD backups to deduplicated storage, they leverage "rbd diff" (and hence make use of fast-diff, I would think). If rbd diff produces wrong output due to this issue, it would affect our backups (but it would also affect classic backups of snapshots via "rbd export"...). In case the issue is known or understood, can somebody extrapolate whether this means "rbd diff" contains too many blocks or actually misses changed blocks? We are from now on running daily, full object-map checks on all volumes and backups, and automatically rebuild any object-map which was found invalid after the check. Hopefully, this will allow to correlate the appearance of these issues with "something" happening on the cluster. I did not detect
[ceph-users] Invalid RBD object maps of snapshots on Mimic
Dear Cephalopodians, inspired by http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032092.html I did a check of the object-maps of our RBD volumes and snapshots. We are running 13.2.1 on the cluster I am talking about, all hosts (OSDs, MONs, RBD client nodes) still on CentOS 7.5. Sadly, I found that for at least 50 % of the snapshots (only the snapshots, not the volumes themselves), I got something like: -- 2019-01-09 23:00:06.481 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0260 marked as 1, but should be 3 2019-01-09 23:00:06.563 7f89aeffd700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- 2019-01-09 23:00:09.166 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0480 marked as 1, but should be 3 2019-01-09 23:00:09.228 7fbcff7fe700 -1 librbd::ObjectMapIterateRequest: object map error: object rbd_data.519c46b8b4567.0840 marked as 1, but should be 3 -- It often appears to affect 1-3 entries in the map of a snapshot. The Object Map was *not* marked invalid before I ran the check. After rebuilding it, the check is fine again. The cluster has not yet seen any Ceph update (it was installed as 13.2.1, we plan to upgrade to 13.2.4 soonish). There have been no major causes of worries so far. We purged a single OSD disk, balanced PGs with upmap, modified the CRUSH topology slightly etc. The cluster never was in a prolonged unhealthy period nor did we have to repair any PG. Is this a known error? Is it harmful, or is this just something like reference counting being off, and objects being in the map which did not really change in the snapshot? Our usecase, in case that helps to understand or reproduce: - RBDs are used as disks for qemu/kvm virtual machines. - Every night: - We run an fstrim in the VM (which propagates to RBD and purges empty blocks), fsfreeze it, take a snapshot, thaw it again. - After that, we run two backups with Benji backup ( https://benji-backup.me/ ) and Backy2 backup ( http://backy2.com/docs/ ) which seems to work rather well so far. - We purge some old snapshots. We use the following RBD feature flags: layering, exclusive-lock, object-map, fast-diff, deep-flatten Since Benji and Backy2 are optimized for differential RBD backups to deduplicated storage, they leverage "rbd diff" (and hence make use of fast-diff, I would think). If rbd diff produces wrong output due to this issue, it would affect our backups (but it would also affect classic backups of snapshots via "rbd export"...). In case the issue is known or understood, can somebody extrapolate whether this means "rbd diff" contains too many blocks or actually misses changed blocks? We are from now on running daily, full object-map checks on all volumes and backups, and automatically rebuild any object-map which was found invalid after the check. Hopefully, this will allow to correlate the appearance of these issues with "something" happening on the cluster. I did not detect a clean pattern in the affected snapshots, though, it seemed rather random... Maybe it would also help to understand this issue if somebody else using RBD in a similar manner on Mimic could also check the object-maps. Since this issue does not show up until a check is performed, this was below our radar for many months now... Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
Am 18.12.18 um 11:48 schrieb Hector Martin: > On 18/12/2018 18:28, Oliver Freyermuth wrote: >> We have yet to observe these hangs, we are running this with ~5 VMs with ~10 >> disks for about half a year now with daily snapshots. But all of these VMs >> have very "low" I/O, >> since we put anything I/O intensive on bare metal (but with automated >> provisioning of course). >> >> So I'll chime in on your question, especially since there might be VMs on >> our cluster in the future where the inner OS may not be running an agent. >> Since we did not observe this yet, I'll also add: What's your "scale", is it >> hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? > > 5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); > usually not that high, but it can easily peak at 100% when certain things > happen. We don't have great I/O performance (RBD over 1gbps links to HDD > OSDs). > > I'm poring through monitoring graphs now and I think the issue this time > around was just too much dirty data in the page cache of a guest. The VM that > failed spent 3 minutes flushing out writes to disk before its I/O was > quiesced, at around 100 IOPS throughput (the actual data throughput was low, > though, so small writes). That exceeded our timeout and then things went > south from there. > > I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior > I'm seeing that seems to be the case. Unfortunately coming up with an upper > bound for the freeze time seems tricky now. I'm increasing our timeout to 15 > minutes, we'll see if the problem recurs. > > Given this, it makes even more sense to just avoid the freeze if at all > reasonable. There's no real way to guarantee that a fsfreeze will complete in > a "reasonable" amount of time as far as I can tell. Potentially, if granted arbitrary command execution by the guest agent, you could check (there might be a better interface than parsing meminfo...): cat /proc/meminfo | grep -i dirty Dirty: 19476 kB You could guess from that information how long the fsfreeze may take (ideally, combining that with allowed IOPS). Of course, if you have control over your VMs, you may also play with the vm.dirty_ratio and vm.dirty_background_ratio. Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile: vm.dirty_ratio = 30 (default is 20 %) so they optimize for performance by increasing the dirty buffers to delay writeback even more. They take the opposite for their "virtual-host" profile: vm.dirty_background_ratio = 5 (default is 10 %). I believe these choices are good for performance, but may increase the time it takes to freeze the VMs, especially if IOPS are limited and there's a lot of dirty data. Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs and hosts, we may also observe this one day... So I'm curious: How did you implement the timeout in your case? Are you using a qemu-agent-command issuing fsfreeze with --async and --timeout instead of domfsfreeze? We are using domfsfreeze as of now, which (probably) has an infinite timeout, or at least no timeout documented in the manpage. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
Dear Hector, we are using the very same approach on CentOS 7 (freeze + thaw), but preceeded by an fstrim. With virtio-scsi, using fstrim propagates the discards from within the VM to Ceph RBD (if qemu is configured accordingly), and a lot of space is saved. We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for about half a year now with daily snapshots. But all of these VMs have very "low" I/O, since we put anything I/O intensive on bare metal (but with automated provisioning of course). So I'll chime in on your question, especially since there might be VMs on our cluster in the future where the inner OS may not be running an agent. Since we did not observe this yet, I'll also add: What's your "scale", is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? Cheers, Oliver Am 18.12.18 um 10:10 schrieb Hector Martin: Hi list, I'm running libvirt qemu guests on RBD, and currently taking backups by issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. This seems to be a common approach. This is safe, but it's impactful: the guest has frozen I/O for the duration of the snapshot. This is usually only a few seconds. Unfortunately, the freeze action doesn't seem to be very reliable. Sometimes it times out, leaving the guest in a messy situation with frozen I/O (thaw times out too when this happens, or returns success but FSes end up frozen anyway). This is clearly a bug somewhere, but I wonder whether the freeze is a hard requirement or not. Are there any atomicity guarantees for RBD snapshots taken *without* freezing the filesystem? Obviously the filesystem will be dirty and will require journal recovery, but that is okay; it's equivalent to a hard shutdown/crash. But is there any chance of corruption related to the snapshot being taken in a non-atomic fashion? Filesystems and applications these days should have no trouble with hard shutdowns, as long as storage writes follow ordering guarantees (no writes getting reordered across a barrier and such). Put another way: do RBD snapshots have ~identical atomicity guarantees to e.g. LVM snapshots? If we can get away without the freeze, honestly I'd rather go that route. If I really need to pause I/O during the snapshot creation, I might end up resorting to pausing the whole VM (suspend/resume), which has higher impact but also probably a much lower chance of messing up (or having excess latency), since it doesn't involve the guest OS or the qemu agent at all... smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Warning: Forged Email] Ceph 10.2.11 - Status not working
That's kind of unrelated to Ceph, but since you wrote two mails already, and I believe it is caused by the mailing list software for ceph-users... Your original mail distributed via the list ("[ceph-users] Ceph 10.2.11 - Status not working") did *not* have the forged-warning. Only the subsequent "Re:"-replies by yourself had it. That also matches what you will find in the archives. So my guess is that "[Warning: Forged Email]" was added by your own mailing system for the mail incoming to you after it was distributed by the ceph-users list server. That's probably since the mailman sending mail for ceph-users leaves the "From:" intact, and that contains your domain (oeg.com.au). So the mailman server for ceph-users is "forging", since it sends mail with "From: m...@oeg.com.au", but using it's own IP, hence violating your SPF record. It also breaks DKIM by adding the footer (ceph-users mailing list, ceph-users@lists.ceph.com, http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com) thus manipulating the body of the mail. So in short: The mailman used for ceph-users breaks both SPF and DKIM (most mailing lists still do that). My guess is that your mailing system adds a tag "[Warning: Forged Email]" at least for mail with a "From:" matching your domain in case SPF and / or DKIM is broken. If somebody wants to "fix" this: The reason is sadly that SPF and DKIM are not well suited for mailing lists :-(. But workarounds exist. Newer mailing list software (including modern mailman releases) allow to manipulate the "From:" before sending out mail, e.g. writing in the header: From: "Mike O'Connor (via ceph-users list)" Reply-To: "Mike O'Connor" With this, SPF is fine, since the mail server sending the mail is allowed to do so for @lists.ceph.com . Users can still reply just fine. Concerning DKIM, there's also a midway. The cleanest (I believe) is pruning all previous DKIM signatures on the list server and re-signing before sending it out. S/MIME will still break by adding the footer, but that's another matter. Cheers, Oliver Am 18.12.18 um 01:34 schrieb Mike O'Connor: > mmm wonder why the list is saying my email is forged, wonder what I have > wrong. > > My email is sent via an outbound spam filter, but I was sure I had the > SPF set correctly. > > Mike > > On 18/12/18 10:53 am, Mike O'Connor wrote: >> Hi All >> >> I have a ceph cluster which has been working with out issues for about 2 >> years now, it was upgrade about 6 month ago to 10.2.11 >> >> root@blade3:/var/lib/ceph/mon# ceph status >> 2018-12-18 10:42:39.242217 7ff770471700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.207:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768001f90).fault >> 2018-12-18 10:42:45.242745 7ff770471700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768002410).fault >> 2018-12-18 10:42:51.243230 7ff770471700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768002f40).fault >> 2018-12-18 10:42:54.243452 7ff770572700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.205:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768008060).fault >> 2018-12-18 10:42:57.243715 7ff770471700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768003580).fault >> 2018-12-18 10:43:03.244280 7ff7781b9700 0 -- 10.1.5.203:0/1608630285 >> >> 10.1.5.205:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> c=0x7ff768003670).fault >> >> All system can ping each other. I simple can not see why its failing. >> >> >> ceph.conf >> >> [global] >> auth client required = cephx >> auth cluster required = cephx >> auth service required = cephx >> cluster network = 10.1.5.0/24 >> filestore xattr use omap = true >> fsid = 42a0f015-76da-4f47-b506-da5cdacd030f >> keyring = /etc/pve/priv/$cluster.$name.keyring >> osd journal size = 5120 >> osd pool default min size = 1 >> public network = 10.1.5.0/24 >> mon_pg_warn_max_per_osd = 0 >> >> [client] >> rbd cache = true >> [osd] >> keyring = /var/lib/ceph/osd/ceph-$id/keyring >> osd max backfills = 1 >> osd recovery max active = 1 >> osd_disk_threads = 1 >> osd_disk_thread_ioprio_class = idle >> osd_disk_thread_ioprio_priority = 7 >> [mon.2] >> host = blade5 >> mon addr = 10.1.5.205:6789 >> [mon.1] >> host = blade3 >> mon addr = 10.1.5.203:6789 >> [mon.3] >> host = blade7 >> mon addr = 10.1.5.207:6789 >> [mon.0] >> host = blade1 >> mon addr = 10.1.5.201:6789 >> [mds] >> mds data = /var/lib/ceph/mds/mds.$id >> keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring >> [mds.0] >> host = blade1 >> [mds.1] >> host = blade3 >> [mds.2] >> host = blade5 >> [mds.3] >> host = blade7 >> >> >> Any
Re: [ceph-users] Upgrade to Luminous (mon+osd)
There's also an additional issue which made us activate CEPH_AUTO_RESTART_ON_UPGRADE=yes (and of course, not have automatic updates of Ceph): When using compression e.g. with Snappy, it seems that already running OSDs which try to dlopen() the snappy library for some version upgrades become unhappy if the version mismatches expectation (i.e. symbols don't match). So effectively, it seems that in some cases you can not get around restarting the OSDs when updating the corresponding packages. Cheers, Oliver Am 03.12.18 um 15:51 schrieb Dan van der Ster: It's not that simple see http://tracker.ceph.com/issues/21672 For the 12.2.8 to 12.2.10 upgrade it seems the selinux module was updated -- so the rpms restart the ceph.target. What's worse is that this seems to happen before all the new updated files are in place. Our 12.2.8 to 12.2.10 upgrade procedure is: systemctl stop ceph.target yum update systemctl start ceph.target -- Dan On Mon, Dec 3, 2018 at 12:42 PM Paul Emmerich wrote: Upgrading Ceph packages does not restart the services -- exactly for this reason. This means there's something broken with your yum setup if the services are restarted when only installing the new version. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak : Hello, ceph users, I have a small(-ish) Ceph cluster, where there are osds on each host, and in addition to that, there are mons on the first three hosts. Is it possible to upgrade the cluster to Luminous without service interruption? I have tested that when I run "yum --enablerepo Ceph update" on a mon host, the osds on that host remain down until all three mons are upgraded to Luminous. Is it possible to upgrade ceph-mon only, and keep ceph-osd running the old version (Jewel in my case) as long as possible? It seems RPM dependencies forbid this, but with --nodeps it could be done. Is there a supported way how to upgrade host running both mon and osd to Luminous? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Customized Crush location hooks in Mimic
Dear Greg, Am 30.11.18 um 18:38 schrieb Gregory Farnum: > I’m pretty sure the monitor command there won’t move intermediate buckets > like the host. This is so if an osd has incomplete metadata it doesn’t > inadvertently move 11 other OSDs into a different rack/row/whatever. > > So in this case, it finds the host osd0001 and matches it, but since the > crush map already knows about osd0001 it doesn’t pay any attention to the > datacenter field. > Whereas if you tried setting it with mynewhost, the monitor wouldn’t know > where that host exists and would look at the other fields to set it in the > specified data center. thanks! That's a good and clear explanation. This was not apparent from the documentation to me, but it sounds like the safest way to go. So in the end, crush-location-hooks are mostly useful for freshly created OSDs, e.g. on a new host (they should then directly go to the correct rack / datacenter etc.). I wonder if that's the only sensible usecase, but it seems to me right now that this is the case. So for our scheme, I will indeed use it for that, and move hosts manually (when moving them physically...) by moving the ceph buckets manually to the other rack / datacenter. Thanks for the explanation! Cheers, Oliver > -Greg > On Fri, Nov 30, 2018 at 6:46 AM Oliver Freyermuth > mailto:freyerm...@physik.uni-bonn.de>> wrote: > > Dear Cephalopodians, > > sorry for the spam, but I found the following in mon logs just now and am > finally out of ideas: > > -- > 2018-11-30 15:43:05.207 7f9d64aac700 0 mon.mon001@0(leader) e3 > handle_command mon_command({"prefix": "osd crush set-device-class", "class": > "hdd", "ids": ["1"]} v 0) v1 > 2018-11-30 15:43:05.207 7f9d64aac700 0 log_channel(audit) log [INF] : > from='osd.1 10.160.12.101:6816/90528 <http://10.160.12.101:6816/90528>' > entity='osd.1' cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", > "ids": ["1"]}]: dispatch > 2018-11-30 15:43:05.208 7f9d64aac700 0 mon.mon001@0(leader) e3 > handle_command mon_command({"prefix": "osd crush create-or-move", "id": 1, > "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]} v > 0) v1 > 2018-11-30 15:43:05.208 7f9d64aac700 0 log_channel(audit) log [INF] : > from='osd.1 10.160.12.101:6816/90528 <http://10.160.12.101:6816/90528>' > entity='osd.1' cmd=[{"prefix": "osd crush create-or-move", "id": 1, > "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]}]: > dispatch > 2018-11-30 15:43:05.208 7f9d64aac700 0 mon.mon001@0(leader).osd e2464 > create-or-move crush item name 'osd.1' initial_weight 3.6824 at location > {datacenter=FTD,host=osd001,root=default} > > -- > So the request to move to datacenter=FTD arrives at the mon, but no > action is taken, and the OSD is left in FTD_1. > > Cheers, > Oliver > > Am 30.11.18 um 15:25 schrieb Oliver Freyermuth: > > Dear Cephalopodians, > > > > further experiments revealed that the crush-location-hook is indeed > called! > > It's just my check (writing to a file in tmp from inside the hook) > which somehow failed. Using "logger" works for debugging. > > > > So now, my hook outputs: > > host=osd001 datacenter=FTD root=default > > as explained before. I have also explicitly created the buckets > beforehand in case that is needed. > > > > Tree looks like that: > > # ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 55.23582 root default > > -9 0 datacenter FTD > > -12 18.41194 datacenter FTD_1 > > -3 18.41194 host osd001 > > 0 hdd 3.68239 osd.0 up 1.0 1.0 > > 1 hdd 3.68239 osd.1 up 1.0 1.0 > > 2 hdd 3.68239 osd.2 up 1.0 1.0 > > 3 hdd 3.68239 osd.3 up 1.0 1.0 > > 4 hdd 3.68239 osd.4 up 1.0 1.0 > > -11 0 datacenter FTD_2 > > -5 18.41194 host osd002 > > 5
Re: [ceph-users] Customized Crush location hooks in Mimic
Dear Cephalopodians, sorry for the spam, but I found the following in mon logs just now and am finally out of ideas: -- 2018-11-30 15:43:05.207 7f9d64aac700 0 mon.mon001@0(leader) e3 handle_command mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["1"]} v 0) v1 2018-11-30 15:43:05.207 7f9d64aac700 0 log_channel(audit) log [INF] : from='osd.1 10.160.12.101:6816/90528' entity='osd.1' cmd=[{"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["1"]}]: dispatch 2018-11-30 15:43:05.208 7f9d64aac700 0 mon.mon001@0(leader) e3 handle_command mon_command({"prefix": "osd crush create-or-move", "id": 1, "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]} v 0) v1 2018-11-30 15:43:05.208 7f9d64aac700 0 log_channel(audit) log [INF] : from='osd.1 10.160.12.101:6816/90528' entity='osd.1' cmd=[{"prefix": "osd crush create-or-move", "id": 1, "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]}]: dispatch 2018-11-30 15:43:05.208 7f9d64aac700 0 mon.mon001@0(leader).osd e2464 create-or-move crush item name 'osd.1' initial_weight 3.6824 at location {datacenter=FTD,host=osd001,root=default} -- So the request to move to datacenter=FTD arrives at the mon, but no action is taken, and the OSD is left in FTD_1. Cheers, Oliver Am 30.11.18 um 15:25 schrieb Oliver Freyermuth: Dear Cephalopodians, further experiments revealed that the crush-location-hook is indeed called! It's just my check (writing to a file in tmp from inside the hook) which somehow failed. Using "logger" works for debugging. So now, my hook outputs: host=osd001 datacenter=FTD root=default as explained before. I have also explicitly created the buckets beforehand in case that is needed. Tree looks like that: # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 55.23582 root default -9 0 datacenter FTD -12 18.41194 datacenter FTD_1 -3 18.41194 host osd001 0 hdd 3.68239 osd.0 up 1.0 1.0 1 hdd 3.68239 osd.1 up 1.0 1.0 2 hdd 3.68239 osd.2 up 1.0 1.0 3 hdd 3.68239 osd.3 up 1.0 1.0 4 hdd 3.68239 osd.4 up 1.0 1.0 -11 0 datacenter FTD_2 -5 18.41194 host osd002 5 hdd 3.68239 osd.5 up 1.0 1.0 6 hdd 3.68239 osd.6 up 1.0 1.0 7 hdd 3.68239 osd.7 up 1.0 1.0 8 hdd 3.68239 osd.8 up 1.0 1.0 9 hdd 3.68239 osd.9 up 1.0 1.0 -7 18.41194 host osd003 10 hdd 3.68239 osd.10 up 1.0 1.0 11 hdd 3.68239 osd.11 up 1.0 1.0 12 hdd 3.68239 osd.12 up 1.0 1.0 13 hdd 3.68239 osd.13 up 1.0 1.0 14 hdd 3.68239 osd.14 up 1.0 1.0 So naively, I would expect that when I restart osd.0, it should move itself into datacenter=FTD. But that does not happen... Any idea what I am missing? Cheers, Oliver Am 30.11.18 um 11:44 schrieb Oliver Freyermuth: Dear Cephalopodians, I'm probably missing something obvious, but I am at a loss here on how to actually make use of a customized crush location hook. I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version before the upgrade-preventing bugs). Here's what I did: 1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be executed by user "ceph": # sudo -u ceph /usr/local/bin/customized-ceph-crush-location host=osd001 datacenter=FTD root=default 2. Add the following to ceph.conf: [osd] crush_location_hook = /usr/local/bin/customized-ceph-crush-location 3. Restart an OSD and confirm that is picked up: # systemctl restart ceph-osd@0 # ceph config show-with-defaults osd.0 ... crush_location_hook /usr/local/bin/customized-ceph-crush-location file ... osd_crush_update_on_start true default ... However, the script is not executed, and I can ensure that since the script should also write a log to /tmp, which is not created. Also, the "datacenter" type does not show up in the crush tree. I have already disabled S
Re: [ceph-users] Customized Crush location hooks in Mimic
Dear Cephalopodians, further experiments revealed that the crush-location-hook is indeed called! It's just my check (writing to a file in tmp from inside the hook) which somehow failed. Using "logger" works for debugging. So now, my hook outputs: host=osd001 datacenter=FTD root=default as explained before. I have also explicitly created the buckets beforehand in case that is needed. Tree looks like that: # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -1 55.23582 root default -9 0 datacenter FTD -12 18.41194 datacenter FTD_1 -3 18.41194 host osd001 0 hdd 3.68239 osd.0up 1.0 1.0 1 hdd 3.68239 osd.1up 1.0 1.0 2 hdd 3.68239 osd.2up 1.0 1.0 3 hdd 3.68239 osd.3up 1.0 1.0 4 hdd 3.68239 osd.4up 1.0 1.0 -11 0 datacenter FTD_2 -5 18.41194 host osd002 5 hdd 3.68239 osd.5up 1.0 1.0 6 hdd 3.68239 osd.6up 1.0 1.0 7 hdd 3.68239 osd.7up 1.0 1.0 8 hdd 3.68239 osd.8up 1.0 1.0 9 hdd 3.68239 osd.9up 1.0 1.0 -7 18.41194 host osd003 10 hdd 3.68239 osd.10 up 1.0 1.0 11 hdd 3.68239 osd.11 up 1.0 1.0 12 hdd 3.68239 osd.12 up 1.0 1.0 13 hdd 3.68239 osd.13 up 1.0 1.0 14 hdd 3.68239 osd.14 up 1.0 1.0 So naively, I would expect that when I restart osd.0, it should move itself into datacenter=FTD. But that does not happen... Any idea what I am missing? Cheers, Oliver Am 30.11.18 um 11:44 schrieb Oliver Freyermuth: Dear Cephalopodians, I'm probably missing something obvious, but I am at a loss here on how to actually make use of a customized crush location hook. I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version before the upgrade-preventing bugs). Here's what I did: 1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be executed by user "ceph": # sudo -u ceph /usr/local/bin/customized-ceph-crush-location host=osd001 datacenter=FTD root=default 2. Add the following to ceph.conf: [osd] crush_location_hook = /usr/local/bin/customized-ceph-crush-location 3. Restart an OSD and confirm that is picked up: # systemctl restart ceph-osd@0 # ceph config show-with-defaults osd.0 ... crush_location_hook /usr/local/bin/customized-ceph-crush-location file ... osd_crush_update_on_start true default ... However, the script is not executed, and I can ensure that since the script should also write a log to /tmp, which is not created. Also, the "datacenter" type does not show up in the crush tree. I have already disabled SELinux just to make sure. Any ideas what I am missing here? Cheers and thanks in advance, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Customized Crush location hooks in Mimic
Dear Cephalopodians, I'm probably missing something obvious, but I am at a loss here on how to actually make use of a customized crush location hook. I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last version before the upgrade-preventing bugs). Here's what I did: 1. Write a script /usr/local/bin/customized-ceph-crush-location. The script can be executed by user "ceph": # sudo -u ceph /usr/local/bin/customized-ceph-crush-location host=osd001 datacenter=FTD root=default 2. Add the following to ceph.conf: [osd] crush_location_hook = /usr/local/bin/customized-ceph-crush-location 3. Restart an OSD and confirm that is picked up: # systemctl restart ceph-osd@0 # ceph config show-with-defaults osd.0 ... crush_location_hook/usr/local/bin/customized-ceph-crush-location file ... osd_crush_update_on_start true default ... However, the script is not executed, and I can ensure that since the script should also write a log to /tmp, which is not created. Also, the "datacenter" type does not show up in the crush tree. I have already disabled SELinux just to make sure. Any ideas what I am missing here? Cheers and thanks in advance, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df space usage confusion - balancing needed?
Am 27.10.18 um 04:12 schrieb Linh Vu: > Should be fine as long as your "mgr/balancer/max_misplaced" is reasonable. I > find the default value of 0.05 decent enough, although from experience that > seems like 0.05% rather than 5% as suggested here: > http://docs.ceph.com/docs/luminous/mgr/balancer/ Ok! I did actually choose 0.01. Interestingly, during the initial large rebalancing, it went up to > 2 % of misplaced objects (in small steps) until I decided to stop the balancer for a day to give the cluster enough time to adapt. > You can also choose to turn it on only during certain hours when the cluster > might be less busy. The config-keys are there somewhere (there's a post by > Dan van der Ster on the ML about them) but they don't actually work in 12.2.8 > at least, when I tried them. I suggest just use cron to turn the balancer on > and off. I found that mail in the archives. Indeed, that seems helpful. I'll start with permanently leaving the balancer on for now and observe if it has any impact. Since we rarely change the cluster's layout, it should effectively just sit there silently most of the time. Thanks! Oliver > > ---------- > *From:* Oliver Freyermuth > *Sent:* Friday, 26 October 2018 9:32:14 PM > *To:* Linh Vu; Janne Johansson > *Cc:* ceph-users@lists.ceph.com; Peter Wienemann > *Subject:* Re: [ceph-users] ceph df space usage confusion - balancing needed? > > Dear Cephalopodians, > > thanks for all your feedback! > > I finally "pushed the button" and let upmap run for ~36 hours. > Previously, we had ~63 % usage of our CephFS with only 50 % raw usage, now, > we see only 53.77 % usage. > > That's as close as I expect things to ever become, and we gained about 70 TiB > of free storage by that, which is almost one file server. > So the outcome is really close to perfection :-). > > I'm leaving the balancer active now in upmap mode. Any bad experiences with > leaving it active "forever"? > > Cheers and many thanks again, > Oliver > > Am 23.10.18 um 01:14 schrieb Linh Vu: >> Upmap is awesome. I ran it on our new cluster before we started ingesting >> data, so that the PG count is balanced on all OSDs. After ingesting about >> 315TB, it's still beautifully balanced. Note: we have a few nodes with 8TB >> OSDs, and the rest on 10TBs. >> >> >> # ceph osd df plain >> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >> 0 mf1hdd 7.27739 1.0 7.28TiB 2.06TiB 5.21TiB 28.34 1.01 144 >> 1 mf1hdd 7.27739 1.0 7.28TiB 2.07TiB 5.21TiB 28.38 1.02 144 >> 2 mf1hdd 7.27739 1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 >> 3 mf1hdd 7.27739 1.0 7.28TiB 2.06TiB 5.21TiB 28.37 1.02 144 >> 4 mf1hdd 7.27739 1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 >> 5 mf1hdd 7.27739 1.0 7.28TiB 2.02TiB 5.26TiB 27.73 0.99 141 >> 6 mf1hdd 7.27739 1.0 7.28TiB 2.03TiB 5.24TiB 27.94 1.00 142 >> 7 mf1hdd 7.27739 1.0 7.28TiB 2.06TiB 5.21TiB 28.35 1.02 144 >> 8 mf1hdd 7.27739 1.0 7.28TiB 2.02TiB 5.26TiB 27.76 0.99 141 >> 9 mf1hdd 7.27739 1.0 7.28TiB 2.04TiB 5.24TiB 27.97 1.00 142 >> 10 mf1hdd 7.27739 1.0 7.28TiB 2.06TiB 5.21TiB 28.35 1.02 144 >> 11 mf1hdd 7.27739 1.0 7.28TiB 2.04TiB 5.24TiB 27.99 1.00 142 >> 12 mf1hdd 7.27739 1.0 7.28TiB 2.02TiB 5.26TiB 27.75 0.99 141 >> 13 mf1hdd 7.27739 1.0 7.28TiB 2.03TiB 5.24TiB 27.96 1.00 142 >> 14 mf1hdd 7.27739 1.0 7.28TiB 2.02TiB 5.26TiB 27.78 0.99 141 >> 15 mf1hdd 7.27739 1.0 7.28TiB 2.07TiB 5.21TiB 28.38 1.02 144 >> 224 nvmemeta 0.02179 1.0 22.3GiB 1.52GiB 20.8GiB 6.82 0.24 185 >> 225 nvmemeta 0.02179 1.0 22.4GiB 1.49GiB 20
Re: [ceph-users] ceph df space usage confusion - balancing needed?
7.99 1.00 174 > 137 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 > 138 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.81 1.00 173 > 139 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.84 1.00 173 > 140 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.81 1.00 173 > 141 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 > 142 mf1hdd 8.91019 1.0 8.91TiB 2.50TiB 6.41TiB 28.00 1.00 174 > 143 mf1hdd 8.91019 1.0 8.91TiB 2.48TiB 6.43TiB 27.82 1.00 173 > 240 nvmemeta 0.02179 1.0 22.3GiB 1.61GiB 20.7GiB 7.22 0.26 184 > 241 nvmemeta 0.02179 1.0 22.4GiB 1.43GiB 20.9GiB 6.41 0.23 182 > TOTAL 1.85PiB 528TiB 1.33PiB 27.93 > MIN/MAX VAR: 0.23/1.02 STDDEV: 7.10 > > ------ > *From:* ceph-users on behalf of Oliver > Freyermuth > *Sent:* Sunday, 21 October 2018 6:57:49 AM > *To:* Janne Johansson > *Cc:* ceph-users@lists.ceph.com; Peter Wienemann > *Subject:* Re: [ceph-users] ceph df space usage confusion - balancing needed? > > Ok, I'll try out the balancer end of the upcoming week then (after we've > fixed a HW-issue with one of our mons > and the cooling system). > > Until then, any further advice and whether upmap is recommended over > crush-compat (all clients are Luminous) are welcome ;-). > > Cheers, > Oliver > > Am 20.10.18 um 21:26 schrieb Janne Johansson: >> Ok, can't say "why" then, I'd reweigh them somewhat to even it out, >> 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for >> the MGRs, a script or just a few manual tweaks might be in order. >> >> Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth >> : >>> >>> All OSDs are of the very same size. One OSD host has slightly more disks >>> (33 instead of 31), though. >>> So also that that can't explain the hefty difference. >>> >>> I attach the output of "ceph osd tree" and "ceph osd df". >>> >>> The crush rule for the ceph_data pool is: >>> rule cephfs_data { >>> id 2 >>> type erasure >>> min_size 3 >>> max_size 6 >>> step set_chooseleaf_tries 5 >>> step set_choose_tries 100 >>> step take default class hdd >>> step chooseleaf indep 0 type host >>> step emit >>> } >>> So that only considers the hdd device class. EC is done with k=4 m=2. >>> >>> So I don't see any imbalance on the hardware level, but only a somewhat >>> uneven distribution of PGs. >>> Am I missing something, or is this really just a case for the ceph balancer >>> plugin? >>> I'm just a bit astonished this effect is so huge. >>> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even >>> distribution without balancing? >>> But it yields about 100 PGs per OSD, as you can see... >>> >>> -- >>> # ceph osd tree >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 826.26428 root default >>> -3 0.43700 host mon001 >>> 0 ssd 0.21799 osd.0 up 1.0 1.0 >>> 1 ssd 0.21799 osd.1 up 1.0 1.0 >>> -5 0.43700 host mon002 >>> 2 ssd 0.21799 osd.2 up 1.0 1.0 >>> 3 ssd 0.21799 osd.3 up 1.0 1.0 >>> -31 1.81898 host mon003 >>> 230 ssd 0.90999 osd.230 up 1.0 1.0 >>> 231 ssd 0.90999 osd.231 up 1.0 1.0 >>> -10 116.64600 host osd001 >>> 4 hdd 3.64499 osd.4 up 1.0 1.0 >>> 5 hdd 3.64499 osd.5 up 1.0 1.0 >>> 6 hdd 3.64499 osd.6 up 1.0 1.0 >>> 7 hdd 3.64499 osd.7 up 1.0 1.0 >>> 8 hdd 3.64499 osd.8 up 1.0 1.0 >>> 9 hdd 3.64499 osd.9 up 1.0 1.0 >>> 10 hdd 3.64499 osd.10 up 1.0 1.0 >>> 11 hdd 3.64499 osd.11 up 1.0 1.0 >>> 12 hdd 3.64499 osd.12 up 1.0 1.0 >>> 13 hdd 3.64499 osd.13 up 1.0 1.000
Re: [ceph-users] ceph df space usage confusion - balancing needed?
Ok, I'll try out the balancer end of the upcoming week then (after we've fixed a HW-issue with one of our mons and the cooling system). Until then, any further advice and whether upmap is recommended over crush-compat (all clients are Luminous) are welcome ;-). Cheers, Oliver Am 20.10.18 um 21:26 schrieb Janne Johansson: > Ok, can't say "why" then, I'd reweigh them somewhat to even it out, > 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for > the MGRs, a script or just a few manual tweaks might be in order. > > Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth > : >> >> All OSDs are of the very same size. One OSD host has slightly more disks (33 >> instead of 31), though. >> So also that that can't explain the hefty difference. >> >> I attach the output of "ceph osd tree" and "ceph osd df". >> >> The crush rule for the ceph_data pool is: >> rule cephfs_data { >> id 2 >> type erasure >> min_size 3 >> max_size 6 >> step set_chooseleaf_tries 5 >> step set_choose_tries 100 >> step take default class hdd >> step chooseleaf indep 0 type host >> step emit >> } >> So that only considers the hdd device class. EC is done with k=4 m=2. >> >> So I don't see any imbalance on the hardware level, but only a somewhat >> uneven distribution of PGs. >> Am I missing something, or is this really just a case for the ceph balancer >> plugin? >> I'm just a bit astonished this effect is so huge. >> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even >> distribution without balancing? >> But it yields about 100 PGs per OSD, as you can see... >> >> -- >> # ceph osd tree >> ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF >> -1 826.26428 root default >> -3 0.43700 host mon001 >> 0 ssd 0.21799 osd.0 up 1.0 1.0 >> 1 ssd 0.21799 osd.1 up 1.0 1.0 >> -5 0.43700 host mon002 >> 2 ssd 0.21799 osd.2 up 1.0 1.0 >> 3 ssd 0.21799 osd.3 up 1.0 1.0 >> -31 1.81898 host mon003 >> 230 ssd 0.90999 osd.230 up 1.0 1.0 >> 231 ssd 0.90999 osd.231 up 1.0 1.0 >> -10 116.64600 host osd001 >> 4 hdd 3.64499 osd.4 up 1.0 1.0 >> 5 hdd 3.64499 osd.5 up 1.0 1.0 >> 6 hdd 3.64499 osd.6 up 1.0 1.0 >> 7 hdd 3.64499 osd.7 up 1.0 1.0 >> 8 hdd 3.64499 osd.8 up 1.0 1.0 >> 9 hdd 3.64499 osd.9 up 1.0 1.0 >> 10 hdd 3.64499 osd.10 up 1.0 1.0 >> 11 hdd 3.64499 osd.11 up 1.0 1.0 >> 12 hdd 3.64499 osd.12 up 1.0 1.0 >> 13 hdd 3.64499 osd.13 up 1.0 1.0 >> 14 hdd 3.64499 osd.14 up 1.0 1.0 >> 15 hdd 3.64499 osd.15 up 1.0 1.0 >> 16 hdd 3.64499 osd.16 up 1.0 1.0 >> 17 hdd 3.64499 osd.17 up 1.0 1.0 >> 18 hdd 3.64499 osd.18 up 1.0 1.0 >> 19 hdd 3.64499 osd.19 up 1.0 1.0 >> 20 hdd 3.64499 osd.20 up 1.0 1.0 >> 21 hdd 3.64499 osd.21 up 1.0 1.0 >> 22 hdd 3.64499 osd.22 up 1.0 1.0 >> 23 hdd 3.64499 osd.23 up 1.0 1.0 >> 24 hdd 3.64499 osd.24 up 1.0 1.0 >> 25 hdd 3.64499 osd.25 up 1.0 1.0 >> 26 hdd 3.64499 osd.26 up 1.0 1.0 >> 27 hdd 3.64499 osd.27 up 1.0 1.0 >> 28 hdd 3.64499 osd.28 up 1.0 1.0 >> 29 hdd 3.64499 osd.29 up 1.0 1.0 >> 30 hdd 3.64499 osd.30 up 1.0 1.0 >> 31 hdd 3.64499 osd.31 up 1.0 1.0 >> 32 hdd 3.64499 osd.32 up 1.0 1.0 >> 33 hdd 3.64499 osd.33 up 1.0 1.0 >> 34 hdd 3.64499 osd.34 up 1.0 1.0 >> 35 hdd 3.64499 osd.35 up 1.0 1.0 >> -13 116.64600 host osd002 >> 36 hdd 3.64499 os
Re: [ceph-users] ceph df space usage confusion - balancing needed?
76G 1949G 47.69 0.95 104 227 hdd 3.63899 1.0 3726G 1929G 1796G 51.78 1.03 113 228 hdd 3.63899 1.0 3726G 1657G 2068G 44.48 0.89 97 229 hdd 3.63899 1.0 3726G 1843G 1882G 49.47 0.98 108 TOTAL 825T 414T 410T 50.24 MIN/MAX VAR: 0.01/1.29 STDDEV: 9.22 -- Am 20.10.18 um 20:35 schrieb Janne Johansson: > Yes, if you have uneven sizes I guess you could end up in a situation > where you have > lots of 1TB OSDs and a number of 2TB OSD but pool replication forces > the pool to have one > PG replica on the 1TB OSD, then it would be possible to state "this > pool cant write more than X G" > but when it is full, there would be free space left on some of the > 2TB-OSDs, but which the pool > cant utilize. Probably same for uneven OSD hosts if you have those. > > Den lör 20 okt. 2018 kl 20:28 skrev Oliver Freyermuth > : >> >> Dear Janne, >> >> yes, of course. But since we only have two pools here, this can not explain >> the difference. >> The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB >> of total raw storage for that. >> So looking at the raw space usage, we can ignore that. >> >> All the rest is used for the ceph_data pool. So the ceph_data pool, in terms >> of raw storage, is about 50 % used. >> >> But in terms of storage shown for that pool, it's almost 63 % %USED. >> So I guess this can purely be from bad balancing, correct? >> >> Cheers, >> Oliver >> >> Am 20.10.18 um 19:49 schrieb Janne Johansson: >>> Do mind that drives may have more than one pool on them, so RAW space >>> is what it says, how much free space there is. Then the avail and >>> %USED on per-pool stats will take replication into account, it can >>> tell how much data you may write into that particular pool, given that >>> pools replication or EC settings. >>> >>> Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth >>> : >>>> >>>> Dear Cephalopodians, >>>> >>>> as many others, I'm also a bit confused by "ceph df" output >>>> in a pretty straightforward configuration. >>>> >>>> We have a CephFS (12.2.7) running, with 4+2 EC profile. >>>> >>>> I get: >>>> >>>> # ceph df >>>> GLOBAL: >>>> SIZE AVAIL RAW USED %RAW USED >>>> 824T 410T 414T 50.26 >>>> POOLS: >>>> NAMEID USED %USED MAX AVAIL OBJECTS >>>> cephfs_metadata 1 452M 0.05 860G 365774 >>>> cephfs_data 2 275T 62.68 164T 75056403 >>>> >>>> >>>> So about 50 % of raw space are used, but already ~63 % of filesystem space >>>> are used. >>>> Is this purely from imperfect balancing? >>>> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage >>>> down to 37.12 %. >>>> >>>> We did not yet use the balancer plugin. >>>> We don't have any pre-luminous clients. >>>> In that setup, I take it that "upmap" mode would be recommended - correct? >>>> Any "gotchas" using that on luminous? >>>> >>>> Cheers, >>>> Oliver >>>> >>>> ___ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >> >> > > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph df space usage confusion - balancing needed?
Dear Janne, yes, of course. But since we only have two pools here, this can not explain the difference. The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB of total raw storage for that. So looking at the raw space usage, we can ignore that. All the rest is used for the ceph_data pool. So the ceph_data pool, in terms of raw storage, is about 50 % used. But in terms of storage shown for that pool, it's almost 63 % %USED. So I guess this can purely be from bad balancing, correct? Cheers, Oliver Am 20.10.18 um 19:49 schrieb Janne Johansson: > Do mind that drives may have more than one pool on them, so RAW space > is what it says, how much free space there is. Then the avail and > %USED on per-pool stats will take replication into account, it can > tell how much data you may write into that particular pool, given that > pools replication or EC settings. > > Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth > : >> >> Dear Cephalopodians, >> >> as many others, I'm also a bit confused by "ceph df" output >> in a pretty straightforward configuration. >> >> We have a CephFS (12.2.7) running, with 4+2 EC profile. >> >> I get: >> >> # ceph df >> GLOBAL: >> SIZE AVAIL RAW USED %RAW USED >> 824T 410T 414T 50.26 >> POOLS: >> NAMEID USED %USED MAX AVAIL OBJECTS >> cephfs_metadata 1 452M 0.05 860G 365774 >> cephfs_data 2 275T 62.68 164T 75056403 >> >> >> So about 50 % of raw space are used, but already ~63 % of filesystem space >> are used. >> Is this purely from imperfect balancing? >> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage >> down to 37.12 %. >> >> We did not yet use the balancer plugin. >> We don't have any pre-luminous clients. >> In that setup, I take it that "upmap" mode would be recommended - correct? >> Any "gotchas" using that on luminous? >> >> Cheers, >> Oliver >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph df space usage confusion - balancing needed?
Dear Cephalopodians, as many others, I'm also a bit confused by "ceph df" output in a pretty straightforward configuration. We have a CephFS (12.2.7) running, with 4+2 EC profile. I get: # ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 824T 410T 414T 50.26 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS cephfs_metadata 1 452M 0.05 860G 365774 cephfs_data 2 275T 62.68 164T 75056403 So about 50 % of raw space are used, but already ~63 % of filesystem space are used. Is this purely from imperfect balancing? In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage down to 37.12 %. We did not yet use the balancer plugin. We don't have any pre-luminous clients. In that setup, I take it that "upmap" mode would be recommended - correct? Any "gotchas" using that on luminous? Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backup ceph
Hi, Am 21.09.18 um 03:28 schrieb ST Wong (ITSC): > Hi, > >>> Will the RAID 6 be mirrored to another storage in remote site for DR >>> purpose? >> >> Not yet. Our goal is to have the backup ceph to which we will replicate >> spread across three different buildings, with 3 replicas. > > May I ask if the backup ceph is a single ceph cluster span across 3 different > buildings, or compose of 3 ceph clusters in 3 different buildings? Thanks. > This will be a single ceph cluster with a failure domain corresponding to the building and three replicas. To test updates before rolling them out to the full cluster, we will also instantiate a small test cluster separately, but we try to keep the number of production clusters down and rather let Ceph handle failover and replication than doing that ourselves, which also allows to grow / shrink the cluster more easily as needed ;-). All the best, Oliver > Thanks again for your help. > Best Regards, > /ST Wong > > -Original Message- > From: Oliver Freyermuth > Sent: Thursday, September 20, 2018 2:10 AM > To: ST Wong (ITSC) > Cc: Peter Wienemann ; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] backup ceph > > Hi, > > Am 19.09.18 um 18:32 schrieb ST Wong (ITSC): >> Thanks for your help. > > You're welcome! > I should also add we don't have very long-term experience with this yet - > Benji is pretty modern. > >>> For the moment, we use Benji to backup to a classic RAID 6. >> Will the RAID 6 be mirrored to another storage in remote site for DR purpose? > > Not yet. Our goal is to have the backup ceph to which we will replicate > spread across three different buildings, with 3 replicas. > >> >>> For RBD mirroring, you do indeed need another running Ceph Cluster, but we >>> plan to use that in the long run (on separate hardware of course). >> Seems this is the way to go, regardless of additional resources required? :) >> Btw, RBD mirroring looks like a DR copy instead of a daily backup from which >> we can restore image of particular date ? > > We would still perform daily snapshots, and keep those both in the RBD mirror > and in the Benji backup. Even when fading out the current RAID 6 machine at > some point, > we'd probably keep Benji and direct it's output to a CephFS pool on our > backup Ceph cluster. If anything goes wrong with the mirroring, this still > leaves us > with an independent backup approach. We also keep several days of snapshots > in the production RBD pool to be able to quickly roll back a VM if anything > goes wrong. > With Benji, you can also mount any of these daily snapshots via NBD in case > it is needed, or restore from a specific date. > > All the best, > Oliver > >> >> Thanks again. >> /st wong >> >> -Original Message- >> From: Oliver Freyermuth >> Sent: Wednesday, September 19, 2018 5:28 PM >> To: ST Wong (ITSC) >> Cc: Peter Wienemann ; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] backup ceph >> >> Hi, >> >> Am 19.09.18 um 03:24 schrieb ST Wong (ITSC): >>> Hi, >>> >>> Thanks for your information. >>> May I know more about the backup destination to use? As the size of the >>> cluster will be a bit large (~70TB to start with), we're looking for some >>> efficient method to do that backup. Seems RBD mirroring or incremental >>> snapshot s with RBD >>> (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some >>> ways to go, but requires another running Ceph cluster. Is my understanding >>> correct?Thanks. >> >> For the moment, we use Benji to backup to a classic RAID 6. With Benji, only >> the changed chunks are backed up, and it learns that by asking Ceph for a >> diff of the RBD snapshots. >> So that's really fast after the first backup, and especially if you do >> trimming (e.g. via guest agent if you run VMs) of the RBD volumes before >> backing them up. >> The same is true for Backy2, but it does not support compression (which >> really helps by several factors(!) in saving I/O and with zstd it does not >> use much CPU). >> >> For RBD mirroring, you do indeed need another running Ceph Cluster, but we >> plan to use that in the long run (on separate hardware of course). >> >>> Btw, is this one (https://benji-backup.me/) Benji you'r referring to ? >>> Thanks a lot. >> >> Exactly :-). >> >> Cheers, >> Oliver >> >>> >>> >>> >>> Cheers
Re: [ceph-users] backup ceph
Hi, Am 19.09.18 um 18:32 schrieb ST Wong (ITSC): > Thanks for your help. You're welcome! I should also add we don't have very long-term experience with this yet - Benji is pretty modern. >> For the moment, we use Benji to backup to a classic RAID 6. > Will the RAID 6 be mirrored to another storage in remote site for DR purpose? Not yet. Our goal is to have the backup ceph to which we will replicate spread across three different buildings, with 3 replicas. > >> For RBD mirroring, you do indeed need another running Ceph Cluster, but we >> plan to use that in the long run (on separate hardware of course). > Seems this is the way to go, regardless of additional resources required? :) > Btw, RBD mirroring looks like a DR copy instead of a daily backup from which > we can restore image of particular date ? We would still perform daily snapshots, and keep those both in the RBD mirror and in the Benji backup. Even when fading out the current RAID 6 machine at some point, we'd probably keep Benji and direct it's output to a CephFS pool on our backup Ceph cluster. If anything goes wrong with the mirroring, this still leaves us with an independent backup approach. We also keep several days of snapshots in the production RBD pool to be able to quickly roll back a VM if anything goes wrong. With Benji, you can also mount any of these daily snapshots via NBD in case it is needed, or restore from a specific date. All the best, Oliver > > Thanks again. > /st wong > > -Original Message- > From: Oliver Freyermuth > Sent: Wednesday, September 19, 2018 5:28 PM > To: ST Wong (ITSC) > Cc: Peter Wienemann ; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] backup ceph > > Hi, > > Am 19.09.18 um 03:24 schrieb ST Wong (ITSC): >> Hi, >> >> Thanks for your information. >> May I know more about the backup destination to use? As the size of the >> cluster will be a bit large (~70TB to start with), we're looking for some >> efficient method to do that backup. Seems RBD mirroring or incremental >> snapshot s with RBD >> (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some >> ways to go, but requires another running Ceph cluster. Is my understanding >> correct?Thanks. > > For the moment, we use Benji to backup to a classic RAID 6. With Benji, only > the changed chunks are backed up, and it learns that by asking Ceph for a > diff of the RBD snapshots. > So that's really fast after the first backup, and especially if you do > trimming (e.g. via guest agent if you run VMs) of the RBD volumes before > backing them up. > The same is true for Backy2, but it does not support compression (which > really helps by several factors(!) in saving I/O and with zstd it does not > use much CPU). > > For RBD mirroring, you do indeed need another running Ceph Cluster, but we > plan to use that in the long run (on separate hardware of course). > >> Btw, is this one (https://benji-backup.me/) Benji you'r referring to ? >> Thanks a lot. > > Exactly :-). > > Cheers, > Oliver > >> >> >> >> Cheers, >> /ST Wong >> >> >> >> -Original Message- >> From: Oliver Freyermuth >> Sent: Tuesday, September 18, 2018 6:09 PM >> To: ST Wong (ITSC) >> Cc: Peter Wienemann >> Subject: Re: [ceph-users] backup ceph >> >> Hi, >> >> we're also just starting to collect experiences, so we have nothing to share >> (yet). However, we are evaluating using Benji (a well-maintained fork of >> Backy2 which can also compress) in addition, trimming and fsfreezing the VM >> disks shortly before, >> and additionally keeping a few daily and weekly snapshots. >> We may add RBD mirroring to a backup system in the future. >> >> Since our I/O requirements are not too high, I guess we will be fine either >> way, but any shared experience is very welcome. >> >> Cheers, >> Oliver >> >> Am 18.09.18 um 11:54 schrieb ST Wong (ITSC): >>> Hi, >>> >>> >>> >>> We're newbie to Ceph. Besides using incremental snapshots with RDB to >>> backup data on one Ceph cluster to another running Ceph cluster, or using >>> backup tools like backy2, will there be any recommended way to backup Ceph >>> data ? Someone here suggested taking snapshot of RDB daily and keeps 30 >>> days to replace backup. I wonder if this is practical and if performance >>> will be impact. >>> >>> >>> >>> Thanks a lot. >>> >>> Regards >>> >>> /st wong >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backup ceph
Hi, Am 19.09.18 um 03:24 schrieb ST Wong (ITSC): > Hi, > > Thanks for your information. > May I know more about the backup destination to use? As the size of the > cluster will be a bit large (~70TB to start with), we're looking for some > efficient method to do that backup. Seems RBD mirroring or incremental > snapshot s with RBD > (https://ceph.com/geen-categorie/incremental-snapshots-with-rbd/) are some > ways to go, but requires another running Ceph cluster. Is my understanding > correct?Thanks. For the moment, we use Benji to backup to a classic RAID 6. With Benji, only the changed chunks are backed up, and it learns that by asking Ceph for a diff of the RBD snapshots. So that's really fast after the first backup, and especially if you do trimming (e.g. via guest agent if you run VMs) of the RBD volumes before backing them up. The same is true for Backy2, but it does not support compression (which really helps by several factors(!) in saving I/O and with zstd it does not use much CPU). For RBD mirroring, you do indeed need another running Ceph Cluster, but we plan to use that in the long run (on separate hardware of course). > Btw, is this one (https://benji-backup.me/) Benji you'r referring to ? > Thanks a lot. Exactly :-). Cheers, Oliver > > > > Cheers, > /ST Wong > > > > -Original Message- > From: Oliver Freyermuth > Sent: Tuesday, September 18, 2018 6:09 PM > To: ST Wong (ITSC) > Cc: Peter Wienemann > Subject: Re: [ceph-users] backup ceph > > Hi, > > we're also just starting to collect experiences, so we have nothing to share > (yet). However, we are evaluating using Benji (a well-maintained fork of > Backy2 which can also compress) in addition, trimming and fsfreezing the VM > disks shortly before, > and additionally keeping a few daily and weekly snapshots. > We may add RBD mirroring to a backup system in the future. > > Since our I/O requirements are not too high, I guess we will be fine either > way, but any shared experience is very welcome. > > Cheers, > Oliver > > Am 18.09.18 um 11:54 schrieb ST Wong (ITSC): >> Hi, >> >> >> >> We're newbie to Ceph. Besides using incremental snapshots with RDB to >> backup data on one Ceph cluster to another running Ceph cluster, or using >> backup tools like backy2, will there be any recommended way to backup Ceph >> data ? Someone here suggested taking snapshot of RDB daily and keeps 30 >> days to replace backup. I wonder if this is practical and if performance >> will be impact. >> >> >> >> Thanks a lot. >> >> Regards >> >> /st wong >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quota and ACL support
Am 28.08.18 um 07:14 schrieb Yan, Zheng: > On Mon, Aug 27, 2018 at 10:53 AM Oliver Freyermuth > wrote: >> >> Thanks for the replies. >> >> Am 27.08.18 um 19:25 schrieb Patrick Donnelly: >>> On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth >>> wrote: >>>> These features are critical for us, so right now we use the Fuse client. >>>> My hope is CentOS 8 will use a recent enough kernel >>>> to get those features automatically, though. >>> >>> Your cluster needs to be running Mimic and Linux v4.17+. >>> >>> See also: https://github.com/ceph/ceph/pull/23728/files >>> >> >> Yes, I know that it's part of the official / vanilla kernel as of 4.17. >> However, I was wondering whether this functionality is also likely to be >> backported to the RedHat-maintained kernel which is also used in CentOS 7? >> Even though the kernel version is "stone-aged", it matches CentOS 7's >> userspace and RedHat is taking good care to implement fixes. >> > > We have already backported quota patches to RHEL 3.10 kernel. It may > take some time for redhat to release the new kernel. That's great news, many thanks - looking forward to it! I also noted the CephFS kernel client is now mentioned as "fully supported" with the upcoming RHEL 7.6: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7-beta/html-single/7.6_release_notes/index#new_features_file_systems Those release notes still talk about missing quota support, but I guess this will then be added soonish :-). All the best, Oliver > > Regards > Yan, Zheng > >> Seeing that even features are backported, it would be really helpful if also >> this functionality would appear as part of CentOS 7.6 / 7.7, >> especially since CentOS 8 still appears to be quite some time away. >> >> Cheers, >> Oliver >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Quota and ACL support
Thanks for the replies. Am 27.08.18 um 19:25 schrieb Patrick Donnelly: > On Mon, Aug 27, 2018 at 12:51 AM, Oliver Freyermuth > wrote: >> These features are critical for us, so right now we use the Fuse client. My >> hope is CentOS 8 will use a recent enough kernel >> to get those features automatically, though. > > Your cluster needs to be running Mimic and Linux v4.17+. > > See also: https://github.com/ceph/ceph/pull/23728/files > Yes, I know that it's part of the official / vanilla kernel as of 4.17. However, I was wondering whether this functionality is also likely to be backported to the RedHat-maintained kernel which is also used in CentOS 7? Even though the kernel version is "stone-aged", it matches CentOS 7's userspace and RedHat is taking good care to implement fixes. Seeing that even features are backported, it would be really helpful if also this functionality would appear as part of CentOS 7.6 / 7.7, especially since CentOS 8 still appears to be quite some time away. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS Quota and ACL support
Dear Cephalopodians, sorry if this is the wrong place to ask - but does somebody know if the recently added quota support in the kernel client, and the ACL support, are going to be backported to RHEL 7 / CentOS 7 kernels? Or can someone redirect me to the correct place to ask? We don't have a RHEL subscription, but are using CentOS. These features are critical for us, so right now we use the Fuse client. My hope is CentOS 8 will use a recent enough kernel to get those features automatically, though. Cheers and thanks, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how can time machine know difference between cephfs fuse and kernel client?
Hi, completely different idea: Have you tried to export the "time capsule" storage via AFP (using netatalk) instead of Samba? We are also planning to offer something like this for our users (in the mid-term future), but my feeling was that compatibility with netatalk / AFP would be better than with Samba. That also appears to be the implementation consumer-grade NAS devices are using behind the scenes for their "time capsule" functionality. I also don't have experience with this (yet) but I know some users backing up their time machine data to AFP shares from NAS devices, and in general this appears to work well. Probably it won't help with the space reporting issue, but it might still be of interest for the use case? In any case, I'd be very interested in case you have experience with both, and if so, why you decided for Samba ;-). And since our plans were also to use export a CephFS mounted via fuse, I'll closely follow your issue... Cheers, Oliver Am 17.08.18 um 17:13 schrieb Chad William Seys: > Hello all, > I have used cephfs served over Samba to set up a "time capsule" server. > However, I could only get this to work using the cephfs kernel module. Time > machine would give errors if cephfs were mounted with fuse. (Sorry, I didn't > write down the error messages!) > Anyone have an idea how the two methods of mounting are detectable by time > machine through Samba? > Windows 10 File History behaved the same way. Error messages are "Could > not enable File History. There is not enough space on the disk". (Although it > shows the correct amount of space.) And "File History doesn't recognize this > drive." > I'd like to use cephfs fuse for the quota support. (The kernel client is > said to support quotas with Mimic and kernel version >= 4.17, but that is to > cutting edge for me ATM.) > > Thanks! > Chad. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
Hi together, for all others on this list, it might also be helpful to know which setups are likely affected. Does this only occur for Filestore disks, i.e. if ceph-volume has taken over taking care of these? Does it happen on every RHEL 7.5 system? We're still on 13.2.0 here and ceph-detect-init works fine on our CentOS 7.5 systems (it just echoes "systemd"). We're on Bluestore. Should we hold off on an upgrade, or are we unaffected? Cheers, Oliver Am 30.07.2018 um 09:50 schrieb ceph.nov...@habmalnefrage.de: > Hey Nathan. > > No blaming here. I'm very thankful for this great peace (ok, sometime more of > a beast ;) ) of open-source SDS and all the great work around it incl. > community and users... and happy the problem is identified and can be fixed > for others/the future as well :) > > Well, yes, can confirm your found "error" also here: > > [root@sds20 ~]# ceph-detect-init > Traceback (most recent call last): > File "/usr/bin/ceph-detect-init", line 9, in > load_entry_point('ceph-detect-init==1.0.1', 'console_scripts', > 'ceph-detect-init')() > File "/usr/lib/python2.7/site-packages/ceph_detect_init/main.py", line 56, > in run > print(ceph_detect_init.get(args.use_rhceph).init) > File "/usr/lib/python2.7/site-packages/ceph_detect_init/__init__.py", line > 42, in get > release=release) > ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.: rhel > 7.5 > > > Gesendet: Sonntag, 29. Juli 2018 um 20:33 Uhr > Von: "Nathan Cutler" > An: ceph.nov...@habmalnefrage.de, "Vasu Kulkarni" > Cc: ceph-users , "Ceph Development" > > Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released") >> Strange... >> - wouldn't swear, but pretty sure v13.2.0 was working ok before >> - so what do others say/see? >> - no one on v13.2.1 so far (hard to believe) OR >> - just don't have this "systemctl ceph-osd.target" problem and all just >> works? >> >> If you also __MIGRATED__ from Luminous (say ~ v12.2.5 or older) to Mimic >> (say v13.2.0 -> v13.2.1) and __DO NOT__ see the same systemctl problems, >> whats your Linix OS and version (I'm on RHEL 7.5 here) ? :O > > Best regards > Anton > > > > Hi ceph.novice: > > I'm the one to blame for this regretful incident. Today I have > reproduced the issue in teuthology: > > 2018-07-29T18:20:07.288 INFO:teuthology.orchestra.run.ovh093:Running: > 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph-detect-init' > 2018-07-29T18:20:07.796 > INFO:teuthology.orchestra.run.ovh093.stderr:Traceback (most recent call > last): > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > File "/bin/ceph-detect-init", line 9, in > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > load_entry_point('ceph-detect-init==1.0.1', 'console_scripts', > 'ceph-detect-init')() > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > File "/usr/lib/python2.7/site-packages/ceph_detect_init/main.py", line > 56, in run > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > print(ceph_detect_init.get(args.use_rhceph).init) > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > File "/usr/lib/python2.7/site-packages/ceph_detect_init/__init__.py", > line 42, in get > 2018-07-29T18:20:07.797 INFO:teuthology.orchestra.run.ovh093.stderr: > release=release) > 2018-07-29T18:20:07.797 > INFO:teuthology.orchestra.run.ovh093.stderr:ceph_detect_init.exc.UnsupportedPlatform: > Platform is not supported.: rhel 7.5 > > Just to be sure, can you confirm? (I.e. issue the command > "ceph-detect-init" on your RHEL 7.5 system. Instead of saying "systemd" > it gives an error like above?) > > I'm working on a fix now at https://github.com/ceph/ceph/pull/23303 > > Nathan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Am 23.07.2018 um 14:59 schrieb Nicolas Huillard: > Le lundi 23 juillet 2018 à 12:40 +0200, Oliver Freyermuth a écrit : >> Am 23.07.2018 um 11:18 schrieb Nicolas Huillard: >>> Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit : >>>> Ceph doesn't shut down systems as in kill or reboot the box if >>>> that's >>>> what you're saying? >>> >>> That's the first part of what I was saying, yes. I was pretty sure >>> Ceph >>> doesn't reboot/shutdown/reset, but now it's 100% sure, thanks. >>> Maybe systemd triggered something, but without any lasting traces. >>> The kernel didn't leave any more traces in kernel.log, and since >>> the >>> server was off, there was no oops remaining on the console... >> >> If there was an oops, it should also be recorded in pstore. >> If the kernel was still running and able to show a stacktrace, even >> if disk I/O has become impossible, >> it will in general dump the stacktrace to pstore (e.g. UEFI pstore if >> you boot via EFI, or ACPI pstore, if available). > > I was sure I would learn something from this thread. Thnaks! > Unfortunately, those machines don't boot using UEFI, /sys/fs/pstore/ is > empty, and: > /sys/module/pstore/parameters/backend:(null) > /sys/module/pstore/parameters/update_ms:-1 > > I suppose this pstore is also shown in the BMC web interface as "Server > Health / System Log". This is empty too, and I wondered what would fill > it. Maybe I'll use UEFI boot next time. It's usually not shown anywhere else - in the end, the UEFI pstore is just permanend storage, which the Linux kernel uses to save OOPSes and other kinds of PANICs. It's very unlikely that the BMC can interpret the very same format the Linux kernel writes there. Sadly, it seems your machine does not have any backend available (unless booted via UEFI). Our machines can luckily use ACPI ERST (Error Record Serialization Table) even if legacy-booted. So probably, booting via UEFI is your only option (other options could be netconsole, but it is less robust / does not capture everything, or ramoops, but I've never used that). Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Am 23.07.2018 um 11:39 schrieb Nicolas Huillard: > Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit : >> Do you have any hardware watchdog running in the system? A watchdog >> could >> trigger a powerdown if it meets some value. Any event logs from the >> chassis >> itself? > > Nice suggestions ;-) > > I see some [watchdog/N] and one [watchdogd] kernel threads, along with > a "kernel: [0.116002] NMI watchdog: enabled on all CPUs, > permanently consumes one hw-PMU counter." line in the kernel log, but > no user-land watchdog daemon: I'm not sure if the watchdog is actually > active. > > There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR > Fault", with a timestamp matching the timestamps below, and no more > information. If this kind of failure (or a less severe one) also happens at runtime, mcelog should catch it. For CATERR errors, we also found that sometimes the web interface of the BMC shows more information for the event log entry than querying the event log via ipmitool - you may want to check this. > If I understand correctly, this is a signal emitted by the CPU, to the > BMC, upon "catastrophic error" (more than "fatal"), which the BMC must > respond to the way it wants, Intel suggestions including resetting the > chassis. > > https://www.intel.in/content/dam/www/public/us/en/documents/white-paper > s/platform-level-error-strategies-paper.pdf > > Does that mean that the hardware is failing, or a neutrino just crossed > some CPU register? > CPU is a Xeon D-1521 with ECC memory. > >> Kind regards, > > Many thanks! > >> >> Caspar >> >> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard : >> >>> Hi all, >>> >>> One of my server silently shutdown last night, with no explanation >>> whatsoever in any logs. According to the existing logs, the >>> shutdown >>> (without reboot) happened between 03:58:20.061452 (last timestamp >>> from >>> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON >>> election called, for which oxygene didn't answer). >>> >>> Is there any way in which Ceph could silently shutdown a server? >>> Can SMART self-test influence scrubbing or compaction? >>> >>> The only thing I have is that smartd stated a long self-test on >>> both >>> OSD spinning drives on that host: >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], >>> starting >>> scheduled Long Self-Test. >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self- >>> test in >>> progress, 90% remaining >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self- >>> test in >>> progress, 90% remaining >>> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], >>> previous >>> self-test completed without error >>> >>> ...and smartctl now says that the self-tests didn't finish (on both >>> drives) : >>> # 1 Extended offlineInterrupted (host >>> reset) 00% 10636 >>> - >>> >>> MON logs on oxygene talks about rockdb compaction a few minutes >>> before >>> the shutdown, and a deep-scrub finished earlier: >>> /var/log/ceph/ceph-osd.6.log >>> 2018-07-21 03:32:54.086021 7fd15d82c700 0 log_channel(cluster) log >>> [DBG] >>> : 6.1d deep-scrub starts >>> 2018-07-21 03:34:31.185549 7fd15d82c700 0 log_channel(cluster) log >>> [DBG] >>> : 6.1d deep-scrub ok >>> 2018-07-21 03:43:36.720707 7fd178082700 0 -- >>> 172.22.0.16:6801/478362 >> >>> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801 >>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>> l=1).handle_connect_msg: challenging authorizer >>> >>> /var/log/ceph/ceph-mgr.oxygene.log >>> 2018-07-21 03:58:16.060137 7fbcd300 1 mgr send_beacon standby >>> 2018-07-21 03:58:18.060733 7fbcd300 1 mgr send_beacon standby >>> 2018-07-21 03:58:20.061452 7fbcd300 1 mgr send_beacon standby >>> >>> /var/log/ceph/ceph-mon.oxygene.log >>> 2018-07-21 03:52:27.702314 7f25b5406700 4 rocksdb: (Original Log >>> Time >>> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/ >>> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual >>> compaction >>> from level-0 to level-1 from 'mgrstat .. ' >>> 2018-07-21 03:52:27.702321 7f25b5406700 4 rocksdb: >>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] >>> [default] [JOB >>> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00 >>> 2018-07-21 03:52:27.702329 7f25b5406700 4 rocksdb: >>> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407] >>> [default] >>> Compaction start summary: Base version 1745 Base level 0, inputs: >>> [149507(602KB)], [149505(13MB)] >>> 2018-07-21 03:52:27.702348 7f25b5406700 4 rocksdb: EVENT_LOG_v1 >>> {"time_micros": 1532137947702334, "job": 1746, "event": >>> "compaction_started", "files_L0": [149507], "files_L1": [149505], >>> "score": >>> -1, "input_data_size": 14916379}
Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Am 23.07.2018 um 11:18 schrieb Nicolas Huillard: > Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit : >> Ceph doesn't shut down systems as in kill or reboot the box if that's >> what you're saying? > > That's the first part of what I was saying, yes. I was pretty sure Ceph > doesn't reboot/shutdown/reset, but now it's 100% sure, thanks. > Maybe systemd triggered something, but without any lasting traces. > The kernel didn't leave any more traces in kernel.log, and since the > server was off, there was no oops remaining on the console... If there was an oops, it should also be recorded in pstore. If the kernel was still running and able to show a stacktrace, even if disk I/O has become impossible, it will in general dump the stacktrace to pstore (e.g. UEFI pstore if you boot via EFI, or ACPI pstore, if available). Cheers, Oliver > > I'm currently activating "Auto video recording" at the BMC/IPMI level, > as that may help next time this event occurs... Triggers look like > they're tuned for Windows BSOD though... > > Thanks for all answers ;-) > >> On Mon, Jul 23, 2018 at 5:04 PM, Nicolas Huillard > .fr> wrote: >>> Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit >>> : > I even have no fancy kernel or device, just real standard > Debian. > The > uptime was 6 days since the upgrade from 12.2.6... Nicolas, you should upgrade your 12.2.6 to 12.2.7 due bugs in this release. > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)
Since all services are running on these machines - are you by any chance running low on memory? Do you have a monitoring of this? We observe some strange issues with our servers if they run for a long while, and with high memory pressure (more memory is ordered...). Then, it seems our Infiniband driver can not allocate sufficiently large pages anymore, communication is lost between the Ceph nodes, recovery starts, memory usage grows even higher from this, etc. In some cases, it seems this may lead to a freeze / lockup (not reboot). My feeling is that the CentOS 7.5 kernel is not doing as well on memory compaction as the modern kernels do. Right now, this is just a hunch of mine, but my recommendation would be to have some monitoring of the machine and see if something strange happens in terms of memory usage, CPU usage, or disk I/O (e.g. iowait) to further pin down the issue. It may as well be something completely different. Other options to investigate would be a potential kernel stacktrace in pstore, or something in mcelog. Cheers, Oliver Am 21.07.2018 um 14:34 schrieb Nicolas Huillard: > I forgot to mention that this server, along with all the other Ceph > servers in my cluster, do not run anything else than Ceph, and each run > all the Ceph daemons (mon, mgr, mds, 2×osd). > > Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit : >> Hi all, >> >> One of my server silently shutdown last night, with no explanation >> whatsoever in any logs. According to the existing logs, the shutdown >> (without reboot) happened between 03:58:20.061452 (last timestamp >> from >> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON >> election called, for which oxygene didn't answer). >> >> Is there any way in which Ceph could silently shutdown a server? >> Can SMART self-test influence scrubbing or compaction? >> >> The only thing I have is that smartd stated a long self-test on both >> OSD spinning drives on that host: >> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting >> scheduled Long Self-Test. >> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting >> scheduled Long Self-Test. >> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting >> scheduled Long Self-Test. >> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self- >> test in progress, 90% remaining >> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self- >> test in progress, 90% remaining >> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous >> self-test completed without error >> >> ...and smartctl now says that the self-tests didn't finish (on both >> drives) : >> # 1 Extended offlineInterrupted (host >> reset) 00% 10636 - >> >> MON logs on oxygene talks about rockdb compaction a few minutes >> before >> the shutdown, and a deep-scrub finished earlier: >> /var/log/ceph/ceph-osd.6.log >> 2018-07-21 03:32:54.086021 7fd15d82c700 0 log_channel(cluster) log >> [DBG] : 6.1d deep-scrub starts >> 2018-07-21 03:34:31.185549 7fd15d82c700 0 log_channel(cluster) log >> [DBG] : 6.1d deep-scrub ok >> 2018-07-21 03:43:36.720707 7fd178082700 0 -- 172.22.0.16:6801/478362 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801 >> >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >> l=1).handle_connect_msg: challenging authorizer >> >> /var/log/ceph/ceph-mgr.oxygene.log >> 2018-07-21 03:58:16.060137 7fbcd300 1 mgr send_beacon standby >> 2018-07-21 03:58:18.060733 7fbcd300 1 mgr send_beacon standby >> 2018-07-21 03:58:20.061452 7fbcd300 1 mgr send_beacon standby >> >> /var/log/ceph/ceph-mon.oxygene.log >> 2018-07-21 03:52:27.702314 7f25b5406700 4 rocksdb: (Original Log >> Time 2018/07/21-03:52:27.702302) [/build/ceph- >> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default] >> Manual compaction from level-0 to level-1 from 'mgrstat .. ' >> 2018-07-21 03:52:27.702321 7f25b5406700 4 rocksdb: [/build/ceph- >> 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746] >> Compacting 1@0 + 1@1 files to L1, score -1.00 >> 2018-07-21 03:52:27.702329 7f25b5406700 4 rocksdb: [/build/ceph- >> 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction >> start summary: Base version 1745 Base level 0, inputs: >> [149507(602KB)], [149505(13MB)] >> 2018-07-21 03:52:27.702348 7f25b5406700 4 rocksdb: EVENT_LOG_v1 >> {"time_micros": 1532137947702334, "job": 1746, "event": >> "compaction_started", "files_L0": [149507], "files_L1": [149505], >> "score": -1, "input_data_size": 14916379} >> 2018-07-21 03:52:27.785532 7f25b5406700 4 rocksdb: [/build/ceph- >> 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746] >> Generated table #149508: 4904 keys, 14808953 bytes >> 2018-07-21 03:52:27.785587 7f25b5406700 4 rocksdb: EVENT_LOG_v1 >> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746, >> "event": "table_file_creation", "file_number":
Re: [ceph-users] JBOD question
Hi Satish, that really completely depends on your controller. For what it's worth: We have AVAGO MegaRAID controllers (9361 series). They can be switched to a "JBOD personality". After doing so and reinitializing (poewrcycling), the cards change PCI-ID and run a different firmware, optimized for JBOD mode (with different caching etc.). Also, the block devices are ordered differently. In that mode, new disks will be exported as JBOD by default, but you can still do RAID1 and RAID0. I think RAID5 and RAID6 are disabled, though. We are using those to have a RAID 1 for our OS and export the rest as JBOD for CephFS. So there surely are controllers which can only do JBOD in addition (without a special controller mode / "personality"), controllers which can be switched, but simple RAID levels are still possible, and I'm also sure there are controllers out there which can be switched to JBOD mode and can't do anything RAID anymore in that mode. If that's the case, just go with software RAID for the OS, or install your servers with a good deployment tool so you can just reinstall them if the OS breaks (we also do that for some Ceph servers with simpler RAID controllers). With a good deployment tool, reinstalling takes 1 click and waiting 40 minutes - but of course, the server will still be down until a broken OS HDD is replaced physically. But Ceph has redundancy for that :-). Cheers, Oliver Am 20.07.2018 um 23:52 schrieb Satish Patel: > Thanks Brian, > > That make sense because i was reading document and found you can > either choose RAID or JBOD > > On Fri, Jul 20, 2018 at 5:33 PM, Brian : wrote: >> Hi Satish >> >> You should be able to choose different modes of operation for each >> port / disk. Most dell servers will let you do RAID and JBOD in >> parallel. >> >> If you can't do that and can only either turn RAID on or off then you >> can use SW RAID for your OS >> >> >> On Fri, Jul 20, 2018 at 9:01 PM, Satish Patel wrote: >>> Folks, >>> >>> I never used JBOD mode before and now i am planning so i have stupid >>> question if i switch RAID controller to JBOD mode in that case how >>> does my OS disk will get mirror? >>> >>> Do i need to use software raid for OS disk when i use JBOD mode? >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Rules with multiple Device Classes
Am 19.07.2018 um 08:43 schrieb Linh Vu: > Since the new NVMes are meant to replace the existing SSDs, why don't you > assign class "ssd" to the new NVMe OSDs? That way you don't need to change > the existing OSDs nor the existing crush rule. And the new NVMe OSDs won't > lose any performance, "ssd" or "nvme" is just a name. > > When you deploy the new NVMe, you can chuck this under [osd] in their local > ceph.conf: `osd_class_update_on_start = false` They should then come up with > a blank class and you can set the class to ssd afterwards. Right, this should also work. But then I'd prefer to "relabel" the existing SSDs and the crush rule to read "NVME" such that the future NVMEs will update themselves automatically without manual configuration. We are trying to keep our ceph.conf small to follow the spirit of Mimic and future releases ;-). I'll schedule this change for our next I/O pause just to be on the safe side. Thanks and all the best, Oliver > > ------ > *From:* ceph-users on behalf of Oliver > Freyermuth > *Sent:* Thursday, 19 July 2018 6:13:25 AM > *To:* ceph-users@lists.ceph.com > *Cc:* Peter Wienemann > *Subject:* [ceph-users] Crush Rules with multiple Device Classes > > Dear Cephalopodians, > > we use an SSD-only pool to store the metadata of our CephFS. > In the future, we will add a few NVMEs, and in the long-term view, replace > the existing SSDs by NVMEs, too. > > Thinking this through, I came up with three questions which I do not find > answered in the docs (yet). > > Currently, we use the following crush-rule: > > rule cephfs_metadata { > id 1 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step choose firstn 0 type osd > step emit > } > > As you can see, this uses "class ssd". > > Now my first question is: > 1) Is there a way to specify "take default class (ssd or nvme)"? > Then we could just do this for the migration period, and at some point > remove "ssd". > > If multi-device-class in a crush rule is not supported yet, the only > workaround which comes to my mind right now is to issue: > $ ceph osd crush set-device-class nvme > for all our old SSD-backed osds, and modify the crush rule to refer to class > "nvme" straightaway. > > This leads to my second question: > 2) Since the OSD IDs do not change, Ceph should not move any data around by > changing both the device classes of the OSDs and the device class in the > crush rule - correct? > > After this operation, adding NVMEs to our cluster should let them > automatically join this crush rule, and once all SSDs are replaced with NVMEs, > the workaround is automatically gone. > > As long as the SSDs are still there, some tunables might not fit well anymore > out of the box, i.e. the "sleep" values for scrub and repair, though. > > Here my third question: > 3) Are the tunables used for NVME devices the same as for SSD devices? > I do not find any NVME tunables here: > http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/ > Only SSD, HDD and Hybrid are shown. > > Cheers, > Oliver > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Rules with multiple Device Classes
Am 19.07.2018 um 05:57 schrieb Konstantin Shalygin: >> Now my first question is: >> 1) Is there a way to specify "take default class (ssd or nvme)"? >>Then we could just do this for the migration period, and at some point >> remove "ssd". >> >> If multi-device-class in a crush rule is not supported yet, the only >> workaround which comes to my mind right now is to issue: >> $ ceph osd crush set-device-class nvme >> for all our old SSD-backed osds, and modify the crush rule to refer to class >> "nvme" straightaway. > > > My advice is to set class to 'nvme' to your current osd's with class 'ssd' > and change crush rule to this class. > > You still have to do it, better sooner than later.Either use the ssd class > for your future drives, in case when you switch all your ssd to nvme and > forgot about ssd disks. Yes, this sounds good. I'll schedule this for as soon as we have a small I/O pause in any case, just to be sure this will not interfere with ongoing I/O. Changing the old devices and the crush rule sounds like the best plan, then all future NVMEs will be handled correctly without any manual intervention. > > >> Here my third question: >> 3) Are the tunables used for NVME devices the same as for SSD devices? >>I do not find any NVME tunables here: >>http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/ >>Only SSD, HDD and Hybrid are shown. > > Ceph is doesn't care about nvme/ssd. Ceph is only care is_rotational drive or > not. > > > "bluefs_db_rotational": "0", > "bluefs_slow_rotational": "1", > "bluefs_wal_rotational": "0", > "bluestore_bdev_rotational": "1", > "journal_rotational": "0", > "rotational": "1" > Ah, I see! So those tunables (osd recovery sleep ssd, osd recovery sleep hdd, osd recovery sleep hybrid, and the other sleep parameters) just have a misleading name ;-). Thanks and all the best, Oliver > > > k > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Crush Rules with multiple Device Classes
Dear Cephalopodians, we use an SSD-only pool to store the metadata of our CephFS. In the future, we will add a few NVMEs, and in the long-term view, replace the existing SSDs by NVMEs, too. Thinking this through, I came up with three questions which I do not find answered in the docs (yet). Currently, we use the following crush-rule: rule cephfs_metadata { id 1 type replicated min_size 1 max_size 10 step take default class ssd step choose firstn 0 type osd step emit } As you can see, this uses "class ssd". Now my first question is: 1) Is there a way to specify "take default class (ssd or nvme)"? Then we could just do this for the migration period, and at some point remove "ssd". If multi-device-class in a crush rule is not supported yet, the only workaround which comes to my mind right now is to issue: $ ceph osd crush set-device-class nvme for all our old SSD-backed osds, and modify the crush rule to refer to class "nvme" straightaway. This leads to my second question: 2) Since the OSD IDs do not change, Ceph should not move any data around by changing both the device classes of the OSDs and the device class in the crush rule - correct? After this operation, adding NVMEs to our cluster should let them automatically join this crush rule, and once all SSDs are replaced with NVMEs, the workaround is automatically gone. As long as the SSDs are still there, some tunables might not fit well anymore out of the box, i.e. the "sleep" values for scrub and repair, though. Here my third question: 3) Are the tunables used for NVME devices the same as for SSD devices? I do not find any NVME tunables here: http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/ Only SSD, HDD and Hybrid are shown. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.2.7 Luminous released
Am 18.07.2018 um 16:20 schrieb Sage Weil: > On Wed, 18 Jul 2018, Oliver Freyermuth wrote: >> Am 18.07.2018 um 14:20 schrieb Sage Weil: >>> On Wed, 18 Jul 2018, Linh Vu wrote: >>>> Thanks for all your hard work in putting out the fixes so quickly! :) >>>> >>>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, >>>> not RGW. In the release notes, it says RGW is a risk especially the >>>> garbage collection, and the recommendation is to either pause IO or >>>> disable RGW garbage collection. >>>> >>>> >>>> In our case with CephFS, not RGW, is it a lot less risky to perform the >>>> upgrade to 12.2.7 without the need to pause IO? >>>> >>>> >>>> What does pause IO do? Do current sessions just get queued up and IO >>>> resume normally with no problem after unpausing? >>>> >>>> >>>> If we have to pause IO, is it better to do something like: pause IO, >>>> restart OSDs on one node, unpause IO - repeated for all the nodes >>>> involved in the EC pool? >> >> Hi! >> >> sorry for asking again, but... >> >>> >>> CephFS can generate a problem rados workload too when files are deleted or >>> truncated. If that isn't happening in your workload then you're probably >>> fine. If deletes are mixed in, then you might consider pausing IO for the >>> upgrade. >>> >>> FWIW, if you have been running 12.2.5 for a while and haven't encountered >>> the OSD FileStore crashes with >>> >>> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must >>> exist") >>> >>> but have had OSDs go up/down then you are probably okay. >> >> => Does this issue only affect filestore, or also bluestore? >> In your "IMPORTANT" warning mail, you wrote: >> "It seems to affect filestore and busy clusters with this specific >> workload." >> concerning this issue. >> However, the release notes do not mention explicitly that only Filestore is >> affected. >> >> Both Linh Vu and me are using Bluestore (exclusively). >> Are we potentially affected unless we pause I/O during the upgrade? > > The bug should apply to both FileStore and BlueStore, but we have only > seen crashes with FileStore. I'm not entirely sure why that is. One > theory is that the filestore apply timing is different and that makes the > bug more likely to happen. Another is that filestore splitting is a > "good" source of that latency that tends to trigger the bug easily. > > If it were me I would err on the safe side. :) That's certainly the choice of a sage ;-). We'll do that, too - we informed our users just now I/O will be blocked for thirty minutes or so to give us some leeway for the upgrade... They will certainly survive the pause with the nice weather outside :-). Cheers and many thanks, Oliver > > sage > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.2.7 Luminous released
Am 18.07.2018 um 14:20 schrieb Sage Weil: > On Wed, 18 Jul 2018, Linh Vu wrote: >> Thanks for all your hard work in putting out the fixes so quickly! :) >> >> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, >> not RGW. In the release notes, it says RGW is a risk especially the >> garbage collection, and the recommendation is to either pause IO or >> disable RGW garbage collection. >> >> >> In our case with CephFS, not RGW, is it a lot less risky to perform the >> upgrade to 12.2.7 without the need to pause IO? >> >> >> What does pause IO do? Do current sessions just get queued up and IO >> resume normally with no problem after unpausing? >> >> >> If we have to pause IO, is it better to do something like: pause IO, >> restart OSDs on one node, unpause IO - repeated for all the nodes >> involved in the EC pool? Hi! sorry for asking again, but... > > CephFS can generate a problem rados workload too when files are deleted or > truncated. If that isn't happening in your workload then you're probably > fine. If deletes are mixed in, then you might consider pausing IO for the > upgrade. > > FWIW, if you have been running 12.2.5 for a while and haven't encountered > the OSD FileStore crashes with > > src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must > exist") > > but have had OSDs go up/down then you are probably okay. => Does this issue only affect filestore, or also bluestore? In your "IMPORTANT" warning mail, you wrote: "It seems to affect filestore and busy clusters with this specific workload." concerning this issue. However, the release notes do not mention explicitly that only Filestore is affected. Both Linh Vu and me are using Bluestore (exclusively). Are we potentially affected unless we pause I/O during the upgrade? All the best, Oliver > > Thanks! > sage > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.2.7 Luminous released
Also many thanks from my side! Am 18.07.2018 um 03:04 schrieb Linh Vu: > Thanks for all your hard work in putting out the fixes so quickly! :) > > We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not > RGW. In the release notes, it says RGW is a risk especially the garbage > collection, and the recommendation is to either pause IO or disable RGW > garbage collection. > > > In our case with CephFS, not RGW, is it a lot less risky to perform the > upgrade to 12.2.7 without the need to pause IO? > > > What does pause IO do? Do current sessions just get queued up and IO resume > normally with no problem after unpausing? That's my understanding, pause blocks any reads and writes. If the processes accessing CephFS do not have any wallclock-related timeout handlers, they should be fine IMHO. I'm unsure how NFS Ganesha But indeed I have the very same question - we also have a pure CephFS cluster, without RGW, EC-pool-backed, on 12.2.5. Should we pause IO during upgrade? I wonder whether it is risky / unrisky to upgrade without pausing I/O? The update notes in the blog do not state whether a pure CephFS setup is affected. Cheers, Oliver > > > If we have to pause IO, is it better to do something like: pause IO, restart > OSDs on one node, unpause IO - repeated for all the nodes involved in the EC > pool? > > > Regards, > > Linh > > -- > *From:* ceph-users on behalf of Sage Weil > > *Sent:* Wednesday, 18 July 2018 4:42:41 AM > *To:* Stefan Kooman > *Cc:* ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; > ceph-maintain...@ceph.com; ceph-us...@ceph.com > *Subject:* Re: [ceph-users] v12.2.7 Luminous released > > On Tue, 17 Jul 2018, Stefan Kooman wrote: >> Quoting Abhishek Lekshmanan (abhis...@suse.com): >> >> > *NOTE* The v12.2.5 release has a potential data corruption issue with >> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see > ^^^ >> > below. >> >> < snip > >> >> > Upgrading from v12.2.5 or v12.2.6 >> > - >> > >> > If you used v12.2.5 or v12.2.6 in combination with erasure coded > ^ >> > pools, there is a small risk of corruption under certain workloads. >> > Specifically, when: >> >> < snip > >> >> One section mentions Luminous clusters _with_ EC pools specifically, the >> other >> section mentions Luminous clusters running 12.2.5. > > I think they both do? > >> I might be misreading this, but to make things clear for current Ceph >> Luminous 12.2.5 users. Is the following statement correct? >> >> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there >> is >> no need to quiesce IO (ceph osd pause). > > Correct. > >> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions >> If your cluster did not run v12.2.5 or v12.2.6 then none of the above >> issues apply to you and you should upgrade normally. >> >> ^^ Above section would indicate all 12.2.5 luminous clusters. > > The intent here is to clarify that any cluster running 12.2.4 or > older can upgrade without reading carefully. If the cluster > does/did run 12.2.5 or .6, then read carefully because it may (or may not) > be affected. > > Does that help? Any suggested revisions to the wording in the release > notes that make it clearer are welcome! > > Thanks- > sage > > >> >> Please clarify, >> >> Thanks, >> >> Stefan >> >> -- >> | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 >> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing
Re: [ceph-users] mds daemon damaged
Hi Kevin, Am 13.07.2018 um 04:21 schrieb Kevin: > That thread looks exactly like what I'm experiencing. Not sure why my > repeated googles didn't find it! maybe the thread was still too "fresh" for Google's indexing. > > I'm running 12.2.6 and CentOS 7 > > And yes, I recently upgraded from jewel to luminous following the > instructions of changing the repo and then updating. Everything has been > working fine up until this point > > Given that previous thread I feel at a bit of a loss as to what to try now > since that thread ended with no resolution I could see. I hope the thread is still continuing, given that another affected person just commented on it. We also planned to upgrade our production cluster to 12.2.6 (also on CentOS 7) in the weekend since we are affected by two Ceph-fuse bugs causing inconsistency of directory contents since months which have been fixed in 12.2.6, but given this situation, we'll rather live with that a bit longer and hold off on the update... > > Thanks for pointing that out though, it seems like almost the exact same > situation > > On 2018-07-12 18:23, Oliver Freyermuth wrote: >> Hi, >> >> all this sounds an awful lot like: >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html >> In htat case, things started with an update to 12.2.6. Which version >> are you running? >> >> Cheers, >> Oliver >> >> Am 12.07.2018 um 23:30 schrieb Kevin: >>> Sorry for the long posting but trying to cover everything >>> >>> I woke up to find my cephfs filesystem down. This was in the logs >>> >>> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a >>> != expected 0x1c08241c on 2:292cf221:::200.:head >>> >>> I had one standby MDS, but as far as I can tell it did not fail over. This >>> was in the logs >>> >>> (insufficient standby MDS daemons available) >>> >>> Currently my ceph looks like this >>> cluster: >>> id: .. >>> health: HEALTH_ERR >>> 1 filesystem is degraded >>> 1 mds daemon damaged >>> >>> services: >>> mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29 >>> mgr: ids27(active) >>> mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged >>> osd: 5 osds: 5 up, 5 in >>> >>> data: >>> pools: 3 pools, 202 pgs >>> objects: 1013k objects, 4018 GB >>> usage: 12085 GB used, 6544 GB / 18630 GB avail >>> pgs: 201 active+clean >>> 1 active+clean+scrubbing+deep >>> >>> io: >>> client: 0 B/s rd, 0 op/s rd, 0 op/s wr >>> >>> I started trying to get the damaged MDS back online >>> >>> Based on this page >>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts >>> >>> # cephfs-journal-tool journal export backup.bin >>> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable >>> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not >>> readable, attempt object-by-object dump with `rados` >>> Error ((5) Input/output error) >>> >>> # cephfs-journal-tool event recover_dentries summary >>> Events by type: >>> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is >>> unreadableErrors: 0 >>> >>> cephfs-journal-tool journal reset - (I think this command might have worked) >>> >>> Next up, tried to reset the filesystem >>> >>> ceph fs reset test-cephfs-1 --yes-i-really-mean-it >>> >>> Each time same errors >>> >>> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE >>> (was: 1 mds daemon damaged) >>> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned >>> to filesystem test-cephfs-1 as rank 0 >>> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: >>> (5) Input/output error >>> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon >>> damaged (MDS_DAMAGE) >>> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a >>> != expected 0x1c08241c on 2:292cf221:::200.:head >>> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem >>> is degraded; 1 mds daemon damaged >>> >>> Tried to
Re: [ceph-users] mds daemon damaged
Hi, all this sounds an awful lot like: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html In htat case, things started with an update to 12.2.6. Which version are you running? Cheers, Oliver Am 12.07.2018 um 23:30 schrieb Kevin: > Sorry for the long posting but trying to cover everything > > I woke up to find my cephfs filesystem down. This was in the logs > > 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a != > expected 0x1c08241c on 2:292cf221:::200.:head > > I had one standby MDS, but as far as I can tell it did not fail over. This > was in the logs > > (insufficient standby MDS daemons available) > > Currently my ceph looks like this > cluster: > id: .. > health: HEALTH_ERR > 1 filesystem is degraded > 1 mds daemon damaged > > services: > mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29 > mgr: ids27(active) > mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged > osd: 5 osds: 5 up, 5 in > > data: > pools: 3 pools, 202 pgs > objects: 1013k objects, 4018 GB > usage: 12085 GB used, 6544 GB / 18630 GB avail > pgs: 201 active+clean > 1 active+clean+scrubbing+deep > > io: > client: 0 B/s rd, 0 op/s rd, 0 op/s wr > > I started trying to get the damaged MDS back online > > Based on this page > http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts > > # cephfs-journal-tool journal export backup.bin > 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable > 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not > readable, attempt object-by-object dump with `rados` > Error ((5) Input/output error) > > # cephfs-journal-tool event recover_dentries summary > Events by type: > 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is > unreadableErrors: 0 > > cephfs-journal-tool journal reset - (I think this command might have worked) > > Next up, tried to reset the filesystem > > ceph fs reset test-cephfs-1 --yes-i-really-mean-it > > Each time same errors > > 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE > (was: 1 mds daemon damaged) > 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned to > filesystem test-cephfs-1 as rank 0 > 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: (5) > Input/output error > 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon > damaged (MDS_DAMAGE) > 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a != > expected 0x1c08241c on 2:292cf221:::200.:head > 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is > degraded; 1 mds daemon damaged > > Tried to 'fail' mds.ds27 > # ceph mds fail ds27 > # failed mds gid 1929168 > > Command worked, but each time I run the reset command the same errors above > appear > > Online searches say the object read error has to be removed. But there's no > object listed. This web page is the closest to the issue > http://tracker.ceph.com/issues/20863 > > Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it > completes but still have the same issue above > > Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and > has data it should become live. If it was not > I assume we will lose the filesystem at this point > > Why didn't the standby MDS failover? > > Just looking for any way to recover the cephfs, thanks! > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug? Ceph-volume /var/lib/ceph/osd permissions
Am 02.06.2018 um 12:35 schrieb Marc Roos: > > o+w? I don’t think that is necessary not? I also wondered about that, but it seems safe - it's only a tmpfs, with sticky bit set - and all files within have: -rw---. as you can check. Also, on our systems, we have: drwxr-x---. for /var/lib/ceph, so nobody can enter there in the first place. Still it would be nice to remove the unnecessary permissions from the OSD subdirectories. I guess what's there now is just the tmpfs default without any mask... Cheers, Oliver > > drwxr-xr-x 2 ceph ceph 182 May 9 12:59 ceph-15 > drwxr-xr-x 2 ceph ceph 182 May 9 20:51 ceph-14 > drwxr-xr-x 2 ceph ceph 182 May 12 10:32 ceph-16 > drwxr-xr-x 2 ceph ceph 6 Jun 2 17:21 ceph-19 > drwxr-x--- 13 ceph ceph 168 Jun 2 17:47 . > drwxrwxrwt 2 ceph ceph 300 Jun 2 17:47 ceph-20 <<< > > I feel like beta tester, playing a bit with this ceph-volume. > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Should ceph-volume lvm prepare not be backwards compitable with ceph-disk?
Am 02.06.2018 um 11:44 schrieb Marc Roos: > > > ceph-disk does not require bootstrap-osd/ceph.keyring and ceph-volume > does I believe that's expected when you use "prepare". For ceph-volume, "prepare" already bootstraps the OSD and fetches a fresh OSD id, for which it needs the keyring. For ceph-disk, this was not part of "prepare", but you only needed a key for "activate" later, I think. Since we always use "create" here via ceph-deploy, I'm not an expert on the subtle command differences, though - but ceph-deploy is doing a good job at making you survive without learning them ;-). Cheers, Oliver > > > > [@~]# ceph-disk prepare --bluestore --zap-disk /dev/sdf > > *** > Found invalid GPT and valid MBR; converting MBR to GPT format. > *** > > GPT data structures destroyed! You may now partition the disk using > fdisk or > other utilities. > Creating new GPT entries. > The operation has completed successfully. > The operation has completed successfully. > The operation has completed successfully. > The operation has completed successfully. > meta-data=/dev/sdf1 isize=2048 agcount=4, agsize=6400 > blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1finobt=0, sparse=0 > data = bsize=4096 blocks=25600, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=1608, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > Warning: The kernel is still using the old partition table. > The new table will be used at the next reboot. > The operation has completed successfully. > > [@~]# ceph-disk zap /dev/sdf > /dev/sdf1: 4 bytes were erased at offset 0x (xfs): 58 46 53 42 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 0.946816 s, 111 MB/s > 110+0 records in > 110+0 records out > 115343360 bytes (115 MB) copied, 0.876412 s, 132 MB/s > Caution: invalid backup GPT header, but valid main header; regenerating > backup header from main header. > > Warning! Main and backup partition tables differ! Use the 'c' and 'e' > options > on the recovery & transformation menu to examine the two tables. > > Warning! One or more CRCs don't match. You should repair the disk! > > > > Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but > disk > verification and recovery are STRONGLY recommended. > > > GPT data structures destroyed! You may now partition the disk using > fdisk or > other utilities. > Creating new GPT entries. > The operation has completed successfully. > > > > [@ ~]# fdisk -l /dev/sdf > WARNING: fdisk GPT support is currently new, and therefore in an > experimental phase. Use at your own discretion. > > Disk /dev/sdf: 3000.6 GB, 3000592982016 bytes, 5860533168 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk label type: gpt > Disk identifier: 7DB3B9B6-CD8E-41B5-85BA-3ABB566BAF8E > > > # Start EndSize TypeName > > > [@ ~]# ceph-volume lvm prepare --bluestore --data /dev/sdf > Running command: /bin/ceph-authtool --gen-print-key > Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd > --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new > 8a2440c2-55a3-4b09-8906-965c25e36066 > stderr: 2018-06-02 17:00:47.309487 7f5a083c1700 -1 auth: unable to find > a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file > or directory > stderr: 2018-06-02 17:00:47.309502 7f5a083c1700 -1 monclient: ERROR: > missing keyring, cannot use cephx for authentication > stderr: 2018-06-02 17:00:47.309505 7f5a083c1700 0 librados: > client.bootstrap-osd initialization error (2) No such file or directory > stderr: [errno 2] error connecting to the cluster > --> RuntimeError: Unable to create a new OSD id > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug? ceph-volume zap not working
The command mapping from ceph-disk to ceph-volume is certainly not 1:1. What we are ended up using is: ceph-volume lvm zap /dev/sda --destroy This takes care of destroying Pvs and Lvs (as the documentation says). Cheers, Oliver Am 02.06.2018 um 12:16 schrieb Marc Roos: > > I guess zap should be used instead of destroy? Maybe keep ceph-disk > backwards compatibility and keep destroy?? > > [root@c03 bootstrap-osd]# ceph-volume lvm zap /dev/sdf > --> Zapping: /dev/sdf > --> Unmounting /var/lib/ceph/osd/ceph-19 > Running command: umount -v /var/lib/ceph/osd/ceph-19 > stderr: umount: /var/lib/ceph/osd/ceph-19 (tmpfs) unmounted > Running command: wipefs --all /dev/sdf > stderr: wipefs: error: /dev/sdf: probing initialization failed: Device > or resource busy > --> RuntimeError: command returned non-zero exit status: 1 > > Pvs / lvs are still there, I guess these are keeping the 'resource busy' > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"
Am 01.06.2018 um 02:59 schrieb Yan, Zheng: > On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth > wrote: >> Am 30.05.2018 um 10:37 schrieb Yan, Zheng: >>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth >>> wrote: >>>> Hi, >>>> >>>> ij our case, there's only a single active MDS >>>> (+1 standby-replay + 1 standby). >>>> We also get the health warning in case it happens. >>>> >>> >>> Were there "client.xxx isn't responding to mclientcaps(revoke)" >>> warnings in cluster log. please send them to me if there were. >> >> Yes, indeed, I almost missed them! >> >> Here you go: >> >> >> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : >> cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to >> respond to capability release >> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : >> cluster [WRN] Health check failed: 1 clients failing to respond to >> capability release (MDS_CLIENT_LATE_RELEASE) >> >> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 >> 15745 : cluster [WRN] client.1524813 isn't responding to >> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, >> sent 63.908382 seconds ago >> >>> repetition of message with increasing delays in between> >> >> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 >> 17169 : cluster [WRN] client.1524813 isn't responding to >> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, >> sent 15364.240272 seconds ago >> >> >> After evicting the client, I also get: >> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : >> cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability >> release; 1 MDSs report slow requests >> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : >> cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX >> failing to respond to capability release >> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : >> cluster [INF] MDS health message cleared (mds.0): 123 slow requests are >> blocked > 30 sec >> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : >> cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients >> failing to respond to capability release) >> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : >> cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report >> slow requests) >> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : >> cluster [INF] Cluster is now healthy >> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 >> 8 : cluster [WRN] replayed op client.1495010:32710304,32710299 used ino >> 0x13909d0 but session next is 0x1388af6 >> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 >> 9 : cluster [WRN] replayed op client.1495010:32710306,32710299 used ino >> 0x13909d1 but session next is 0x1388af6 >> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : >> cluster [INF] overall HEALTH_OK >> >> Thanks for looking into it! >> >> Cheers, >> Oliver >> >> > > I found cause of your issue. http://tracker.ceph.com/issues/24369 Wow, many thanks! I did not yet manage to reproduce the stuck behaviour, since the user who could reliably cause it made use of the national holiday around here. But the issue seems extremely likely to be exactly that one - quotas are set for the directory tree which was affected. Let me know if I still should ask him to reproduce and collect the information from the client to confirm. Many thanks and cheers, Oliver > >>> >>>> Cheers, >>>> Oliver >>>> >>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng: >>>>> I could be http://tracker.ceph.com/issues/24172 >>>>> >>>>> >>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu wrote: >>>>>> In my case, I have multiple active MDS (with directory pinning at the >>>>>> very >>>>>> top level), and there would be "Client xxx failing to respond to >>>>>> capability >>>>>> release" health warning every single time that happens. >>>>>> >>>
Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"
Am 30.05.2018 um 10:37 schrieb Yan, Zheng: > On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth > wrote: >> Hi, >> >> ij our case, there's only a single active MDS >> (+1 standby-replay + 1 standby). >> We also get the health warning in case it happens. >> > > Were there "client.xxx isn't responding to mclientcaps(revoke)" > warnings in cluster log. please send them to me if there were. Yes, indeed, I almost missed them! Here you go: 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to respond to capability release 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : cluster [WRN] Health check failed: 1 clients failing to respond to capability release (MDS_CLIENT_LATE_RELEASE) 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 15745 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago >repetition of message with increasing delays in between> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 17169 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago After evicting the client, I also get: 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX failing to respond to capability release 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : cluster [INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release) 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : cluster [INF] Cluster is now healthy 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 8 : cluster [WRN] replayed op client.1495010:32710304,32710299 used ino 0x13909d0 but session next is 0x1388af6 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 9 : cluster [WRN] replayed op client.1495010:32710306,32710299 used ino 0x13909d1 but session next is 0x1388af6 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : cluster [INF] overall HEALTH_OK Thanks for looking into it! Cheers, Oliver > >> Cheers, >> Oliver >> >> Am 30.05.2018 um 03:25 schrieb Yan, Zheng: >>> I could be http://tracker.ceph.com/issues/24172 >>> >>> >>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu wrote: >>>> In my case, I have multiple active MDS (with directory pinning at the very >>>> top level), and there would be "Client xxx failing to respond to capability >>>> release" health warning every single time that happens. >>>> >>>> >>>> From: ceph-users on behalf of Yan, >>>> Zheng >>>> >>>> Sent: Tuesday, 29 May 2018 9:53:43 PM >>>> To: Oliver Freyermuth >>>> Cc: Ceph Users; Peter Wienemann >>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to >>>> authpin local pins" >>>> >>>> Single or multiple acitve mds? Were there "Client xxx failing to >>>> respond to capability release" health warning? >>>> >>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth >>>> wrote: >>>>> Dear Cephalopodians, >>>>> >>>>> we just had a "lockup" of many MDS requests, and also trimming fell >>>>> behind, for over 2 days. >>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status >>>>> "currently failed to authpin local pins". Metadata pool usage did grow by >>>>> 10 >>>>> GB in those 2 days. >>>>> >>>>> Rebooting the node to force a client eviction solved the issue, and now >>>>> metadata usage is down again, and all stuck requests were processed >>>>> quickly. >>>>
Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"
Hi, ij our case, there's only a single active MDS (+1 standby-replay + 1 standby). We also get the health warning in case it happens. Cheers, Oliver Am 30.05.2018 um 03:25 schrieb Yan, Zheng: > I could be http://tracker.ceph.com/issues/24172 > > > On Wed, May 30, 2018 at 9:01 AM, Linh Vu wrote: >> In my case, I have multiple active MDS (with directory pinning at the very >> top level), and there would be "Client xxx failing to respond to capability >> release" health warning every single time that happens. >> >> >> From: ceph-users on behalf of Yan, Zheng >> >> Sent: Tuesday, 29 May 2018 9:53:43 PM >> To: Oliver Freyermuth >> Cc: Ceph Users; Peter Wienemann >> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to >> authpin local pins" >> >> Single or multiple acitve mds? Were there "Client xxx failing to >> respond to capability release" health warning? >> >> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth >> wrote: >>> Dear Cephalopodians, >>> >>> we just had a "lockup" of many MDS requests, and also trimming fell >>> behind, for over 2 days. >>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status >>> "currently failed to authpin local pins". Metadata pool usage did grow by 10 >>> GB in those 2 days. >>> >>> Rebooting the node to force a client eviction solved the issue, and now >>> metadata usage is down again, and all stuck requests were processed quickly. >>> >>> Is there any idea on what could cause something like that? On the client, >>> der was no CPU load, but many processes waiting for cephfs to respond. >>> Syslog did yield anything. It only affected one user and his user >>> directory. >>> >>> If there are no ideas: How can I collect good debug information in case >>> this happens again? >>> >>> Cheers, >>> Oliver >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> >>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"
I get the feeling this is not dependent on the exact Ceph version... In our case, I know what the user has done (and he'll not do it again). He misunderstood how our cluster works and started 1100 cluster jobs, all entering the very same directory on CephFS (mounted via ceph-fuse on 38 machines), all running "make clean; make -j10 install". So 1100 processes from 38 clients have been trying to lock / delete / write the very same files. In parallel, an IDE (eclipse) and an indexing service (zeitgeist...) may have accessed the very same directory via nfs-ganesha since the user mounted the NFS-exported directory via sshfs into his desktop home directory... So I can't really blame CephFS for becoming as unhappy as I would become myself. However, I would have hoped it would not enter a "stuck" state in which only client eviction will help... Cheers, Oliver Am 29.05.2018 um 03:26 schrieb Linh Vu: > I get the exact opposite to the same error message "currently failed to > authpin local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into > those issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. > The main cluster is on 12.2.4. > > > The cause is user's HPC jobs or even just their login on multiple nodes > accessing the same files, in a particular way. Doesn't happen to other users. > Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our > problem. > > ------ > *From:* ceph-users on behalf of Oliver > Freyermuth > *Sent:* Tuesday, 29 May 2018 7:29:06 AM > *To:* Paul Emmerich > *Cc:* Ceph Users; Peter Wienemann > *Subject:* Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to > authpin local pins" > > Dear Paul, > > Am 28.05.2018 um 20:16 schrieb Paul Emmerich: >> I encountered the exact same issue earlier today immediately after upgrading >> a customer's cluster from 12.2.2 to 12.2.5. >> I've evicted the session and restarted the ganesha client to fix it, as I >> also couldn't find any obvious problem. > > interesting! In our case, the client with the problem (it happened again a > few hours later...) always was a ceph-fuse client. Evicting / rebooting the > client node helped. > However, it may well be that the original issue way caused by a Ganesha > client, which we also use (and the user in question who complained was > accessing files in parallel via NFS and ceph-fuse), > but I don't have a clear indication of that. > > Cheers, > Oliver > >> >> Paul >> >> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth > <mailto:freyerm...@physik.uni-bonn.de>>: >> >> Dear Cephalopodians, >> >> we just had a "lockup" of many MDS requests, and also trimming fell >>behind, for over 2 days. >> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status >>"currently failed to authpin local pins". Metadata pool usage did grow by 10 >>GB in those 2 days. >> >> Rebooting the node to force a client eviction solved the issue, and now >>metadata usage is down again, and all stuck requests were processed quickly. >> >> Is there any idea on what could cause something like that? On the >>client, der was no CPU load, but many processes waiting for cephfs to respond. >> Syslog did yield anything. It only affected one user and his user >>directory. >> >> If there are no ideas: How can I collect good debug information in case >>this happens again? >> >> Cheers, >> Oliver >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-
Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"
Dear Paul, Am 28.05.2018 um 20:16 schrieb Paul Emmerich: > I encountered the exact same issue earlier today immediately after upgrading > a customer's cluster from 12.2.2 to 12.2.5. > I've evicted the session and restarted the ganesha client to fix it, as I > also couldn't find any obvious problem. interesting! In our case, the client with the problem (it happened again a few hours later...) always was a ceph-fuse client. Evicting / rebooting the client node helped. However, it may well be that the original issue way caused by a Ganesha client, which we also use (and the user in question who complained was accessing files in parallel via NFS and ceph-fuse), but I don't have a clear indication of that. Cheers, Oliver > > Paul > > 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth <mailto:freyerm...@physik.uni-bonn.de>>: > > Dear Cephalopodians, > > we just had a "lockup" of many MDS requests, and also trimming fell > behind, for over 2 days. > One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status > "currently failed to authpin local pins". Metadata pool usage did grow by 10 > GB in those 2 days. > > Rebooting the node to force a client eviction solved the issue, and now > metadata usage is down again, and all stuck requests were processed quickly. > > Is there any idea on what could cause something like that? On the client, > der was no CPU load, but many processes waiting for cephfs to respond. > Syslog did yield anything. It only affected one user and his user > directory. > > If there are no ideas: How can I collect good debug information in case > this happens again? > > Cheers, > Oliver > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io <http://www.croit.io> > Tel: +49 89 1896585 90 smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS "move" operation
Am 25.05.2018 um 15:39 schrieb Sage Weil: > On Fri, 25 May 2018, Oliver Freyermuth wrote: >> Dear Ric, >> >> I played around a bit - the common denominator seems to be: Moving it >> within a directory subtree below a directory for which max_bytes / >> max_files quota settings are set, things work fine. Moving it to another >> directory tree without quota settings / with different quota settings, >> rename() returns EXDEV. > > Aha, yes, this is the issue. > > When you set a quota you force subvolume-like behavior. This is done > because hard links across this quota boundary won't correctly account for > utilization (only one of the file links will accrue usage). The > expectation is that quotas are usually set in locations that aren't > frequently renamed across. Understood, that explains it. That's indeed also true for our application in most cases - but sometimes, we have the case that users want to migrate their data to group storage, or vice-versa. > > It might be possible to allow rename(2) to proceed in cases where > nlink==1, but the behavior will probably seem inconsistent (some files get > EXDEV, some don't). I believe even this would be extremely helpful, performance-wise. At least in our case, hardlinks are seldomly used, it's more about data movement between user, group and scratch areas. For files with nlinks>1, it's more or less expected a copy has to be performed when crossing quota boundaries (I think). Cheers, Oliver > > sage > > > >> >> Cheers, Oliver >> >> >> Am 25.05.2018 um 15:18 schrieb Ric Wheeler: >>> That seems to be the issue - we need to understand why rename sees them as >>> different. >>> >>> Ric >>> >>> >>> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth >>> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> >>> wrote: >>> >>> Mhhhm... that's funny, I checked an mv with an strace now. I get: >>> >>> - >>> access("/cephfs/some_folder/file", W_OK) = 0 >>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid >>> cross-device link) >>> unlink("/cephfs/some_folder/file") = 0 >>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", >>> 255) = 30 >>> >>> - >>> But I can assure it's only a single filesystem, and a single ceph-fuse >>> client running. >>> >>> Same happens when using absolute paths. >>> >>> Cheers, >>> Oliver >>> >>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler: >>> > We should look at what mv uses to see if it thinks the directories >>> are on different file systems. >>> > >>> > If the fstat or whatever it looks at is confused, that might explain >>> it. >>> > >>> > Ric >>> > >>> > >>> > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth >>> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de> >>> <mailto:freyerm...@physik.uni-bonn.de >>> <mailto:freyerm...@physik.uni-bonn.de>>> wrote: >>> > >>> > Am 25.05.2018 um 14:57 schrieb Ric Wheeler: >>> > > Is this move between directories on the same file system? >>> > >>> > It is, we only have a single CephFS in use. There's also only a >>> single ceph-fuse client running. >>> > >>> > What's different, though, are different ACLs set for source and >>> target directory, and owner / group, >>> > but I hope that should not matter. >>> > >>> > All the best, >>> > Oliver >>> > >>> > > Rename as a system call only works within a file system. >>> > > >>> > > The user space mv command becomes a copy when not the same file >>> system. >>> > > >>> > > Regards, >>> > > >>> > > Ric >>> > > >>> > > >>> > > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com &g
Re: [ceph-users] CephFS "move" operation
Am 25.05.2018 um 15:26 schrieb Luis Henriques: > Oliver Freyermuth <freyerm...@physik.uni-bonn.de> writes: > >> Mhhhm... that's funny, I checked an mv with an strace now. I get: >> - >> access("/cephfs/some_folder/file", W_OK) = 0 >> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device >> link) > > I believe this could happen if you have quotas set on any of the paths, > or different snapshot realms. Wow - yes, this matches my observations! So in this case, e.g. moving files from a "user" directory with quota to a "group" directory with different quota, it's currently expected that files can not be renamed across those boundaries? Cheers, Oliver > > Cheers, > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS "move" operation
Dear Sage, here you go, some_folder in reality is "/cephfs/group": # stat foo File: ‘foo’ Size: 1048576000 Blocks: 2048000IO Block: 4194304 regular file Device: 27h/39d Inode: 1099515065517 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:fusefs_t:s0 Access: 2018-05-25 15:27:59.433279424 +0200 Modify: 2018-05-25 15:28:01.379754052 +0200 Change: 2018-05-25 15:28:01.379754052 +0200 Birth: - # stat -f foo File: "foo" ID: 0Namelen: 255 Type: fuseblk Block size: 4194304Fundamental block size: 4194304 Blocks: Total: 104471885 Free: 79096968 Available: 79096968 Inodes: Total: 26258533 Free: -1 # stat -f /cephfs/group/ File: "/cephfs/group/" ID: 0Namelen: 255 Type: fuseblk Block size: 4194304Fundamental block size: 4194304 Blocks: Total: 104471835 Free: 79098264 Available: 79098264 Inodes: Total: 26257190 Free: -1 # stat /cephfs/group/ File: ‘/cephfs/group/’ Size: 73167320986856 Blocks: 1 IO Block: 4096 directory Device: 27h/39d Inode: 1099511627888 Links: 1 Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) Context: system_u:object_r:fusefs_t:s0 Access: 2018-03-09 18:22:47.061501906 +0100 Modify: 2018-05-25 15:18:02.164391701 +0200 Change: 2018-05-25 15:18:02.164391701 +0200 Birth: - Cheers, Oliver Am 25.05.2018 um 15:21 schrieb Sage Weil: > Can you paste the output of 'stat foo' and 'stat /cephfs/some_folder'? > (Maybe also the same with 'stat -f'.) > > Thanks! > sage > > > On Fri, 25 May 2018, Ric Wheeler wrote: >> That seems to be the issue - we need to understand why rename sees them as >> different. >> >> Ric >> >> >> On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth < >> freyerm...@physik.uni-bonn.de> wrote: >> >>> Mhhhm... that's funny, I checked an mv with an strace now. I get: >>> >>> - >>> access("/cephfs/some_folder/file", W_OK) = 0 >>> rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device >>> link) >>> unlink("/cephfs/some_folder/file") = 0 >>> lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) >>> = 30 >>> >>> - >>> But I can assure it's only a single filesystem, and a single ceph-fuse >>> client running. >>> >>> Same happens when using absolute paths. >>> >>> Cheers, >>> Oliver >>> >>> Am 25.05.2018 um 15:06 schrieb Ric Wheeler: >>>> We should look at what mv uses to see if it thinks the directories are >>> on different file systems. >>>> >>>> If the fstat or whatever it looks at is confused, that might explain it. >>>> >>>> Ric >>>> >>>> >>>> On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth < >>> freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> >>> wrote: >>>> >>>> Am 25.05.2018 um 14:57 schrieb Ric Wheeler: >>>> > Is this move between directories on the same file system? >>>> >>>> It is, we only have a single CephFS in use. There's also only a >>> single ceph-fuse client running. >>>> >>>> What's different, though, are different ACLs set for source and >>> target directory, and owner / group, >>>> but I hope that should not matter. >>>> >>>> All the best, >>>> Oliver >>>> >>>> > Rename as a system call only works within a file system. >>>> > >>>> > The user space mv command becomes a copy when not the same file >>> system. >>>> > >>>> > Regards, >>>> > >>>> > Ric >>>> > >>>> > >>>> > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com >>> <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com >> jsp...@redhat.com>>> wrote: >>>> > >>>> > On Fri, May 2
Re: [ceph-users] CephFS "move" operation
Dear Ric, I played around a bit - the common denominator seems to be: Moving it within a directory subtree below a directory for which max_bytes / max_files quota settings are set, things work fine. Moving it to another directory tree without quota settings / with different quota settings, rename() returns EXDEV. Cheers, Oliver Am 25.05.2018 um 15:18 schrieb Ric Wheeler: > That seems to be the issue - we need to understand why rename sees them as > different. > > Ric > > > On Fri, May 25, 2018, 9:15 AM Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: > > Mhhhm... that's funny, I checked an mv with an strace now. I get: > > - > access("/cephfs/some_folder/file", W_OK) = 0 > rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid > cross-device link) > unlink("/cephfs/some_folder/file") = 0 > lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", > 255) = 30 > > - > But I can assure it's only a single filesystem, and a single ceph-fuse > client running. > > Same happens when using absolute paths. > > Cheers, > Oliver > > Am 25.05.2018 um 15:06 schrieb Ric Wheeler: > > We should look at what mv uses to see if it thinks the directories are > on different file systems. > > > > If the fstat or whatever it looks at is confused, that might explain it. > > > > Ric > > > > > > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de> > <mailto:freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de>>> wrote: > > > > Am 25.05.2018 um 14:57 schrieb Ric Wheeler: > > > Is this move between directories on the same file system? > > > > It is, we only have a single CephFS in use. There's also only a > single ceph-fuse client running. > > > > What's different, though, are different ACLs set for source and > target directory, and owner / group, > > but I hope that should not matter. > > > > All the best, > > Oliver > > > > > Rename as a system call only works within a file system. > > > > > > The user space mv command becomes a copy when not the same file > system. > > > > > > Regards, > > > > > > Ric > > > > > > > > > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com > <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com > <mailto:jsp...@redhat.com>> <mailto:jsp...@redhat.com > <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com > <mailto:jsp...@redhat.com>>>> wrote: > > > > > > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth > > > <freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de>> <mailto:freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de>>>> wrote: > > > > Dear Cephalopodians, > > > > > > > > I was wondering why a simple "mv" is taking extraordinarily > long on CephFS and must note that, > > > > at least with the fuse-client (12.2.5) and when moving a > file from one directory to another, > > > > the file appears to be copied first (byte by byte, traffic > going through the client?) before the initial file is deleted. > > > > > > > > Is this true, or am I missing something? > > > > > > A mv should not involve copying a file through the client -- > it's > > > implemented in the MDS as a rename from one location to > another. > > > What's the observation that's making it seem like the data is > going > > > through the client? > > > > > > John > > > > > > > >
Re: [ceph-users] CephFS "move" operation
Mhhhm... that's funny, I checked an mv with an strace now. I get: - access("/cephfs/some_folder/file", W_OK) = 0 rename("foo", "/cephfs/some_folder/file") = -1 EXDEV (Invalid cross-device link) unlink("/cephfs/some_folder/file") = 0 lgetxattr("foo", "security.selinux", "system_u:object_r:fusefs_t:s0", 255) = 30 - But I can assure it's only a single filesystem, and a single ceph-fuse client running. Same happens when using absolute paths. Cheers, Oliver Am 25.05.2018 um 15:06 schrieb Ric Wheeler: > We should look at what mv uses to see if it thinks the directories are on > different file systems. > > If the fstat or whatever it looks at is confused, that might explain it. > > Ric > > > On Fri, May 25, 2018, 9:04 AM Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: > > Am 25.05.2018 um 14:57 schrieb Ric Wheeler: > > Is this move between directories on the same file system? > > It is, we only have a single CephFS in use. There's also only a single > ceph-fuse client running. > > What's different, though, are different ACLs set for source and target > directory, and owner / group, > but I hope that should not matter. > > All the best, > Oliver > > > Rename as a system call only works within a file system. > > > > The user space mv command becomes a copy when not the same file system. > > > > Regards, > > > > Ric > > > > > > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com > <mailto:jsp...@redhat.com> <mailto:jsp...@redhat.com > <mailto:jsp...@redhat.com>>> wrote: > > > > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth > > <freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de> <mailto:freyerm...@physik.uni-bonn.de > <mailto:freyerm...@physik.uni-bonn.de>>> wrote: > > > Dear Cephalopodians, > > > > > > I was wondering why a simple "mv" is taking extraordinarily long > on CephFS and must note that, > > > at least with the fuse-client (12.2.5) and when moving a file > from one directory to another, > > > the file appears to be copied first (byte by byte, traffic going > through the client?) before the initial file is deleted. > > > > > > Is this true, or am I missing something? > > > > A mv should not involve copying a file through the client -- it's > > implemented in the MDS as a rename from one location to another. > > What's the observation that's making it seem like the data is going > > through the client? > > > > John > > > > > > > > For large files, this might be rather time consuming, > > > and we should certainly advise all our users to not move files > around needlessly if this is the case. > > > > > > Cheers, > > > Oliver > > > > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > <mailto:ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS "move" operation
Am 25.05.2018 um 14:57 schrieb Ric Wheeler: > Is this move between directories on the same file system? It is, we only have a single CephFS in use. There's also only a single ceph-fuse client running. What's different, though, are different ACLs set for source and target directory, and owner / group, but I hope that should not matter. All the best, Oliver > Rename as a system call only works within a file system. > > The user space mv command becomes a copy when not the same file system. > > Regards, > > Ric > > > On Fri, May 25, 2018, 8:51 AM John Spray <jsp...@redhat.com > <mailto:jsp...@redhat.com>> wrote: > > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> > wrote: > > Dear Cephalopodians, > > > > I was wondering why a simple "mv" is taking extraordinarily long on > CephFS and must note that, > > at least with the fuse-client (12.2.5) and when moving a file from one > directory to another, > > the file appears to be copied first (byte by byte, traffic going > through the client?) before the initial file is deleted. > > > > Is this true, or am I missing something? > > A mv should not involve copying a file through the client -- it's > implemented in the MDS as a rename from one location to another. > What's the observation that's making it seem like the data is going > through the client? > > John > > > > > For large files, this might be rather time consuming, > > and we should certainly advise all our users to not move files around > needlessly if this is the case. > > > > Cheers, > > Oliver > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS "move" operation
Am 25.05.2018 um 14:50 schrieb John Spray: > On Fri, May 25, 2018 at 1:10 PM, Oliver Freyermuth > <freyerm...@physik.uni-bonn.de> wrote: >> Dear Cephalopodians, >> >> I was wondering why a simple "mv" is taking extraordinarily long on CephFS >> and must note that, >> at least with the fuse-client (12.2.5) and when moving a file from one >> directory to another, >> the file appears to be copied first (byte by byte, traffic going through the >> client?) before the initial file is deleted. >> >> Is this true, or am I missing something? > > A mv should not involve copying a file through the client -- it's > implemented in the MDS as a rename from one location to another. > What's the observation that's making it seem like the data is going > through the client? The fact that it's happening with only about 1 GBit/s and all OSDs are reading and writing. I will also check the network interface of the client next time it occurs. Also, ceph-fuse was taking 50 % CPU load just from this. Also, I observe the file at the source being kept during the copy, and the file at the target growing slowly. So it's definitely a copy, and only at the end the source file is deleted. > > John > >> >> For large files, this might be rather time consuming, >> and we should certainly advise all our users to not move files around >> needlessly if this is the case. >> >> Cheers, >> Oliver >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS "move" operation
Dear Cephalopodians, I was wondering why a simple "mv" is taking extraordinarily long on CephFS and must note that, at least with the fuse-client (12.2.5) and when moving a file from one directory to another, the file appears to be copied first (byte by byte, traffic going through the client?) before the initial file is deleted. Is this true, or am I missing something? For large files, this might be rather time consuming, and we should certainly advise all our users to not move files around needlessly if this is the case. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo
Hi David, thanks for the reply! Interesting that the package was not installed - it was for us, but the machines we run the nfs-ganesha servers on are also OSDs, so it might have been pulled in via ceph-packages for us. In any case, I'd say this means librados2 as dependency is missing either in the libcephfs or in nfs-ganesha packages. Also, good news that things work fine with 12.2.5 - so I hope our upgrade will also go without bumps ;-). My experience is sadly only a few months old. We've started with nfs-ganesha 2.5 from the Ceph repos, but hit a bad locking issue, which I also reported to this list. After upgrading to 2.6, we did not observe any further hard issues. It seems that there are sometimes issues with slow locks if processes are running with a working directory in ceph and other ceph-fuse clients want to access files in the same directory, but there are no "deadlock" situations anymore. In terms of tuning, I did not do anything special yet. I'm running with some basic NFS / Fileserver kernel tunables (sysctl): net.core.rmem_max = 12582912 net.core.wmem_max = 12582912 net.ipv4.tcp_rmem = 10240 87380 12582912 net.ipv4.tcp_wmem = 10240 87380 12582912 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 25 net.core.default_qdisc = fq_codel However, I did not do explicit testing of different values, but just followed general recommendations here. It seems ACLs and quotas are honoured by the NFS server (as expected, since it uses libcephfs behind the scenes). Right now, throughput for bulk data is close to perfect (we manage to saturate our 1 GBit/s link) and for metadata access it seems close to what ceph-fuse achieves, which is sufficient for us. Cheers and thanks for the feedback, Oliver Am 16.05.2018 um 21:06 schrieb David C: > Hi Oliver > > Thanks for following up. I just picked this up again today and it was indeed > librados2...the package wasn't installed! It's working now, haven't tested > much but I haven't noticed any problems yet. This is with > nfs-ganesha-2.6.1-0.1.el7.x86_64, libcephfs2-12.2.5-0.el7.x86_64 and > librados2-12.2.5-0.el7.x86_64. Thanks for the pointer on that. > > I'd be interested to hear your experience with ganesha with cephfs if you're > happy to share some insights. Any tuning you would recommend? > > Thanks, > > On Wed, May 16, 2018 at 4:14 PM, Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: > > Hi David, > > did you already manage to check your librados2 version and manage to pin > down the issue? > > Cheers, > Oliver > > Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth: > > Hi David, > > > > Am 11.05.2018 um 16:55 schrieb David C: > >> Hi Oliver > >> > >> Thanks for the detailed reponse! I've downgraded my libcephfs2 to > 12.2.4 and still get a similar error: > >> > >> load_fsal :NFS STARTUP :CRIT :Could not dlopen > module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: > undefined symbol: > _Z14common_preinitRK18CephInitParameters18code_environment_ti > >> load_fsal :NFS STARTUP :MAJ :Failed to load module > (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared > library > >> > >> I'm on CentOS 7.4, using the following package versions: > >> > >> # rpm -qa | grep ganesha > >> nfs-ganesha-2.6.1-0.1.el7.x86_64 > >> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64 > >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > >> > >> # rpm -qa | grep ceph > >> libcephfs2-12.2.4-0.el7.x86_64 > >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > > > > Mhhhm - that sounds like a messup in the dependencies. > > The symbol you are missing should be provided by > > librados2-12.2.4-0.el7.x86_64 > > which contains > > /usr/lib64/ceph/ceph/libcephfs-common.so.0 > > Do you have a different version of librados2 installed? If so, I wonder > how yum / rpm allowed that ;-). > > > > Thinking again, it might also be (if you indeed have a different > version there) that this is the cause also for the previous error. > > If the problematic symbol is indeed not exposed, but can be resolved > only if both libraries (libcephfs-common and libcephfs) are loaded in unison > with matching versions, > > it might be that also 12.2.5 works fine... > > > > First thing, in any case, is to checkout which version of librados2 you > are using ;-). >
Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo
Hi David, did you already manage to check your librados2 version and manage to pin down the issue? Cheers, Oliver Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth: > Hi David, > > Am 11.05.2018 um 16:55 schrieb David C: >> Hi Oliver >> >> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and >> still get a similar error: >> >> load_fsal :NFS STARTUP :CRIT :Could not dlopen >> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: >> undefined symbol: >> _Z14common_preinitRK18CephInitParameters18code_environment_ti >> load_fsal :NFS STARTUP :MAJ :Failed to load module >> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared >> library >> >> I'm on CentOS 7.4, using the following package versions: >> >> # rpm -qa | grep ganesha >> nfs-ganesha-2.6.1-0.1.el7.x86_64 >> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64 >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 >> >> # rpm -qa | grep ceph >> libcephfs2-12.2.4-0.el7.x86_64 >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > > Mhhhm - that sounds like a messup in the dependencies. > The symbol you are missing should be provided by > librados2-12.2.4-0.el7.x86_64 > which contains > /usr/lib64/ceph/ceph/libcephfs-common.so.0 > Do you have a different version of librados2 installed? If so, I wonder how > yum / rpm allowed that ;-). > > Thinking again, it might also be (if you indeed have a different version > there) that this is the cause also for the previous error. > If the problematic symbol is indeed not exposed, but can be resolved only if > both libraries (libcephfs-common and libcephfs) are loaded in unison with > matching versions, > it might be that also 12.2.5 works fine... > > First thing, in any case, is to checkout which version of librados2 you are > using ;-). > > Cheers, > Oliver > >> >> I don't have the ceph user space components installed, assuming they're not >> nesscary apart from libcephfs2? Any idea why it's giving me this error? >> >> Thanks, >> >> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth >> <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: >> >> Hi David, >> >> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph >> repos on CentOS 7.4 with the following set of versions: >> libcephfs2-12.2.4-0.el7.x86_64 >> nfs-ganesha-2.6.1-0.1.el7.x86_64 >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 >> Of course, we plan to upgrade to 12.2.5 soon-ish... >> >> Am 11.05.2018 um 00:05 schrieb David C: >> > Hi All >> > >> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from >> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ >> <http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/> >> > >> > It's failing to load /usr/lib64/ganesha/libfsalceph.so >> > >> > With libcephfs-12.2.1 installed I get the following error in my >> ganesha log: >> > >> > load_fsal :NFS STARTUP :CRIT :Could not dlopen >> module:/usr/lib64/ganesha/libfsalceph.so Error: >> > /usr/lib64/ganesha/libfsalceph.so: undefined symbol: >> ceph_set_deleg_timeout >> > load_fsal :NFS STARTUP :MAJ :Failed to load module >> (/usr/lib64/ganesha/libfsalceph.so) because >> > : Can not access a needed shared library >> >> That looks like an ABI incompatibility, probably the nfs-ganesha >> packages should block this libcephfs2-version (and older ones). >> >> > >> > >> > With libcephfs-12.2.5 installed I get: >> > >> > load_fsal :NFS STARTUP :CRIT :Could not dlopen >> module:/usr/lib64/ganesha/libfsalceph.so Error: >> > /lib64/libcephfs.so.2: undefined symbol: >> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo >> > load_fsal :NFS STARTUP :MAJ :Failed to load module >> (/usr/lib64/ganesha/libfsalceph.so) because >> > : Can not access a needed shared library >> >> That looks ugly and makes me fear for our planned 12.2.5-upgrade. >> Interestingly, we do not have that symbol on 12.2.4: >> # nm -D /lib64/libcephfs.so.2 | grep FSMap >> U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo >> U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo >> and NFS-Gane
Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo
Hi David, Am 11.05.2018 um 16:55 schrieb David C: > Hi Oliver > > Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and > still get a similar error: > > load_fsal :NFS STARTUP :CRIT :Could not dlopen > module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: > undefined symbol: > _Z14common_preinitRK18CephInitParameters18code_environment_ti > load_fsal :NFS STARTUP :MAJ :Failed to load module > (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared > library > > I'm on CentOS 7.4, using the following package versions: > > # rpm -qa | grep ganesha > nfs-ganesha-2.6.1-0.1.el7.x86_64 > nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64 > nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > > # rpm -qa | grep ceph > libcephfs2-12.2.4-0.el7.x86_64 > nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 Mhhhm - that sounds like a messup in the dependencies. The symbol you are missing should be provided by librados2-12.2.4-0.el7.x86_64 which contains /usr/lib64/ceph/ceph/libcephfs-common.so.0 Do you have a different version of librados2 installed? If so, I wonder how yum / rpm allowed that ;-). Thinking again, it might also be (if you indeed have a different version there) that this is the cause also for the previous error. If the problematic symbol is indeed not exposed, but can be resolved only if both libraries (libcephfs-common and libcephfs) are loaded in unison with matching versions, it might be that also 12.2.5 works fine... First thing, in any case, is to checkout which version of librados2 you are using ;-). Cheers, Oliver > > I don't have the ceph user space components installed, assuming they're not > nesscary apart from libcephfs2? Any idea why it's giving me this error? > > Thanks, > > On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth > <freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>> wrote: > > Hi David, > > for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph > repos on CentOS 7.4 with the following set of versions: > libcephfs2-12.2.4-0.el7.x86_64 > nfs-ganesha-2.6.1-0.1.el7.x86_64 > nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64 > Of course, we plan to upgrade to 12.2.5 soon-ish... > > Am 11.05.2018 um 00:05 schrieb David C: > > Hi All > > > > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from > http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ > <http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/> > > > > It's failing to load /usr/lib64/ganesha/libfsalceph.so > > > > With libcephfs-12.2.1 installed I get the following error in my ganesha > log: > > > > load_fsal :NFS STARTUP :CRIT :Could not dlopen > module:/usr/lib64/ganesha/libfsalceph.so Error: > > /usr/lib64/ganesha/libfsalceph.so: undefined symbol: > ceph_set_deleg_timeout > > load_fsal :NFS STARTUP :MAJ :Failed to load module > (/usr/lib64/ganesha/libfsalceph.so) because > > : Can not access a needed shared library > > That looks like an ABI incompatibility, probably the nfs-ganesha packages > should block this libcephfs2-version (and older ones). > > > > > > > With libcephfs-12.2.5 installed I get: > > > > load_fsal :NFS STARTUP :CRIT :Could not dlopen > module:/usr/lib64/ganesha/libfsalceph.so Error: > > /lib64/libcephfs.so.2: undefined symbol: > _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo > > load_fsal :NFS STARTUP :MAJ :Failed to load module > (/usr/lib64/ganesha/libfsalceph.so) because > > : Can not access a needed shared library > > That looks ugly and makes me fear for our planned 12.2.5-upgrade. > Interestingly, we do not have that symbol on 12.2.4: > # nm -D /lib64/libcephfs.so.2 | grep FSMap > U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo > U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo > and NFS-Ganesha works fine. > > Looking at: > https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h > <https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h> > versus > https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h > <https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h> > it seems this commit: > > https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d > <https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d> > changed libcephfs2 ABI. > > I've no idea h