Re: [ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

2018-10-19 Thread Eugen Block
Hi Denny, the recommendation for ceph maintenance is to set three flags if you need to shutdown a node (or the entire cluster): ceph osd set noout ceph osd set nobackfill ceph osd set norecover Although the 'noout' flag seems to be enough for many maintenance tasks it doesn't prevent the

Re: [ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

2018-10-19 Thread Eugen Block
one server with noout. Paul Am Fr., 19. Okt. 2018 um 11:37 Uhr schrieb Eugen Block : Hi Denny, the recommendation for ceph maintenance is to set three flags if you need to shutdown a node (or the entire cluster): ceph osd set noout ceph osd set nobackfill ceph osd set norecover Although

Re: [ceph-users] mds_cache_memory_limit value

2018-10-05 Thread Eugen Block
Hi, you can monitor the cache size and see if the new values are applied: ceph@mds:~> ceph daemon mds. cache status { "pool": { "items": 106708834, "bytes": 5828227058 } } You should also see in top (or similar tools) that the memory increases/decreases. From my

Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-11 Thread Eugen Block
I only tried to use the Ceph CLI once out of curiosity, simply because it is there, but I don't really benefit from it. Usually when I'm working with clusters it requires a combination of different commands (rbd, rados, ceph etc.), so this would mean either exiting and entering the CLI back

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block
for the clarification! Zitat von Ilya Dryomov : On Thu, Aug 30, 2018 at 1:04 PM Eugen Block wrote: Hi again, we still didn't figure out the reason for the flapping, but I wanted to get back on the dmesg entries. They just reflect what happened in the past, they're no indicator to predict anything

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-09-03 Thread Eugen Block
and will expire in version "Sodium".ago 31 12:43:51 polar salt-minion[1421]: [ERROR ] Mine on polar.iq.ufrgs.br <http://polar.iq.ufrgs.br> for cephdisks.listago 31 12:43:51 polar salt-minion[1421]: [ERROR ] Module function osd.deploy threw an exception. Exception: Mine on polar.

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
Hi Olivier, what size does the cache tier have? You could set cache-mode to forward and flush it, maybe restarting those OSDs (68, 69) helps, too. Or there could be an issue with the cache tier, what do those logs say? Regards, Eugen Zitat von Olivier Bonvalet : Hello, on a Luminous

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
ier Le vendredi 21 septembre 2018 à 09:34 +0000, Eugen Block a écrit : Hi Olivier, what size does the cache tier have? You could set cache-mode to forward and flush it, maybe restarting those OSDs (68, 69) helps, too. Or there could be an issue with the cache tier, what do those logs say? Regard

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
I tried to flush the cache with "rados -p cache-bkp-foo cache-flush- evict-all", but it blocks on the object "rbd_data.f66c92ae8944a.000f2596". This is the object that's stuck in the cache tier (according to your output in https://pastebin.com/zrwu5X0w). Can you verify if that block

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
I also switched the cache tier to "readproxy", to avoid using this cache. But, it's still blocked. You could change the cache mode to "none" to disable it. Could you paste the output of: ceph osd pool ls detail | grep cache-bkp-foo Zitat von Olivier Bonvalet : In fact, one object (only

Re: [ceph-users] bluestore osd journal move

2018-09-24 Thread Eugen Block
Hi, I am wondering if it is possible to move the ssd journal for the bluestore osd? I would like to move it from one ssd drive to another. yes, this question has been asked several times. Depending on your deployment there are several things to be aware of, maybe you should first read [1]

Re: [ceph-users] Bluestore DB showing as ssd

2018-09-26 Thread Eugen Block
Hi, how did you create the OSDs? Were they built from scratch with the respective command options (--block.db /dev/)? You could check what the bluestore tool tells you about the block.db: ceph1:~ # ceph-bluestore-tool show-label --dev /var/lib/ceph/osd/ceph-21/block | grep path

Re: [ceph-users] ACL '+' not shown in 'ls' on kernel cephfs mount

2018-09-26 Thread Eugen Block
Hi, I can confirm this for: ceph --version ceph version 12.2.5-419-g8cbf63d997 (8cbf63d997fb5cdc783fe7bfcd4f5032ee140c0c) luminous (stable) Setting ACLs on a file works as expected (restrict file access to specific user), getfacl displays correct information, but 'ls -la' does not

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block
l strongly offers to unset nodown parameter. What do you think? Eugen Block , 26 Eyl 2018 Çar, 12:54 tarihinde şunu yazdı: Hi, could this be related to this other Mimic upgrade thread [1]? Your failing MONs sound a bit like the problem described there, eventually the user reported recovery

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block
Hi, could this be related to this other Mimic upgrade thread [1]? Your failing MONs sound a bit like the problem described there, eventually the user reported recovery success. You could try the described steps: - disable cephx auth with 'auth_cluster_required = none' - set the

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Eugen Block
I would try to reduce recovery to a minimum, something like this helped us in in a small cluster (25 OSDs on 3 hosts) in case of recovery while operation continued without impact: ceph tell 'osd.*' injectargs '--osd-recovery-max-active 2' ceph tell 'osd.*' injectargs '--osd-max-backfills 8'

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-19 Thread Eugen Block
Yeah, since we haven't knowingly done anything about it, it would be a (pleasant) surprise if it was accidentally resolved in mimic ;-) Too bad ;-) Thanks for your help! Eugen Zitat von John Spray : On Wed, Sep 19, 2018 at 10:37 AM Eugen Block wrote: Hi John, > I'm not 100% s

Re: [ceph-users] Slow requests blocked. No rebalancing

2018-09-20 Thread Eugen Block
Hi, to reduce impact on clients during migration I would set the OSD's primary-affinity to 0 beforehand. This should prevent the slow requests, at least this setting has helped us a lot with problematic OSDs. Regards Eugen Zitat von Jaime Ibar : Hi all, we recently upgrade from

Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Eugen Block
Hi Jan, I think you're running into an issue reported a couple of times. For the use of LVM you have to specify the name of the Volume Group and the respective Logical Volume instead of the path, e.g. ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda Regards, Eugen

Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Eugen Block
Hi Thomas, What is the best practice for creating pools & images? Should I create multiple pools, means one pool per database? Or should I create a single pool "backup" and use namespace when writing data in the pool? I don't think one pool per DB is reasonable. If the number of DBs

Re: [ceph-users] The OSD can be “down” but still “in”.

2019-01-23 Thread Eugen Block
Hi, If the OSD represents the primary one for a PG, then all IO will be stopped..which may lead to application failure.. no, that's not how it works. You have an acting set of OSDs for a PG, typically 3 OSDs in a replicated pool. If the primary OSD goes down, the secondary becomes the

Re: [ceph-users] Openstack ceph - non bootable volumes

2018-12-20 Thread Eugen Block
volume-baa6c928-8ac1-4240-b189-32b444b434a3 volume-c23a69dc-d043-45f7-970d-1eec2ccb10cc volume-f1872ae6-48e3-4a62-9f46-bf157f079e7f On Wed, 19 Dec 2018 at 09:25, Eugen Block wrote: Hi, can you explain more detailed what exactly goes wrong? In many cases it's an authentication error, can you

[ceph-users] Clarification of mon osd communication

2019-01-10 Thread Eugen Block
Hello list, there are two config options of mon/osd interaction that I don't fully understand. Maybe one of you could clarify it for me. mon osd report timeout - The grace period in seconds before declaring unresponsive Ceph OSD Daemons down. Default 900 mon osd down out interval - The

[ceph-users] Clarification of communication between mon and osd

2019-01-14 Thread Eugen Block
Hello list, I noticed my last post was displayed as a reply to a different thread, so I re-send my question, please excuse the noise. There are two config options of mon/osd interaction that I don't fully understand. Maybe one of you could clarify it for me. mon osd report timeout - The

Re: [ceph-users] Clarification of communication between mon and osd

2019-01-14 Thread Eugen Block
help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Jan 14, 2019 at 10:17 AM Eugen Block wrote: Hello list, I noticed my last post was displayed as a reply to a different thread, so I re-send my que

Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Eugen Block
Hi Jan, how did you move the WAL and DB to the SSD/NVMe? By recreating the OSDs or a different approach? Did you check afterwards that the devices were really used for that purpose? We had to deal with that a couple of months ago [1] and it's not really obvious if the new devices are

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-13 Thread Eugen Block
Hi Stefan, mds.mds1 [WRN] replayed op client.15327973:15585315,15585103 used ino 0x19918de but session next is 0x1873b8b Nothing of importance is logged in the mds (debug_mds_log": "1/5"). What does this warning message mean / indicate? we face these messages on a regular basis.

Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-14 Thread Eugen Block
Hi, Between tests we destroyed the OSDs and created them from scratch. We used Docker image to deploy Ceph on one machine. I've seen that there are WAL/DB partitions created on the disks. Should I also check somewhere in ceph config that it actually uses those? if you created them from

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-19 Thread Eugen Block
the replay mds until we hit a real issue. ;-) It's probably impossible to predict any improvement on this with mimic, right? Regards, Eugen Zitat von John Spray : On Mon, Sep 17, 2018 at 2:49 PM Eugen Block wrote: Hi, from your response I understand that these messages are not expected

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-17 Thread Eugen Block
Hi, from your response I understand that these messages are not expected if everything is healthy. We face them every now and then, three or four times a week, but there's no real connection to specific jobs or a high load in our cluster. It's a Luminous cluster (12.2.7) with 1 active, 1

Re: [ceph-users] Openstack ceph - non bootable volumes

2018-12-19 Thread Eugen Block
Hi, can you explain more detailed what exactly goes wrong? In many cases it's an authentication error, can you check if your specified user is allowed to create volumes in the respective pool? You could try something like this (from compute node): rbd --user -k

Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block
--long" command. Thanks for the clarification. Eugen Zitat von Jason Dillaman : On Tue, Apr 2, 2019 at 8:42 AM Eugen Block wrote: Hi, > If you run "rbd snap ls --all", you should see a snapshot in > the "trash" namespace. I just tried the command "rbd

Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block
Hi, If you run "rbd snap ls --all", you should see a snapshot in the "trash" namespace. I just tried the command "rbd snap ls --all" on a lab cluster (nautilus) and get this error: ceph-2:~ # rbd snap ls --all rbd: image name was not specified Are there any requirements I haven't

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-21 Thread Eugen Block
Hi Dan, I don't know about keeping the osd-id but I just partially recreated your scenario. I wiped one OSD and recreated it. You are trying to re-use the existing block.db-LV with the device path (--block.db /dev/vg-name/lv-name) instead the lv notation (--block.db vg-name/lv-name):

Re: [ceph-users] ceph migration

2019-02-26 Thread Eugen Block
t cluster. Regards, Eugen Zitat von Janne Johansson : Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block : I just moved a (virtual lab) cluster to a different network, it worked like a charm. In an offline method - you need to: - set osd noout, ensure there are no OSDs up - Change the MONs IP,

Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread Eugen Block
Hi, We skipped stage 1 and replaced the UUIDs of old disks with the new ones in the policy.cfg We ran salt '*' pillar.items and confirmed that the output was correct. It showed the new UUIDs in the correct places. Next we ran salt-run state.orch ceph.stage.3 PS: All of the above ran

Re: [ceph-users] ceph migration

2019-02-25 Thread Eugen Block
I just moved a (virtual lab) cluster to a different network, it worked like a charm. In an offline method - you need to: - set osd noout, ensure there are no OSDs up - Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP ADDRESS", MONs are the only ones really sticky with the

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Eugen Block
Hi, my first guess would be a network issue. Double-check your connections and make sure the network setup works as expected. Check syslogs, dmesg, switches etc. for hints that a network interruption may have occured. Regards, Eugen Zitat von Zhenshi Zhou : Hi, I deployed a ceph

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-03-11 Thread Eugen Block
in DB. And assertion might still happen (hopefully with less frequency). So could you please run fsck for OSDs that were broken once and share the results? Then we can decide if it makes sense to proceed with the repair. Thanks, Igor On 2/7/2019 3:37 PM, Eugen Block wrote: Hi list, I

Re: [ceph-users] Luminous to Mimic: MON upgrade requires "full luminous scrub". What is that?

2019-02-07 Thread Eugen Block
Hi, could it be a missing 'ceph osd require-osd-release luminous' on your cluster? When I check a luminous cluster I get this: host1:~ # ceph osd dump | grep recovery flags sortbitwise,recovery_deletes,purged_snapdirs The flags in the code you quote seem related to that. Can you check that

Re: [ceph-users] best practices for EC pools

2019-02-07 Thread Eugen Block
Hi Francois, Is that correct that recovery will be forbidden by the crush rule if a node is down? yes, that is correct, failure-domain=host means no two chunks of the same PG can be on the same host. So if your PG is divided into 6 chunks, they're all on different hosts, no recovery is

[ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block
Hi cephers, I'm struggling a little with the deep-scrubs. I know this has been discussed multiple times (e.g. in [1]) and we also use a known crontab script in a Luminous cluster (12.2.10) to start the deep-scrubbing manually (a quarter of all PGs 4 times a week). The script works just

Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block
Thank you, Konstantin, I'll give that a try. Do you have any comment on osd_deep_mon_scrub_interval? Eugen Zitat von Konstantin Shalygin : The expectation was to prevent the automatic deep-scrubs but they are started anyway You can disable deep-scrubs per pool via `ceph osd pool set

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block
ction CEPH... Regards, /st -Original Message- From: ceph-users On Behalf Of Eugen Block Sent: Tuesday, February 12, 2019 5:32 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] will crush rule be used during object relocation in OSD failure ? Hi, I came to the same conclusion a

Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block
I created http://tracker.ceph.com/issues/38310 for this. Regards, Eugen Zitat von Konstantin Shalygin : On 2/14/19 2:21 PM, Eugen Block wrote: Already did, but now with highlighting ;-) http://docs.ceph.com/docs/luminous/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval

Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block
2:16 PM, Eugen Block wrote: Exactly, it's also not available in a Mimic test-cluster. But it's mentioned in the docs for L and M (I didn't check the docs for other releases), that's what I was wondering about. Can you provide url to this page? k

Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block
My Ceph Luminous don't know anything about this option: # ceph daemon osd.7 config help osd_deep_mon_scrub_interval { "error": "Setting not found: 'osd_deep_mon_scrub_interval'" } Exactly, it's also not available in a Mimic test-cluster. But it's mentioned in the docs for L and M (I

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block
Hi, I came to the same conclusion after doing various tests with rooms and failure domains. I agree with Maged and suggest to use size=4, min_size=2 for replicated pools. It's more overhead but you can survive the loss of one room and even one more OSD (of the affected PG) without losing

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Eugen Block
I have no issues opening that site from Germany. Zitat von Dan van der Ster : On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote: On 15/02/2019 10:39, Ilya Dryomov wrote: > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: >> >> Hi Marc, >> >> You can see previous designs on the

[ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block
Hi list, I found this thread [1] about crashing SSD OSDs, although that was about an upgrade to 12.2.7, we just hit (probably) the same issue after our update to 12.2.10 two days ago in a production cluster. Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block
. And assertion might still happen (hopefully with less frequency). So could you please run fsck for OSDs that were broken once and share the results? Then we can decide if it makes sense to proceed with the repair. Thanks, Igor On 2/7/2019 3:37 PM, Eugen Block wrote: Hi list, I found

Re: [ceph-users] min_size vs. K in erasure coded pools

2019-02-20 Thread Eugen Block
Hi, I see that as a security feature ;-) You can prevent data loss if k chunks are intact, but you don't want to work with the least required amount of chunks. In a disaster scenario you can reduce min_size to k temporarily, but the main goal should always be to get the OSDs back up. For

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block
fsck report first. W.r.t to running ceph-bluestore-tool - you might want to specify log file and increase log level to 20 using --log-file and --log-level options. On 2/7/2019 4:45 PM, Eugen Block wrote: Hi Igor, thanks for the quick response! Just to make sure I don't misunderstand

Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block
68a5700 failed to open image: (1) Operation not permitted rbd: error opening image isa: (1) Operation not permitted In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (1) Operation not permitted Regards Thomas Am 25.01.2019 um 11:52 schrieb Eugen B

Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block
Hi, I replied to your thread a couple of days ago, maybe you didn't notice: Restricting user access is possible on rbd image level. You can grant read/write access for one client and only read access for other clients, you have to create different clients for that, see [1] for more

Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block
You can check all objects of that pool to see if your caps match: rados -p backup ls | grep rbd_id Zitat von Eugen Block : caps osd = "allow pool backup object_prefix rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix rbd_header.18102d6b8b4567; allow rx pool backup object_p

Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block
Hi, I haven't used the --show-config option until now, but if you ask your OSD daemon directly, your change should have been applied: host1:~ # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' host1:~ # ceph daemon osd.1 config show | grep osd_recovery_max_active

Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block
I'll keep it that way. ;-) Zitat von Janne Johansson : Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block : > If you don't specify which daemon to talk to, it tells you what the > defaults would be for a random daemon started just now using the same > config as you have in /etc/ceph/

Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block
osd_recovery_max_active osd_recovery_max_active = 3 Zitat von Janne Johansson : Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block : While --show-config still shows host1:~ # ceph --show-config | grep osd_recovery_max_active osd_recovery_max_active = 3 It seems as if --show-config is not really up-to-date

Re: [ceph-users] osd daemon cluster_fsid not reflecting actual cluster_fsid

2019-06-20 Thread Eugen Block
ot;: "173b6382-504b-421f-aa4d-52526fa80dfa", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "AQBXwwddy5OEAxAAS4AidvOF0kl+kxIBvFhT1A==", "ready": "ready"

Re: [ceph-users] osd daemon cluster_fsid not reflecting actual cluster_fsid

2019-06-18 Thread Eugen Block
Hi, this OSD must have been part of a previous cluster, I assume. I would remove it from crush if it's still there (check just to make sure), wipe the disk, remove any traces like logical volumes (if it was a ceph-volume lvm OSD) and if possible, reboot the node. Regards, Eugen Zitat von

Re: [ceph-users] Failed Disk simulation question

2019-05-22 Thread Eugen Block
Hi Alex, The cluster has been idle at the moment being new and all. I noticed some disk related errors in dmesg but that was about it. It looked to me for the next 20 - 30 minutes the failure has not been detected. All osds were up and in and health was OK. OSD logs had no smoking gun

Re: [ceph-users] inconsistent number of pools

2019-05-20 Thread Eugen Block
Hi, have you tried 'ceph health detail'? Zitat von Lars Täuber : Hi everybody, with the status report I get a HEALTH_WARN I don't know how to get rid of. It my be connected to recently removed pools. # ceph -s cluster: id: 6cba13d1-b814-489c-9aac-9c04aaf78720 health:

Re: [ceph-users] Is a not active mds doing something?

2019-05-21 Thread Eugen Block
Hi Marc, have you configured the other MDS to be standby-replay for the active MDS? I have three MDS servers, one is active, the second is active-standby and the third just standby. If the active fails, the second takes over within seconds. This is what I have in my ceph.conf: [mds.]

Re: [ceph-users] Nautilus, k+m erasure coding a profile vs size+min_size

2019-05-21 Thread Eugen Block
Hi, this question comes up regularly and is been discussed just now: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html Regards, Eugen Zitat von Yoann Moulin : Dear all, I am doing some tests with Nautilus and cephfs on erasure coding pool. I noticed something

Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Eugen Block
Sure there is: ceph pg ls-by-osd Regards, Eugen Zitat von Igor Podlesny : Or is there no direct way to accomplish that? What workarounds can be used then? -- End of message. Next message? ___ ceph-users mailing list ceph-users@lists.ceph.com

Re: [ceph-users] ceph-volume failed after replacing disk

2019-07-05 Thread Eugen Block
Hi, did you also remove that OSD from crush and also from auth before recreating it? ceph osd crush remove osd.71 ceph auth del osd.71 Regards, Eugen Zitat von "ST Wong (ITSC)" : Hi all, We replaced a faulty disk out of N OSD and tried to follow steps according to "Replacing and OSD"

Re: [ceph-users] MGR Logs after Failure Testing

2019-06-27 Thread Eugen Block
Hi, some more information about the cluster status would be helpful, such as ceph -s ceph osd tree service status of all MONs, MDSs, MGRs. Are all services up? Did you configure the spare MDS as standby for rank 0 so that a failover can happen? Regards, Eugen Zitat von

Re: [ceph-users] MGR Logs after Failure Testing

2019-06-28 Thread Eugen Block
m Air International Inc. dhils...@performair.com www.PerformAir.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen Block Sent: Thursday, June 27, 2019 8:23 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] MGR Logs after Failure Testing Hi

Re: [ceph-users] Fwd: HW failure cause client IO drops

2019-04-16 Thread Eugen Block
Good morning, the OSDs are usually marked out after 10 minutes, that's when rebalancing starts. But the I/O should not drop during that time, this could be related to your pool configuration. If you have a replicated pool of size 3 and also set min_size to 3 the I/O would pause if a node

Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-02 Thread Eugen Block
Hi, did you try to use rbd and rados commands with the cinder keyring, not the admin keyring? Did you check if the caps for that client are still valid (do the caps differ between the two cinder pools)? Are the ceph versions on your hypervisors also nautilus? Regards, Eugen Zitat von

Re: [ceph-users] PGs allocated to osd with weights 0

2019-07-02 Thread Eugen Block
Hi, I can’t get data flushed out of osd with weights set to 0. Is there any way of checking the tasks queued for PG remapping ? Thank You. can you give some more details about your cluster (replicated or EC pools, applied rules etc.)? My first guess would be that the other OSDs are

Re: [ceph-users] ceph device list empty

2019-08-15 Thread Eugen Block
Hi, are the OSD nodes on Nautilus already? We upgraded from Luminous to Nautilus recently and the commands return valid output, except for those OSDs that haven't been upgraded yet. Zitat von Gary Molenkamp : I've had no luck in tracing this down.  I've tried setting debugging and log

Re: [ceph-users] Howto add DB (aka RockDB) device to existing OSD on HDD

2019-08-29 Thread Eugen Block
Hi, Then I tried to move DB to a new device (SSD) that is not formatted: root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path /var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk too many positional options have been specified on the command line I think you're trying the wrong option.

Re: [ceph-users] Howto add DB (aka RockDB) device to existing OSD on HDD

2019-08-29 Thread Eugen Block
Sorry, I misread, your option is correct, of course since there was no external db device. This worked for me: ceph-2:~ # CEPH_ARGS="--bluestore-block-db-size 1048576" ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1 bluefs-bdev-new-db --dev-target /dev/sdb inferring bluefs devices

[ceph-users] OSD replacement causes slow requests

2019-07-18 Thread Eugen Block
Hi list, we're facing an unexpected recovery behavior of an upgraded cluster (Luminous -> Nautilus). We added new servers with Nautilus to the existing Luminous cluster, so we could first replace the MONs step by step. Then we moved the old servers to a new root in the crush map and then

Re: [ceph-users] OSD replacement causes slow requests

2019-07-24 Thread Eugen Block
needs his vacation. ;-) Regards, Eugen Zitat von Wido den Hollander : On 7/18/19 12:21 PM, Eugen Block wrote: Hi list, we're facing an unexpected recovery behavior of an upgraded cluster (Luminous -> Nautilus). We added new servers with Nautilus to the existing Luminous cluster, so we co

[ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block
Hi all, we just upgraded our cluster to: ceph version 14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable) When clicking through the dashboard to see what's new we noticed that the crushmap viewer only shows the first root of our crushmap (we have two

Re: [ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block
Thank you very much! Zitat von EDH - Manuel Rios Fernandez : Hi Eugen, Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2 Regards, Manuel -Mensaje original- De: ceph-users En nombre de Eugen Block Enviado el: miércoles, 24 de julio de 2019 15:10 Para: ceph-users

Re: [ceph-users] Is deepscrub Part of PG increase?

2019-11-03 Thread Eugen Block
Hi, deep-scrubs can also be configured per pool, so even if you have adjusted the general deep-scrub time the deep-scrubs will still happen. To disable per pool deep-scrubs you need to set ceph osd pool set nodeep-scrub true Regards, Eugen Zitat von c...@elchaka.de: Hello, I have a

Re: [ceph-users] Command ceph osd df hangs

2019-11-21 Thread Eugen Block
Hi, check if the active MGR is hanging. I had this when testing pg_autoscaler, after some time every command would hang. Restarting the MGR helped for a short period of time, then I disabled pg_autoscaler. This is an upgraded cluster, currently on Nautilus. Regards, Eugen Zitat von

Re: [ceph-users] Cluster in ERR status when rebalancing

2019-12-09 Thread Eugen Block
Hi, since we upgraded our cluster to Nautilus we also see those messages sometimes when it's rebalancing. There are several reports about this [1] [2], we didn't see it in Luminous. But eventually the rebalancing finished and the error message cleared, so I'd say there's (probably)

Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache

2019-12-05 Thread Eugen Block
Hi, can you provide more details? ceph daemon mds. cache status ceph config show mds. | grep mds_cache_memory_limit Regards, Eugen Zitat von Ranjan Ghosh : Okay, now, after I settled the issue with the oneshot service thanks to the amazing help of Paul and Richard (thanks again!), I still

Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Eugen Block
Hi, can you share `ceph osd tree`? What crush rules are in use in your cluster? I assume that the two failed OSDs prevent the remapping because the rules can't be applied. Regards, Eugen Zitat von Philipp Schwaha : hi, I have a problem with a cluster being stuck in recovery after osd

Re: [ceph-users] clust recovery stuck

2019-10-23 Thread Eugen Block
that helps.  This would allow the recovery to proceed - but you should consider adding OSDs (or at least increase the memory allocated to OSDs above the defaults). Andras On 10/22/19 3:02 PM, Philipp Schwaha wrote: hi, On 2019-10-22 08:05, Eugen Block wrote: Hi, can you share `ceph osd t

Re: [ceph-users] ceph stats on the logs

2019-10-08 Thread Eugen Block
Hi, there is also /var/log/ceph/ceph.log on the MONs, it has the stats you're asking for. Does this answer your question? Regards, Eugen Zitat von nokia ceph : Hi Team, With default log settings , the ceph stats will be logged like cluster [INF] pgmap v30410386: 8192 pgs: 8192

Re: [ceph-users] pgs backfill_toofull after removing OSD from CRUSH map

2019-12-19 Thread Eugen Block
Hi Kristof, setting the OSD "out" doesn't change the crush weight of that OSD, but removing it from the tree does, that's why the cluster started to rebalance. Regards, Eugen Zitat von Kristof Coucke : Hi all, We are facing a strange symptom here. We're testing our recovery

Re: [ceph-users] OSD Marked down unable to restart continuously failing

2020-01-11 Thread Eugen Block
Hi, you say the daemons are locally up and running but restarting fails? Which one is it? Do you see any messages suggesting flapping OSDs? After 5 retries within 10 minutes the OSDs would be marked out. What is the result of your checks for iostat etc.? Anything pointing to a high load on

Re: [ceph-users] ceph (jewel) unable to recover after node failure

2020-01-10 Thread Eugen Block
Hi, A. will ceph be able to recover over time? I am afraid that the 14 PGs that are down will not recover. if all OSDs come back (stable) the recovery should eventually finish. B. what caused the OSDs going down and up during recovery after the failed OSD node came back online? (step 2

<    1   2