Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Le vendredi 21 septembre 2018 à 19:45 +0200, Paul Emmerich a écrit : > The cache tiering has nothing to do with the PG of the underlying > pool > being incomplete. > You are just seeing these requests as stuck because it's the only > thing trying to write to the underlying pool. I agree, It was

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Paul Emmerich
The cache tiering has nothing to do with the PG of the underlying pool being incomplete. You are just seeing these requests as stuck because it's the only thing trying to write to the underlying pool. What you need to fix is the PG showing incomplete. I assume you already tried reducing the

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
So I've totally disable cache-tiering and overlay. Now OSD 68 & 69 are fine, no more blocked. But OSD 32 is still blocked, and PG 37.9c still marked incomplete with : "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2018-09-21

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Maks Kowalik
According to the query output you pasted shards 1 and 2 are broken. But, on the other hand EC profile (4+2) should make it possible to recover from 2 shards lost simultanously... pt., 21 wrz 2018 o 16:29 Olivier Bonvalet napisał(a): > Well on drive, I can find thoses parts : > > - cs0 on OSD 29

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Well on drive, I can find thoses parts : - cs0 on OSD 29 and 30 - cs1 on OSD 18 and 19 - cs2 on OSD 13 - cs3 on OSD 66 - cs4 on OSD 0 - cs5 on OSD 75 And I can read thoses files too. And all thoses OSD are UP and IN. Le vendredi 21 septembre 2018 à 13:10 +, Eugen Block a écrit : > > > I

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Yep : pool 38 'cache-bkp-foo' replicated size 3 min_size 2 crush_rule 26 object_hash rjenkins pg_num 128 pgp_num 128 last_change 585369 lfor 68255/68255 flags hashpspool,incomplete_clones tier_of 37 cache_mode readproxy target_bytes 209715200 hit_set bloom{false_positive_probability: 0.05,

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
I also switched the cache tier to "readproxy", to avoid using this cache. But, it's still blocked. You could change the cache mode to "none" to disable it. Could you paste the output of: ceph osd pool ls detail | grep cache-bkp-foo Zitat von Olivier Bonvalet : In fact, one object (only

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
I tried to flush the cache with "rados -p cache-bkp-foo cache-flush- evict-all", but it blocks on the object "rbd_data.f66c92ae8944a.000f2596". This is the object that's stuck in the cache tier (according to your output in https://pastebin.com/zrwu5X0w). Can you verify if that block

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Maks Kowalik
Could you, please paste the output of pg 37.9c query pt., 21 wrz 2018 o 14:39 Olivier Bonvalet napisał(a): > In fact, one object (only one) seem to be blocked on the cache tier > (writeback). > > I tried to flush the cache with "rados -p cache-bkp-foo cache-flush- > evict-all", but it blocks on

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
In fact, one object (only one) seem to be blocked on the cache tier (writeback). I tried to flush the cache with "rados -p cache-bkp-foo cache-flush- evict-all", but it blocks on the object "rbd_data.f66c92ae8944a.000f2596". So I reduced (a lot) the cache tier to 200MB, "rados -p

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Ok, so it's a replica 3 pool, and OSD 68 & 69 are on the same host. Le vendredi 21 septembre 2018 à 11:09 +, Eugen Block a écrit : > > cache-tier on this pool have 26GB of data (for 5.7TB of data on the > > EC > > pool). > > We tried to flush the cache tier, and restart OSD 68 & 69, without >

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
cache-tier on this pool have 26GB of data (for 5.7TB of data on the EC pool). We tried to flush the cache tier, and restart OSD 68 & 69, without any success. I meant the replication size of the pool ceph osd pool ls detail | grep In the experimental state of our cluster we had a cache tier

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Olivier Bonvalet
Hi, cache-tier on this pool have 26GB of data (for 5.7TB of data on the EC pool). We tried to flush the cache tier, and restart OSD 68 & 69, without any success. But I don't see any related data on cache-tier OSD (filestore) with : find /var/lib/ceph/osd/ -maxdepth 3 -name '*37.9c*' I

Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block
Hi Olivier, what size does the cache tier have? You could set cache-mode to forward and flush it, maybe restarting those OSDs (68, 69) helps, too. Or there could be an issue with the cache tier, what do those logs say? Regards, Eugen Zitat von Olivier Bonvalet : Hello, on a Luminous

[ceph-users] PG stuck incomplete

2018-09-20 Thread Olivier Bonvalet
Hello, on a Luminous cluster, I have a PG incomplete and I can't find how to fix that. It's an EC pool (4+2) : pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for 'incomplete') Of course, we can't reduce min_size

Re: [ceph-users] PG stuck incomplete after power failure.

2016-05-17 Thread Hein-Pieter van Braam
Hi, Thank you so much! This fixed my issue completely, minus one image that was apparently being uploaded while the rack lost power. Is there anything I can do to prevent this from happening in the future, or a way to detect this issue? I've looked online for an explanation of exactly what

Re: [ceph-users] PG stuck incomplete after power failure.

2016-05-17 Thread Samuel Just
Try restarting the primary osd for that pg with osd_find_best_info_ignore_history_les set to true (don't leave it set long term). -Sam On Tue, May 17, 2016 at 7:50 AM, Hein-Pieter van Braam wrote: > Hello, > > Today we had a power failure in a rack housing our OSD servers. We had >

[ceph-users] PG stuck incomplete after power failure.

2016-05-17 Thread Hein-Pieter van Braam
Hello, Today we had a power failure in a rack housing our OSD servers. We had 7 of our 30 total OSD nodes down. Of the affect PG 2 out of the 3 OSDs went down. After everything was back and mostly healthy I found one placement group marked as incomplete. I can't figure out why.  I'm running