Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-30 Thread Michael Sudnick
Mine repaired themselves after a regular deep scrub. Weird that I couldn't
trigger one manually.

On 30 April 2018 at 14:23, David Turner  wrote:

> My 3 inconsistent PGs finally decided to run automatic scrubs and now 2 of
> the 3 will allow me to run deep-scrubs and repairs on them.  The deep-scrub
> did not show any new information about the objects other than that they
> were missing in one of the copies.  Running a repair fixed the
> inconsistency.
>
> On Tue, Apr 24, 2018 at 4:53 PM David Turner 
> wrote:
>
>> Neither the issue I created nor Michael's [1] ticket that it was rolled
>> into are getting any traction.  How are y'all fairing with your clusters?
>> I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
>> assumed that the third PG was just like the first 2 in that it couldn't be
>> scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
>> third one is able to run scrubs.  I'm going to increase the logging on it
>> after I finish a round of maintenance we're performing on some OSDs.
>> Hopefully I'll find something more about these objects.
>>
>>
>> [1] http://tracker.ceph.com/issues/23576
>>
>> On Fri, Apr 6, 2018 at 12:30 PM David Turner 
>> wrote:
>>
>>> I'm using filestore.  I think the root cause is something getting stuck
>>> in the code.  As such I went ahead and created a [1] bug tracker for this.
>>> Hopefully it gets some traction as I'm not particularly looking forward to
>>> messing with deleting PGs with the ceph-objectstore-tool in production.
>>>
>>> [1] http://tracker.ceph.com/issues/23577
>>>
>>> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick <
>>> michael.sudn...@gmail.com> wrote:
>>>
 I've tried a few more things to get a deep-scrub going on my PG. I
 tried instructing the involved osds to scrub all their PGs and it looks
 like that didn't do it.

 Do you have any documentation on the object-store-tool? What I've found
 online talks about filestore and not bluestore.

 On 6 April 2018 at 09:27, David Turner  wrote:

> I'm running into this exact same situation.  I'm running 12.2.2 and I
> have an EC PG with a scrub error.  It has the same output for [1] rados
> list-inconsistent-obj as mentioned before.  This is the [2] full health
> detail.  This is the [3] excerpt from the log from the deep-scrub that
> marked the PG inconsistent.  The scrub happened when the PG was starting 
> up
> after using ceph-objectstore-tool to split its filestore subfolders.  This
> is using a script that I've used for months without any side effects.
>
> I have tried quite a few things to get this PG to deep-scrub or
> repair, but to no avail.  It will not do anything.  I have set every osd's
> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
> issuing a deep-scrub.  And it will sit there for over an hour without
> deep-scrubbing.  My current testing of this is to set all osds to 1,
> increase all of the osds for this PG to 4, and then issue the repair... 
> but
> similarly nothing happens.  Each time I issue the deep-scrub or repair, 
> the
> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
> nothing shows up in the log for the OSD and the PG state stays
> 'active+clean+inconsistent'.
>
> My next step, unless anyone has a better idea, is to find the exact
> copy of the PG with the missing object, use object-store-tool to back up
> that copy of the PG and remove it.  Then starting the OSD back up should
> backfill the full copy of the PG and be healthy again.
>
>
>
> [1] $ rados list-inconsistent-obj 145.2e3
> No scrub information available for pg 145.2e3
> error 2: (2) No such file or directory
>
> [2] $ ceph health detail
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 145.2e3 is active+clean+inconsistent, acting
> [234,132,33,331,278,217,55,358,79,3,24]
>
> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster)
> log [DBG] : 145.2e3 deep-scrub starts
> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
> [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
> [ERR] : 145.2e3 deep-scrub 1 errors
>
> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
> michael.sudn...@gmail.com> wrote:
>
>> Hi Kjetil,
>>
>> I've tried to get the pg scrubbing/deep scrubbing and nothing seems
>> to be happening. I've tried it a few times over the last few days. My
>> cluster is recovering 

Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-30 Thread David Turner
My 3 inconsistent PGs finally decided to run automatic scrubs and now 2 of
the 3 will allow me to run deep-scrubs and repairs on them.  The deep-scrub
did not show any new information about the objects other than that they
were missing in one of the copies.  Running a repair fixed the
inconsistency.

On Tue, Apr 24, 2018 at 4:53 PM David Turner  wrote:

> Neither the issue I created nor Michael's [1] ticket that it was rolled
> into are getting any traction.  How are y'all fairing with your clusters?
> I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
> assumed that the third PG was just like the first 2 in that it couldn't be
> scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
> third one is able to run scrubs.  I'm going to increase the logging on it
> after I finish a round of maintenance we're performing on some OSDs.
> Hopefully I'll find something more about these objects.
>
>
> [1] http://tracker.ceph.com/issues/23576
>
> On Fri, Apr 6, 2018 at 12:30 PM David Turner 
> wrote:
>
>> I'm using filestore.  I think the root cause is something getting stuck
>> in the code.  As such I went ahead and created a [1] bug tracker for this.
>> Hopefully it gets some traction as I'm not particularly looking forward to
>> messing with deleting PGs with the ceph-objectstore-tool in production.
>>
>> [1] http://tracker.ceph.com/issues/23577
>>
>> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick <
>> michael.sudn...@gmail.com> wrote:
>>
>>> I've tried a few more things to get a deep-scrub going on my PG. I tried
>>> instructing the involved osds to scrub all their PGs and it looks like that
>>> didn't do it.
>>>
>>> Do you have any documentation on the object-store-tool? What I've found
>>> online talks about filestore and not bluestore.
>>>
>>> On 6 April 2018 at 09:27, David Turner  wrote:
>>>
 I'm running into this exact same situation.  I'm running 12.2.2 and I
 have an EC PG with a scrub error.  It has the same output for [1] rados
 list-inconsistent-obj as mentioned before.  This is the [2] full health
 detail.  This is the [3] excerpt from the log from the deep-scrub that
 marked the PG inconsistent.  The scrub happened when the PG was starting up
 after using ceph-objectstore-tool to split its filestore subfolders.  This
 is using a script that I've used for months without any side effects.

 I have tried quite a few things to get this PG to deep-scrub or repair,
 but to no avail.  It will not do anything.  I have set every osd's
 osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
 scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
 issuing a deep-scrub.  And it will sit there for over an hour without
 deep-scrubbing.  My current testing of this is to set all osds to 1,
 increase all of the osds for this PG to 4, and then issue the repair... but
 similarly nothing happens.  Each time I issue the deep-scrub or repair, the
 output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
 nothing shows up in the log for the OSD and the PG state stays
 'active+clean+inconsistent'.

 My next step, unless anyone has a better idea, is to find the exact
 copy of the PG with the missing object, use object-store-tool to back up
 that copy of the PG and remove it.  Then starting the OSD back up should
 backfill the full copy of the PG and be healthy again.



 [1] $ rados list-inconsistent-obj 145.2e3
 No scrub information available for pg 145.2e3
 error 2: (2) No such file or directory

 [2] $ ceph health detail
 HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
 OSD_SCRUB_ERRORS 1 scrub errors
 PG_DAMAGED Possible data damage: 1 pg inconsistent
 pg 145.2e3 is active+clean+inconsistent, acting
 [234,132,33,331,278,217,55,358,79,3,24]

 [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
 [DBG] : 145.2e3 deep-scrub starts
 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
 [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
 [ERR] : 145.2e3 deep-scrub 1 errors

 On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
 michael.sudn...@gmail.com> wrote:

> Hi Kjetil,
>
> I've tried to get the pg scrubbing/deep scrubbing and nothing seems to
> be happening. I've tried it a few times over the last few days. My cluster
> is recovering from a failed disk (which was probably the reason for the
> inconsistency), do I need to wait for the cluster to heal before
> repair/deep scrub works?
>
> -Michael
>
> On 2 April 2018 at 14:13, Kjetil Joergensen 
> wrote:
>
>> Hi,
>>

Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-24 Thread David Turner
Neither the issue I created nor Michael's [1] ticket that it was rolled
into are getting any traction.  How are y'all fairing with your clusters?
I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
assumed that the third PG was just like the first 2 in that it couldn't be
scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
third one is able to run scrubs.  I'm going to increase the logging on it
after I finish a round of maintenance we're performing on some OSDs.
Hopefully I'll find something more about these objects.


[1] http://tracker.ceph.com/issues/23576

On Fri, Apr 6, 2018 at 12:30 PM David Turner  wrote:

> I'm using filestore.  I think the root cause is something getting stuck in
> the code.  As such I went ahead and created a [1] bug tracker for this.
> Hopefully it gets some traction as I'm not particularly looking forward to
> messing with deleting PGs with the ceph-objectstore-tool in production.
>
> [1] http://tracker.ceph.com/issues/23577
>
> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick 
> wrote:
>
>> I've tried a few more things to get a deep-scrub going on my PG. I tried
>> instructing the involved osds to scrub all their PGs and it looks like that
>> didn't do it.
>>
>> Do you have any documentation on the object-store-tool? What I've found
>> online talks about filestore and not bluestore.
>>
>> On 6 April 2018 at 09:27, David Turner  wrote:
>>
>>> I'm running into this exact same situation.  I'm running 12.2.2 and I
>>> have an EC PG with a scrub error.  It has the same output for [1] rados
>>> list-inconsistent-obj as mentioned before.  This is the [2] full health
>>> detail.  This is the [3] excerpt from the log from the deep-scrub that
>>> marked the PG inconsistent.  The scrub happened when the PG was starting up
>>> after using ceph-objectstore-tool to split its filestore subfolders.  This
>>> is using a script that I've used for months without any side effects.
>>>
>>> I have tried quite a few things to get this PG to deep-scrub or repair,
>>> but to no avail.  It will not do anything.  I have set every osd's
>>> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
>>> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
>>> issuing a deep-scrub.  And it will sit there for over an hour without
>>> deep-scrubbing.  My current testing of this is to set all osds to 1,
>>> increase all of the osds for this PG to 4, and then issue the repair... but
>>> similarly nothing happens.  Each time I issue the deep-scrub or repair, the
>>> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
>>> nothing shows up in the log for the OSD and the PG state stays
>>> 'active+clean+inconsistent'.
>>>
>>> My next step, unless anyone has a better idea, is to find the exact copy
>>> of the PG with the missing object, use object-store-tool to back up that
>>> copy of the PG and remove it.  Then starting the OSD back up should
>>> backfill the full copy of the PG and be healthy again.
>>>
>>>
>>>
>>> [1] $ rados list-inconsistent-obj 145.2e3
>>> No scrub information available for pg 145.2e3
>>> error 2: (2) No such file or directory
>>>
>>> [2] $ ceph health detail
>>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>> pg 145.2e3 is active+clean+inconsistent, acting
>>> [234,132,33,331,278,217,55,358,79,3,24]
>>>
>>> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
>>> [DBG] : 145.2e3 deep-scrub starts
>>> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
>>> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3 deep-scrub 1 errors
>>>
>>> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
>>> michael.sudn...@gmail.com> wrote:
>>>
 Hi Kjetil,

 I've tried to get the pg scrubbing/deep scrubbing and nothing seems to
 be happening. I've tried it a few times over the last few days. My cluster
 is recovering from a failed disk (which was probably the reason for the
 inconsistency), do I need to wait for the cluster to heal before
 repair/deep scrub works?

 -Michael

 On 2 April 2018 at 14:13, Kjetil Joergensen 
 wrote:

> Hi,
>
> scrub or deep-scrub the pg, that should in theory get you back to
> list-inconsistent-obj spitting out what's wrong, then mail that info to 
> the
> list.
>
> -KJ
>
> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
> michael.sudn...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a small cluster with an inconsistent pg. I've tried ceph pg
>> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
>> returns:

Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-06 Thread David Turner
I'm using filestore.  I think the root cause is something getting stuck in
the code.  As such I went ahead and created a [1] bug tracker for this.
Hopefully it gets some traction as I'm not particularly looking forward to
messing with deleting PGs with the ceph-objectstore-tool in production.

[1] http://tracker.ceph.com/issues/23577

On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick 
wrote:

> I've tried a few more things to get a deep-scrub going on my PG. I tried
> instructing the involved osds to scrub all their PGs and it looks like that
> didn't do it.
>
> Do you have any documentation on the object-store-tool? What I've found
> online talks about filestore and not bluestore.
>
> On 6 April 2018 at 09:27, David Turner  wrote:
>
>> I'm running into this exact same situation.  I'm running 12.2.2 and I
>> have an EC PG with a scrub error.  It has the same output for [1] rados
>> list-inconsistent-obj as mentioned before.  This is the [2] full health
>> detail.  This is the [3] excerpt from the log from the deep-scrub that
>> marked the PG inconsistent.  The scrub happened when the PG was starting up
>> after using ceph-objectstore-tool to split its filestore subfolders.  This
>> is using a script that I've used for months without any side effects.
>>
>> I have tried quite a few things to get this PG to deep-scrub or repair,
>> but to no avail.  It will not do anything.  I have set every osd's
>> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
>> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
>> issuing a deep-scrub.  And it will sit there for over an hour without
>> deep-scrubbing.  My current testing of this is to set all osds to 1,
>> increase all of the osds for this PG to 4, and then issue the repair... but
>> similarly nothing happens.  Each time I issue the deep-scrub or repair, the
>> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
>> nothing shows up in the log for the OSD and the PG state stays
>> 'active+clean+inconsistent'.
>>
>> My next step, unless anyone has a better idea, is to find the exact copy
>> of the PG with the missing object, use object-store-tool to back up that
>> copy of the PG and remove it.  Then starting the OSD back up should
>> backfill the full copy of the PG and be healthy again.
>>
>>
>>
>> [1] $ rados list-inconsistent-obj 145.2e3
>> No scrub information available for pg 145.2e3
>> error 2: (2) No such file or directory
>>
>> [2] $ ceph health detail
>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 145.2e3 is active+clean+inconsistent, acting
>> [234,132,33,331,278,217,55,358,79,3,24]
>>
>> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
>> [DBG] : 145.2e3 deep-scrub starts
>> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log [ERR]
>> : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
>> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log [ERR]
>> : 145.2e3 deep-scrub 1 errors
>>
>> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick 
>> wrote:
>>
>>> Hi Kjetil,
>>>
>>> I've tried to get the pg scrubbing/deep scrubbing and nothing seems to
>>> be happening. I've tried it a few times over the last few days. My cluster
>>> is recovering from a failed disk (which was probably the reason for the
>>> inconsistency), do I need to wait for the cluster to heal before
>>> repair/deep scrub works?
>>>
>>> -Michael
>>>
>>> On 2 April 2018 at 14:13, Kjetil Joergensen  wrote:
>>>
 Hi,

 scrub or deep-scrub the pg, that should in theory get you back to
 list-inconsistent-obj spitting out what's wrong, then mail that info to the
 list.

 -KJ

 On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
 michael.sudn...@gmail.com> wrote:

> Hello,
>
> I have a small cluster with an inconsistent pg. I've tried ceph pg
> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
> returns:
>
> # rados list-inconsistent-obj 49.11c
> No scrub information available for pg 49.11c
> error 2: (2) No such file or directory
>
> I'm a bit at a loss here as what to do to recover. That pg is part of
> a cephfs_data pool with compression set to force/snappy.
>
> Does anyone have an suggestions?
>
> -Michael
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


 --
 Kjetil Joergensen 
 SRE, Medallia Inc

>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> 

Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-06 Thread Michael Sudnick
I've tried a few more things to get a deep-scrub going on my PG. I tried
instructing the involved osds to scrub all their PGs and it looks like that
didn't do it.

Do you have any documentation on the object-store-tool? What I've found
online talks about filestore and not bluestore.

On 6 April 2018 at 09:27, David Turner  wrote:

> I'm running into this exact same situation.  I'm running 12.2.2 and I have
> an EC PG with a scrub error.  It has the same output for [1] rados
> list-inconsistent-obj as mentioned before.  This is the [2] full health
> detail.  This is the [3] excerpt from the log from the deep-scrub that
> marked the PG inconsistent.  The scrub happened when the PG was starting up
> after using ceph-objectstore-tool to split its filestore subfolders.  This
> is using a script that I've used for months without any side effects.
>
> I have tried quite a few things to get this PG to deep-scrub or repair,
> but to no avail.  It will not do anything.  I have set every osd's
> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
> issuing a deep-scrub.  And it will sit there for over an hour without
> deep-scrubbing.  My current testing of this is to set all osds to 1,
> increase all of the osds for this PG to 4, and then issue the repair... but
> similarly nothing happens.  Each time I issue the deep-scrub or repair, the
> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
> nothing shows up in the log for the OSD and the PG state stays
> 'active+clean+inconsistent'.
>
> My next step, unless anyone has a better idea, is to find the exact copy
> of the PG with the missing object, use object-store-tool to back up that
> copy of the PG and remove it.  Then starting the OSD back up should
> backfill the full copy of the PG and be healthy again.
>
>
>
> [1] $ rados list-inconsistent-obj 145.2e3
> No scrub information available for pg 145.2e3
> error 2: (2) No such file or directory
>
> [2] $ ceph health detail
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 145.2e3 is active+clean+inconsistent, acting
> [234,132,33,331,278,217,55,358,79,3,24]
>
> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
> [DBG] : 145.2e3 deep-scrub starts
> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log [ERR]
> : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log [ERR]
> : 145.2e3 deep-scrub 1 errors
>
> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick 
> wrote:
>
>> Hi Kjetil,
>>
>> I've tried to get the pg scrubbing/deep scrubbing and nothing seems to be
>> happening. I've tried it a few times over the last few days. My cluster is
>> recovering from a failed disk (which was probably the reason for the
>> inconsistency), do I need to wait for the cluster to heal before
>> repair/deep scrub works?
>>
>> -Michael
>>
>> On 2 April 2018 at 14:13, Kjetil Joergensen  wrote:
>>
>>> Hi,
>>>
>>> scrub or deep-scrub the pg, that should in theory get you back to
>>> list-inconsistent-obj spitting out what's wrong, then mail that info to the
>>> list.
>>>
>>> -KJ
>>>
>>> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
>>> michael.sudn...@gmail.com> wrote:
>>>
 Hello,

 I have a small cluster with an inconsistent pg. I've tried ceph pg
 repair multiple times to no luck. rados list-inconsistent-obj 49.11c
 returns:

 # rados list-inconsistent-obj 49.11c
 No scrub information available for pg 49.11c
 error 2: (2) No such file or directory

 I'm a bit at a loss here as what to do to recover. That pg is part of a
 cephfs_data pool with compression set to force/snappy.

 Does anyone have an suggestions?

 -Michael

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>> --
>>> Kjetil Joergensen 
>>> SRE, Medallia Inc
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-06 Thread David Turner
I'm running into this exact same situation.  I'm running 12.2.2 and I have
an EC PG with a scrub error.  It has the same output for [1] rados
list-inconsistent-obj as mentioned before.  This is the [2] full health
detail.  This is the [3] excerpt from the log from the deep-scrub that
marked the PG inconsistent.  The scrub happened when the PG was starting up
after using ceph-objectstore-tool to split its filestore subfolders.  This
is using a script that I've used for months without any side effects.

I have tried quite a few things to get this PG to deep-scrub or repair, but
to no avail.  It will not do anything.  I have set every osd's
osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
issuing a deep-scrub.  And it will sit there for over an hour without
deep-scrubbing.  My current testing of this is to set all osds to 1,
increase all of the osds for this PG to 4, and then issue the repair... but
similarly nothing happens.  Each time I issue the deep-scrub or repair, the
output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
nothing shows up in the log for the OSD and the PG state stays
'active+clean+inconsistent'.

My next step, unless anyone has a better idea, is to find the exact copy of
the PG with the missing object, use object-store-tool to back up that copy
of the PG and remove it.  Then starting the OSD back up should backfill the
full copy of the PG and be healthy again.



[1] $ rados list-inconsistent-obj 145.2e3
No scrub information available for pg 145.2e3
error 2: (2) No such file or directory

[2] $ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 145.2e3 is active+clean+inconsistent, acting
[234,132,33,331,278,217,55,358,79,3,24]

[3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
[DBG] : 145.2e3 deep-scrub starts
2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log [ERR] :
145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log [ERR] :
145.2e3 deep-scrub 1 errors

On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick 
wrote:

> Hi Kjetil,
>
> I've tried to get the pg scrubbing/deep scrubbing and nothing seems to be
> happening. I've tried it a few times over the last few days. My cluster is
> recovering from a failed disk (which was probably the reason for the
> inconsistency), do I need to wait for the cluster to heal before
> repair/deep scrub works?
>
> -Michael
>
> On 2 April 2018 at 14:13, Kjetil Joergensen  wrote:
>
>> Hi,
>>
>> scrub or deep-scrub the pg, that should in theory get you back to
>> list-inconsistent-obj spitting out what's wrong, then mail that info to the
>> list.
>>
>> -KJ
>>
>> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
>> michael.sudn...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a small cluster with an inconsistent pg. I've tried ceph pg
>>> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
>>> returns:
>>>
>>> # rados list-inconsistent-obj 49.11c
>>> No scrub information available for pg 49.11c
>>> error 2: (2) No such file or directory
>>>
>>> I'm a bit at a loss here as what to do to recover. That pg is part of a
>>> cephfs_data pool with compression set to force/snappy.
>>>
>>> Does anyone have an suggestions?
>>>
>>> -Michael
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>> Kjetil Joergensen 
>> SRE, Medallia Inc
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-02 Thread Michael Sudnick
Hi Kjetil,

I've tried to get the pg scrubbing/deep scrubbing and nothing seems to be
happening. I've tried it a few times over the last few days. My cluster is
recovering from a failed disk (which was probably the reason for the
inconsistency), do I need to wait for the cluster to heal before
repair/deep scrub works?

-Michael

On 2 April 2018 at 14:13, Kjetil Joergensen  wrote:

> Hi,
>
> scrub or deep-scrub the pg, that should in theory get you back to
> list-inconsistent-obj spitting out what's wrong, then mail that info to the
> list.
>
> -KJ
>
> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick  > wrote:
>
>> Hello,
>>
>> I have a small cluster with an inconsistent pg. I've tried ceph pg repair
>> multiple times to no luck. rados list-inconsistent-obj 49.11c returns:
>>
>> # rados list-inconsistent-obj 49.11c
>> No scrub information available for pg 49.11c
>> error 2: (2) No such file or directory
>>
>> I'm a bit at a loss here as what to do to recover. That pg is part of a
>> cephfs_data pool with compression set to force/snappy.
>>
>> Does anyone have an suggestions?
>>
>> -Michael
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Kjetil Joergensen 
> SRE, Medallia Inc
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-02 Thread Marc Roos
 
I have this inconsistent pg also for a long time on my test cluster, 
also tried pg repair among other things. Can I also get some help on 
this?


[@c02 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 17.36 is active+clean+inconsistent, acting [9,0,12]

[@c02 ~]# rados list-inconsistent-obj 17.36
{"epoch":18985,"inconsistents":[]}

Tried deleting these on the other osd's, hoping osd.9 would replicate to 
osd.0 and osd.12.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pool rbd 
rbd_data.1f114174b0dc51.0974 removeall

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pool rbd 
rbd_data.1f114174b0dc51.0974 removeall




-Original Message-
From: Kjetil Joergensen [mailto:kje...@medallia.com] 
Sent: maandag 2 april 2018 20:14
To: Michael Sudnick
Cc: ceph-users
Subject: Re: [ceph-users] Have an inconsistent PG, repair not working

Hi,

scrub or deep-scrub the pg, that should in theory get you back to 
list-inconsistent-obj spitting out what's wrong, then mail that info to 
the list.

-KJ

On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick 
<michael.sudn...@gmail.com> wrote:


Hello,


I have a small cluster with an inconsistent pg. I've tried ceph pg 
repair multiple times to no luck. rados list-inconsistent-obj 49.11c 
returns:

# rados list-inconsistent-obj 49.11c
No scrub information available for pg 49.11c
error 2: (2) No such file or directory


I'm a bit at a loss here as what to do to recover. That pg is part 
of a cephfs_data pool with compression set to force/snappy.


Does anyone have an suggestions?


-Michael


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> 






-- 

Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-02 Thread Kjetil Joergensen
Hi,

scrub or deep-scrub the pg, that should in theory get you back to
list-inconsistent-obj spitting out what's wrong, then mail that info to the
list.

-KJ

On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick 
wrote:

> Hello,
>
> I have a small cluster with an inconsistent pg. I've tried ceph pg repair
> multiple times to no luck. rados list-inconsistent-obj 49.11c returns:
>
> # rados list-inconsistent-obj 49.11c
> No scrub information available for pg 49.11c
> error 2: (2) No such file or directory
>
> I'm a bit at a loss here as what to do to recover. That pg is part of a
> cephfs_data pool with compression set to force/snappy.
>
> Does anyone have an suggestions?
>
> -Michael
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com