Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-14 Thread David Zafman


The fix for tracker 20089 undid the changes you're seeing in the 15368 
pull request.  The attr name mismatch of 'hinfo_key'  means that key is 
missing because every erasure coded object should have a key called 
"hinfo_key."


You should try to determine why your extended attributes are getting 
corrupted.  All the errors are on shard 0.  My testing shows that repair 
will fix this scenario.


David


On 3/13/18 3:48 PM, Graham Allan wrote:
Updated cluster now to 12.2.4 and the cycle of 
inconsistent->repair->unfound seems to continue, though possibly 
slightly differently. A pg does pass through an "active+clean" phase 
after repair, which might be new, but more likely I never observed it 
at the right time before.


I see messages like this in the logs now "attr name mismatch 
'hinfo_key'" - perhaps this might cast more light on the cause:


2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 
401 : cluster [ERR] 70.3dbs0 : soid 
70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 
428 : cluster [ERR] 70.3dbs0 : soid 
70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 
452 : cluster [ERR] 70.3dbs0 : soid 
70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 
494 : cluster [ERR] 70.3dbs0 : soid 
70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head 
attr name mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 
495 : cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 
496 : cluster [ERR] 70.3db repair 161 errors, 161 fixed


The only similar-sounding issue I could find is

http://tracker.ceph.com/issues/20089

When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, 
I don't see the changes in the commit here:


https://github.com/ceph/ceph/pull/15368/files

of course a lot of other things have changed, but is it possible this 
fix never made it into luminous?


Graham

On 02/17/2018 12:48 PM, David Zafman wrote:


The commits below came after v12.2.2 and may impact this issue. When 
a pg is active+clean+inconsistent means that scrub has detected 
issues with 1 or more replicas of 1 or more objects . An unfound 
object is a potentially temporary state in which the current set of 
available OSDs doesn't allow an object to be 
recovered/backfilled/repaired.  When the primary OSD restarts, any 
unfound objects ( an in memory structure) are reset so that the new 
set of peered OSDs can determine again what objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as 
designed) because of the unfound object.  The new recovery_unfound 
and backfill_unfound states indicates that recovery has stopped due 
to unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman 
Date:   Wed Jan 10 13:30:41 2018 -0800

 osd: Don't start recovery for missing until active pg state set

 I was seeing recovery hang when it is started before 
_activate_committed()
 The state machine passes into "Active" but this transitions to 
activating

 pg state and after commmitted into "active" pg state.

 Signed-off-by: David Zafman 

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman 
Date:   Mon Oct 9 08:19:21 2017 -0700

 osd, mon: Add new pg states recovery_unfound and backfill_unfound

 Signed-off-by: David Zafman 



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with 
osds for

pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 
<(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390
Prior to that, probably more relevant, this was the only 
communication

logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-13 Thread Graham Allan
Updated cluster now to 12.2.4 and the cycle of 
inconsistent->repair->unfound seems to continue, though possibly 
slightly differently. A pg does pass through an "active+clean" phase 
after repair, which might be new, but more likely I never observed it at 
the right time before.


I see messages like this in the logs now "attr name mismatch 
'hinfo_key'" - perhaps this might cast more light on the cause:



2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 401 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 428 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 452 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 494 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 495 : 
cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 496 : 
cluster [ERR] 70.3db repair 161 errors, 161 fixed


The only similar-sounding issue I could find is

http://tracker.ceph.com/issues/20089

When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, I 
don't see the changes in the commit here:


https://github.com/ceph/ceph/pull/15368/files

of course a lot of other things have changed, but is it possible this 
fix never made it into luminous?


Graham

On 02/17/2018 12:48 PM, David Zafman wrote:


The commits below came after v12.2.2 and may impact this issue. When a 
pg is active+clean+inconsistent means that scrub has detected issues 
with 1 or more replicas of 1 or more objects .  An unfound object is a 
potentially temporary state in which the current set of available OSDs 
doesn't allow an object to be recovered/backfilled/repaired.  When the 
primary OSD restarts, any unfound objects ( an in memory structure) are 
reset so that the new set of peered OSDs can determine again what 
objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as designed) 
because of the unfound object.  The new recovery_unfound and 
backfill_unfound states indicates that recovery has stopped due to 
unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman 
Date:   Wed Jan 10 13:30:41 2018 -0800

     osd: Don't start recovery for missing until active pg state set

     I was seeing recovery hang when it is started before 
_activate_committed()
     The state machine passes into "Active" but this transitions to 
activating

     pg state and after commmitted into "active" pg state.

     Signed-off-by: David Zafman 

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman 
Date:   Mon Oct 9 08:19:21 2017 -0700

     osd, mon: Add new pg states recovery_unfound and backfill_unfound

     Signed-off-by: David Zafman 



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds 
for

pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390

Prior to that, probably more relevant, this was the only communication
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
osd_backoff(70.438s0 block id 1

[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 


e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-19 Thread Graham Allan



On 02/17/2018 12:48 PM, David Zafman wrote:


The commits below came after v12.2.2 and may impact this issue. When a 
pg is active+clean+inconsistent means that scrub has detected issues 
with 1 or more replicas of 1 or more objects .  An unfound object is a 
potentially temporary state in which the current set of available OSDs 
doesn't allow an object to be recovered/backfilled/repaired.  When the 
primary OSD restarts, any unfound objects ( an in memory structure) are 
reset so that the new set of peered OSDs can determine again what 
objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as designed) 
because of the unfound object.  The new recovery_unfound and 
backfill_unfound states indicates that recovery has stopped due to 
unfound objects.


Thanks for your comments David. I could certainly enable any additional 
logging that might help to clarify what's going on here - perhaps on the 
primary OSD for a given pg?


I am still having a hard time understanding why these objects repeatedly 
get flagged as unfound, when they are downloadable and contain correct 
data whenever they are not in this state. It is a 4+2 EC pool, so I 
would think it possible to reconstruct any missing EC chunks.


It's an extensive problem; while I have been focusing on examining a 
couple of specific pgs, the pool in general is showing 2410 pgs 
inconsistent (out of 4096).


Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-17 Thread David Zafman


The commits below came after v12.2.2 and may impact this issue. When a 
pg is active+clean+inconsistent means that scrub has detected issues 
with 1 or more replicas of 1 or more objects .  An unfound object is a 
potentially temporary state in which the current set of available OSDs 
doesn't allow an object to be recovered/backfilled/repaired.  When the 
primary OSD restarts, any unfound objects ( an in memory structure) are 
reset so that the new set of peered OSDs can determine again what 
objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as designed) 
because of the unfound object.  The new recovery_unfound and 
backfill_unfound states indicates that recovery has stopped due to 
unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman 
Date:   Wed Jan 10 13:30:41 2018 -0800

    osd: Don't start recovery for missing until active pg state set

    I was seeing recovery hang when it is started before 
_activate_committed()
    The state machine passes into "Active" but this transitions to 
activating

    pg state and after commmitted into "active" pg state.

    Signed-off-by: David Zafman 

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman 
Date:   Mon Oct 9 08:19:21 2017 -0700

    osd, mon: Add new pg states recovery_unfound and backfill_unfound

    Signed-off-by: David Zafman 



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390

Prior to that, probably more relevant, this was the only communication
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
osd_backoff(70.438s0 block id 1


[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)

e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_backoff(70.438s0 ack-block id 1


[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)

e507695) v1 -- 0x7fab48065420 con 0

so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies

If I restart the primary osd for the pg, that makes it forget its state
and return to active+clean+inconsistent. I can then download the
previously-unfound objects again, as well as run "radosgw-admin object
stat".

So the interesting bit is probably figuring out why it decides these
objects are unfound, when clearly they aren't.

What would be the best place to enable additional logging to understand
this - perhaps the primary osd?


David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg



Thanks for all your help,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:

> On 02/16/2018 12:31 PM, Graham Allan wrote:
> >
> > If I set debug rgw=1 and demug ms=1 before running the "object stat"
> > command, it seems to stall in a loop of trying communicate with osds for
> > pool 96, which is .rgw.control
> >
> >> 10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
> >> osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
> >> cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
> >> e507695) v8 -- 0x7f10ac033610 con 0
> >> 10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
> >> osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
> >> uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0
> 0) 0x7f1158003e20
> >> con 0x7f117afd8390
> >
> > Prior to that, probably more relevant, this was the only communication
> > logged with the primary osd of the pg:
> >
> >> 10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
> >> osd_op(unknown.0.0:96 70.438s0
> >> 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
> >> [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
> >> v8 -- 0x7fab79889fa0 con 0
> >> 10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
> >> osd_backoff(70.438s0 block id 1
> >>
> [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
> >> e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
> >> 0x7fab79885980
> >> 210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
> >> osd_backoff(70.438s0 ack-block id 1
> >>
> [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head)
> >> e507695) v1 -- 0x7fab48065420 con 0
> >
> > so I guess the backoff message above is saying the object is
> > unavailable. OK, that certainly makes sense. Not sure that it helps me
> > understand how to fix the inconsistencies
>
> If I restart the primary osd for the pg, that makes it forget its state
> and return to active+clean+inconsistent. I can then download the
> previously-unfound objects again, as well as run "radosgw-admin object
> stat".
>
> So the interesting bit is probably figuring out why it decides these
> objects are unfound, when clearly they aren't.
>
> What would be the best place to enable additional logging to understand
> this - perhaps the primary osd?
>

David, this sounds like one of the bugs where an OSD can mark objects as
inconsistent locally but then doesn't actually trigger recovery on them. Or
it doesn't like any copy but doesn't persist that.
Do any known issues around that apply to 12.2.2?
-Greg


>
> Thanks for all your help,
>
> Graham
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan

On 02/16/2018 12:31 PM, Graham Allan wrote:


If I set debug rgw=1 and demug ms=1 before running the "object stat" 
command, it seems to stall in a loop of trying communicate with osds for 
pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- 
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping 
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected 
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59  
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 0 0) 0x7f1158003e20 
con 0x7f117afd8390


Prior to that, probably more relevant, this was the only communication 
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- 
osd_op(unknown.0.0:96 70.438s0 
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head 
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) 
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1  
osd_backoff(70.438s0 block id 1 
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 
e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- 
osd_backoff(70.438s0 ack-block id 1 
[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 
e507695) v1 -- 0x7fab48065420 con 0


so I guess the backoff message above is saying the object is 
unavailable. OK, that certainly makes sense. Not sure that it helps me 
understand how to fix the inconsistencies


If I restart the primary osd for the pg, that makes it forget its state 
and return to active+clean+inconsistent. I can then download the 
previously-unfound objects again, as well as run "radosgw-admin object 
stat".


So the interesting bit is probably figuring out why it decides these 
objects are unfound, when clearly they aren't.


What would be the best place to enable additional logging to understand 
this - perhaps the primary osd?


Thanks for all your help,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan



On 02/15/2018 05:33 PM, Gregory Farnum wrote:
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan > wrote:


A lot more in xattrs which I won't paste, though the keys are:

 > root@cephmon1:~# ssh ceph03 find
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*'
-exec xattr {} +
 > user.ceph._user.rgw.idtag
 > user.cephos.spill_out
 > user.ceph._
 > user.ceph.snapset
 > user.ceph._user.rgw.manifest
 > user.ceph._@1
 > user.ceph.hinfo_key
 > user.ceph._user.rgw.manifest@1
 > user.ceph._user.rgw.manifest@2
 > user.ceph._user.rgw.acl
 > user.ceph._user.rgw.x-amz-acl
 > user.ceph._user.rgw.etag
 > user.ceph._user.rgw.x-amz-date
 > user.ceph._user.rgw.content_type

Not sure which among these would contain pointers to part files.


I believe it’s the manifest xattrs there.


Thanks, I see better now how this goes together. Rather than dumping out 
the manifest xattrs, I guess I should also be able to understand this 
from the output of "radosgw-admin object stat".



One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.

Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear
itself.

Right, “recovery_wait” means that the pg needs to do log-based recovery 
but (at least) one of the participating OSDs doesn’t have a slot 
available; that will resolve itself eventually.


It sounds like the scrubbing has detected some inconsistencies but the 
reason you weren’t getting data is just that it hit an object which 
needed recovery but was blocked waiting on it.


Yes, though it seems to be stuck in a cycle. This morning, that same pg 
70.438 is back in active+recovery_wait+inconsistent state with the same 
169 unfound objects - and from s3, the objects which would download 
successfully yesterday now stall. Which probably makes sense while in 
"unfound" state, but clearly the data is there, while ceph is not 
successful in making the pg consistent. Each time it repairs again, it 
claims to fix more errors but finds the same number of unfound objects 
again.



/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.438357 osd.175 osd.175 
10.31.0.71:6838/66301 2928 : cluster [ERR] 70.438s0 repair 0 missing, 169 
inconsistent objects
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.442875 osd.175 osd.175 
10.31.0.71:6838/66301 2929 : cluster [ERR] 70.438 repair 169 errors, 845 fixed
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.040196 osd.175 osd.175 
10.31.0.71:6838/66301 2995 : cluster [ERR] 70.438s0 repair 0 missing, 169 
inconsistent objects
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.046028 osd.175 osd.175 
10.31.0.71:6838/66301 2996 : cluster [ERR] 70.438 repair 169 errors, 685 fixed


I also now see that for these unfound objects "radosgw-admin object 
stat" also hangs. Clearly makes sense since radosgw must perform this to 
retrieve the object. Does it imply that ceph can't access the "head" 
object in order to read the xattr data?


If I set debug rgw=1 and demug ms=1 before running the "object stat" 
command, it seems to stall in a loop of trying communicate with osds for 
pool 96, which is .rgw.control



10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 
96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] 
ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59  
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 
ondisk = 0) v8  152+0+0 (2536111836 0 0) 0x7f1158003e20 con 0x7f117afd8390


Prior to that, probably more relevant, this was the only communication 
logged with the primary osd of the pg:



10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 
70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head 
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 
0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1  
osd_backoff(70.438s0 block id 1 

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Gregory Farnum
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan  wrote:

> On 02/15/2018 11:58 AM, Gregory Farnum wrote:
> >
> > Well, if the objects were uploaded using multi-part upload I believe the
> > objects you’re looking at here will only contain omap (or xattr?)
> > entries pointing to the part files, so the empty file data is to be
> > expected. This might also make slightly more sense in terms of the scrub
> > inconsistencies popping up, although I didn’t think any omap issues I
> > remember should have impacted rgw.
> >
> > Other than that, I’m not sure how it would be turning 0 bytes of data
> > into the correct results.
>
> That makes a lot of sense! So the 0-byte file here is effectivelyjust a
> holder for xattr data.
>
> Now I'm trying to figure out how I can associate it to files which might
> contain the data.
>
> I don't see a lot of data in omap (assuming I'm looking at the right
> thing here):
>
> > root@cephmon1:~# rados -p .rgw.buckets.index getomapval
> .dir.default.325674.85 bellplants_images/1089213.jpg
> > value (193 bytes) :
> >   05 03 bb 00 00 00 1d 00  00 00 62 65 6c 6c 70 6c
> |..bellpl|
> > 0010  61 6e 74 73 5f 69 6d 61  67 65 73 2f 31 30 38 39
> |ants_images/1089|
> > 0020  32 31 33 2e 6a 70 67 71  70 1f 00 00 00 00 00 01
> |213.jpgqp...|
> > 0030  03 03 3a 00 00 00 01 00  a4 92 00 00 00 00 00 10
> |..:.|
> > 0040  40 c5 55 00 00 00 00 00  00 00 00 08 00 00 00 74
> |@.Ut|
> > 0050  70 72 61 74 68 65 72 11  00 00 00 54 68 6f 6d 61
> |pratherThoma|
> > 0060  73 20 46 2e 20 50 72 61  74 68 65 72 00 00 00 00  |s F.
> Prather|
> > 0070  00 00 00 00 1d 00 00 00  62 65 6c 6c 70 6c 61 6e
> |bellplan|
> > 0080  74 73 5f 69 6d 61 67 65  73 2f 31 30 38 39 32 31
> |ts_images/108921|
> > 0090  33 2e 6a 70 67 01 01 06  00 00 00 46 84 71 70 1f
> |3.jpg..F.qp.|
> > 00a0  00 84 14 ee 09 00 17 00  00 00 64 65 66 61 75 6c
> |..defaul|
> > 00b0  74 2e 33 32 35 36 37 34  2e 31 31 35 39 33 36 33
> |t.325674.1159363|
> > 00c0  37|7|
> > 00c1
>
> A lot more in xattrs which I won't paste, though the keys are:
>
> > root@cephmon1:~# ssh ceph03 find
> /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec
> xattr {} +
> > user.ceph._user.rgw.idtag
> > user.cephos.spill_out
> > user.ceph._
> > user.ceph.snapset
> > user.ceph._user.rgw.manifest
> > user.ceph._@1
> > user.ceph.hinfo_key
> > user.ceph._user.rgw.manifest@1
> > user.ceph._user.rgw.manifest@2
> > user.ceph._user.rgw.acl
> > user.ceph._user.rgw.x-amz-acl
> > user.ceph._user.rgw.etag
> > user.ceph._user.rgw.x-amz-date
> > user.ceph._user.rgw.content_type
>
> Not sure which among these would contain pointers to part files.


I believe it’s the manifest xattrs there.



>
> While I'm less pessimistic about data being lost I'm still really
> uncertain that the cluster is making progress towards a clean state.
>
> One example, yesterday pg 70.438 showed "has 169 objects unfound and
> apparently lost". At that time the state was
> active+recovery_wait+inconsistent. Today it's showing no unfound objects
> but is active+clean+inconsistent, and objects which were inaccessible
> via radosgw yesterday can now download. I'm not sure what changed. I
> have asked ceph to perform another deep scrub and repair on the pg, but
> it has yet to start. I'm really curious to see if it becomes consistent,
> or discovers unfound objects again.
>
> Actually now I notice that a pg reported as
> active+recovery_wait+inconsistent by "ceph health detail" is shown as
> active+recovering+inconsistent by "ceph pg list". That makes more sense
> to me - "recovery_wait" implied to me that it was waiting for recovery
> to start, while "recovering" explains why the problem might clear itself.


Right, “recovery_wait” means that the pg needs to do log-based recovery but
(at least) one of the participating OSDs doesn’t have a slot available;
that will resolve itself eventually.

It sounds like the scrubbing has detected some inconsistencies but the
reason you weren’t getting data is just that it hit an object which needed
recovery but was blocked waiting on it.
-Greg



>
> Graham
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Graham Allan

On 02/15/2018 11:58 AM, Gregory Farnum wrote:


Well, if the objects were uploaded using multi-part upload I believe the 
objects you’re looking at here will only contain omap (or xattr?) 
entries pointing to the part files, so the empty file data is to be 
expected. This might also make slightly more sense in terms of the scrub 
inconsistencies popping up, although I didn’t think any omap issues I 
remember should have impacted rgw.


Other than that, I’m not sure how it would be turning 0 bytes of data 
into the correct results.


That makes a lot of sense! So the 0-byte file here is effectivelyjust a 
holder for xattr data.


Now I'm trying to figure out how I can associate it to files which might 
contain the data.


I don't see a lot of data in omap (assuming I'm looking at the right 
thing here):



root@cephmon1:~# rados -p .rgw.buckets.index getomapval .dir.default.325674.85 
bellplants_images/1089213.jpg
value (193 bytes) :
  05 03 bb 00 00 00 1d 00  00 00 62 65 6c 6c 70 6c  |..bellpl|
0010  61 6e 74 73 5f 69 6d 61  67 65 73 2f 31 30 38 39  |ants_images/1089|
0020  32 31 33 2e 6a 70 67 71  70 1f 00 00 00 00 00 01  |213.jpgqp...|
0030  03 03 3a 00 00 00 01 00  a4 92 00 00 00 00 00 10  |..:.|
0040  40 c5 55 00 00 00 00 00  00 00 00 08 00 00 00 74  |@.Ut|
0050  70 72 61 74 68 65 72 11  00 00 00 54 68 6f 6d 61  |pratherThoma|
0060  73 20 46 2e 20 50 72 61  74 68 65 72 00 00 00 00  |s F. Prather|
0070  00 00 00 00 1d 00 00 00  62 65 6c 6c 70 6c 61 6e  |bellplan|
0080  74 73 5f 69 6d 61 67 65  73 2f 31 30 38 39 32 31  |ts_images/108921|
0090  33 2e 6a 70 67 01 01 06  00 00 00 46 84 71 70 1f  |3.jpg..F.qp.|
00a0  00 84 14 ee 09 00 17 00  00 00 64 65 66 61 75 6c  |..defaul|
00b0  74 2e 33 32 35 36 37 34  2e 31 31 35 39 33 36 33  |t.325674.1159363|
00c0  37|7|
00c1


A lot more in xattrs which I won't paste, though the keys are:


root@cephmon1:~# ssh ceph03 find 
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec xattr 
{} +
user.ceph._user.rgw.idtag
user.cephos.spill_out
user.ceph._
user.ceph.snapset
user.ceph._user.rgw.manifest
user.ceph._@1
user.ceph.hinfo_key
user.ceph._user.rgw.manifest@1
user.ceph._user.rgw.manifest@2
user.ceph._user.rgw.acl
user.ceph._user.rgw.x-amz-acl
user.ceph._user.rgw.etag
user.ceph._user.rgw.x-amz-date
user.ceph._user.rgw.content_type


Not sure which among these would contain pointers to part files.

While I'm less pessimistic about data being lost I'm still really 
uncertain that the cluster is making progress towards a clean state.


One example, yesterday pg 70.438 showed "has 169 objects unfound and 
apparently lost". At that time the state was 
active+recovery_wait+inconsistent. Today it's showing no unfound objects 
but is active+clean+inconsistent, and objects which were inaccessible 
via radosgw yesterday can now download. I'm not sure what changed. I 
have asked ceph to perform another deep scrub and repair on the pg, but 
it has yet to start. I'm really curious to see if it becomes consistent, 
or discovers unfound objects again.


Actually now I notice that a pg reported as 
active+recovery_wait+inconsistent by "ceph health detail" is shown as 
active+recovering+inconsistent by "ceph pg list". That makes more sense 
to me - "recovery_wait" implied to me that it was waiting for recovery 
to start, while "recovering" explains why the problem might clear itself.


Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Gregory Farnum
On Thu, Feb 15, 2018 at 9:41 AM Graham Allan  wrote:

> Hi Greg,
>
> On 02/14/2018 11:49 AM, Gregory Farnum wrote:
> >
> > On Tue, Feb 13, 2018 at 8:41 AM Graham Allan  > > wrote:
> >
> > I'm replying to myself here, but it's probably worth mentioning that
> > after this started, I did bring back the failed host, though with
> "ceph
> > osd weight 0" to avoid more data movement.
> >
> > For inconsistent pgs containing unfound objects, the output of "ceph
> pg
> >  query" does then show the original osd being queried for objects,
> > and indeed if I dig through the filesystem I find the same 0-byte
> files
> > dated from 2015-2016.
> >
> > This strongly implies to me that data loss occurred a long time in
> the
> > past and is not related to the osd host going down - this only
> triggered
> > the problem being found.
> >
> >
> > I would assume that too, but unless you had scrubbing disabled then it
> > should have been discovered long ago; I don’t understand how it could
> > have stayed hidden. Did you change any other settings recently?
> >
> > Or, what is this EC pool being used for, and what are the EC settings?
> > Having a bunch of empty files is not surprising if the objects are
> > smaller than the chunk/stripe size — then just the primary and the
> > parity locations would actually have data for them.
>
> We didn't have scrubbing disabled -it /was/ dialed back somewhat using
> the "osd scrub sleep|priority|load_threshold" parameters but I would see
> it running, and keeping an eye on the last scrub times for pgs looked
> ok. (I guess it is appropriate to revisit and maybe remove "tuning
> tweaks" such as the above, as ceph defaults change/improve over time.)
>
> The (4+2) EC pool is used as backing for radosgw - the pool itself
> contains ~400T data in ~300M objects.
>
> Having examined some specific "unfound" objects in some detail, I am
> less pessimistic but even more confused.
>
> For a huge number of the objects I find they are also listed as 0-bytes
> in size when listed via s3 - and these are grouped by "directory" (path)
> so it really makes me question whether these were ever successful
> uploads. It would explain the 0-byte shard files I see in filestore.
>
> However some objects definitely do contain data. I have some identified
> from last week as unfound, which I traced to 6x 0-bytes filestore files,
> and attempting to download them from s3 would simply stall (I do need to
> follow up on radosgw logs for this case).
>
> However these same test objects now *do* downloads successfully. The pg
> itself has reverted to active+clean+inconsistent state. The files
> contain image data so I can load them and see they are as expected. If I
> again trawl through all the pg filestore locations, I can still only
> find the six 0-byte files - where is the data coming from!?


Well, if the objects were uploaded using multi-part upload I believe the
objects you’re looking at here will only contain omap (or xattr?) entries
pointing to the part files, so the empty file data is to be expected. This
might also make slightly more sense in terms of the scrub inconsistencies
popping up, although I didn’t think any omap issues I remember should have
impacted rgw.

Other than that, I’m not sure how it would be turning 0 bytes of data into
the correct results.



>
> Here's an example (first 3 osds for the pg, the other 3 are the same):
> > root@cephmon1:~# ssh ceph03 find
> /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec ls
> -ltr {} +
> > -rw-r--r-- 1 root root 0 Jan 30 13:50
> /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__0
> > root@cephmon1:~# ssh ceph01 find
> /var/lib/ceph/osd/ceph-221/current/70.3d6s1_head -name '*1089213*' -exec ls
> -ltr {} +
> > -rw-r--r-- 1 root root 0 Aug 22  2015
> /var/lib/ceph/osd/ceph-221/current/70.3d6s1_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__1
> > root@cephmon1:~# ssh ceph08 find
> /var/lib/ceph/osd/ceph-357/current/70.3d6s2_head -name '*1089213*' -exec ls
> -ltr {} +
> > -rw-r--r-- 1 root root 0 Feb 17  2016
> /var/lib/ceph/osd/ceph-357/current/70.3d6s2_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__2
>
> Thanks for any insights!
>
> Graham
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Graham Allan

Hi Greg,

On 02/14/2018 11:49 AM, Gregory Farnum wrote:


On Tue, Feb 13, 2018 at 8:41 AM Graham Allan > wrote:


I'm replying to myself here, but it's probably worth mentioning that
after this started, I did bring back the failed host, though with "ceph
osd weight 0" to avoid more data movement.

For inconsistent pgs containing unfound objects, the output of "ceph pg
 query" does then show the original osd being queried for objects,
and indeed if I dig through the filesystem I find the same 0-byte files
dated from 2015-2016.

This strongly implies to me that data loss occurred a long time in the
past and is not related to the osd host going down - this only triggered
the problem being found.


I would assume that too, but unless you had scrubbing disabled then it 
should have been discovered long ago; I don’t understand how it could 
have stayed hidden. Did you change any other settings recently?


Or, what is this EC pool being used for, and what are the EC settings? 
Having a bunch of empty files is not surprising if the objects are 
smaller than the chunk/stripe size — then just the primary and the 
parity locations would actually have data for them.


We didn't have scrubbing disabled -it /was/ dialed back somewhat using 
the "osd scrub sleep|priority|load_threshold" parameters but I would see 
it running, and keeping an eye on the last scrub times for pgs looked 
ok. (I guess it is appropriate to revisit and maybe remove "tuning 
tweaks" such as the above, as ceph defaults change/improve over time.)


The (4+2) EC pool is used as backing for radosgw - the pool itself 
contains ~400T data in ~300M objects.


Having examined some specific "unfound" objects in some detail, I am 
less pessimistic but even more confused.


For a huge number of the objects I find they are also listed as 0-bytes 
in size when listed via s3 - and these are grouped by "directory" (path) 
so it really makes me question whether these were ever successful 
uploads. It would explain the 0-byte shard files I see in filestore.


However some objects definitely do contain data. I have some identified 
from last week as unfound, which I traced to 6x 0-bytes filestore files, 
and attempting to download them from s3 would simply stall (I do need to 
follow up on radosgw logs for this case).


However these same test objects now *do* downloads successfully. The pg 
itself has reverted to active+clean+inconsistent state. The files 
contain image data so I can load them and see they are as expected. If I 
again trawl through all the pg filestore locations, I can still only 
find the six 0-byte files - where is the data coming from!?


Here's an example (first 3 osds for the pg, the other 3 are the same):
root@cephmon1:~# ssh ceph03 find /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*' -exec ls -ltr {} +  
-rw-r--r-- 1 root root 0 Jan 30 13:50 /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__0
root@cephmon1:~# ssh ceph01 find /var/lib/ceph/osd/ceph-221/current/70.3d6s1_head -name '*1089213*' -exec ls -ltr {} +  
-rw-r--r-- 1 root root 0 Aug 22  2015 /var/lib/ceph/osd/ceph-221/current/70.3d6s1_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__1
root@cephmon1:~# ssh ceph08 find /var/lib/ceph/osd/ceph-357/current/70.3d6s2_head -name '*1089213*' -exec ls -ltr {} +  
-rw-r--r-- 1 root root 0 Feb 17  2016 /var/lib/ceph/osd/ceph-357/current/70.3d6s2_head/DIR_6/DIR_D/DIR_3/DIR_0/DIR_0/DIR_4/default.325674.85\ubellplants\uimages\s1089213.jpg__head_794003D6__46__2


Thanks for any insights!

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-14 Thread Gregory Farnum
On Tue, Feb 13, 2018 at 8:41 AM Graham Allan  wrote:

> I'm replying to myself here, but it's probably worth mentioning that
> after this started, I did bring back the failed host, though with "ceph
> osd weight 0" to avoid more data movement.
>
> For inconsistent pgs containing unfound objects, the output of "ceph pg
>  query" does then show the original osd being queried for objects,
> and indeed if I dig through the filesystem I find the same 0-byte files
> dated from 2015-2016.
>
> This strongly implies to me that data loss occurred a long time in the
> past and is not related to the osd host going down - this only triggered
> the problem being found.


I would assume that too, but unless you had scrubbing disabled then it
should have been discovered long ago; I don’t understand how it could have
stayed hidden. Did you change any other settings recently?

Or, what is this EC pool being used for, and what are the EC settings?
Having a bunch of empty files is not surprising if the objects are smaller
than the chunk/stripe size — then just the primary and the parity locations
would actually have data for them.



>
> Graham
>
> On 02/12/2018 06:26 PM, Graham Allan wrote:
> > Hi,
> >
> > For the past few weeks I've been seeing a large number of pgs on our
> > main erasure coded pool being flagged inconsistent, followed by them
> > becoming active+recovery_wait+inconsistent with unfound objects. The
> > cluster is currently running luminous 12.2.2 but has in the past also
> > run its way through firefly, hammer and jewel.
> >
> > Here's a sample object from "ceph pg list_missing" (there are 150
> > unfound objects in this particular pg):
> >
> > ceph health detail shows:
> >> pg 70.467 is stuck unclean for 1004525.715896, current state
> >> active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]
> >
> > ceph pg 70.467 list_missing:
> >> {
> >> "oid": {
> >> "oid":
> >>
> "default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h",
> >>
> >> "key": "",
> >> "snapid": -2,
> >> "hash": 628294759,
> >> "max": 0,
> >> "pool": 70,
> >> "namespace": ""
> >> },
> >> "need": "73222'132227",
> >> "have": "0'0",
> >> "flags": "none",
> >> "locations": [
> >> "193(5)",
> >> "259(4)",
> >> "449(0)"
> >> ]
> >> },
> >
> > When I trace through the filesystem on each OSD, I find the associated
> > file present on each OSD but with size 0 bytes.
> >
> > Interestingly, for the 3 OSDs for which "list_missing" shows locations
> > above (193,259,449), the timestamp of the 0-byte file is recent (within
> > last few weeks). For the other 3 OSDs (233,336,323), it's in the distant
> > past (08/2015 and 02/2016).
> >
> > All the unfound objects I've checked on this pg show the same pattern,
> > along with the "have" epoch showing as "0'0".
> >
> > Other than the potential data loss being disturbing, I wonder why this
> > showed up so suddenly?
> >
> > It seems to have been triggered by one OSD host failing over a long
> > weekend. By the time we looked at it on Monday, the cluster had
> > re-balanced enough data that I decided to simply leave it - we had long
> > wanted to evacuate a first host to convert to a newer OS release, as
> > well as Bluestore. Perhaps this was a bad choice, but the cluster
> > recovery appeared to be proceeding normally, and was apparently complete
> > a few days later. It was only around a week later that the unfound
> > objects started.
> >
> > All the unfound object file fragments I've tracked down so far have
> > their older members with timestamps in the same mid-2015 to mid-2016
> > period. I could be wrong but this really seems like a long-standing
> > problem has just been unearthed. I wonder if it could be connected to
> > this thread from early 2016, concerning a problem on the same cluster:
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html
> >
> > It's a long thread, but the 0-byte files sound very like the "orphaned
> > files" in that thread - related to performing a directory split while
> > handling links on a filename with the special long filename handling...
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html
> >
> > However unlike that thread, I'm not finding any other files with
> > duplicate names in the hierarchy.
> >
> > I'm not sure there's much else I can do besides record the names of any
> > unfound objects before resorting to "mark_unfound_lost delete" - any
> > suggestions for further research?
> >
> > Thanks,
> >
> > Graham
>
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-13 Thread Graham Allan
I'm replying to myself here, but it's probably worth mentioning that 
after this started, I did bring back the failed host, though with "ceph 
osd weight 0" to avoid more data movement.


For inconsistent pgs containing unfound objects, the output of "ceph pg 
 query" does then show the original osd being queried for objects, 
and indeed if I dig through the filesystem I find the same 0-byte files 
dated from 2015-2016.


This strongly implies to me that data loss occurred a long time in the 
past and is not related to the osd host going down - this only triggered 
the problem being found.


Graham

On 02/12/2018 06:26 PM, Graham Allan wrote:

Hi,

For the past few weeks I've been seeing a large number of pgs on our 
main erasure coded pool being flagged inconsistent, followed by them 
becoming active+recovery_wait+inconsistent with unfound objects. The 
cluster is currently running luminous 12.2.2 but has in the past also 
run its way through firefly, hammer and jewel.


Here's a sample object from "ceph pg list_missing" (there are 150 
unfound objects in this particular pg):


ceph health detail shows:
    pg 70.467 is stuck unclean for 1004525.715896, current state 
active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]


ceph pg 70.467 list_missing:

    {
    "oid": {
    "oid": 
"default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h", 


    "key": "",
    "snapid": -2,
    "hash": 628294759,
    "max": 0,
    "pool": 70,
    "namespace": ""
    },
    "need": "73222'132227",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "193(5)",
    "259(4)",
    "449(0)"
    ]
    },


When I trace through the filesystem on each OSD, I find the associated 
file present on each OSD but with size 0 bytes.


Interestingly, for the 3 OSDs for which "list_missing" shows locations 
above (193,259,449), the timestamp of the 0-byte file is recent (within 
last few weeks). For the other 3 OSDs (233,336,323), it's in the distant 
past (08/2015 and 02/2016).


All the unfound objects I've checked on this pg show the same pattern, 
along with the "have" epoch showing as "0'0".


Other than the potential data loss being disturbing, I wonder why this 
showed up so suddenly?


It seems to have been triggered by one OSD host failing over a long 
weekend. By the time we looked at it on Monday, the cluster had 
re-balanced enough data that I decided to simply leave it - we had long 
wanted to evacuate a first host to convert to a newer OS release, as 
well as Bluestore. Perhaps this was a bad choice, but the cluster 
recovery appeared to be proceeding normally, and was apparently complete 
a few days later. It was only around a week later that the unfound 
objects started.


All the unfound object file fragments I've tracked down so far have 
their older members with timestamps in the same mid-2015 to mid-2016 
period. I could be wrong but this really seems like a long-standing 
problem has just been unearthed. I wonder if it could be connected to 
this thread from early 2016, concerning a problem on the same cluster:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html

It's a long thread, but the 0-byte files sound very like the "orphaned 
files" in that thread - related to performing a directory split while 
handling links on a filename with the special long filename handling...


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html

However unlike that thread, I'm not finding any other files with 
duplicate names in the hierarchy.


I'm not sure there's much else I can do besides record the names of any 
unfound objects before resorting to "mark_unfound_lost delete" - any 
suggestions for further research?


Thanks,

Graham


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-12 Thread Graham Allan

Hi,

For the past few weeks I've been seeing a large number of pgs on our 
main erasure coded pool being flagged inconsistent, followed by them 
becoming active+recovery_wait+inconsistent with unfound objects. The 
cluster is currently running luminous 12.2.2 but has in the past also 
run its way through firefly, hammer and jewel.


Here's a sample object from "ceph pg list_missing" (there are 150 
unfound objects in this particular pg):


ceph health detail shows:

pg 70.467 is stuck unclean for 1004525.715896, current state 
active+recovery_wait+inconsistent, last acting [449,233,336,323,259,193]


ceph pg 70.467 list_missing:

{
"oid": {
"oid": 
"default.323253.6_20150226/Downloads/linux-nvme-HEAD-5aa2ffa/include/config/via/fir.h",
"key": "",
"snapid": -2,
"hash": 628294759,
"max": 0,
"pool": 70,
"namespace": ""
},
"need": "73222'132227",
"have": "0'0",
"flags": "none",
"locations": [
"193(5)",
"259(4)",
"449(0)"
]
},


When I trace through the filesystem on each OSD, I find the associated 
file present on each OSD but with size 0 bytes.


Interestingly, for the 3 OSDs for which "list_missing" shows locations 
above (193,259,449), the timestamp of the 0-byte file is recent (within 
last few weeks). For the other 3 OSDs (233,336,323), it's in the distant 
past (08/2015 and 02/2016).


All the unfound objects I've checked on this pg show the same pattern, 
along with the "have" epoch showing as "0'0".


Other than the potential data loss being disturbing, I wonder why this 
showed up so suddenly?


It seems to have been triggered by one OSD host failing over a long 
weekend. By the time we looked at it on Monday, the cluster had 
re-balanced enough data that I decided to simply leave it - we had long 
wanted to evacuate a first host to convert to a newer OS release, as 
well as Bluestore. Perhaps this was a bad choice, but the cluster 
recovery appeared to be proceeding normally, and was apparently complete 
a few days later. It was only around a week later that the unfound 
objects started.


All the unfound object file fragments I've tracked down so far have 
their older members with timestamps in the same mid-2015 to mid-2016 
period. I could be wrong but this really seems like a long-standing 
problem has just been unearthed. I wonder if it could be connected to 
this thread from early 2016, concerning a problem on the same cluster:


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008120.html

It's a long thread, but the 0-byte files sound very like the "orphaned 
files" in that thread - related to performing a directory split while 
handling links on a filename with the special long filename handling...


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008317.html

However unlike that thread, I'm not finding any other files with 
duplicate names in the hierarchy.


I'm not sure there's much else I can do besides record the names of any 
unfound objects before resorting to "mark_unfound_lost delete" - any 
suggestions for further research?


Thanks,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com