Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman


I expected it to return to osd.36.  Oh, if you set "noout" during this 
process then the pg won't move around when you down osd.36.  I expected 
osd.36 to go down and back up quickly.


Also, the pg 10.4f is the same situation, so try the same thing on osd.6.

David

On 3/8/16 1:05 PM, Ben Hines wrote:

After making that setting, the pg appeared to start peering but then it
actually changed the primary OSD to osd.100 - then went incomplete again.
Perhaps it did that because another OSD had more data? I presume i need to
set that value on each osd where the pg hops to.

-Ben

On Tue, Mar 8, 2016 at 10:39 AM, David Zafman  wrote:


Ben,

I haven't look at everything in your message, but pg 12.7a1 has lost data
because of writes that went only to osd.73.  The way to recover this is to
force recovery to ignore this fact and go with whatever data you have on
the remaining OSDs.
I assume that having min_size 1, having multiple nodes failing and clients
continuing to write then permanently losing osd.73 caused this.

You should TEMPORARILY set osd_find_best_info_ignore_history_les config
variable to 1 on osd.36 and then mark it down (ceph osd down), so it will
rejoin, re-peer and mark the pg active+clean.  Don't forget to set
osd_find_best_info_ignore_history_les
back to 0.


Later you should fix your crush map.  See
http://docs.ceph.com/docs/master/rados/operations/crush-map/

The wrong placements makes you vulnerable to a single host failure taking
out multiple copies of an object.

David


On 3/7/16 9:41 PM, Ben Hines wrote:

Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup 
log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup 
log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

   Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread Ben Hines
After making that setting, the pg appeared to start peering but then it
actually changed the primary OSD to osd.100 - then went incomplete again.
Perhaps it did that because another OSD had more data? I presume i need to
set that value on each osd where the pg hops to.

-Ben

On Tue, Mar 8, 2016 at 10:39 AM, David Zafman  wrote:

>
> Ben,
>
> I haven't look at everything in your message, but pg 12.7a1 has lost data
> because of writes that went only to osd.73.  The way to recover this is to
> force recovery to ignore this fact and go with whatever data you have on
> the remaining OSDs.
> I assume that having min_size 1, having multiple nodes failing and clients
> continuing to write then permanently losing osd.73 caused this.
>
> You should TEMPORARILY set osd_find_best_info_ignore_history_les config
> variable to 1 on osd.36 and then mark it down (ceph osd down), so it will
> rejoin, re-peer and mark the pg active+clean.  Don't forget to set
> osd_find_best_info_ignore_history_les
> back to 0.
>
>
> Later you should fix your crush map.  See
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
> The wrong placements makes you vulnerable to a single host failure taking
> out multiple copies of an object.
>
> David
>
>
> On 3/7/16 9:41 PM, Ben Hines wrote:
>
> Howdy,
>
> I was hoping someone could help me recover a couple pgs which are causing
> problems in my cluster. If we aren't able to resolve this soon, we may have
> to just destroy them and lose some data. Recovery has so far been
> unsuccessful. Data loss would probably cause some here to reconsider Ceph
> as something we'll stick with long term, so i'd love to recover it.
>
> Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
> after a disk failure
>
> pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
> pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
> pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
> ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7
>
> - The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
> it went down, the pg was 'down + peering'. It was marked lost.
> - After marking 73 lost, the new primary still wants to peer and flips
> between peering and incomplete.
> - Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
> need to bring back an osd with the same name?)
> - Noticed that the new primary got set to an osd (osd-77) which was on the
> same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
> with 36 because it was on the same node, i set 77 out, 36 became primary
> and 76 became one of the replicas. No change.
>
> startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
> debug filestore = 30, debug ms = 1'  (large files)
>
> osd 36 (12.7a1) startup 
> log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
> osd 6 (10.4f) startup 
> log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log
>
>
> Some other Notes:
>
> - Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
> 12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
> copy of the data.  Even after running a pg repair does not pick up the data
> from 76, remains stuck peering
>
> - One of the pgs was part of a pool which was no longer needed. (the unused
> radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
> discussed here for a similar failure, i attempted these recovery steps on
> it, to see if they would work for the others:
>
> -- The failed osd disk only mounts 'read only' which causes
> ceph-objectstore-tool to fail to export, so i exported it from a seemingly
> good copy on another osd.
> -- stopped all osds
> -- exported the pg with objectstore-tool from an apparently good OSD
> -- removed the pg from all osds which had it using objectstore-tool
> -- imported the pg into an out osd, osd-100
>
>   Importing pgid 4.95
> Write 4/88aa5c95/notify.2/head
> Import successful
>
> -- Force recreated the pg on the cluster:
>ceph pg force_create_pg 4.95
> -- brought up all osds
> -- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects
>
> However, the object doesn't sync to the pg from osd-100, and instead 64
> tells to to remove itself from osd-100:
>
> 2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
> 0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
> 2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
> from osd.64 on 1 pgs
> 2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
> require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
> 2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
> queue_pg_for_deletion: 4.95
> 2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
> 4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
> 

Re: [ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-08 Thread David Zafman


Ben,

I haven't look at everything in your message, but pg 12.7a1 has lost 
data because of writes that went only to osd.73.  The way to recover 
this is to force recovery to ignore this fact and go with whatever data 
you have on the remaining OSDs.
I assume that having min_size 1, having multiple nodes failing and 
clients continuing to write then permanently losing osd.73 caused this.


You should TEMPORARILY set osd_find_best_info_ignore_history_les config 
variable to 1 on osd.36 and then mark it down (ceph osd down), so it 
will rejoin, re-peer and mark the pg active+clean. Don't forget to set 
osd_find_best_info_ignore_history_les

back to 0.


Later you should fix your crush map.  See 
http://docs.ceph.com/docs/master/rados/operations/crush-map/


The wrong placements makes you vulnerable to a single host failure 
taking out multiple copies of an object.


David

On 3/7/16 9:41 PM, Ben Hines wrote:

Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

   Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
66982/67983/66982

Not wanting this to happen to my needed data from the other PGs, i didn't
try this procedure with those PGs. After this procedure  osd-100 does get
listed in 'pg query' as 'might_have_unfound', but ceph apparently decides
not to use it and the active osd sends a remove.

output of 'ceph pg 4.95 query' after these recovery steps:
https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf


Quite Possibly Related:

I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems
my crush map thinks some osds are on 

[ceph-users] Ceph Recovery Assistance, pgs stuck peering

2016-03-07 Thread Ben Hines
Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

  Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
   ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
66982/67983/66982

Not wanting this to happen to my needed data from the other PGs, i didn't
try this procedure with those PGs. After this procedure  osd-100 does get
listed in 'pg query' as 'might_have_unfound', but ceph apparently decides
not to use it and the active osd sends a remove.

output of 'ceph pg 4.95 query' after these recovery steps:
https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf


Quite Possibly Related:

I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems
my crush map thinks some osds are on the wrong hosts. I wonder if this is
why peering is failing?
(example)
 -5   9.04999 host cld-mtl-006
 12   1.81000 osd.12   up  1.0  1.0
 13   1.81000 osd.13   up  1.0  1.0
 14   1.81000 osd.14   up  1.0  1.0
 94   1.81000 osd.94   up  1.0  1.0
 26   1.81000 osd.26   up  0.86775  1.0

^^ this host only has 4 osds on it! osd.26 is actually running over on
cld-mtl-004 !Restarting 26 fixed the map.
osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on
cld-mtl-013, it's actually on cld-mtl-024.
- fixing these issues caused a large re-balance, so 'ceph health detail' is
a bit dirty right now, but you can see the stuck pgs:
ceph health detail:

-  I wonder if these incorrect crushmaps caused ceph to put some data on