Re: [ceph-users] PG_AVAILABILITY with one osd down?

2019-02-16 Thread Maks Kowalik
Clients' experience depends on whether at the very moment they need to
read/write to those particular PGs involved in peering.
If their objects are placed in another PGs, then I/O operations shouldn't
be impacted.
If clients were performing I/O ops to those PGs that went into peering,
then they will notice increased latency. That's the case for Object and RBD.
In case of CephFS I have no experience.

Peering of several PGs does not mean the whole cluster is unavailable
during that time. Only a tiny part of it.
Also, those 6 seconds is a period of the "PG_AVAIL health check warning"
duration. It is not the length of each PG unavailablity.
It's the cluster which noticed that during that time some groups performed
peering.
In a proper setup and healthy conditions one group peers in fractions of
second.

Restarting an OSD causes the same thing. However is more "smooth" than an
unexpected death (going into the details would require quite a long
elaboration).
If your setup is correct, you should be able to perform a cluster-wide
restart of everything and only effect visible outside would be a slightly
increased latency.

Kind regards,
Maks


sob., 16 lut 2019 o 21:39  napisał(a):

> > Hello,
> > your log extract shows that:
> >
> > 2019-02-15 21:40:08 OSD.29 DOWN
> > 2019-02-15 21:40:09 PG_AVAILABILITY warning start
> > 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared
> >
> > 2019-02-15 21:44:06 OSD.29 UP
> > 2019-02-15 21:44:08 PG_AVAILABILITY warning start
> > 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared
> >
> > What you saw is the natural consequence of OSD state change. Those two
> > periods of limited PG availability (6s each) are related to peering
> > that happens shortly after an OSD goes down or up.
> > Basically, the placement groups stored on that OSD need peering, so
> > the incoming connections are directed to other (alive) OSDs. And, yes,
> > during those few seconds the data are not accessible.
>
> Thanks, bear over with my questions. I'm pretty new to Ceph.
> What will clients  (CephFS, Object) experience?
> .. will they just block until time has passed and they get through or?
>
> Which means that I'll get 72 x 6 seconds unavailabilty when doing
> a rolling restart of my OSD's during upgrades and such? Or is a
> controlled restart different than a crash?
>
> --
> Jesper.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG_AVAILABILITY with one osd down?

2019-02-16 Thread jesper
> Hello,
> your log extract shows that:
>
> 2019-02-15 21:40:08 OSD.29 DOWN
> 2019-02-15 21:40:09 PG_AVAILABILITY warning start
> 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared
>
> 2019-02-15 21:44:06 OSD.29 UP
> 2019-02-15 21:44:08 PG_AVAILABILITY warning start
> 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared
>
> What you saw is the natural consequence of OSD state change. Those two
> periods of limited PG availability (6s each) are related to peering
> that happens shortly after an OSD goes down or up.
> Basically, the placement groups stored on that OSD need peering, so
> the incoming connections are directed to other (alive) OSDs. And, yes,
> during those few seconds the data are not accessible.

Thanks, bear over with my questions. I'm pretty new to Ceph.
What will clients  (CephFS, Object) experience?
.. will they just block until time has passed and they get through or?

Which means that I'll get 72 x 6 seconds unavailabilty when doing
a rolling restart of my OSD's during upgrades and such? Or is a
controlled restart different than a crash?

-- 
Jesper.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG_AVAILABILITY with one osd down?

2019-02-16 Thread Maks Kowalik
Hello,
your log extract shows that:

2019-02-15 21:40:08 OSD.29 DOWN
2019-02-15 21:40:09 PG_AVAILABILITY warning start
2019-02-15 21:40:15 PG_AVAILABILITY warning cleared

2019-02-15 21:44:06 OSD.29 UP
2019-02-15 21:44:08 PG_AVAILABILITY warning start
2019-02-15 21:44:15 PG_AVAILABILITY warning cleared

What you saw is the natural consequence of OSD state change. Those two
periods of limited PG availability (6s each) are related to peering
that happens shortly after an OSD goes down or up.
Basically, the placement groups stored on that OSD need peering, so
the incoming connections are directed to other (alive) OSDs. And, yes,
during those few seconds the data are not accessible.

Kind regards,
Maks


sob., 16 lut 2019 o 07:25  napisał(a):

> Yesterday I saw this one.. it puzzles me:
> 2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 :
> cluster [INF] overall HEALTH_OK
> 2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 :
> cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec.
> Implicated osds 58 (REQUEST_SLOW)
> 2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 :
> cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec.
> Implicated osds 9,19,52,58,68 (REQUEST_SLOW)
> 2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 :
> cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec.
> Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW)
> 2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 :
> cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from
> different host after 33.862482 >= grace 29.247323)
> 2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 :
> cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
> 2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 :
> cluster [WRN] Health check failed: Reduced data availability: 6 pgs
> peering (PG_AVAILABILITY)
> 2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 :
> cluster [WRN] Health check failed: Degraded data redundancy:
> 3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED)
> 2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 :
> cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec.
> Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW)
> 2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 :
> cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
> availability: 17 pgs peering)
> 2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897139/700354131 objects degraded (1.413%), 200 pgs degraded
> (PG_DEGRADED)
> 2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 :
> cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are
> blocked > 32 sec. Implicated osds 32,55)
> 2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897140/700354194 objects degraded (1.413%), 200 pgs degraded
> (PG_DEGRADED)
> 2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897142/700354287 objects degraded (1.413%), 200 pgs degraded
> (PG_DEGRADED)
> 2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897143/700354356 objects degraded (1.413%), 200 pgs degraded
> (PG_DEGRADED)
> ... shortened ..
> 2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs
> undersized (PG_DEGRADED)
> 2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs
> undersized (PG_DEGRADED)
> 2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs
> undersized (PG_DEGRADED)
> 2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs
> undersized (PG_DEGRADED)
> 2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 :
> cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
> 2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 :
> cluster [INF] osd.29 10.194.133.58:6844/305358 boot
> 2019-02-15 21:44:08.498060 mon.torsk1 mon.0 

[ceph-users] PG_AVAILABILITY with one osd down?

2019-02-15 Thread jesper
Yesterday I saw this one.. it puzzles me:
2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 :
cluster [INF] overall HEALTH_OK
2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 :
cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec.
Implicated osds 58 (REQUEST_SLOW)
2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 :
cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec.
Implicated osds 9,19,52,58,68 (REQUEST_SLOW)
2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 :
cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec.
Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW)
2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 :
cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from
different host after 33.862482 >= grace 29.247323)
2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 :
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 :
cluster [WRN] Health check failed: Reduced data availability: 6 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 :
cluster [WRN] Health check failed: Degraded data redundancy:
3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED)
2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 :
cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec.
Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW)
2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 17 pgs peering)
2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 :
cluster [WRN] Health check update: Degraded data redundancy:
9897139/700354131 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are
blocked > 32 sec. Implicated osds 32,55)
2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 :
cluster [WRN] Health check update: Degraded data redundancy:
9897140/700354194 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 :
cluster [WRN] Health check update: Degraded data redundancy:
9897142/700354287 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 :
cluster [WRN] Health check update: Degraded data redundancy:
9897143/700354356 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
... shortened ..
2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 :
cluster [WRN] Health check update: Degraded data redundancy:
9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 :
cluster [WRN] Health check update: Degraded data redundancy:
9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 :
cluster [WRN] Health check update: Degraded data redundancy:
9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 :
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 :
cluster [INF] osd.29 10.194.133.58:6844/305358 boot
2019-02-15 21:44:08.498060 mon.torsk1 mon.0 10.194.132.88:6789/0 604376 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700357056 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:08.996099 mon.torsk1 mon.0 10.194.132.88:6789/0 604377 :
cluster [WRN] Health check failed: Reduced data availability: 12 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:44:13.498472 mon.torsk1 mon.0 10.194.132.88:6789/0 604378 :
cluster [WRN] Health check update: Degraded data redundancy: 55/700357161
objects degraded (0.000%), 33 pgs degraded (PG_DEGRADED)
2019-02-15 21:44:15.081437 mon.torsk1 mon.0 10.194.132.88:6789/0 604379 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 12 pgs peering)
2019-02-15 21:44:18.498808 mon.torsk1 mon.0 10.194.132.88:6789/0 604380 :
cluster [WRN] Health check update: Degraded data redundancy: 14/700357230
objects degraded