Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-06-07 Thread Jan Pekař - Imatic

I think I found source of my problems.

One of my monitors was on problematic disk. It was not responding for few seconds so mon was taken out and in the cluster frequently (few 
times a day). I expect, that key exchange was during that time taken with this problematic monitor and was not successful or the monitors 
reacted not correctly in this situation and didn't complete key exchange with OSD's.


After system disk replacement on the problematic mon, verify_authorizer problem 
was not in log anymore.

With regards
Jan Pekar

On 01/05/2019 13.58, Jan Pekař - Imatic wrote:

Today problem reappeared.

Restarting mon helps, but it is no solving the issue.

Is there any way how to debug that? Can I dump this keys from MON, from OSD or 
other components? Can I debug key exchange?

Thank you

On 27/04/2019 10.56, Jan Pekař - Imatic wrote:


On 26/04/2019 21.50, Gregory Farnum wrote:

On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:

Hi,

yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. 
Whole

cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
.. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone 
set,

but I hope MONs are using UTC to distribute keys to clients. Or different 
timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)


I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early 
(before

2019-04-25 09:06:55.65)

approximately 7 hours before problem occurred. I can see, that it related to 
the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg


If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of 
this kind.

I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP


Thank you

With regards

Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-05-01 Thread Jan Pekař - Imatic

Today problem reappeared.

Restarting mon helps, but it is no solving the issue.

Is there any way how to debug that? Can I dump this keys from MON, from OSD or 
other components? Can I debug key exchange?

Thank you

On 27/04/2019 10.56, Jan Pekař - Imatic wrote:


On 26/04/2019 21.50, Gregory Farnum wrote:

On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:

Hi,

yesterday my cluster reported slow request for minutes and after restarting 
OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
.. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the 
issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different 
timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)


I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.65)

approximately 7 hours before problem occurred. I can see, that it related to 
the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg


If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of 
this kind.

I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP


Thank you

With regards

Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-27 Thread Jan Pekař - Imatic


On 26/04/2019 21.50, Gregory Farnum wrote:

On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:

Hi,

yesterday my cluster reported slow request for minutes and after restarting 
OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
.. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the 
issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different 
timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)


I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.65)

approximately 7 hours before problem occurred. I can see, that it related to 
the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg


If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of 
this kind.

I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP


Thank you

With regards

Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Brian Topping
> On Apr 26, 2019, at 1:50 PM, Gregory Farnum  wrote:
> 
> Hmm yeah, it's probably not using UTC. (Despite it being good
> practice, it's actually not an easy default to adhere to.) cephx
> requires synchronized clocks and probably the same timezone (though I
> can't swear to that.)

Apps don’t “see” timezones, timezones are a rendering transform of an absolute 
time. The instant “now” is the same throughout space and time, regardless of 
how that instant is quantified. UNIX wall time is just one such quantification.

Problems ensue when the rendered time is incorrect for the time zone shown in 
the rendering. If a machine that is “not using time zones” shows that the time 
is 3PM UTC and one lives in London, the internal time will be correct. On the 
other hand, if they live in NYC, the internal time is incorrect. This is to say 
15:00UTC rendered at 3PM in NYC is very wrong *because it’s not 3PM in London*, 
where UTC is true.

tl;dr: Make sure that your clock is set for the correct time in the time zone 
in whatever rendering is set. It doesn’t matter where the system actually 
resides or whether the TZ matches it’s geographic location. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Gregory Farnum
On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:
>
> Hi,
>
> yesterday my cluster reported slow request for minutes and after restarting 
> OSDs (reporting slow requests) it stuck with peering PGs. Whole
> cluster was not responding and IO stopped.
>
> I also notice, that problem was with cephx - all OSDs were reporting the same 
> (even the same number of secret_id)
>
> cephx: verify_authorizer could not get service secret for service osd 
> secret_id=14086
> .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH 
> pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer
> auth: could not find secret_id=14086
>
> My questions are:
>
> Why happened that?
> Can I prevent cluster from stopping to work (with cephx enabled)?
> How quickly are keys rotating/expiring and can I check problems with that 
> anyhow?
>
> I'm running NTP on nodes (and also ceph monitors), so time should not be the 
> issue. I noticed, that some monitor nodes has no timezone set,
> but I hope MONs are using UTC to distribute keys to clients. Or different 
> timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)

>
> I "fixed" the problem by restarting monitors.
>
> It happened for the second time during last 3 months, so I'm reporting it as 
> issue, that can happen.
>
> I also noticed in all OSDs logs
>
> 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
> possible clock skew, rotating keys expired way too early (before
> 2019-04-25 09:06:55.65)
>
> approximately 7 hours before problem occurred. I can see, that it related to 
> the issue. But why 7 hours? Is there some timeout or grace
> period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg

> Thank you
>
> With regards
>
> Jan Pekar
>
> --
> 
> Ing. Jan Pekař
> jan.pe...@imatic.cz
> 
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> 
> --
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Jan Pekař - Imatic

Hi,

yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole 
cluster was not responding and IO stopped.


I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
.. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, 
but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem?


I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 
2019-04-25 09:06:55.65)


approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace 
period of old keys usage before they are invalidated?


Thank you

With regards

Jan Pekar

--

Ing. Jan Pekař
jan.pe...@imatic.cz

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com