Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
I think I found source of my problems. One of my monitors was on problematic disk. It was not responding for few seconds so mon was taken out and in the cluster frequently (few times a day). I expect, that key exchange was during that time taken with this problematic monitor and was not successful or the monitors reacted not correctly in this situation and didn't complete key exchange with OSD's. After system disk replacement on the problematic mon, verify_authorizer problem was not in log anymore. With regards Jan Pekar On 01/05/2019 13.58, Jan Pekař - Imatic wrote: Today problem reappeared. Restarting mon helps, but it is no solving the issue. Is there any way how to debug that? Can I dump this keys from MON, from OSD or other components? Can I debug key exchange? Thank you On 27/04/2019 10.56, Jan Pekař - Imatic wrote: On 26/04/2019 21.50, Gregory Farnum wrote: On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic wrote: Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id) cephx: verify_authorizer could not get service secret for service osd secret_id=14086 .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer auth: could not find secret_id=14086 My questions are: Why happened that? Can I prevent cluster from stopping to work (with cephx enabled)? How quickly are keys rotating/expiring and can I check problems with that anyhow? I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem? Hmm yeah, it's probably not using UTC. (Despite it being good practice, it's actually not an easy default to adhere to.) cephx requires synchronized clocks and probably the same timezone (though I can't swear to that.) I "fixed" the problem by restarting monitors. It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen. I also noticed in all OSDs logs 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-25 09:06:55.65) approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace period of old keys usage before they are invalidated? 7 hours shouldn't be directly related. IIRC by default a new rotating key is issued every hour, it gives out the current and next key on request, and daemons accept keys within a half-hour offset of what they believe the current time to be. Something like that. -Greg If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of this kind. I changed timezone, restarted services and I will wait... I forget to mention, that I'm running luminous. JP Thank you With regards Jan Pekar -- Ing. Jan Pekař jan.pe...@imatic.cz Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Ing. Jan Pekař jan.pe...@imatic.cz | +420603811737 Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
Today problem reappeared. Restarting mon helps, but it is no solving the issue. Is there any way how to debug that? Can I dump this keys from MON, from OSD or other components? Can I debug key exchange? Thank you On 27/04/2019 10.56, Jan Pekař - Imatic wrote: On 26/04/2019 21.50, Gregory Farnum wrote: On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic wrote: Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id) cephx: verify_authorizer could not get service secret for service osd secret_id=14086 .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer auth: could not find secret_id=14086 My questions are: Why happened that? Can I prevent cluster from stopping to work (with cephx enabled)? How quickly are keys rotating/expiring and can I check problems with that anyhow? I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem? Hmm yeah, it's probably not using UTC. (Despite it being good practice, it's actually not an easy default to adhere to.) cephx requires synchronized clocks and probably the same timezone (though I can't swear to that.) I "fixed" the problem by restarting monitors. It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen. I also noticed in all OSDs logs 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-25 09:06:55.65) approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace period of old keys usage before they are invalidated? 7 hours shouldn't be directly related. IIRC by default a new rotating key is issued every hour, it gives out the current and next key on request, and daemons accept keys within a half-hour offset of what they believe the current time to be. Something like that. -Greg If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of this kind. I changed timezone, restarted services and I will wait... I forget to mention, that I'm running luminous. JP Thank you With regards Jan Pekar -- Ing. Jan Pekař jan.pe...@imatic.cz Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
On 26/04/2019 21.50, Gregory Farnum wrote: On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic wrote: Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id) cephx: verify_authorizer could not get service secret for service osd secret_id=14086 .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer auth: could not find secret_id=14086 My questions are: Why happened that? Can I prevent cluster from stopping to work (with cephx enabled)? How quickly are keys rotating/expiring and can I check problems with that anyhow? I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem? Hmm yeah, it's probably not using UTC. (Despite it being good practice, it's actually not an easy default to adhere to.) cephx requires synchronized clocks and probably the same timezone (though I can't swear to that.) I "fixed" the problem by restarting monitors. It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen. I also noticed in all OSDs logs 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-25 09:06:55.65) approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace period of old keys usage before they are invalidated? 7 hours shouldn't be directly related. IIRC by default a new rotating key is issued every hour, it gives out the current and next key on request, and daemons accept keys within a half-hour offset of what they believe the current time to be. Something like that. -Greg If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents of this kind. I changed timezone, restarted services and I will wait... I forget to mention, that I'm running luminous. JP Thank you With regards Jan Pekar -- Ing. Jan Pekař jan.pe...@imatic.cz Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
> On Apr 26, 2019, at 1:50 PM, Gregory Farnum wrote: > > Hmm yeah, it's probably not using UTC. (Despite it being good > practice, it's actually not an easy default to adhere to.) cephx > requires synchronized clocks and probably the same timezone (though I > can't swear to that.) Apps don’t “see” timezones, timezones are a rendering transform of an absolute time. The instant “now” is the same throughout space and time, regardless of how that instant is quantified. UNIX wall time is just one such quantification. Problems ensue when the rendered time is incorrect for the time zone shown in the rendering. If a machine that is “not using time zones” shows that the time is 3PM UTC and one lives in London, the internal time will be correct. On the other hand, if they live in NYC, the internal time is incorrect. This is to say 15:00UTC rendered at 3PM in NYC is very wrong *because it’s not 3PM in London*, where UTC is true. tl;dr: Make sure that your clock is set for the correct time in the time zone in whatever rendering is set. It doesn’t matter where the system actually resides or whether the TZ matches it’s geographic location. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic wrote: > > Hi, > > yesterday my cluster reported slow request for minutes and after restarting > OSDs (reporting slow requests) it stuck with peering PGs. Whole > cluster was not responding and IO stopped. > > I also notice, that problem was with cephx - all OSDs were reporting the same > (even the same number of secret_id) > > cephx: verify_authorizer could not get service secret for service osd > secret_id=14086 > .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH > pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer > auth: could not find secret_id=14086 > > My questions are: > > Why happened that? > Can I prevent cluster from stopping to work (with cephx enabled)? > How quickly are keys rotating/expiring and can I check problems with that > anyhow? > > I'm running NTP on nodes (and also ceph monitors), so time should not be the > issue. I noticed, that some monitor nodes has no timezone set, > but I hope MONs are using UTC to distribute keys to clients. Or different > timezone between MON and OSD can cause the problem? Hmm yeah, it's probably not using UTC. (Despite it being good practice, it's actually not an easy default to adhere to.) cephx requires synchronized clocks and probably the same timezone (though I can't swear to that.) > > I "fixed" the problem by restarting monitors. > > It happened for the second time during last 3 months, so I'm reporting it as > issue, that can happen. > > I also noticed in all OSDs logs > > 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating > possible clock skew, rotating keys expired way too early (before > 2019-04-25 09:06:55.65) > > approximately 7 hours before problem occurred. I can see, that it related to > the issue. But why 7 hours? Is there some timeout or grace > period of old keys usage before they are invalidated? 7 hours shouldn't be directly related. IIRC by default a new rotating key is issued every hour, it gives out the current and next key on request, and daemons accept keys within a half-hour offset of what they believe the current time to be. Something like that. -Greg > Thank you > > With regards > > Jan Pekar > > -- > > Ing. Jan Pekař > jan.pe...@imatic.cz > > Imatic | Jagellonská 14 | Praha 3 | 130 00 > http://www.imatic.cz > > -- > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem
Hi, yesterday my cluster reported slow request for minutes and after restarting OSDs (reporting slow requests) it stuck with peering PGs. Whole cluster was not responding and IO stopped. I also notice, that problem was with cephx - all OSDs were reporting the same (even the same number of secret_id) cephx: verify_authorizer could not get service secret for service osd secret_id=14086 .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer auth: could not find secret_id=14086 My questions are: Why happened that? Can I prevent cluster from stopping to work (with cephx enabled)? How quickly are keys rotating/expiring and can I check problems with that anyhow? I'm running NTP on nodes (and also ceph monitors), so time should not be the issue. I noticed, that some monitor nodes has no timezone set, but I hope MONs are using UTC to distribute keys to clients. Or different timezone between MON and OSD can cause the problem? I "fixed" the problem by restarting monitors. It happened for the second time during last 3 months, so I'm reporting it as issue, that can happen. I also noticed in all OSDs logs 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-25 09:06:55.65) approximately 7 hours before problem occurred. I can see, that it related to the issue. But why 7 hours? Is there some timeout or grace period of old keys usage before they are invalidated? Thank you With regards Jan Pekar -- Ing. Jan Pekař jan.pe...@imatic.cz Imatic | Jagellonská 14 | Praha 3 | 130 00 http://www.imatic.cz -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com