[jira] [Created] (KUDU-3522) A tablet server starts in non-functional state when enabling data-at-rest encryption
Alexey Serbin created KUDU-3522: --- Summary: A tablet server starts in non-functional state when enabling data-at-rest encryption Key: KUDU-3522 URL: https://issues.apache.org/jira/browse/KUDU-3522 Project: Kudu Issue Type: Bug Components: security, tserver Affects Versions: 1.17.0, 1.16.0 Reporter: Alexey Serbin It's possible to configure a Kudu tablet server by enabling the data-at-rest encryption feature in such a way that the server runs in a non-functional state: {{kudu-tserver}} process starts and runs with no visible issues, but it's not able to host any tablet replicas. It's easy to fix/address the issue by adding an extra sanity check: when opening an already existing FS data directory structure, make sure the server encryption key isn't empty if Kudu server is run with the {{\-\-encrypt_data_at_rest}} flag. There might be more alternatives around. The reproduction scenario for the issue is below. # Start a tablet server without encryption-at-rest, making sure the tablet server starts and creates the directory structure on the file system. # Don't create any tables/ranges yet. Essentially, it's necessary to make sure not a single tablet replica is placed at the server yet. # Shut down the tablet server. # Update the configuration for the tablet server, enabling encryption-at-rest and specifying the key provider. For test purposes, it's enough to use the "default" key provider: {noformat} --encrypt_data_at_rest=true --encryption_key_provider=default {noformat} # Start the tablet server. # Try to create a new tablet replica that would be placed at the tablet server. That could be creation of a new table, or try to move a tablet replica from some other tablet server by using the {{kudu tablet change_config move_replica}} CLI tool. # Check logs of Kudu master or the {{kudu}} CLI tool: there should be error messages like {{Failed to initialize encryption: error:0607B083:digital envelope routines:EVP_CipherInit_ex:no cipher set}} # No tablet replica can now be placed at the tablet server, while nothing suspicious can be found in the tablet server's log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd
[ https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3521. - Fix Version/s: 1.18.0 1.17.1 Resolution: Fixed > Kudu servers sometimes crash when host clock is synchronized by PTPd > > > Key: KUDU-3521 > URL: https://issues.apache.org/jira/browse/KUDU-3521 > Project: Kudu > Issue Type: Bug >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > This issue has been reported on the [\#kudu-general Slack > channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269]. A > Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or > {{kudu-tserver}}, but it doesn't matter) crashed with the following error: > {noformat} > F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() > unable to get current timestamp with error bound: Service unavailable: clock > error estimate (18446744073709551615us) too high (clock considered > synchronized by the kernel) > {noformat} > From the analysis of the [code in > hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705], > the only case it could happen is when {{t.maxerror}} turned to be a negative > number (e.g., -1) in [this > code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176]. > Negative values of the {{timex::maxerror}} field have never been seen when > using ntpd or chronyd for clock synchronization, but it's necessary to update > the code to adapt for such situations: apparently, PTP might set the > {{maxerror}} field of the {{timex}} structure to a negative value and then > call {{adjtimex()}}. That's obvious from [the PTPd's > code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984]. > The essence of the issue is using unsigned integers for clock error in the > Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets > it to a negative number when calling {{adjtimex()}}. Also, nowhere in [the > documentation for > adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's > stated that the {{maxerror}} field's value should be a non-negative number. > As a side note, there was [a prior attempt to address this > issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was > presented for the RCA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd
[ https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781852#comment-17781852 ] ASF subversion and git services commented on KUDU-3521: --- Commit ceaffc5f6f50745f8eaf687668d3b4ac767eea76 in kudu's branch refs/heads/branch-1.17.x from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=ceaffc5f6 ] [clock] KUDU-3521 fix crash when clock is synchronized by PTPd There was an earlier attempt to address the issue [1], but the fix hasn't received +2 since there was not enough evidence behind the root cause analysis. With the report on #kudu-general Slack channel [2], from the analysis of the code [3] it's easy to see there isn't any other way to get such a manifestation of the issue but a negative value for the 'maxerror' field of the 'timex' structure returned by the ntp_adjtime()/adjtimex() system call. The essence of the problem is that in Kudu the maximum error is supposed to be a non-negative number. This patch addresses the issue. [1] https://gerrit.cloudera.org/#/c/12149/ [2] https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269 [3] https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L706 Change-Id: Ibbe1a50c4857b9742d2ffde35440d0dee082edc0 Reviewed-on: http://gerrit.cloudera.org:8080/20626 Tested-by: Kudu Jenkins Reviewed-by: Yingchun Lai Reviewed-by: Abhishek Chennaka (cherry picked from commit 4859d290277bf36f0bd84891c4764194c2cf9521) Reviewed-on: http://gerrit.cloudera.org:8080/20644 > Kudu servers sometimes crash when host clock is synchronized by PTPd > > > Key: KUDU-3521 > URL: https://issues.apache.org/jira/browse/KUDU-3521 > Project: Kudu > Issue Type: Bug >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > This issue has been reported on the [\#kudu-general Slack > channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269]. A > Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or > {{kudu-tserver}}, but it doesn't matter) crashed with the following error: > {noformat} > F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() > unable to get current timestamp with error bound: Service unavailable: clock > error estimate (18446744073709551615us) too high (clock considered > synchronized by the kernel) > {noformat} > From the analysis of the [code in > hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705], > the only case it could happen is when {{t.maxerror}} turned to be a negative > number (e.g., -1) in [this > code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176]. > Negative values of the {{timex::maxerror}} field have never been seen when > using ntpd or chronyd for clock synchronization, but it's necessary to update > the code to adapt for such situations: apparently, PTP might set the > {{maxerror}} field of the {{timex}} structure to a negative value and then > call {{adjtimex()}}. That's obvious from [the PTPd's > code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984]. > The essence of the issue is using unsigned integers for clock error in the > Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets > it to a negative number when calling {{adjtimex()}}. Also, nowhere in [the > documentation for > adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's > stated that the {{maxerror}} field's value should be a non-negative number. > As a side note, there was [a prior attempt to address this > issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was > presented for the RCA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd
[ https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781832#comment-17781832 ] ASF subversion and git services commented on KUDU-3521: --- Commit 4859d290277bf36f0bd84891c4764194c2cf9521 in kudu's branch refs/heads/master from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4859d2902 ] [clock] KUDU-3521 fix crash when clock is synchronized by PTPd There was an earlier attempt to address the issue [1], but the fix hasn't received +2 since there was not enough evidence behind the root cause analysis. With the report on #kudu-general Slack channel [2], from the analysis of the code [3] it's easy to see there isn't any other way to get such a manifestation of the issue but a negative value for the 'maxerror' field of the 'timex' structure returned by the ntp_adjtime()/adjtimex() system call. The essence of the problem is that in Kudu the maximum error is supposed to be a non-negative number. This patch addresses the issue. [1] https://gerrit.cloudera.org/#/c/12149/ [2] https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269 [3] https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L706 Change-Id: Ibbe1a50c4857b9742d2ffde35440d0dee082edc0 Reviewed-on: http://gerrit.cloudera.org:8080/20626 Tested-by: Kudu Jenkins Reviewed-by: Yingchun Lai Reviewed-by: Abhishek Chennaka > Kudu servers sometimes crash when host clock is synchronized by PTPd > > > Key: KUDU-3521 > URL: https://issues.apache.org/jira/browse/KUDU-3521 > Project: Kudu > Issue Type: Bug >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > This issue has been reported on the [\#kudu-general Slack > channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269]. A > Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or > {{kudu-tserver}}, but it doesn't matter) crashed with the following error: > {noformat} > F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() > unable to get current timestamp with error bound: Service unavailable: clock > error estimate (18446744073709551615us) too high (clock considered > synchronized by the kernel) > {noformat} > From the analysis of the [code in > hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705], > the only case it could happen is when {{t.maxerror}} turned to be a negative > number (e.g., -1) in [this > code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176]. > Negative values of the {{timex::maxerror}} field have never been seen when > using ntpd or chronyd for clock synchronization, but it's necessary to update > the code to adapt for such situations: apparently, PTP might set the > {{maxerror}} field of the {{timex}} structure to a negative value and then > call {{adjtimex()}}. That's obvious from [the PTPd's > code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984]. > The essence of the issue is using unsigned integers for clock error in the > Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets > it to a negative number when calling {{adjtimex()}}. Also, nowhere in [the > documentation for > adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's > stated that the {{maxerror}} field's value should be a non-negative number. > As a side note, there was [a prior attempt to address this > issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was > presented for the RCA. -- This message was sent by Atlassian Jira (v8.20.10#820010)