[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect its own stack traces, Kudu periodically sends itself a SIGUSR2. The diagnostics log initiates stack collection every 60s, as do some service queue overflow events. In theory, the collection shouldn't affect any ongoing syscalls because the SIGUSR2 signal handler is installed with SA_RESTART; in practice, not all syscalls are restartable, and precisely categorizing those that are and those that aren't is difficult. As such, it's really important that we retry every interruptible syscall rather than surfacing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 This patch also fixes KUDU-2151. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Reviewed-on: http://gerrit.cloudera.org:8080/10435 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin --- M src/kudu/consensus/log_index.cc M src/kudu/gutil/macros.h M src/kudu/gutil/sysinfo.cc M src/kudu/tools/tool_action_common.cc M src/kudu/tools/tool_action_test.cc M src/kudu/util/env-test.cc M src/kudu/util/env_posix.cc M src/kudu/util/net/socket.cc M src/kudu/util/os-util.cc M src/kudu/util/os-util.h M src/kudu/util/pstack_watcher-test.cc M src/kudu/util/pstack_watcher.cc M src/kudu/util/semaphore.cc M src/kudu/util/subprocess-test.cc M src/kudu/util/subprocess.cc 15 files changed, 253 insertions(+), 138 deletions(-) Approvals: Kudu Jenkins: Verified Alexey Serbin: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 8 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 7: Code-Review+2 LGTM. Thank you for searching for those additional calls of EINTR-specific calls. Maybe, it's worth checking if Todd has more input on this. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 7 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 23 May 2018 22:55:13 + Gerrit-HasComments: No
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10435 to look at the new patch set (#7). Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect its own stack traces, Kudu periodically sends itself a SIGUSR2. The diagnostics log initiates stack collection every 60s, as do some service queue overflow events. In theory, the collection shouldn't affect any ongoing syscalls because the SIGUSR2 signal handler is installed with SA_RESTART; in practice, not all syscalls are restartable, and precisely categorizing those that are and those that aren't is difficult. As such, it's really important that we retry every interruptible syscall rather than surfacing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 This patch also fixes KUDU-2151. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 --- M src/kudu/consensus/log_index.cc M src/kudu/gutil/macros.h M src/kudu/gutil/sysinfo.cc M src/kudu/tools/tool_action_common.cc M src/kudu/tools/tool_action_test.cc M src/kudu/util/env-test.cc M src/kudu/util/env_posix.cc M src/kudu/util/net/socket.cc M src/kudu/util/os-util.cc M src/kudu/util/os-util.h M src/kudu/util/pstack_watcher-test.cc M src/kudu/util/pstack_watcher.cc M src/kudu/util/semaphore.cc M src/kudu/util/subprocess-test.cc M src/kudu/util/subprocess.cc 15 files changed, 253 insertions(+), 138 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/10435/7 -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 7 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 6: (17 comments) http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/consensus/log_index.cc File src/kudu/consensus/log_index.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/consensus/log_index.cc@116 PS6, Line 116: RETRY_ON_EINTR(ret, close(fd_)); > It's highly unlikely, but is it worth logging about an error? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/gutil/sysinfo.cc File src/kudu/gutil/sysinfo.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/gutil/sysinfo.cc@301 PS6, Line 301: RETRY_ON_EINTR(ret, close(fd)); > Is it worth logging about a failure (if it not EINTR, of course)? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/rpc/negotiation.cc File src/kudu/rpc/negotiation.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/rpc/negotiation.cc@119 PS6, Line 119: RETRY_ON_EINTR(ready, ppoll(&poll_fd, 1, &ts, nullptr)); : #else : RETRY_ON_EINTR(ready, poll(&poll_fd, 1, remaining.ToMilliseconds())); > This update changes the logic a bit so the deadline is no longer applied. You're right; this is a mistake. I'll revert this back to the way it was, with EINTR handled in the outer loop rather than inside the RETRY_ON_EINTR inner loop. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc File src/kudu/util/env_posix.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@560 PS6, Line 560: fclose > Wrap into RETRY_ON_EINTR() and log in case of other error? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@603 PS6, Line 603: close(fd_); > Wrap into RETRY_ON_EINTR() and log in case of other error? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1174 PS6, Line 1174: closedir(d); > Wrap into RETRY_ON_EINTR() and log in case of other error? The closedir() manpage doesn't mention returning EINTR, or returning other errors set by close() either. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1380 PS6, Line 1380: close(fd) > Wrap into RETRY_ON_EINTR() here as well? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1392 PS6, Line 1392: PosixFileLock* my_lock = reinterpret_cast(lock); > nit: maybe, wrap this into unique_ptr<> by the way? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1397 PS6, Line 1397: close(my_lock->fd_); > Wrap into RETRY_ON_EINTR() and log if case of other error? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1737 PS6, Line 1737: fts_close(fts) > nit: not a part of this changelist, but just curious, could this ever fail, The fts manpage says: ERRORS The function fts_open() may fail and set errno for any of the errors specified for open(2) and malloc(3). The function fts_close() may fail and set errno for any of the errors specified for chdir(2) and close(2). The functions fts_read() and fts_children() may fail and set errno for any of the errors specified for chdir(2), malloc(3), opendir(3), readdir(3), and stat(2). So yes, we should wrap fts_open and fts_close. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/semaphore.cc File src/kudu/util/semaphore.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/semaphore.cc@a66 PS6, Line 66: > I hope that was not the desired behavior to return from this method on EINT Todd and I discussed that on Slack. Returning false on EINTR is not necessarily incorrect, but it leads to more TryAcquire failures than are strictly necessary. It seems to have been this way from day one. We agreed that changing it should be OK. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc File src/kudu/util/subprocess.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@384 PS6, Line 384: == STDIN_FILENO > typo? Oops, good catch. Done. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@385 PS6, Line 385: dup2_ret) > dup2_ret == STDIN_FILENO ? Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@394 PS6, Line 394: == STDOUT_FILENO > ditto Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@395 PS6, Line 395: dup2_ret > ditto Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@411 PS6, Line 411: == STDERR_FILENO > ditto Done http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@412 PS6, Line 412: dup2_ret > ditto Done -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Cha
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 6: (17 comments) It seems we also have some calls with the EINTR errno semantic in src/kudu/tools. Does it make sense to wrap those with RETRY_ON_EINTR() as well? E.g., tools/tool_action_test.cc: PCHECK(dup2(STDERR_FILENO, STDOUT_FILENO) == STDOUT_FILENO); tools/tool_action_common.cc:close(read_fd_); tools/tool_action_common.cc:close(write_fd_); http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/consensus/log_index.cc File src/kudu/consensus/log_index.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/consensus/log_index.cc@116 PS6, Line 116: RETRY_ON_EINTR(ret, close(fd_)); It's highly unlikely, but is it worth logging about an error? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/gutil/sysinfo.cc File src/kudu/gutil/sysinfo.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/gutil/sysinfo.cc@301 PS6, Line 301: RETRY_ON_EINTR(ret, close(fd)); Is it worth logging about a failure (if it not EINTR, of course)? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/rpc/negotiation.cc File src/kudu/rpc/negotiation.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/rpc/negotiation.cc@119 PS6, Line 119: RETRY_ON_EINTR(ready, ppoll(&poll_fd, 1, &ts, nullptr)); : #else : RETRY_ON_EINTR(ready, poll(&poll_fd, 1, remaining.ToMilliseconds())); This update changes the logic a bit so the deadline is no longer applied. Is it OK? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc File src/kudu/util/env_posix.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@560 PS6, Line 560: fclose Wrap into RETRY_ON_EINTR() and log in case of other error? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@603 PS6, Line 603: close(fd_); Wrap into RETRY_ON_EINTR() and log in case of other error? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1174 PS6, Line 1174: closedir(d); Wrap into RETRY_ON_EINTR() and log in case of other error? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1380 PS6, Line 1380: close(fd) Wrap into RETRY_ON_EINTR() here as well? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1392 PS6, Line 1392: PosixFileLock* my_lock = reinterpret_cast(lock); nit: maybe, wrap this into unique_ptr<> by the way? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1397 PS6, Line 1397: close(my_lock->fd_); Wrap into RETRY_ON_EINTR() and log if case of other error? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/env_posix.cc@1737 PS6, Line 1737: fts_close(fts) nit: not a part of this changelist, but just curious, could this ever fail, and if yes, whether we need any handling for that. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/semaphore.cc File src/kudu/util/semaphore.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/semaphore.cc@a66 PS6, Line 66: I hope that was not the desired behavior to return from this method on EINTR error. http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc File src/kudu/util/subprocess.cc: http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@384 PS6, Line 384: == STDIN_FILENO typo? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@385 PS6, Line 385: dup2_ret) dup2_ret == STDIN_FILENO ? http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@394 PS6, Line 394: == STDOUT_FILENO ditto http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@395 PS6, Line 395: dup2_ret ditto http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@411 PS6, Line 411: == STDERR_FILENO ditto http://gerrit.cloudera.org:8080/#/c/10435/6/src/kudu/util/subprocess.cc@412 PS6, Line 412: dup2_ret ditto -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 6 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 22 May 2018 22:19:47 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc@465 PS3, Line 465: Status s = Write(buf, num_to_write, &inc_num_written); : tot_written += inc_num_written; : buf += inc_nu > Yep, SSL_read() and SSL_write() seemed messy when I looked at them last tim OK, I'll revert these changes, and the change to tls_socket.cc. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 21 May 2018 20:52:51 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Hello Tidy Bot, Alexey Serbin, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10435 to look at the new patch set (#6). Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect our own stack traces, we send ourselves a SIGUSR2. The diagnostics log initiates collection every 60s, as do service queue overflows. As such, it's really important that we retry every interruptible system call rather than passing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 This patch also fixes KUDU-2151. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 --- M src/kudu/consensus/log_index.cc M src/kudu/gutil/macros.h M src/kudu/gutil/sysinfo.cc M src/kudu/rpc/negotiation.cc M src/kudu/tools/tool_action_common.cc M src/kudu/util/env_posix.cc M src/kudu/util/net/socket.cc M src/kudu/util/os-util.cc M src/kudu/util/os-util.h M src/kudu/util/semaphore.cc M src/kudu/util/subprocess.cc 11 files changed, 142 insertions(+), 105 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/10435/6 -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 6 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc@465 PS3, Line 465: Status s = Write(buf, num_to_write, &inc_num_written); : tot_written += inc_num_written; : buf += inc_nu > ah... the RETRY_ON_EINTR around SSL_read is actually a little concerning. I Yep, SSL_read() and SSL_write() seemed messy when I looked at them last time. As they say in the doc: '... The behaviour of SSL_read() depends on the underlying BIO. ...' But first of all, the corresponding change with RETRY_ON_EINTR() in tls_socket.cc looks suspicious to me: while working with OpenSSL functions, it's necessary to clean up the errors from the stack using SSL_get_error(). My another concern is that SSL_read()/SSL_write() might handle EINTR themselves. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 21 May 2018 18:59:10 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc@465 PS3, Line 465: // Continue silently when the syscall is interrupted. : if (s.posix_code() == EINTR) { : continue; > Right, but take a look at tls_socket.cc: I changed TlsSocket's Write and Re ah... the RETRY_ON_EINTR around SSL_read is actually a little concerning. I remember Alexey and I spent many hours trying to understand the various return types from that function. I seem to recall there were cases where it returns 0, and then SSL_get_error returns SSL_ERROR_WANT_READ, and errno is set to EINTR. ie SSL_read does not conform to the normal syscall interface that any error is indicated by a negative return. If I recall correctly the behavior we saw wasn't documented properly in the man pages, etc. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 21 May 2018 17:39:03 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc@465 PS3, Line 465: Status s = Write(buf, num_to_write, &inc_num_written); : tot_written += inc_num_written; : buf += inc_nu > given that Write and Recv might be overridden by TlsSocket, is this safe? I Right, but take a look at tls_socket.cc: I changed TlsSocket's Write and Recv to also RETRY_ON_EINTR. The idea is that Socket's BlockingWrite and BlockingRecv should never see an EINTR out of Write/Recv, because either Socket's Write/Recv have retried it, or TlsSocket's overrides did. If you think that's a reasonable design, I can add a DCHECK and/or documentation if you think it'd help. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 21 May 2018 17:31:17 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/net/socket.cc@465 PS3, Line 465: Status s = Write(buf, num_to_write, &inc_num_written); : tot_written += inc_num_written; : buf += inc_nu given that Write and Recv might be overridden by TlsSocket, is this safe? I see for example in TlsSocket::Recv the following code: auto error_code = SSL_get_error(ssl_.get(), bytes_read); if (error_code == SSL_ERROR_WANT_READ) { if (save_errno != 0) { return Status::NetworkError("SSL_read error from " + remote.ToString(), ErrnoToString(save_errno), save_errno); } -- maybe 'save_errno' here could be EINTR, so we end up returning an EINTR posix code to BlockingRecv or BlockingSend? I think we set some flag on OpenSSL to theoretically do auto-retry on EINTR but there was a bunch of messiness around this in the past. See 18e024cf8bcaea192efb63780802cc4c799bbb9c -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 21 May 2018 17:21:27 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10435/5//COMMIT_MSG@9 PS5, Line 9: In order to collect our own stack traces, we send ourselves a SIGUSR2. The > It's worth noting that for the SIGUSR2 case we set SA_RESTART on our signal Good question. According to signal(7), not all syscalls are restarted. There's a length list of system calls that will not be restarted, and the restartability of some I/O-related syscalls depends on the properties of the file backing the fd. In my testing I observed fallocate(2) returning EINTR, which isn't documented in that list. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 18 May 2018 03:08:52 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/10435/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10435/5//COMMIT_MSG@9 PS5, Line 9: In order to collect our own stack traces, we send ourselves a SIGUSR2. The It's worth noting that for the SIGUSR2 case we set SA_RESTART on our signal handler, so in theory the system should already automatically restart all of our syscalls for us for that signal. That also explains why we haven't seen this be a big problem as of yet. Do we have any alternate explanation? -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 18 May 2018 02:57:03 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 5: Verified+1 (1 comment) Overriding Jenkins, known flaky test. http://gerrit.cloudera.org:8080/#/c/10435/5/src/kudu/util/semaphore.cc File src/kudu/util/semaphore.cc: http://gerrit.cloudera.org:8080/#/c/10435/5/src/kudu/util/semaphore.cc@66 PS5, Line 66: if (errno == EAGAIN) { This is also a semantic change, but a correct one, I think. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 17 May 2018 23:26:16 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has removed Kudu Jenkins from this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Removed reviewer Kudu Jenkins with the following votes: * Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteReviewer Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10435 to look at the new patch set (#5). Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect our own stack traces, we send ourselves a SIGUSR2. The diagnostics log initiates collection every 60s, as do service queue overflows. As such, it's really important that we retry every interruptible system call rather than passing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 This patch also fixes KUDU-2151. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 --- M src/kudu/consensus/log_index.cc M src/kudu/gutil/macros.h M src/kudu/gutil/sysinfo.cc M src/kudu/rpc/negotiation.cc M src/kudu/security/tls_socket.cc M src/kudu/tools/tool_action_common.cc M src/kudu/util/env_posix.cc M src/kudu/util/net/socket.cc M src/kudu/util/os-util.cc M src/kudu/util/os-util.h M src/kudu/util/semaphore.cc M src/kudu/util/subprocess.cc 12 files changed, 149 insertions(+), 117 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/10435/5 -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/10435/4/src/kudu/security/tls_socket.cc File src/kudu/security/tls_socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/4/src/kudu/security/tls_socket.cc@63 PS4, Line 63: RETRY_ON_EINTR(bytes_written, SSL_write(ssl_.get(), buf, amt)); This (and the equivalent change to SSL_read) are a little funky; is it safe to expect errno to be the primary vehicle for delivering error information from OpenSSL? Normally errno is only used after libc calls. http://gerrit.cloudera.org:8080/#/c/10435/4/src/kudu/util/net/socket.cc File src/kudu/util/net/socket.cc: http://gerrit.cloudera.org:8080/#/c/10435/4/src/kudu/util/net/socket.cc@a465 PS4, Line 465: : : : The changes in tls_socket.cc allow this (and the equivalent in BlockingRecv) to be removed. http://gerrit.cloudera.org:8080/#/c/10435/4/src/kudu/util/net/socket.cc@128 PS4, Line 128: return (err == EAGAIN) || (err == EWOULDBLOCK); I'm not as confident about this change. I made it because I think EINTR should be handled at the "ends" and thus it shouldn't show up at higher levels of the stack. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 4 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 17 May 2018 22:07:09 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Hello Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10435 to look at the new patch set (#4). Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect our own stack traces, we send ourselves a SIGUSR2. The diagnostics log initiates collection every 60s, as do service queue overflows. As such, it's really important that we retry every interruptible system call rather than passing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 This patch also fixes KUDU-2151. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 --- M src/kudu/consensus/log_index.cc M src/kudu/gutil/macros.h M src/kudu/gutil/sysinfo.cc M src/kudu/rpc/negotiation.cc M src/kudu/security/tls_socket.cc M src/kudu/tools/tool_action_common.cc M src/kudu/util/env_posix.cc M src/kudu/util/net/socket.cc M src/kudu/util/os-util.cc M src/kudu/util/os-util.h M src/kudu/util/semaphore.cc M src/kudu/util/subprocess.cc 12 files changed, 146 insertions(+), 113 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/10435/4 -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 4 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 3: (1 comment) > Seem like we might be missing a few more: > - Socket::Close > - Log::IndexChunk::~IndexChunk > - a few things in sysinfo.cc (open and close calls) > - os-util.cc DisableCoreDumps > - various close() and open() calls in subprocess.cc I'll push a new rev with these addressed. I've also changed a couple places with manual EINTR handling to use RETRY_ON_EINTR instead. http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/env_posix.cc File src/kudu/util/env_posix.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/env_posix.cc@300 PS3, Line 300: RETRY_ON_EINTR(err, ::close(fd_)); > I think this might be tracked by KUDU-2151? worth adding it to the commit m Yeah, I'll mention it in the commit message. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 17 May 2018 20:27:06 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 3: Verified+1 Unrelated failure; filed KUDU-2444 for it. -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 17 May 2018 17:50:45 + Gerrit-HasComments: No
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Adar Dembo has removed a vote on this change. Change subject: KUDU-2427: retry more system calls on EINTR .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10435 ) Change subject: KUDU-2427: retry more system calls on EINTR .. Patch Set 3: (1 comment) Seem like we might be missing a few more: - Socket::Close - Log::IndexChunk::~IndexChunk - a few things in sysinfo.cc (open and close calls) - os-util.cc DisableCoreDumps - various close() and open() calls in subprocess.cc http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/env_posix.cc File src/kudu/util/env_posix.cc: http://gerrit.cloudera.org:8080/#/c/10435/3/src/kudu/util/env_posix.cc@300 PS3, Line 300: RETRY_ON_EINTR(err, ::close(fd_)); I think this might be tracked by KUDU-2151? worth adding it to the commit message and closing the jira when committed -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 17 May 2018 17:47:00 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2427: retry more system calls on EINTR
Hello Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/10435 to review the following change. Change subject: KUDU-2427: retry more system calls on EINTR .. KUDU-2427: retry more system calls on EINTR In order to collect our own stack traces, we send ourselves a SIGUSR2. The diagnostics log initiates collection every 60s, as do service queue overflows. As such, it's really important that we retry every interruptible system call rather than passing the EINTR up the call stack as a failure. For whatever reason this happens more frequently on Ubuntu 18.04, though maybe it's because I've placed my test directory on tmpfs. For example, I can easily repro a crash due to non-existent retry with the following command line: bin/tablet_server-test --gtest_repeat=1000 --gtest_throw_on_failure \ --diagnostics_log_stack_traces_interval_ms=100 \ --unlock_experimental_flags --gtest_filter=*KUDU_177 I went through env_posix.cc and looked for EINTR in every syscall's manpage. If I found it, I made sure the call was surrounded by RETRY_ON_EINTR. Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 --- M src/kudu/util/env_posix.cc 1 file changed, 33 insertions(+), 15 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/10435/1 -- To view, visit http://gerrit.cloudera.org:8080/10435 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I6cce03c4e1b2be32c1910382737526082fc99966 Gerrit-Change-Number: 10435 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Todd Lipcon