[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 3: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 3 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 20 Feb 2019 04:06:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. IMPALA-8191: Wait for additional breakpad processes during test The Breakpad signal handler forks off a process to write a minidump. During the breakpad tests we send signals to the Impala daemons and then wait for all processes to go away. Prior to this change we did this by waiting on the PID returned by process.get_pid(). It is determined by iterating over psutil.get_pid_list() which is an ordered list of PIDs running on the system. We return the first process in the list with a matching command line. In cases where the PID space rolled over, this could have been the forked off breakpad process and we'd wait on that one. During the subsequent check that all processes are indeed gone, we could then pick up the original Impala daemon that had forked off to write the minidump and was still in the process of shutting down. To fix this, we wait for every process twice. Processes are identified by their command and iterating through them twice makes sure we catch both the original daemon and it's breakpad child. This change also contains improvements to the logging of processes in our tests. This should make it easier to identify similar issues in the future. Testing: I ran the breakpad tests in exhaustive mode. I didn't try to exercise it around a PID roll-over, but we shouldn't see the issue in IMPALA-8191 again. Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Reviewed-on: http://gerrit.cloudera.org:8080/12501 Reviewed-by: Lars Volker Tested-by: Impala Public Jenkins --- M tests/common/impala_cluster.py M tests/custom_cluster/test_breakpad.py 2 files changed, 55 insertions(+), 22 deletions(-) Approvals: Lars Volker: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 4 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2161/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 2 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 20 Feb 2019 00:22:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 2: Code-Review+2 I think get_pids() works better here. Thanks! -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 2 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 20 Feb 2019 00:03:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 3: Code-Review+2 Rebased, carrying Phil's +2 -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 3 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 20 Feb 2019 00:08:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3795/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 3 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 20 Feb 2019 00:08:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Hello Philip Zeyliger, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12501 to look at the new patch set (#2). Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. IMPALA-8191: Wait for additional breakpad processes during test The Breakpad signal handler forks off a process to write a minidump. During the breakpad tests we send signals to the Impala daemons and then wait for all processes to go away. Prior to this change we did this by waiting on the PID returned by process.get_pid(). It is determined by iterating over psutil.get_pid_list() which is an ordered list of PIDs running on the system. We return the first process in the list with a matching command line. In cases where the PID space rolled over, this could have been the forked off breakpad process and we'd wait on that one. During the subsequent check that all processes are indeed gone, we could then pick up the original Impala daemon that had forked off to write the minidump and was still in the process of shutting down. To fix this, we wait for every process twice. Processes are identified by their command and iterating through them twice makes sure we catch both the original daemon and it's breakpad child. This change also contains improvements to the logging of processes in our tests. This should make it easier to identify similar issues in the future. Testing: I ran the breakpad tests in exhaustive mode. I didn't try to exercise it around a PID roll-over, but we shouldn't see the issue in IMPALA-8191 again. Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca --- M tests/common/impala_cluster.py M tests/custom_cluster/test_breakpad.py 2 files changed, 55 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/01/12501/2 -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 2 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 1: (3 comments) Ok. Thanks for explaining. I'm fine with the tactical fix, but I think doing process.get_pids() is probably easy enough and will be less confusing. http://gerrit.cloudera.org:8080/#/c/12501/1/tests/common/impala_cluster.py File tests/common/impala_cluster.py: http://gerrit.cloudera.org:8080/#/c/12501/1/tests/common/impala_cluster.py@321 PS1, Line 321: if set(self.cmd) == set(process.cmdline): Do you know why this is set? http://gerrit.cloudera.org:8080/#/c/12501/1/tests/common/impala_cluster.py@322 PS1, Line 322: return pid Instead of returning pid here, you could collect all matches, and then either pick the better one (parent) or fail? Would that have made this more obvious? http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py File tests/custom_cluster/test_breakpad.py: http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py@88 PS1, Line 88: def wait_for_all_processes_dead(self, processes, timeout=300): : # For every process in the list we might see the original Impala process plus a forked : # off child that is writing the minidump. To catch both we go through the list twice. : for process in processes * 2: > These are Process objects (https://github.com/apache/impala/blob/master/tes OMG. I would never have thought that "Process" is really "output of pgrep". How would you feel about changing this to process.get_pids() here, so that it's more obvious that a process can get represented multiple times in this case, and we have an answer? -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 1 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Sat, 16 Feb 2019 00:27:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py File tests/custom_cluster/test_breakpad.py: http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py@88 PS1, Line 88: def wait_for_all_processes_dead(self, processes, timeout=300): : # For every process in the list we might see the original Impala process plus a forked : # off child that is writing the minidump. To catch both we go through the list twice. : for process in processes * 2: > I don't understand this. processes here is a list of pids, and we wait() on These are Process objects (https://github.com/apache/impala/blob/master/tests/common/impala_cluster.py#L280). They store their command and look up their PID by iterating through all PIDs on the system comparing them to their cmd. I think the idea behind this is that processes can restart, but of course we could track the expected state including its PID more explicitly. Let me know if you'd like me to explain what these are in the comment and/or file a Jira to improve the Process class to know its PID. -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 1 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Fri, 15 Feb 2019 23:10:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py File tests/custom_cluster/test_breakpad.py: http://gerrit.cloudera.org:8080/#/c/12501/1/tests/custom_cluster/test_breakpad.py@88 PS1, Line 88: def wait_for_all_processes_dead(self, processes, timeout=300): : # For every process in the list we might see the original Impala process plus a forked : # off child that is writing the minidump. To catch both we go through the list twice. : for process in processes * 2: I don't understand this. processes here is a list of pids, and we wait() on each of them. Why does waiting on them multiple times help things? It sounded from your description in the commit message is that the issue is where we enumarate which process to wait for (the impalad or its fork), but that's not what this code is doing. I'm sure I'm confused; let me know where! -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 1 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Fri, 15 Feb 2019 23:00:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12501 ) Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2142/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 1 Gerrit-Owner: Lars Volker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 15 Feb 2019 21:12:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8191: Wait for additional breakpad processes during test
Lars Volker has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12501 Change subject: IMPALA-8191: Wait for additional breakpad processes during test .. IMPALA-8191: Wait for additional breakpad processes during test The Breakpad signal handler forks off a process to write a minidump. During the breakpad tests we send signals to the Impala daemons and then wait for all processes to go away. Prior to this change we did this by waiting on the PID returned by process.get_pid(). It is determined by iterating over psutil.get_pid_list() which is an ordered list of PIDs running on the system. We return the first process in the list with a matching command line. In cases where the PID space rolled over, this could have been the forked off breakpad process and we'd wait on that one. During the subsequent check that all processes are indeed gone, we could then pick up the original Impala daemon that had forked off to write the minidump and was still in the process of shutting down. To fix this, we wait for every process twice. Processes are identified by their command and iterating through them twice makes sure we catch both the original daemon and it's breakpad child. This change also contains improvements to the logging of processes in our tests. This should make it easier to identify similar issues in the future. Testing: I ran the breakpad tests in exhaustive mode. I didn't try to exercise it around a PID roll-over, but we shouldn't see the issue in IMPALA-8191 again. Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca --- M tests/common/impala_cluster.py M tests/custom_cluster/test_breakpad.py 2 files changed, 27 insertions(+), 14 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/01/12501/1 -- To view, visit http://gerrit.cloudera.org:8080/12501 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ia4dcc5fecb9b5f38ae1504aae40f099837cf1bca Gerrit-Change-Number: 12501 Gerrit-PatchSet: 1 Gerrit-Owner: Lars Volker