[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas updated YARN-493: - Fix Version/s: (was: 3.0.0) 2.0.5-beta I merged the patch to branch-2. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 2.0.5-beta Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch, YARN-493.4.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Attachment: YARN-493.4.patch Attaching rebased patch and resubmitting to Jenkins. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch, YARN-493.4.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Attachment: YARN-493.3.patch Here is a new patch that renames the new {{Shell}} methods to {{appendScriptExtension}}. Regarding trying to use {{Shell#getRunScriptCommand}} in the badSymlink test, I have not been able to get this to work. The test depends on very specific quoting, and the conversion to absolute path inside {{Shell#getRunScriptCommand}} (required by other callers) interferes with this. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Attachment: YARN-493.2.patch {quote} would make sense to expose common Shell/Test utilities that would abstract out the following 2 patterns {quote} Good idea, Ivan. Here is version 2 of the patch, which adds a few more helper methods to {{Shell}} to assist with this. I've intentionally left one occurrence of this pattern untouched in {{TestContainerLaunch#testSpecialCharSymlinks}} because of a very specific need for internal quoting and escaping in the arguments. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Attachment: YARN-493.1.patch This patch addresses the bugs that I found. I've verified that the tests pass on Mac (does not have setsid), Ubuntu (does have setsid), and Windows. Here is an explanation of the changes: # Discussion on YARN-359 concluded that we should refactor {{getCheckProcessIsAliveCommand}} and {{getSignalKillCommand}} from {{ContainerExecutor}} back to {{Shell}}. I'm taking the opportunity to do it now while we're working on this code. {{isSetsidSupported}} used to return true for Windows, with the rationale being that this flag really means are process groups supported. This didn't work out in practice, because there is too much logic that is very specific to using setsid. This had been causing the calls to winutils to prepend a '-' character to the job ID, which is incorrect. # winutils task kill had been terminating the job with exit code 1, but some of the YARN code depends on seeing a Unix-style exit code from signalled child processes, which is 128 + signal. (See {{ContainerLaunch#call}}.) The Windows {{TerminateJobObject}} API is most analogous to a kill signal, so I've changed task.c to use 128 + 9 = 137. # {{TestNodeManagerShutdown}}, {{TestContainerManager}}, and {{TestContainerLaunch}} were using bash scripts and signals for testing. I wrote alternatives for Windows that use cmd and winutils. Note that there is no equivalent to bash's ability to trap a signal, so on Windows, the assertions need to check for process existence instead. # Some test working directories have been shortened by switching from {{Class#getName}} to {{Class#getSimpleName}}, similar to several prior patches. # {{TestContainerManager}} had been requesting memory in bytes, but the API actually uses megabytes. I'm guessing that the API changed from bytes to MB at some point, but we forgot to update this test. This caused a very interesting problem. {{ContainerImpl#LaunchTransition}} would apply a conversion from bytes to MB, which would cause an overflow to exactly 0. Then, {{ContainersMonitorImpl#isProcessTreeOverLimit}} would see that the container uses 0 MB and decide to kill it. This is a race condition that would cause the test to fail unpredictably on Windows. I hadn't seen the problem on Mac or Ubuntu, where it seems we were just getting lucky. I've changed the test code to use MB. # {{TestContainerLaunch#setNewEnvironmentHack}} uses reflection to modify the environment during the test. I needed to update this code to handle different internal JDK class structure when running on Windows. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Description: Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. (was: The tests contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it.) Summary: NodeManager job control logic flaws on Windows (was: TestContainerManager fails on Windows) I'm expanding the scope of this jira to cover some flaws I've discovered in NodeManager's job control logic on Windows: # Windows was erroneously flagged as supporting setsid, which caused prepending of a '-' character to the job ID passed to winutils. # Exit code from job terminated by winutils task kill differed from expectations in YARN Java code, so that it couldn't tell the difference between a killed container vs. a container that had exited with failure. # Multiple tests were relying on bash scripts and signals for launching and controlling containers. I have a patch in progress. With the expanded scope, the patch will fix the following tests on Windows: {{TestContainerLaunch}}, {{TestContainerManager}}, and {{TestNodeManagerShutdown}}. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira