[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255134#comment-14255134 ] Hudson commented on YARN-2975: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #48 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/48/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255135#comment-14255135 ] Hudson commented on YARN-2977: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #48 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/48/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255144#comment-14255144 ] Hudson commented on YARN-2977: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #782 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/782/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255143#comment-14255143 ] Hudson commented on YARN-2975: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #782 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/782/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
Varun Saxena created YARN-2983: -- Summary: NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena While going through code for checking YARN-2978 , found one issue. During construction GetQueueInfoResponse in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a ConcurrentHashMap in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple ConcurrentHashMap#get (say, in a for loop) is not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2983: --- Description: While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) is not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} was: While going through code for checking YARN-2978 , found one issue. During construction GetQueueInfoResponse in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a ConcurrentHashMap in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple ConcurrentHashMap#get (say, in a for loop) is not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) is not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete
[jira] [Updated] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2983: --- Description: While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} was: While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) is not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another
[jira] [Updated] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2983: --- Attachment: YARN-2983.patch NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2983: --- Attachment: (was: YARN-2983.patch) NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2983: --- Attachment: YARN-2983.patch NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2983.patch While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255156#comment-14255156 ] Hadoop QA commented on YARN-2983: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688574/YARN-2983.patch against trunk revision 8f5522e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6163//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6163//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6163//console This message is automatically generated. NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2983.patch While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2983) NPE possible in ClientRMService#getQueueInfo
[ https://issues.apache.org/jira/browse/YARN-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255157#comment-14255157 ] Varun Saxena commented on YARN-2983: Findbugs to be addressed by YARN-2937 to YARN-2940 NPE possible in ClientRMService#getQueueInfo Key: YARN-2983 URL: https://issues.apache.org/jira/browse/YARN-2983 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2983.patch While going through code for checking YARN-2978 , found one issue. During construction of {{GetQueueInfoResponse}} in {{ClientRMService#getQueueInfo}}, we first collect application attempts from scheduler and then get apps from a {{ConcurrentHashMap}} in {{RMContext}}. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple {{ConcurrentHashMap#get}} (say, in a for loop) are not. For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config {{yarn.resourcemanager.max-completed-applications}}. I think there should be a null check inside this for loop, otherwise a NPE can occur. {code:title=ClientRMService#getQueueInfo} public GetQueueInfoResponse getQueueInfo(GetQueueInfoRequest request) throws YarnException { . if (request.getIncludeApplications()) { ListApplicationAttemptId apps = scheduler.getAppsInQueue(request.getQueueName()); appReports = new ArrayListApplicationReport(apps.size()); for (ApplicationAttemptId app : apps) { RMApp rmApp = rmContext.getRMApps().get(app.getApplicationId()); appReports.add(rmApp.createAndGetApplicationReport(null, true)); } } .. } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255158#comment-14255158 ] Hudson commented on YARN-2975: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1980 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1980/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255159#comment-14255159 ] Hudson commented on YARN-2977: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1980 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1980/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255160#comment-14255160 ] Hudson commented on YARN-2975: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #45 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/45/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255161#comment-14255161 ] Hudson commented on YARN-2977: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #45 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/45/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255180#comment-14255180 ] Hudson commented on YARN-2977: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #49 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/49/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255179#comment-14255179 ] Hudson commented on YARN-2975: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #49 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/49/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2975) FSLeafQueue app lists are accessed without required locks
[ https://issues.apache.org/jira/browse/YARN-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255187#comment-14255187 ] Hudson commented on YARN-2975: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1999/]) YARN-2975. FSLeafQueue app lists are accessed without required locks. (kasha) (kasha: rev 24ee9e3431d27811530ffa01d8d241133fd643fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestMaxRunningAppsEnforcer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/FairSchedulerLeafQueueInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/MaxRunningAppsEnforcer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FSLeafQueue app lists are accessed without required locks - Key: YARN-2975 URL: https://issues.apache.org/jira/browse/YARN-2975 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2975-1.patch, yarn-2975-2.patch, yarn-2975-3.patch YARN-2910 adds explicit locked access to runnable and non-runnable apps in FSLeafQueue. As FSLeafQueue has getters for these, they can be accessed without locks in other places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2977) TestNMClient get failed intermittently
[ https://issues.apache.org/jira/browse/YARN-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255188#comment-14255188 ] Hudson commented on YARN-2977: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1999/]) YARN-2977. Fixed intermittent TestNMClient failure. (Contributed by Junping Du) (ozawa: rev cf7fe583d14ebb16fc1b6e29dc2afbf67d24b9d1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java CHANGES.txt: add YARN-2977 (ozawa: rev 76b0370a27c482caff9498e15ef889d37f413ce7) * hadoop-yarn-project/CHANGES.txt CHANGES.txt: move YARN-2977 from 2.6.1 to 2.7.0 (ozawa: rev 8f5522ed9913ab175c422cbf89928742243c207e) * hadoop-yarn-project/CHANGES.txt TestNMClient get failed intermittently --- Key: YARN-2977 URL: https://issues.apache.org/jira/browse/YARN-2977 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Fix For: 2.7.0 Attachments: YARN-2977.patch There are still some test failures for TestNMClient in slow testbed. Like my comments in YARN-2148, the container could be finished before CLEANUP_CONTAINER happens due to slow start. Let's add back exit code 0 and add more message for test case. The failure stack: java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:386) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:348) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:227) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2980: --- Attachment: YARN-2980.001.patch Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255232#comment-14255232 ] Hadoop QA commented on YARN-2980: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688579/YARN-2980.001.patch against trunk revision 8f5522e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 21 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6164//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6164//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6164//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6164//console This message is automatically generated. Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2980: --- Attachment: YARN-2980.002.patch Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch, YARN-2980.002.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255253#comment-14255253 ] Hadoop QA commented on YARN-2980: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688584/YARN-2980.002.patch against trunk revision 8f5522e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 21 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.ha.TestZKFailoverController Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6165//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6165//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6165//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6165//console This message is automatically generated. Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch, YARN-2980.002.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255296#comment-14255296 ] Varun Saxena commented on YARN-2980: Test Failure unrelated. Passing in local. Findbugs to be addressed by other JIRAs. Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch, YARN-2980.002.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255421#comment-14255421 ] Hadoop QA commented on YARN-2939: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687590/YARN-2939-121614.patch against trunk revision 7bc0a6d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6166//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6166//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6166//console This message is automatically generated. Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2980) Move health check script related functionality to hadoop-common
[ https://issues.apache.org/jira/browse/YARN-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255484#comment-14255484 ] Varun Saxena commented on YARN-2980: [~mingma], kindly look at the patch. Going by the patch for HDFS-7400, it seems LocalDirsHandlerService is not required hence that would still be part of nodemanager. Move health check script related functionality to hadoop-common --- Key: YARN-2980 URL: https://issues.apache.org/jira/browse/YARN-2980 Project: Hadoop YARN Issue Type: Improvement Reporter: Ming Ma Assignee: Varun Saxena Attachments: YARN-2980.001.patch, YARN-2980.002.patch HDFS might want to leverage health check functionality available in YARN in both namenode https://issues.apache.org/jira/browse/HDFS-7400 and datanode https://issues.apache.org/jira/browse/HDFS-7441. We can move health check functionality including the protocol between hadoop daemons and health check script to hadoop-common. That will simplify the development and maintenance for both hadoop source code and health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255497#comment-14255497 ] Varun Saxena commented on YARN-2962: [~rakeshr], thanks for your input. ApplicationID in YARN is of the format {noformat}application_[cluster timestamp]_[sequence number]{noformat} Here sequence number has 4 digits and is in the range -. Going along the lines of what you are saying, I think we can break the sequence number part of ApplicationID as cluster timestamp will probably be same for most of the application IDs'. My suggestion is to have it as {noformat}(app_root)\application_[cluster timestamp]_\[first 2 digits of sequence number]\[last 2 digits]{noformat} We can view it as under : {noformat} * |--- RM_APP_ROOT * | |- (application_{cluster timestamp}_) * | ||- (00 to 99) * | |||-- (00 to 99) * | ||| |- (#ApplicationAttemptIds) {noformat} [~rakeshr] and [~kasha], kindly comment on the approach. One constraint is that this would entail a larger number of contacts to ZK when RM is recovering. I am not sure how many znodes can lead to reaching limit of 1 MB. We can break sequence number as 1 digit and last 3 digit as well. Moreover, I dont see much of an issue with application attempt znodes as max-attempts by default are limited to 2. ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2664) Improve RM webapp to expose info about reservations.
[ https://issues.apache.org/jira/browse/YARN-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255533#comment-14255533 ] Anubhav Dhoot commented on YARN-2664: - Overall this is a very natural way to visualize reservations, great job!!. It maps to the mental model of skylines. high level comments/questions a) This is a view of reservations and does not indicate actual allocations right? But the legend for y axis says Utilization GB. Allocation would be a great addition (knowing how much is left of my reservation etc) b) This shows everything in terms of memory but not cpu, right? Should we add a switch to show both and in future other resource types? Showing them together is more correct but harder to visualize. c) Should we also show total plan capacity as the end of the yaxis or an explicit ceiling line? Minor usability issues a) How is the time window for the slider and time window that is selected in the slider chosen? Some times it would keep the slider to some point before the current time at other times it would show future time as part of the view. Also if there was no reservations it would not go to current time until a new reservation shows up? b) related to previous, why does refresh button on page allow me to move the chosen time window forward but not the refresh button. Maybe rename the refresh to refresh queues? Also provide a refresh time button if c) below cannot be solved? c) is there a query parameter or some other way to get back to a specific queue? That would help not having to do drop down every time i refresh the page Improve RM webapp to expose info about reservations. Key: YARN-2664 URL: https://issues.apache.org/jira/browse/YARN-2664 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Matteo Mazzucchelli Attachments: PlannerPage_screenshot.pdf, YARN-2664.1.patch, YARN-2664.2.patch, YARN-2664.3.patch, YARN-2664.4.patch, YARN-2664.5.patch, YARN-2664.6.patch, YARN-2664.7.patch, YARN-2664.patch, legal.patch, screenshot_reservation_UI.pdf YARN-1051 provides a new functionality in the RM to ask for reservation on resources. Exposing this through the webapp GUI is important. -- This message was sent by Atlassian JIRA (v6.3.4#6332)