[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152842#comment-14152842 ] zhihai xu commented on YARN-2566: - Picking the directory with most available space is a good suggestion. I will implement it in my new patch. thanks IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004
[jira] [Commented] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.
[ https://issues.apache.org/jira/browse/YARN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153023#comment-14153023 ] Remus Rusanu commented on YARN-2623: Note that DCE also picks first local dir, DefaultContainerExecutor.java@99: {code} // TODO: Why pick first app dir. The same in LCE why not random? Path appStorageDir = getFirstApplicationDir(localDirs, user, appId); {code} Linux container executor only use the first local directory to copy token file in container-executor.c. --- Key: YARN-2623 URL: https://issues.apache.org/jira/browse/YARN-2623 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: Linux container executor only use the first local directory to copy token file in container-executor.c. Reporter: zhihai xu Assignee: zhihai xu Linux container executor only use the first local directory to copy token file in container-executor.c. if It failed to copy token file to the first local directory, the localization failure event will happen. Even though it can copy token file to the other local directory successfully. The correct way should be to copy token file to the next local directory if the first one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2545) RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED
[ https://issues.apache.org/jira/browse/YARN-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153065#comment-14153065 ] Tsuyoshi OZAWA commented on YARN-2545: -- Thanks for your reporting, [~zhiguohong]. I think we should fix to report the states of app correctly. How about changing to check the state of apps and dispatch RMAppEventType#ATTEMPT_FAILED in RMAppAttemptImpl#AMUnregisteredTransition? RMApp should transit to FAILED when AM calls finishApplicationMaster with FAILED Key: YARN-2545 URL: https://issues.apache.org/jira/browse/YARN-2545 Project: Hadoop YARN Issue Type: Bug Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor If AM calls finishApplicationMaster with getFinalApplicationStatus()==FAILED, and then exits, the corresponding RMApp and RMAppAttempt transit to state FINISHED. I think this is wrong and confusing. On RM WebUI, this application is displayed as State=FINISHED, FinalStatus=FAILED, and is counted as Apps Completed, not as Apps Failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153091#comment-14153091 ] Hudson commented on YARN-1769: -- FAILURE: Integrated in Hadoop-Yarn-trunk #696 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/696/]) YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/CHANGES.txt CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 2.6.0 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153086#comment-14153086 ] Hudson commented on YARN-2606: -- FAILURE: Integrated in Hadoop-Yarn-trunk #696 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/696/]) YARN-2606. Application History Server tries to access hdfs before doing secure login (Mit Desai via jeagles) (jeagles: rev e10eeaabce2a21840cfd5899493c9d2d4fe2e322) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java * hadoop-yarn-project/CHANGES.txt Application History Server tries to access hdfs before doing secure login - Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.6.0 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, YARN-2606.patch While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2625) Problems with CLASSPATH in Job Submission REST API
Doug Haigh created YARN-2625: Summary: Problems with CLASSPATH in Job Submission REST API Key: YARN-2625 URL: https://issues.apache.org/jira/browse/YARN-2625 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Doug Haigh There are a couple of issues I have found specifying the CLASSPATH environment variable using the REST API. 1) In the Java client, the CLASSPATH environment is usually made up of either the value of the yarn.application.classpath in yarn-site.xml value or the default YARN classpath value as defined by YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API consumers have no method of telling the resource manager to use the default unless they hardcode the default value themselves. If the default ever changes, the code would need to change. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the application master is :/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/* These two problems require REST API consumers to always have the fully resolved path defined in the yarn.application.classpath value. If the property is missing or contains environment varaibles, the application created by the REST API will fail due to the CLASSPATH being incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2617: --- Attachment: YARN-2617.3.patch Update the patch. Delete an unrelated line. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153200#comment-14153200 ] Hudson commented on YARN-2606: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1887 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1887/]) YARN-2606. Application History Server tries to access hdfs before doing secure login (Mit Desai via jeagles) (jeagles: rev e10eeaabce2a21840cfd5899493c9d2d4fe2e322) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java Application History Server tries to access hdfs before doing secure login - Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.6.0 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, YARN-2606.patch While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153205#comment-14153205 ] Hudson commented on YARN-1769: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1887 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1887/]) YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 2.6.0 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153262#comment-14153262 ] Hudson commented on YARN-1769: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1912 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1912/]) YARN-1769. CapacityScheduler: Improve reservations. Contributed by Thomas Graves (jlowe: rev 9c22065109a77681bc2534063eabe8692fbcb3cd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * hadoop-yarn-project/CHANGES.txt CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 2.6.0 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153257#comment-14153257 ] Hudson commented on YARN-2606: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1912 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1912/]) YARN-2606. Application History Server tries to access hdfs before doing secure login (Mit Desai via jeagles) (jeagles: rev e10eeaabce2a21840cfd5899493c9d2d4fe2e322) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestFileSystemApplicationHistoryStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java * hadoop-yarn-project/CHANGES.txt Application History Server tries to access hdfs before doing secure login - Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Fix For: 2.6.0 Attachments: YARN-2606.patch, YARN-2606.patch, YARN-2606.patch, YARN-2606.patch While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153406#comment-14153406 ] Zhijie Shen commented on YARN-2320: --- [~mayank_bansal], thanks for the review. bq. shouldn't we use N/A in convertToApplicationAttemptReport instead of null ? generateApplicationReport is fixed in YARN-2598. Attempt and container reports are different. While app report hides the details if the user doesn't have access, the attempt and the container will completely not be shown. Removing old application history store after we store the history data to timeline store Key: YARN-2320 URL: https://issues.apache.org/jira/browse/YARN-2320 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2320.1.patch, YARN-2320.2.patch After YARN-2033, we should deprecate application history store set. There's no need to maintain two sets of store interfaces. In addition, we should conclude the outstanding jira's under YARN-321 about the application history store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2626) Document of timeline server needs to be updated
Zhijie Shen created YARN-2626: - Summary: Document of timeline server needs to be updated Key: YARN-2626 URL: https://issues.apache.org/jira/browse/YARN-2626 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen YARN-2033, the document is no longer accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2626) Document of timeline server needs to be updated
[ https://issues.apache.org/jira/browse/YARN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2626: -- Target Version/s: 2.6.0 Document of timeline server needs to be updated --- Key: YARN-2626 URL: https://issues.apache.org/jira/browse/YARN-2626 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen YARN-2033, the document is no longer accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2626) Document of timeline server needs to be updated
[ https://issues.apache.org/jira/browse/YARN-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2626: -- Component/s: timelineserver Affects Version/s: 2.6.0 Document of timeline server needs to be updated --- Key: YARN-2626 URL: https://issues.apache.org/jira/browse/YARN-2626 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen YARN-2033, the document is no longer accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2594: - Attachment: YARN-2594.patch Attached a updated patch removed several read lock of methods in {{RMAppImpl}} uses {{currentAttempt}} only. [~kasha], [~jianhe], would you please take a look? Thanks, Wangda Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
Xuan Gong created YARN-2627: --- Summary: Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2627: Attachment: YARN-2627.1.patch Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153485#comment-14153485 ] Jian Fang commented on YARN-1198: - I tried to merge in YARN-1857.3.patch and then merge in YARN-1198.7.patch since people favor this patch over the .8 patch. Seems the change in the following method cancels the update in YARN-1857. private Resource getHeadroom(User user, Resource queueMaxCap, Resource clusterResource, Resource userLimit) { Resource headroom = Resources.subtract( Resources.min(resourceCalculator, clusterResource, userLimit, queueMaxCap), user.getConsumedResources()); return headroom; } Shouldn't it be the following one if I merge both YARN-1857 and YARN-1198? private Resource getHeadroom(User user, Resource queueMaxCap, Resource clusterResource, Resource userLimit) { Resource headroom = Resources.min(resourceCalculator, clusterResource, Resources.subtract( Resources.min(resourceCalculator, clusterResource, userLimit, queueMaxCap), user.getConsumedResources()), Resources.subtract(queueMaxCap, usedResources)); return headroom; } Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153491#comment-14153491 ] Craig Welch commented on YARN-1972: --- [~rusanu] [~vinodkv], as on [YARN-1063], we can go ahead and address these comments as part of the [YARN-2198] effort, it's not necessary to resolve these before these patches are committed. Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153490#comment-14153490 ] Craig Welch commented on YARN-1063: --- [~rusanu] [~vinodkv], we can go ahead and address these comments as part of the [YARN-2198] effort, it's not necessary to resolve these before these patches are committed. Winutils needs ability to create task as domain user Key: YARN-1063 URL: https://issues.apache.org/jira/browse/YARN-1063 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Environment: Windows Reporter: Kyle Leckie Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch h1. Summary: Securing a Hadoop cluster requires constructing some form of security boundary around the processes executed in YARN containers. Isolation based on Windows user isolation seems most feasible. This approach is similar to the approach taken by the existing LinuxContainerExecutor. The current patch to winutils.exe adds the ability to create a process as a domain user. h1. Alternative Methods considered: h2. Process rights limited by security token restriction: On Windows access decisions are made by examining the security token of a process. It is possible to spawn a process with a restricted security token. Any of the rights granted by SIDs of the default token may be restricted. It is possible to see this in action by examining the security tone of a sandboxed process launch be a web browser. Typically the launched process will have a fully restricted token and need to access machine resources through a dedicated broker process that enforces a custom security policy. This broker process mechanism would break compatibility with the typical Hadoop container process. The Container process must be able to utilize standard function calls for disk and network IO. I performed some work looking at ways to ACL the local files to the specific launched without granting rights to other processes launched on the same machine but found this to be an overly complex solution. h2. Relying on APP containers: Recent versions of windows have the ability to launch processes within an isolated container. Application containers are supported for execution of WinRT based executables. This method was ruled out due to the lack of official support for standard windows APIs. At some point in the future windows may support functionality similar to BSD jails or Linux containers, at that point support for containers should be added. h1. Create As User Feature Description: h2. Usage: A new sub command was added to the set of task commands. Here is the syntax: winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] Some notes: * The username specified is in the format of user@domain * The machine executing this command must be joined to the domain of the user specified * The domain controller must allow the account executing the command access to the user information. For this join the account to the predefined group labeled Pre-Windows 2000 Compatible Access * The account running the command must have several rights on the local machine. These can be managed manually using secpol.msc: ** Act as part of the operating system - SE_TCB_NAME ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME * The launched process will not have rights to the desktop so will not be able to display any information or create UI. * The launched process will have no network credentials. Any access of network resources that requires domain authentication will fail. h2. Implementation: Winutils performs the following steps: # Enable the required privileges for the current process. # Register as a trusted process with the Local Security Authority (LSA). # Create a new logon for the user passed on the command line. # Load/Create a profile on the local machine for the new logon. # Create a new environment for the new logon. # Launch the new process in a job with the task name specified and using the created logon. # Wait for the JOB to exit. h2. Future work: The following work was scoped out of this check in: * Support for non-domain users or machine that are not domain joined. * Support for privilege isolation by running the task launcher in a high privilege service with access over an ACLed named pipe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153494#comment-14153494 ] Craig Welch commented on YARN-2198: --- Bringing over some comments from [YARN-1063] When looking this over to pickup context for 2198, I noticed a couple things: libwinutils.c CreateLogonForUser - confusing name, makes me think a new account is being created - CreateLogonTokenForUser? LogonUser? TestWinUtils - can we add testing specific to security? and from [YARN-1972] ContainerLaunch launchContainer - nit, why userName here, it's user everywhere else getLocalWrapperScriptBuilder - why not an override instead of conditional (see below wrt WindowsContainerExecutor) WindowsSecureContainerExecutor - I really think there should be a WindowsContainerExecutor and that we should go ahead and have differences move generally to inheritance rather than conditional (as far as reasonable/related to the change, and incrementally as we go forward, no need to boil the ocean, but it would be good to set a good foundation here) Windows specific logic, secure or not, should be based in this class. If the differences required for security specific logic are significant enough, by all means also have a WindowsSecureContainerExecutor which inherits from WindowsContainerExecutor. I think, as much as possible, the logic should be the same for both - with only the security specific functionality as a delta (right now, it looks like non-secure windows uses default for implementation, and may differ more from the windows secure than it should) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153499#comment-14153499 ] Craig Welch commented on YARN-1198: --- That's not intentional - I think it's just a side effect of where the changes are taking place, and it will require some manual fixup to keep both changes together. I expected that [YARN-1857] would be committed first, and then I would fixup this patch to reflect the change. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2627: Attachment: YARN-2627.2.patch Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch, YARN-2627.2.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153512#comment-14153512 ] Craig Welch commented on YARN-1198: --- [~leftnoteasy] [~john.jian.fang] it sounds like the .7 approach is the way to go. Jian had a tweak to this approach which he suggested here: [https://issues.apache.org/jira/browse/YARN-1198?focusedCommentId=14122078page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14122078] - on the whole the same thing happens, but it might be a cleaner way to do it. I was hoping to give a go at it so that we could compare with .7 before closing this up. Thoughts? Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153519#comment-14153519 ] Zhijie Shen commented on YARN-2627: --- +1, will commit after Jenkins' feedback. Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch, YARN-2627.2.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1963: -- Attachment: (was: YARN Application Priorities Design.pdf) Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1963: -- Attachment: YARN Application Priorities Design.pdf Attached updated design doc capturing comments. Thank you. Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153526#comment-14153526 ] Wangda Tan commented on YARN-2494: -- I feel the Cluster is almost a super set of Collection. I prefer what [~cwelch] suggested set of method name. Maybe {{ClusterNodeLabelsCollection}} is slightly clear than {{ClusterNodeLabels}}, but its name is too long I think :) [YARN-796] Node label manager API and storage implementations - Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153536#comment-14153536 ] Karthik Kambatla commented on YARN-2594: We need to handle getFinalApplicationStatus, and may be {{createAndGetApplicationReport}} as well. In the latter, we can replace direct access of {{diagnostics}} with {{getDiagnostics}} to avoid races on diagnostics. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153539#comment-14153539 ] Karthik Kambatla commented on YARN-2610: Checking this in.. Hamlet should close table tags -- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153541#comment-14153541 ] Karthik Kambatla commented on YARN-2594: Also, it would be nice to add a comment next to the declaration of currentAttempt to say it is not protected by the readLock. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153555#comment-14153555 ] Hudson commented on YARN-2610: -- FAILURE: Integrated in Hadoop-trunk-Commit #6154 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6154/]) YARN-2610. Hamlet should close table tags. (Ray Chiang via kasha) (kasha: rev f7743dd07dfbe0dde9be71acfaba16ded52adba7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/hamlet/TestHamlet.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/webapp/view/TestInfoBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/hamlet/Hamlet.java * hadoop-yarn-project/CHANGES.txt Hamlet should close table tags -- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Fix For: 2.6.0 Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153557#comment-14153557 ] Hadoop QA commented on YARN-2594: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672075/YARN-2594.patch against trunk revision ea32a66. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5182//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5182//console This message is automatically generated. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153608#comment-14153608 ] zhihai xu commented on YARN-2594: - Hi [~leftnoteasy], It will be good to use a local variable to save currentAttempt to avoid any potential null pointer exception in the future. RMAppAttempt attempt = this.currentAttempt; if (attempt != null) { return attempt.getTrackingUrl(); } Without lock, it is possible that this.currentAttempt will be changed between null check and calling getTrackingUrl. Using a local variable to save currentAttempt will solve this race condition. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153611#comment-14153611 ] Remus Rusanu commented on YARN-2198: [~cwelch]: thanks for the review! I will address many of the comments with new patch, meantime some reply on issues I won't address: pom.xml - don’t see a /etc/hadoop or a wsce-site.xml, missed? RR: Not sure what you mean. Do you expect a default wcse-site.xml in hadoop-common/src/conf ? return (parent == null || parent2f.exists() || mkdirs(parent)) (mkOneDir(p2f) || p2f.isDirectory()); so, I don't get this logic, believe it will fail if the path exists and is not a directory. Why not just do if p2f doesn't exist mkdirs(p2f)? seems much simpler, and drops the need for mkOneDir RR: This is actually the result of a problem Kevin hit during test deployments when NM has access to child dirs but is access denied to parent dirs. Old NM code would attempt to mkdir ever dir in the parent path, all the way to /. With existing dirs with access denied, this would fail, hence the need for my change. There is already a check in the unmodified code for the parent existing and not being a dir, couple of lines above my change. TestWinUtils: can we add testing specific to security? RR: I would like to add some, but is not at all easy. The core tenet of the WSCE is the elevated privilege required for S4U impersonation and having tests depend on that would pose many problems (false failures). Basically, starting the hadoopwinutilsvc service on the test box is unfeasable. WindowsSecureContainerExecutor - I really think there should be a WindowsContainerExecutor RR: While I agree that the class architecture separation of secure vs. non-secure and Windows vs. Linux leaves room for improvement, it is not my goal with these JIRAs to address that problem. In fact I do have an explicit opposite mandate, to disturb all the non-secure code paths as little as possible, to minimize regression risks. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153618#comment-14153618 ] Hadoop QA commented on YARN-2627: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672087/YARN-2627.2.patch against trunk revision cdf1af0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5184//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5184//console This message is automatically generated. Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch, YARN-2627.2.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153628#comment-14153628 ] Hadoop QA commented on YARN-2627: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672087/YARN-2627.2.patch against trunk revision cdf1af0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5185//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5185//console This message is automatically generated. Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch, YARN-2627.2.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153656#comment-14153656 ] Hadoop QA commented on YARN-2613: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672090/YARN-2613.3.patch against trunk revision f7743dd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5186//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5186//console This message is automatically generated. NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2627) Add logs when attemptFailuresValidityInterval is enabled
[ https://issues.apache.org/jira/browse/YARN-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153672#comment-14153672 ] Hudson commented on YARN-2627: -- FAILURE: Integrated in Hadoop-trunk-Commit #6155 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6155/]) YARN-2627. Added the info logs of attemptFailuresValidityInterval and number of previous failed attempts. Contributed by Xuan Gong. (zjshen: rev 9582a50176800433ad3fa8829a50c28b859812a3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt Add logs when attemptFailuresValidityInterval is enabled Key: YARN-2627 URL: https://issues.apache.org/jira/browse/YARN-2627 Project: Hadoop YARN Issue Type: Improvement Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2627.1.patch, YARN-2627.2.patch After YARN-611, users can specify attemptFailuresValidityInterval for their applications. This is for testing/debug purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2594: - Attachment: YARN-2594.patch [~zxu], bq. It will be good to use a local variable to save currentAttempt to avoid any potential null pointer exception in the future. Good catch! Addressed, [~kasha], bq. We need to handle getFinalApplicationStatus, and may be createAndGetApplicationReport as well. In the latter, we can replace direct access of diagnostics with getDiagnostics to avoid races on diagnostics. {{getFinalApplicationStatus}} has access to statemachine.getCurrentState(), and {{createAndGetApplicationReport}} has accesses on statemachine.getCurrentState() and other Fields. To minimize scope to solve the problem we can see now, I would suggest to keep other fields as-is. bq. Also, it would be nice to add a comment next to the declaration of currentAttempt to say it is not protected by the readLock. Addressed, New patch attached. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153697#comment-14153697 ] Steve Loughran commented on YARN-913: - A couple of offline comments from sanjay radia # don't publish full path in {{RegistryPathStatus}} fields; it only makes moving to indirection and cross references harder in future. # don't differentiate hadoop-classic IPC from hadoop protobuf in protocol list. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-90: -- Attachment: apache-yarn-90.7.patch Uploaded a new patch to address the comments by Jason. {quote} bq.I've changed it to Disk(s) health report: . My only concern with this is that there might be scripts looking for the Disk(s) failed log line for monitoring. What do you think? If that's true then the code should bother to do a diff between the old disk list and the new one, logging which disks turned bad using the Disk(s) failed line and which disks became healthy with some other log message. {quote} Fixed. We now have two log messages - one indicating when disks go bad and one when disks get marked as good. {quote} bq.Directories are only cleaned up during startup. The code tests for existence of the directories and the correct permissions. This does mean that container directories left behind for any reason won't get cleaned up unit the NodeManager is restarted. Is that ok? This could still be problematic for the NM work-preserving restart case, as we could try to delete an entire disk tree with active containers on it due to a hiccup when the NM restarts. I think a better approach is a periodic cleanup scan that looks for directories under yarn-local and yarn-logs that shouldn't be there. This could be part of the health check scan or done separately. That way we don't have to wait for a disk to turn good or bad to catch leaked entities on the disk due to some hiccup. Sorta like an fsck for the NM state on disk. That is best done as a separate JIRA, as I think this functionality is still an incremental improvement without it. {quote} The current code will only cleanup if the NM recovery can't be carried out. {noformat} if (!stateStore.canRecover()) { cleanUpLocalDirs(lfs, delService); initializeLocalDirs(lfs); initializeLogDirs(lfs); } {noformat} Will that handle the case you mentioned? bq. checkDirs unnecessarily calls union(errorDirs, fullDirs) twice. Fixed. bq. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the free space is under the limit. Fixed. bq. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments like the other methods. Fixed. {quote} Nit: The union utility function doesn't technically perform a union but rather a concatenation, and it'd be a little clearer if the name reflected that. Also the function should leverage the fact that it knows how big the ArrayList will be after the operations and give it the appropriate hint to its constructor to avoid reallocations. {quote} Fixed. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.
[ https://issues.apache.org/jira/browse/YARN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153709#comment-14153709 ] zhihai xu commented on YARN-2623: - Thanks for the information. I think the better way to solve the issue is to choose the local directory which has the most free disk space. I will implement the patch by copying the token file to the local directory which has the most free disk space. Linux container executor only use the first local directory to copy token file in container-executor.c. --- Key: YARN-2623 URL: https://issues.apache.org/jira/browse/YARN-2623 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: Linux container executor only use the first local directory to copy token file in container-executor.c. Reporter: zhihai xu Assignee: zhihai xu Linux container executor only use the first local directory to copy token file in container-executor.c. if It failed to copy token file to the first local directory, the localization failure event will happen. Even though it can copy token file to the other local directory successfully. The correct way should be to copy token file to the next local directory if the first one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153735#comment-14153735 ] Steve Loughran commented on YARN-913: - w.r.t first comment. putting path in the registry status field * we don't actually need to do this, not if we return the stat'd entries as a map of name:status. * and we can pull that operation,, currently called {{listFully}} out of operations and put in {{RegistryOperationsUtils}}. this will make clear it's a separate operation; we can emphasise it's non-atomic. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2624: Description: We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} was: We have found resource localization fails on a secure cluster with following error in certain cases. This happens at some indeterminate point after which it will keep failing until NM is restarted. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} Summary: Resource Localization fails on a cluster due to existing cache directories (was: Resource Localization fails on a secure cluster until nm are restarted) Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153740#comment-14153740 ] Steve Loughran commented on YARN-913: - w.r.t [~aw]'s comments: bq. If a client needs to talk to more than one ZK, it sounds like they are basically screwed. If you are grabbing binding/configs via the CLI, it's not a worry, nor if you are talking to 1 ZK quorum with the same auth policy. Its when you start tuning SASL auth and some various timeouts that this arises. This is not an issue with the registry, it's the ZK client here. bq. I was mainly looking at the hostname pattern: {code} + String HOSTNAME_PATTERN = + ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]); {code} bq. It doesn't appear to support periods/dots. That's just the pattern for entries in the registry path itself; you can't give a service a name like -#foo as DNS won't like it. Stick whatever you want in the fields themselves. I'll javadoc that field Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153745#comment-14153745 ] Steve Loughran commented on YARN-913: - [~aw]: new constant for you+ javadocs: {code} /** * Pattern of a single entry in the registry path. : {@value}. * p * This is what constitutes a valid hostname according to current RFCs. * Alphanumeric first two and last one digit, alphanumeric * and hyphens allowed in between. * p * No upper limit is placed on the size of an entry. */ {code} Better? Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153749#comment-14153749 ] Anubhav Dhoot commented on YARN-2624: - What we see is a bunch of preexisting local resource cache directories conflict with the new resource download. The destination directory being chosen via uniqueNumberGenerator is choosing one of these and without [HADOOP-9438|https://issues.apache.org/jira/browse/HADOOP-9438] we dont know until the rename fails. Resetting uniqueNumberGenerator based on recoverResource does not seem to be enough. We may need to check the state of the NM's cache directory and reset to the highest number in the directory Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153748#comment-14153748 ] Steve Loughran commented on YARN-913: - Oh, and I renamed the field: {code} /** * Pattern of a single entry in the registry path. : {@value}. * p * This is what constitutes a valid hostname according to current RFCs. * Alphanumeric first two and last one digit, alphanumeric * and hyphens allowed in between. * p * No upper limit is placed on the size of an entry. */ String VALID_PATH_ENTRY_PATTERN = ([a-z0-9]|[a-z0-9][a-z0-9\\-]*[a-z0-9]); {code} Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153750#comment-14153750 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672125/apache-yarn-90.7.patch against trunk revision 9582a50. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5187//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5187//console This message is automatically generated. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free
Varun Vasudev created YARN-2628: --- Summary: Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free Key: YARN-2628 URL: https://issues.apache.org/jira/browse/YARN-2628 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev We've noticed that if you run the CapacityScheduler with the DominantResourceCalculator, sometimes apps will end up with containers in a reserved state even though free slots are available. The root cause seems to be this piece of code from CapacityScheduler.java - {noformat} // Try to schedule more if there are no reservations to fulfill if (node.getReservedContainer() == null) { if (Resources.greaterThanOrEqual(calculator, getClusterResource(), node.getAvailableResource(), minimumAllocation)) { if (LOG.isDebugEnabled()) { LOG.debug(Trying to schedule on node: + node.getNodeName() + , available: + node.getAvailableResource()); } root.assignContainers(clusterResource, node, false); } } else { LOG.info(Skipping scheduling since node + node.getNodeID() + is reserved by application + node.getReservedContainer().getContainerId().getApplicationAttemptId() ); } {noformat} The code is meant to check if a node has any slots available for containers . Since it uses the greaterThanOrEqual function, we end up in situation where greaterThanOrEqual returns true, even though we may not have enough CPU or memory to actually run the container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
Zhijie Shen created YARN-2629: - Summary: Make distributed shell use the domain-based timeline ACLs Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2629: -- Component/s: timelineserver Target Version/s: 2.6.0 Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153771#comment-14153771 ] Karthik Kambatla commented on YARN-2594: Fair enough. We could improve the locking in RMAppImpl further, but I guess the follow-up JIRA to fix SchedulerApplicationAttempt would take care of things in a better way. +1, pending Jenkins. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153779#comment-14153779 ] zhihai xu commented on YARN-2594: - The new patch looks good to me. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153824#comment-14153824 ] Karthik Kambatla commented on YARN-2179: Extending YarnClientImpl for the test seems reasonable to me. +1, assuming TestRemoteAppChecker is the only file changed in the latest patch. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
Jian He created YARN-2630: - Summary: TestDistributedShell#testDSRestartWithPreviousRunningContainers fails Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He The problem is that after YARN-1372, the re-launched AM will also receive previously failed AM container. And DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2630: -- Description: The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. (was: The problem is that after YARN-1372, the re-launched AM will also receive previously failed AM container. And DistributedShell logic is not expecting this extra completed container. ) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2630: -- Attachment: YARN-2630.1.patch Uploaded a patch to make RMAppAttempt not return AM container. TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2630.1.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153888#comment-14153888 ] Jason Lowe commented on YARN-2387: -- +1 lgtm. Committing this. Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153907#comment-14153907 ] Chris Trezzo commented on YARN-2179: Yes that was the only file changed. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153921#comment-14153921 ] Jason Lowe commented on YARN-2610: -- Looks like branch-2.6 was just cut as this was going in and it missed that branch. Karthik could you cherry-pick to that branch as well? Hamlet should close table tags -- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Fix For: 2.6.0 Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet should close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153941#comment-14153941 ] Karthik Kambatla commented on YARN-2610: Thanks for catching it, Jason. Just cherry-picked to branch-2.6 as well. Hamlet should close table tags -- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Fix For: 2.6.0 Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153968#comment-14153968 ] Hadoop QA commented on YARN-2594: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672114/YARN-2594.patch against trunk revision 9582a50. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5188//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5188//console This message is automatically generated. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2631) Modify DistributedShell to enable LogAggregationContext
Xuan Gong created YARN-2631: --- Summary: Modify DistributedShell to enable LogAggregationContext Key: YARN-2631 URL: https://issues.apache.org/jira/browse/YARN-2631 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153975#comment-14153975 ] Hudson commented on YARN-2387: -- FAILURE: Integrated in Hadoop-trunk-Commit #6156 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6156/]) YARN-2387. Resource Manager crashes with NPE due to lack of synchronization. Contributed by Mit Desai (jlowe: rev feaf139b4f327d33011e5a4424c06fb44c630955) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerStatusPBImpl.java * hadoop-yarn-project/CHANGES.txt Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2387.patch, YARN-2387.patch, YARN-2387.patch We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2594: --- Target Version/s: 2.6.0 Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153982#comment-14153982 ] Craig Welch commented on YARN-2198: --- -re pom.xml - maybe I'm just confused, I saw a reference to this in the pom and assumed it needed to be somewhere in the project, I see it builds fine, so I guess no worries there. TestWinUtils - so what I had in mind was mocking the native bit and having some tests for the proper behavior of the java components under various conditions - i realize this won't test the native code, which is significant, but it will test the java code for expected native code behavior, and there's non-trivial java code, strikes me as possible/worthwhile WindowsSecureContainerExecutor - understandable as a tactical approach but I'm concerned with leaving it that way - among other things, there is quite a lot more testing opportunity with non-secure code paths as they will be exercised much more frequently in testing (doubly so with reference to your comment above...), by having the non-secure and secure line up more the secure path will end up being higher quality as most of it's codepaths will see a good deal more use/exercise/testing, especially when new functionality is added. Also, changes going forward should require less effort if the windows path is mostly shared between secure and unsecure execution Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153981#comment-14153981 ] Karthik Kambatla commented on YARN-2594: Committing this. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153983#comment-14153983 ] Craig Welch commented on YARN-2198: --- there are a number of changes which impact common multi-platform code, has this been tested on non-Windows with security enabled (Linux) as well as windows? it looks like this is only a 64 bit build now, where it used to be 64 and 32. I assume this is intentional and ok? It would be really nice if we could start to separate out some of this new functionality from winutils, e.g., make the elevated service functionality independent. I see that there is a jira for doing so down the road, which is good... it looks like the elevated privilages are just around creating local directories and (obviously) spawning the process. If a stand-alone service just created and set permissions on those directories, and the java code simply checked for their existance and then moved on if they were present, I think that a lot of the back-and-forth of the elevation could be dropped to just one call to create the base directory and a second to spawn/hand back the output handles. Is that correct? service.c // We're now transfering ownership of the duplicated handles to the caller + // If the RPC call fails *after* this point the handles are leaked inside the NM process this is a little alarming. Doesn't the close() call clean this up, regardless of success/ fail? Have we done any profiling to make sure we're not leaking threads, thread stacks, memory, etc, in at least the happy case (and preferably some unhappy cases also)? I think we need to, there's a fair bit of additional native code, and running it for a bit with a profiler could tell us quite a bit about whether or not we may be leaking something... why is this conditional check different from all the others? + dwError = ValidateConfigurationFile(); + if (dwError) { nit anonimous sp anonymous hadoop-common-project/hadoop-common/src/main/native/src/org_apache_hadoop.h just a line added, pls revert ElevatedFileSystem delete() it appears that the tests for existance, etc, are run in a non-elevated way, while the actions are elevated. Is it possible for permissions to be such that the non-elevated tests do not see files/directories which are present for permission reasons? should those not be elevated also? streamReaderThread.run - using the readLine() instead of following the simple buffer copy idiom in ShellCommandExecutor has some efficiency issues, granted it looks to be reading memory sized data so it may be no big deal, but it would be nice to follow the buffer-copy pattern instead ContainerExecutor comment on comment: On Windows the ContainerLaunch creates a temporary empty jar to workaround the CLASSPATH length not exactly, it looks like it creates a jar with a special manifest of other jars, it would be helpful to explain that in the comment so it's clear what's going on ContainerLaunch public void sanitizeEnv(...) Can we please move the process of generating a new reference jar out of the sanitizeEnv method into it's own method (called ?conditionally? after sanitizeEnv)? While there's a clear connection in terms of it's setting up the environment, it's building a new jar I think it is doing more than just manipulating variables, so it belongs in a dedicated method, which can be called in call() after sanitizeEnv I believe this also means that Path nmPrivateClasspathJarDir can be pulled from the sanitizeEnv signature. ContainerLocalizer LOG.info(String.format(nRet: %d, nRet)); - not sure this should be info level Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is
[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154007#comment-14154007 ] zhihai xu commented on YARN-2254: - I uploaded a new patch YARN-2254.003.patch which rebase to the latest code base change TestRMWebServicesAppsModification to support FairScheduler. -- Key: YARN-2254 URL: https://issues.apache.org/jira/browse/YARN-2254 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Labels: test Attachments: YARN-2254.000.patch, YARN-2254.001.patch, YARN-2254.002.patch, YARN-2254.003.patch TestRMWebServicesAppsModification skips the test, if the scheduler is not CapacityScheduler. change TestRMWebServicesAppsModification to support both CapacityScheduler and FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE
[ https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154011#comment-14154011 ] Jian He commented on YARN-2602: --- looks good to me. Generic History Service of TimelineServer sometimes not able to handle NPE -- Key: YARN-2602 URL: https://issues.apache.org/jira/browse/YARN-2602 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Environment: ATS is running with AHS/GHS enabled to use TimelineStore. Running for 4-5 days, with many random example jobs running Reporter: Karam Singh Assignee: Zhijie Shen Attachments: YARN-2602.1.patch ATS is running with AHS/GHS enabled to use TimelineStore. Running for 4-5 day, with many random example jobs running . When I ran WS API for AHS/GHS: {code} curl --negotiate -u : 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001' {code} It ran successfully. However {code} curl --negotiate -u : 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps' {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException} {code} Failed with Internal server error 500. After looking at TimelineServer logs found that there was NPE: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154018#comment-14154018 ] Hudson commented on YARN-2594: -- FAILURE: Integrated in Hadoop-trunk-Commit #6157 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6157/]) YARN-2594. Potential deadlock in RM when querying ApplicationResourceUsageReport. (Wangda Tan via kasha) (kasha: rev 14d60dadc25b044a2887bf912ba5872367f2dffb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: YARN-2180-trunk-v6.patch [~kasha] [~vinodkv] [~sjlee0] Attached is v6. Here are the major changes: 1. Moved in-memory implementation specific logic to check for initial apps from the cleaner service to the InMemorySCMStore. Also updated unit tests. 2. Got rid of InMemorySCMStoreConfiguration and added them back to YarnConfigruation with an in-memory store prefix. 3. Added configuration around AppChecker implementation in the in-memory store. 4. Changed synchronization of initialApps to use a separate lock object. 5. Annotated classes with private/evolving. 6. Addressed various notes from karthik. One specific comment: bq. For resources that are not in the store, isn't the access time trivially zero? I am okay with returning -1 for those cases, but will returning zero help at call sites? I am going through and trying to verify if everything would be OK returning an access time of 0 instead of -1. If I remember correctly, this covered a case around SCM crashing and the Uploader service on the node manager. I will jog my memory and come up with a better response. The only place that this method is called is in the isResourceEvictable method. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, YARN-2180-trunk-v6.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154057#comment-14154057 ] Craig Welch commented on YARN-1198: --- Sorry, the above was off - the conversation happened offline, here was the tweak to .7 that Jian suggested: Hi Craig, I looked at your patch again. It's similar to what I thought. One thing is that now that headRoom is not application specific, it doesn't belong to application any more. We may make a member of LeafQueue#User. From CapacityScheduler#allocate, directly call LeafQueue #getAndCalculateHeadRoom , not going through SchedulerApplicationAttempt route to get the HeadRoom. I think this is simpler. do you think this will work? We may make a member of LeafQueue#User. To clarify: make the headRoom a variable of LeafQueue#User, and remove that from SchedulerAttempt we might, in this approach, do what we are doing in .7 but without the HeadroomProvider at all... I'm going to give a go at this... Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154060#comment-14154060 ] Chris Trezzo commented on YARN-2180: I also removed the clearCache() method from SCMStore and InMemorySCMStore. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, YARN-2180-trunk-v6.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-014.patch Updated patch # comments renames the {{HOSTNAME_PATTERN}} field (for AW) # registryOperationsStatus record holds the shortname of the stat'd record, not the full path (for Sanjay) # moves the {{listFull}} operation to list then stat the children out of the core {{RegistryOperations}} API and into {{RegistryUtils}}, as it is a utility action built from the lower level operations. Migration to this across the codebase. # made that stat operation robust against child entries being deleted during the action # same for the registry purge: there may be race conditions of overlapping delete operations ... this is not an error Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, YARN-913-014.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154069#comment-14154069 ] Wangda Tan commented on YARN-2594: -- Thanks [~kasha], [~jianhe] and [~zxu] for review and commit! Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2594.patch, YARN-2594.patch, YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: (was: YARN-2180-trunk-v6.patch) In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: YARN-2180-trunk-v6.patch Re-attached v6. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, YARN-2180-trunk-v6.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE
[ https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154087#comment-14154087 ] Hudson commented on YARN-2602: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6158 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6158/]) YARN-2602. Fixed possible NPE in ApplicationHistoryManagerOnTimelineStore. Contributed by Zhijie Shen (jianhe: rev bbff96be48119774688981d04baf444639135977) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java Generic History Service of TimelineServer sometimes not able to handle NPE -- Key: YARN-2602 URL: https://issues.apache.org/jira/browse/YARN-2602 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Environment: ATS is running with AHS/GHS enabled to use TimelineStore. Running for 4-5 days, with many random example jobs running Reporter: Karam Singh Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2602.1.patch ATS is running with AHS/GHS enabled to use TimelineStore. Running for 4-5 day, with many random example jobs running . When I ran WS API for AHS/GHS: {code} curl --negotiate -u : 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/v1/applicationhistory/apps/application_1411579118376_0001' {code} It ran successfully. However {code} curl --negotiate -u : 'http://TIMELINE_SERFVER_WEPBAPP_ADDR/ws/v1/applicationhistory/apps' {exception:WebApplicationException,message:java.lang.NullPointerException,javaClassName:javax.ws.rs.WebApplicationException} {code} Failed with Internal server error 500. After looking at TimelineServer logs found that there was NPE: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154084#comment-14154084 ] Vinod Kumar Vavilapalli commented on YARN-2578: --- bq. Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? +1. The default from conf is 1min. Assuming it all boils down the ping interval, we should fix it in common. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154098#comment-14154098 ] Hadoop QA commented on YARN-2630: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672165/YARN-2630.1.patch against trunk revision 14d60da. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler org.apache.hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5189//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5189//console This message is automatically generated. TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2630.1.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2632) Document NM Restart feature
Junping Du created YARN-2632: Summary: Document NM Restart feature Key: YARN-2632 URL: https://issues.apache.org/jira/browse/YARN-2632 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Junping Du As a new feature to YARN, we should document this feature's behavior, configuration, and things to pay attention. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154128#comment-14154128 ] Jian He commented on YARN-2617: --- Thanks for updating ! - {{containerStatuses.add(status);}} is moved after this check {{status.getContainerState() == ContainerState.COMPLETE}}. In some cases(e.g. NM decommission), I think we still need to send the completeContainers across so that RM knows this container completes. - we may not need to change {{getNMContainerStatuses}}, as this method will be invoked only once on re-register. I’m afraid not sending the whole containers for recovery will hit some other race conditions. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2629: -- Attachment: YARN-2629.1.patch Create a patch, which enable timeline domain feature for the distributed shell. User can specify the domain ID, the readers and writers via the options when submitting a DS job. DS client will automatically create one domain before submitting the app to YARN. AM will get the domain ID from env, and put all the entities into the domain of this ID. Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154135#comment-14154135 ] Zhijie Shen commented on YARN-2629: --- The patch depends on the one on YARN-2446 Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154140#comment-14154140 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672203/YARN-913-014.patch against trunk revision a469833. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5191//console This message is automatically generated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, YARN-913-014.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2630: -- Attachment: YARN-2630.2.patch Fixed test failures. TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2630.1.patch, YARN-2630.2.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154173#comment-14154173 ] Hadoop QA commented on YARN-2254: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672188/YARN-2254.003.patch against trunk revision a4c9b80. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5190//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5190//console This message is automatically generated. change TestRMWebServicesAppsModification to support FairScheduler. -- Key: YARN-2254 URL: https://issues.apache.org/jira/browse/YARN-2254 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Labels: test Attachments: YARN-2254.000.patch, YARN-2254.001.patch, YARN-2254.002.patch, YARN-2254.003.patch TestRMWebServicesAppsModification skips the test, if the scheduler is not CapacityScheduler. change TestRMWebServicesAppsModification to support both CapacityScheduler and FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154211#comment-14154211 ] Ming Ma commented on YARN-90: - Thanks, Varun, Jason. Couple comments: 1. What if a dir is transitioned from DISK_FULL state to OTHER state? DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs properly. We can use some state machine for each dir and make sure each transition is covered. 2. DISK_FULL state is counted toward the error disk threshold by LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. If we believe DISK_FULL is mostly temporary issue, should we consider disks are healthy if disks only stay in DISK_FULL for some short period of time? 3. In AppLogAggregatorImpl.java, (Path[]) localAppLogDirs.toArray(new Path[localAppLogDirs.size()]).. It seems the (Path[]) cast isn't necessary. 4. What is the intention of numFailures? Method getNumFailures isn't used. 5. Nit: It is better to expand import java.util.*; in DirectoryCollection.java and LocalDirsHandlerService.java. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154222#comment-14154222 ] Karthik Kambatla commented on YARN-2179: Committing this.. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154239#comment-14154239 ] Hudson commented on YARN-1492: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6161 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6161/]) YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml * hadoop-yarn-project/hadoop-yarn/bin/yarn * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java * hadoop-yarn-project/CHANGES.txt truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Chris Trezzo Priority: Critical Attachments: YARN-1492-all-trunk-v1.patch, YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf, shared_cache_design_v6.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154240#comment-14154240 ] Hudson commented on YARN-2179: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6161 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6161/]) YARN-2179. [YARN-1492] Initial cache manager structure and context. (Chris Trezzo via kasha) (kasha: rev 17d1202c35a1992eab66ea05dfd2baf219a17aec) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/RemoteAppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/test/java/org/apache/hadoop/yarn/server/sharedcachemanager/TestRemoteAppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/AppChecker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/src/main/java/org/apache/hadoop/yarn/server/sharedcachemanager/SharedCacheManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager/pom.xml * hadoop-yarn-project/hadoop-yarn/bin/yarn * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/sharedcache/SharedCacheStructureUtil.java * hadoop-yarn-project/CHANGES.txt Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Fix For: 2.7.0 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v10.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154247#comment-14154247 ] Hadoop QA commented on YARN-2630: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672219/YARN-2630.2.patch against trunk revision 9e9e9cf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5192//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5192//console This message is automatically generated. TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2630.1.patch, YARN-2630.2.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2617: --- Attachment: YARN-2617.4.patch Update the patch. I am not sure whether I catch your point: send completed container one time even its corresponding Application is stopped then delete it from context.getContainers(). If so, we need modify corresponding test cases. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154334#comment-14154334 ] Jian He commented on YARN-2617: --- bq. send completed container one time even its corresponding Application is stopped then delete it from context.getContainers() yep, because if we gracefully decommission a node, we also need to notify RM that the containers running on this node completes. btw. once you uploaded a patch, you can click the submit patch, which will trigger jenkins to run the corresponding unit tests. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154337#comment-14154337 ] Jun Gong commented on YARN-2617: Get it. Thank you! NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154369#comment-14154369 ] Hadoop QA commented on YARN-2617: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672239/YARN-2617.4.patch against trunk revision 17d1202. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5193//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5193//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)